Optimizing 3proxy for High Load
Precaution 1: 3proxy was not initially developed for high load and is positioned as a SOHO product. The main reason is the "one connection - one thread" model 3proxy uses. 3proxy is known to work with over 200,000 connections under proper configuration, but use it in a production environment under high loads at your own risk and do not expect too much.
Precaution 2: This documentation is incomplete and insufficient. High loads may require very specific system tuning including, but not limited to, specific or customized kernels, builds, settings, sysctls, options, etc. All of this is not covered by this documentation.
Configuring 'maxconn'
The number of simultaneous connections per service is limited by the 'maxconn' option.
The default maxconn value since 3proxy 0.8 is 500. You may want to set 'maxconn'
to a higher value. Under this configuration:
maxconn 1000
proxy -p3129
proxy -p3128
socks
maxconn for every service is 1000, and there are 3 services running
(2 proxy and 1 socks), so for all services there can be up to 3000
simultaneous connections to 3proxy.
Avoid setting 'maxconn' to an arbitrarily high value; it should be carefully
chosen to protect the system and proxy from resource exhaustion. Setting maxconn
above available resources can lead to denial of service conditions.
Understanding Resource Requirements
Each running service requires:
- 1 thread (process)
- 1 socket (file descriptor)
- 1 stack memory segment + some heap memory, ~64K-128K depending on the system
Each connected client requires:
- 1 thread (process)
- 2 sockets (file descriptors). For FTP, 4 sockets are required.
Under Linux since 0.9, splice() is used. It's much more efficient but requires
2 sockets (file descriptors) + 2 pipes (file descriptors) = 4 file descriptors.
For FTP with splice(), 4 sockets and 2 pipes are required.
Up to 128K (up to 256K in the case of splice()) of kernel buffer memory. This is the theoretical maximum; actual numbers depend on connection quality and traffic amount.
1 additional socket (file descriptor) during name resolution for non-cached names
1 additional socket during authentication or logging for RADIUS authentication or logging.
- 1 ephemeral port (3 ephemeral ports for FTP connections).
- 1 stack memory segment of ~32K-128K depending on the system + at least 16K and up to a few MB (for 'proxy' and 'ftppr') of heap memory. If you are short on memory, prefer 'socks' over 'proxy' and 'ftppr'.
- Many system buffers, especially in the case of slow network connections.
Also, additional resources like system buffers are required for network activity.
Setting ulimits
Hard and soft ulimits must be set above calculated requirements. Under Linux, you can
check the limits of a running process with
cat /proc/PID/limits
where PID is the process ID.
Validate that ulimits match your expectations, especially if you run 3proxy under a dedicated account
by adding, e.g.:
system "ulimit -Ha >>/tmp/3proxy.ulim.hard"
system "ulimit -Sa >>/tmp/3proxy.ulim.soft"
at the beginning (before the first service is started) and at the end of the config file.
Perform both a hard restart (i.e., kill and start the 3proxy process) and a soft restart
by sending SIGUSR1 to the 3proxy process; check that the ulimits recorded to files match your
expectations. In systemd-based distros (e.g., latest Debian/Ubuntu), changing limits.conf
is not enough; limits must be adjusted in the systemd configuration, e.g., by setting:
DefaultLimitDATA=infinity
DefaultLimitSTACK=infinity
DefaultLimitCORE=infinity
DefaultLimitRSS=infinity
DefaultLimitNOFILE=102400
DefaultLimitAS=infinity
DefaultLimitNPROC=10240
DefaultLimitMEMLOCK=infinity
in user.conf / system.conf
Extending System Limitations
Check the manuals/documentation for your system's limitations, e.g., the system-wide limit for the number of open files
(fs.file-max in Linux). You may need to change sysctls or even rebuild the kernel from source.
To help with socket-based system-dependent settings, since 0.9-devel, 3proxy supports different
socket options which can be set via the -ol option for the listening socket, -oc for the proxy-to-client
socket, and -os for the proxy-to-server socket. Example:
proxy -olSO_REUSEADDR,SO_REUSEPORT -ocTCP_TIMESTAMPS,TCP_NODELAY -osTCP_NODELAY
Available options are system-dependent.
Using 3proxy in a Virtual Environment
If 3proxy is used in a VPS environment, there can be additional limitations.
For example, kernel resources, system CPU usage, and IOCTLs can be limited differently, and this can become a bottleneck.
Since 0.9-devel, 3proxy uses splice() by default on Linux. splice() prevents network traffic from being copied from
kernel space to the 3proxy process and generally increases throughput, especially in the case of high-volume traffic. This is especially
true for virtual environments (it can improve throughput up to 10 times) unless there are additional kernel limitations.
Since some work is moved to the kernel, it requires up to 2 times more kernel resources in terms of CPU, memory, and IOCTLs.
If your hosting additionally limits kernel resources (you can see this as nearly 100% CPU usage without any real CPU activity for
any application performing IOCTLs), use the -s0 option to disable splice() usage for a given service, e.g.:
socks -s0
Extending the Ephemeral Port Range
Check the ephemeral port range for your system and extend it to the number of
ports required.
The ephemeral range is always limited to the maximum number of ports (64K). To extend the
number of outgoing connections above this limit, extending the ephemeral port range
is not enough; you need additional actions:
- Configure multiple outgoing IPs
- Make sure 3proxy is configured to use a different outgoing IP by either setting
the external IP via RADIUS:
radius secret 1.2.3.4
auth radius
proxy
or by using multiple services with different external
interfaces, for example:
allow user1,user11,user111
proxy -p1111 -e1.1.1.1
flush
allow user2,user22,user222
proxy -p2222 -e2.2.2.2
flush
allow user3,user33,user333
proxy -p3333 -e3.3.3.3
flush
allow user4,user44,user444
proxy -p4444 -e4.4.4.4
flush
or via "parent extip" rotation,
e.g.:
allow user1,user11,user111
parent 1000 extip 1.1.1.1 0
allow user2,user22,user222
parent 1000 extip 2.2.2.2 0
allow user3,user33,user333
parent 1000 extip 3.3.3.3 0
allow user4,user44,user444
parent 1000 extip 4.4.4.4 0
proxy
or
allow *
parent 250 extip 1.1.1.1 0
parent 250 extip 2.2.2.2 0
parent 250 extip 3.3.3.3 0
parent 250 extip 4.4.4.4 0
socks
Under the latest Linux versions, you can also start multiple services with different
external addresses on a single port with SO_REUSEPORT on the listening socket to
evenly distribute incoming connections between outgoing interfaces:
socks -olSO_REUSEPORT -p3128 -e 1.1.1.1
socks -olSO_REUSEPORT -p3128 -e 2.2.2.2
socks -olSO_REUSEPORT -p3128 -e 3.3.3.3
socks -olSO_REUSEPORT -p3128 -e 4.4.4.4
For web browsing, the last two examples are not recommended because the same client can get
a different external address for different requests; you should choose the external
interface with user-based rules instead.
- You may need additional system-dependent actions to use the same port on different IPs,
usually by adding the SO_REUSEADDR (SO_PORT_SCALABILITY for Windows) socket option to
the external socket. This option can be set (since 0.9-devel) with the -os option:
proxy -p3128 -e1.2.3.4 -osSO_REUSEADDR
The behavior for SO_REUSEADDR and SO_REUSEPORT is different between different systems,
even between different kernel versions, and can lead to unexpected results.
The specifics are described here.
Use these options only if actually required and if you fully understand the possible
consequences. For example, SO_REUSEPORT can help establish more connections than the
number of client ports available, but it can also lead to situations where connections
randomly fail due to IP+port pair collisions if the remote or local system
doesn't support this trick.
Setting Stack Size
'stacksize' is a size added to all stack allocations and can be both positive and
negative. Stack is required for function calls. 3proxy itself doesn't require a large
stack, but it can be required if some
poorly written libc, 3rd party libraries, or system functions are called. There is known
dirty code in Unix ODBC
implementations and built-in DNS resolvers, especially in the case of IPv6 and a large
number of interfaces. Under most 64-bit systems, extending stacksize will lead
to additional memory space usage but does not require actual committed memory,
so you can increase stacksize to a relatively large value (e.g., 1024000) without
the need to add additional physical memory,
but it's system/libc dependent and requires additional testing under your
installation. Don't forget about memory-related ulimits.
For 32-bit systems, address space can be a bottleneck you should consider. If
you're short on address space, you can try using a negative stack size.
Known System Issues
There are known race condition issues in the Linux/glibc resolver. The probability
of a race condition arises under configuration with IPv6, a large number of interfaces
or IP addresses, or with resolvers configured. In this case, install a local recursor and
use 3proxy's built-in resolver (nserver / nscache / nscache6).
Do Not Use Public Resolvers
Public resolvers like those from Google have rate limits. For a large number of
requests, install a local caching recursor (ISC bind named, PowerDNS recursor, etc).
Avoid Large Lists
Currently, 3proxy is not optimized to use large ACLs, user lists, etc. All lists
are processed linearly. In the devel version, you can use RADIUS authentication to avoid
user lists and ACLs in 3proxy itself. Also, RADIUS allows you to easily set an outgoing IP
on a per-user basis or implement more sophisticated logic.
RADIUS is a new beta feature; test it before using it in production.
Avoid Changing Configuration Too Often
Every configuration reload requires additional resources. Do not make frequent
changes, such as user addition/deletion via configuration; use alternative
authentication methods instead, like RADIUS.
Consider Using 'noforce'
The 'force' behavior (default) re-authenticates all connections after
configuration reload; it may be resource-consuming with a large number of
connections. Consider adding the 'noforce' command before services are started
to prevent connection re-authentication.
Do Not Monitor Configuration Files Directly
Using a configuration file directly in 'monitor' can lead to a race condition where
the configuration is reloaded while the file is being written.
To avoid race conditions:
- Update config files only if there is no lock file
- Create a lock file when the 3proxy configuration is updated, e.g., with
"touch /some/path/3proxy/3proxy.lck". If you generate config files
asynchronously, e.g., by a user's request via web, you should consider
implementing existence checking and file creation as an atomic operation.
- Add
system "rm /some/path/3proxy/3proxy.lck"
at the end of the config file to remove it after the configuration is successfully loaded
- Use a dedicated version file to monitor, e.g.:
monitor "/some/path/3proxy/3proxy.ver"
- After the config is updated, change the version file for 3proxy to reload the configuration,
e.g., with "touch /some/path/3proxy/3proxy.ver".
Use TCP_NODELAY to Speed Up Connections with Small Amounts of Data
If most requests require an exchange with a small amount of data in both directions
without the need for bandwidth, e.g., messengers or small web requests,
you can eliminate Nagle's algorithm delay with the TCP_NODELAY flag. Usage example:
proxy -osTCP_NODELAY -ocTCP_NODELAY
sets TCP_NODELAY for client (oc) and server (os) connections.
Do not use TCP_NODELAY on slow connections with high delays when
connection bandwidth is a bottleneck.
Use Splice to Speed Up Large Data Amount Transfers
splice() allows copying data between connections without copying to the process
address space. It can speed up the proxy on high-bandwidth connections if most
connections require large data transfers. Splice is enabled by default on Linux
since 0.9; "-s0" disables splice usage. Example:
proxy -s0
Splice is only available on Linux. Splice requires more system buffers and file descriptors
and produces more IOCTLs but reduces process memory and overall CPU usage.
Disable splice if there are a lot of short-lived connections with no bandwidth
requirements.
Use splice only on high-speed connections (e.g., 10GbE) when the processor, memory speed, or
system bus are bottlenecks.
TCP_NODELAY and splice are not contrary to each other and should be combined on
high-speed connections.
Add Grace Delay to Reduce System Calls
proxy -g8000,3,10
The first parameter is the average read size we want to keep, the second parameter is
the minimal number of packets in the same direction to apply the algorithm,
and the last value is the delay added after polling and prior to reading data.
The example above adds a 10-millisecond delay before reading data if the average
polling size is below 8000 bytes and 3 read operations have been made in the same
direction. It's especially useful with splice. logdump 1 1
is useful
to see how grace delays work; choose a delay value to avoid filling the read
pipe/buffer (typically 64K) but keep the request sizes close to the chosen average
on large file uploads/downloads.