Timeouts are one of the most underrated features among java developers - they add some complexity to the code without paying off with business functionality. They also force us to ask ourselves unpleasant questions and consider uncommon situations. The failure to correctly understand and use them often results in major availability/performance issues in production systems. This is the first of a series of blog posts covering some of the more common timeouts in the java world. Using a bottom-up approach, I will start with the low-level stream socket API and its
connect timeout.
The connect timeout is the maximal amount of time that your program is willing to spend waiting for a TCP/IP connection to be established.
TCP establishes connections using a three-way handshake mechanism - the initiator of the connection sends a
SYN packet, waits for a
SYN/ACK packet from the other party and responds with an
ACK. For the initiator, the connection is considered established as soon as it has received the
SYN/ACK packet and has sent an
ACK back (sending the
ACK is however not really blocking).
The linux network stack re-sends
SYN packets if it does not receive corresponding
SYN/ACKs on time. The time it waits for a
SYN/ACK is doubled every time (aka
exponential back-off: 1, 2, 4, 8, ...). This process is repeated up to a threshold value specified in /
proc/sys/net/ipv4/tcp_syn_retries. For a value of 6, the network stack will send 6
SYN packets before finally giving up: the first one right away, the second after 1 second, the third packet two seconds later, and so on until it sends the sixth
SYN packet. After waiting for 64 more seconds, it will give up and return a code indicating a timeout. In this case it will take a total of 127 seconds (1+2+4+8+16+32+64 seconds).
Java offers overloaded versions of the
java.net.Socket#connect() call - one with a timeout parameter and one without:
Socket socket = new Socket();
InetSocketAddress addr = new InetSocketAddress(host, port)
socket.connect(addr); // without a timeout
socket.connect(addr, 3333); // with a timeout of 3333 milliseconds
The behavior without a timeout depends on the OS network stack. As we have seen above, in case of my linux it took 127 seconds until the kernel indicated a timeout to the
JVM and it in turn raised a
java.net.ConnectException. During that time, the thread that called the connect method would block and it would be impossible to interrupt it from within the
JVM -
Thread.interrupt() will have no effect. For many applications that value is probably way too high and might cause your application to reserve resources for long time.
A naive approach to provoke this scenario would be to try to connect to the wrong port / ip-address. In most cases it will not take long until you get a
java.net.ConnectException: Connection refused. The reason for this is that the network stack on the other side is polite and responds to your
SYN packet with a
RST (reset) packet - this is the
TCP/IP way of saying that there is no process listening on this port. In real life, however, there is a multitude of issues that might cause
SYN packets to get dropped (it might be a paranoid firewall somewhere on the way to the other party that just drops your packets without responding in any way, an overloaded switch / router / firewall, misconfiguration in routing, ...). In such cases your program will end up blocking for longer periods of time.
You might be able to reproduce this behavior using one of the IP addresses defined in
RFC-5737 - e.g. 192.0.2.1 - this address is not routed and thus no response will be received. Alternatively, you could configure iptables to drop
SYN packets to the target port to test how your application reacts to this situation.
Now lets have a look at the overloaded version of the
connect method with a timeout. In this case, you instruct the OS to override the default behavior and make sure that it does not wait much longer than your timeout parameter. In this case, the network stack behaves as above but the
JVM is given the chance to act after the specified timeout is elapsed.
You can find out how that works on a linux host by recording the interaction between the
JVM and the kernel in terms of invoked system calls (aka syscalls). This can be achieved with strace. Consider the following trace snippets:
Without a timeout
1. socket(PF_INET6, SOCK_STREAM, IPPROTO_IP) = 6 <0.000091>
2. connect(6, {sa_family=AF_INET6, sin6_port=htons(1234), inet_pton(AF_INET6, "::ffff:127.0.0.1", &sin6_addr), sin6_flowinfo=0, sin6_scope_id=0}, 28) = 0 <0.000156>
On line 2, the
JVM invokes the
connect syscall without first switching the socket to non-blocking mode. The call will block until the network stack establishes a connection or gives up with an error. In this case we see that it returns 0 (after just 156 microseconds) - meaning that the connection has been successfully established. A failure (the network stack gave up after 127 seconds) could look like this:
1. socket(PF_INET6, SOCK_STREAM, IPPROTO_IP) = 6 <0.000091>
2. connect(6, {sa_family=AF_INET6, sin6_port=htons(1234), inet_pton(AF_INET6, "::ffff:127.0.0.1", &sin6_addr), sin6_flowinfo=0, sin6_scope_id=0}, 28 <unfinished ...>
3. <... connect resumed> ) = -1 ETIMEDOUT (Connection timed out) <127.215560>
Line 3 is the actual return from the
connect syscall invoked on line 2 after 127.215560 seconds.
With a timeout:
1. socket(PF_INET6, SOCK_STREAM, IPPROTO_IP) = 6 <0.000078>
2. fcntl(6, F_GETFL) = 0x2 (flags O_RDWR) <0.000090>
3. fcntl(6, F_SETFL, O_RDWR|O_NONBLOCK) = 0 <0.000078>
4. connect(6, {sa_family=AF_INET6, sin6_port=htons(1234), inet_pton(AF_INET6, "::ffff:127.0.0.1", &sin6_addr), sin6_flowinfo=0, sin6_scope_id=0}, 28) = -1 EINPROGRESS (Operation now in progress) <0.000170>
5. poll([{fd=6, events=POLLOUT}], 1, 3316) = 1 ([{fd=6, revents=POLLOUT}]) <0.000128>
On lines 2 and 3, the
JVM switches the socket to non-blocking mode using the
fcntl syscall. This causes the
connect syscall on the next line to return immediately (well after 170 microseconds) with -1, indicating that the operation is in progress and that the
JVM has to check the status of the operation (using
poll) actively. On line 5, the
JVM invokes the
poll syscall telling it to wait up to 3316 milliseconds. In our case this call returns 1 after 128 microseconds, indicating that the connection has been in fact established successfully. It might also return 0 to indicate a timeout (after roughly <timeout> milliseconds), in which case the
JVM takes over again and reports the problem by raising a
java.net.SocketTimeoutException and giving the control back to the java program. BTW, I noticed that the timeout passed to the
poll syscall is always around 10-20 ms less than the one passed to the java
connect method. My naive explanation is that the
JVM starts counting down the connection duration sometime before the call to
poll and passes the difference to
poll (in this case I called the
connect method with 3333 ms, so I guess it started the connection duration measurement 17 ms before invoking
poll). In case of a timeout the trace would look like this (
poll returned after 3.317667 seconds):
5. poll([{fd=6, events=POLLOUT}], 1, 3314 <unfinished ...>
6. <... poll resumed> ) = 0 (Timeout) <3.317667>
As you have hopefully seen in this article, failing to specify a connect timeout in your application will result in very long non-interruptible blocks in case of network issues. If there is one thing that is guaranteed for network connections is that sooner or later connectivity problems will arise. It is best to prepare for them by specifying meaningful timeouts and not allowing network issues to completely bring your application down.
Notes
- You can trace the interaction between the JVM and the linux kernel using strace. For example the call strace -f -o strace.log -tt -T -s 1024 java -jar socket-timeouts*.jar 127.0.0.1 1234 0 0 will run the JVM and record all of the syscall activity of the JVM and all of its child processes and write them in a human-readable format in strace.log
- You can use nc -l <port> to listen on a port
- You might be able to simulate a dropped SYN packet when connecting to 192.0.2.1
- I used this program available on github to analyze the JVM socket timeouts
- For details you can check out the manual pages for the syscalls, eg. man 2 connect
- The examples above are from a JDK 1.8.0_40 running on a 64-bit Ubuntu 3.16.0-30