the critical 3%

Wednesday, November 22, 2017

Library Playground and Testable Contracts

In this post I will write about an idea we came up with while working on a hibernate upgrade of a medium sized monolithic project. I found it to be very useful when learning new libraries / technologies or upgrading them.

You may have already heard about learning tests. If you are familiar with them you can skip this paragraph. Imagine that you have a library that you need to use but are unfamiliar with. To learn it, you will probably read some documentation, try out examples in a trivial application, maybe do some research online what others are doing. You will repeat that until you are confident that you understand how the library works and are ready to use it in a real-world scenario. The central point of learning tests is to use a test framework to try stuff out instead of writing throw-away mini-applications. It requires a bit more work but you end up learning the library, getting better at writing unit tests and having an executable test suite that verifies certain aspects in an automated fashion. Great!

These learning tests can be helpful to other members of the team too. Developers can use them to learn the library by reading, executing and debugging the tests. The resulting test-suite will be fast to execute as it only has to setup whatever is required by this library and nothing else - learning a library for HTTP communication for example will require you to spin up a simple web server. It will still be probably much faster to do this instead of running your full blown application with a database and many more dependencies so that you can run a test that happens to use this library somewhere along the line.

You can promote the test-suite to an independent project. In a micro-service world you will probably be using this library in many micro-services and do not want to limit its usage to a single application. As it is only testing aspects of the library, it is agnostic to your application logic and can thus live independently. You can think of this independent test project as a playground where you can reproduce issues, try out new features and learn. The best part is that the cost for this has been pretty low so far - you just kept the by-product of your learning process that you would have otherwise deleted.

If you go one step further and improve the quality of those tests, carefully document and distill them, and really treat them as first class citizens, you will end up with a set of tests that verify your expectations of a library - something like a testable contract between your application and the library (well a real contract would require that both parties agree to the conditions - it is more like unwritten terms and conditions that the library has and which we have agreed to by using it). In this test-suite you will have tests for all major features that you use, probably in functionally relevant variations. You will also include tests for exotic features - anything that surprised you or you found non-trivial at the time. Any behaviour that has been discovered in production via a bug report should also be part of the test-suite. With the time it will grow and get refined.

Having such a testable contract will come in handy when upgrading. Now you can use your test-suite to get the upgrade process started. You change the library version and get an instant feedback from the compiler - all compilation problems are changes to the public API that you will need to understand and later rework in you actual project. What is more exciting are the test results. Failing tests indicate deviations from your contract - expectations that are not met by the library any more. You will have to understand each such change and later on identify the places in your application that use this feature and adapt them. You will be able to switch between versions, debug through the library code with the working and with the failing versions and search for an explanation for the change. There are always implicit expectations that you have but have not coded into tests - the tests do not give you a full-proof guarantee for a smooth upgrade but will get you very far very fast. Their major goal is to help you learn what has changed in the new version so that you can apply this knowledge to your applications afterwards. A red test is also much harder to miss than a sentence in a multi-page changelog of the library.

Back to the hibernate upgrade. Unfortunately there were no pre-existing learning tests that we could use, no playground to reproduce and debug stuff. In many cases we had to remotely debug a WebLogic server to find out what was going on. Soon we noticed that such a playground project would be very helpful to verify certain assumptions about how hibernate behaves. We created one that starts an embedded in-memory database and creates the schema on the fly. We defined dedicated entities with the required relationships for each test-case. Using a custom JDBC driver we wrapped the actual driver and were able to track the SQL statements executed by hibernate. We parsed those statements using the in-memory database facilities and wrote some assertion methods that could structurally compare SQL statements. Now we were in the position to execute some code and write expectations on what SQL statements get executed by hibernate behind the scene (ignoring the generated aliases).

Let me give you an example of one such test that turned out pretty important. As I am interested in non-functional aspects of systems, I wanted to know exactly what operations trigger loading of lazily loaded entities. One of the tests I wrote checked whether getting the primary key of a referenced ManyToOne entity (e.g. child.getParent().getId()) did trigger a load. It did and I was not expecting it because the entity on the many side already had a column containing the primary key of the referenced entity on the one-side, so there was no real need to actually load the whole referenced entity to just return the id. But hibernate still did it so I wrote a few assertions that the primary key getter does actually trigger a lazy load. We also noticed that we did use this side effect of the getter to make sure that lazily loaded entities got loaded before detaching them and passing them to other components. Well, that test paid off pretty fast as it showed us that the next minor hibernate upgrade (5.2.12) does not fulfil this expectation - with any future upgrade we have to be careful in places where we were calling the primary key getter to provoke a load (https://hibernate.atlassian.net/browse/HHH-11838)

After this particular experience I wrote some other playground / contract applications - so far they have had a pretty good return on investment and I hope that this post motivates you enough to try this out too. If you already have, please share your experience or best-practices as comment on this post at https://dimovelev.blogspot.com/2017/11/library-playground-and-testable.html.

Friday, November 17, 2017

A story about backwards compatibility

I have recently been involved in a massive WebLogic 12.2.1.3 and Hibernate upgrade. There are way too many things that went wrong and took days to figure out, but one I will remember for long time.

After all the tests were green, we fixed a few bugs that the tests missed and implemented workarounds for new hibernate bugs, it was time to deploy our work to a higher environment (pre-production). I did not have many concerns as we have deployed to a few environments without issues already, ran many tests and did all that was in our power to reduce the risk of anything going wrong.

Guess what - something went wrong. In the first possible instance - the deployment failed. We were not able to deploy our EAR files because they could not find the data source in JNDI. The root cause for this was an interruption / socket read timeout exception while opening connections in the data-pools. These problems were sporadic, hit different servers and different connection pools. Pretty weird because other pools were successfully initialised at the same time.

We knew that we have configured a 3 second connection timeout in the oracle JDBC URL connection string using the NETWORK_CONNECT_TIMEOUT=3 parameter. Three seconds for a data-center is a very long time for it to fail that often. Unfortunately, increasing the timeout to a much higher value was not a viable option because of other issues we have experienced in the past with RAC failover etc.

The team got bigger (because nine women can make a baby in one month) and we started looking in many directions - are the virtual machines overloaded, are the hypervisors overloaded, do we have issues on the database, is it network, we captured network traffic, looked through endless TCP streams (btw, now I know more about oracle ONS than before), restarted the whole environment, restarted database services, increased the entropy by running rngd, changed java to use urandom (by working around an age old bug and specifying the entropy source as /dev/./urandom). Nothing helped and we were not getting anywhere.

Then I used an old and trivial java program that just connects to the Oracle database via JDBC, executes a configurable SQL statement a given number of times with a configurable concurrency and shuts down. In this case I configured it to execute the classical health-check statement "select 1 from dual" with 1 thread. I started it in a loop and voila - after a while an invocation failed with a socket timeout exception. Excellent, the problem was not in WebLogic 12.2.1.3! We were relieved - it was an environment issue that must be addressed but is nothing that should stop us from proceeding to production. I turned off my computer.

After putting the baby to sleep I decided to look through the logs again because I did not want to miss something and risk issues in production. After a while I noticed something that looked suspicious - in the log where I was executing the java program in a loop I noticed that the timestamps of the two successful executions that surrounded the failure, had around 2 seconds of time difference. So how can we exceed a timeout of 3 seconds for less than 2 seconds wall clock time? It just isn't possible (ignoring anomalies like time synchronisation, leap-seconds and such). After eliminating some other possibilities (or some that we have already eliminated before that point, like ONS/FAN) I came to the conclusion that the time unit must be wrong. A search in the web lead me to a documentation on an Oracle site that, among million other things, had a tiny note that the unit of NETWORK_CONNECT_TIMEOUT changed from seconds to milliseconds (with ojdbc 12.2.x) and if you want to keep the old behaviour you have to apply a patch!

So a commercial library used by a ton of people changes the unit of a parameter and hides the information in a note somewhere.

Let that sink for a second! Or is it a millisecond...

3 things went wrong during the development of the ojdbc driver:

They defined a property which contains a duration without including the unit in the name of the property. That might look redundant (e.g. NETWORK_CONNECT_TIMEOUT_SEC) but many many people will thank you later. So the rule is always include the unit in the name of duration properties.
Initially they decided to go for a seconds unit - that is way to coarse for such a low-level timeout. I would always go for milliseconds unless it makes absolutely no sense and then probably still do it.
They changed the API without any reason. If you need to have a millisecond timeout, then just introduce a new property called NETWORK_CONNECT_TIMEOUT_MS. Keep the old one in the interface and map to milliseconds. Fail if both are defined. It is that easy. If you want to reuse the timeout, well, provide a patch that changes to milliseconds and keep the default version to be backwards compatible.

Note that if they had followed the first rule (and named the property NETWORK_CONNECT_TIMEOUT_SEC in the first place) they would have never decided to change the unit to milliseconds and would have been forced to introduce a new property thus causing much less WTFs / day.

I hope this helps the next victim of the NETWORK_CONNECT_TIMEOUT oracle jdbc8 12.2 issue.

Originally posted at https://dimovelev.blogspot.com/2017/11/a-story-about-backwards-compatibility.html.

Sunday, February 14, 2016

WebLogic Remote EJB Calls Causing Security Exceptions

I stumbled upon a strange behavior (SecurityException-s thrown) in WebLogic 12c (and probably older releases) when calling remote EJBs related to the usage of InitialContext. This article summarizes my findings.

While processing a request, the WebLogic application server manages a stack of users. The last user pushed on the stack is the user that is currently active - calls to further EJBs will automatically run as this user. When creating a new InitialContext with a principal, WebLogic pushes it on the stack and the principal provided becomes the active user. It remains active until the InitialContext is closed (or the request processing finishes). This behavior paves the way for very unpleasant situations that I will try to explain based on the following test setup.

We have a JAX-WS web-service packaged in an EJB3 application invoking local and remote EJB2 beans after looking them up in the respective JNDI trees. This is shown in the sequence diagram below:

1. A consumer invokes the web-service. We have secured the web-service using WebLogic security configuration which requires HTTP basic authentication and checks the user against the users available in its realm. If the authentication succeeds, the FacadeJaxWs is executed as the authenticated user. In this example I will call the user ws_user.
2. FacadeJaxWs uses JNDI (with URL, principal backend_user and their credentials) to lookup the remote home interface of the BackendEjb EJB2 bean (hosted in a different domain), create the EJB stub and invoke a method on it.
3. The BackendEjb processes the request. Depending on what exactly this EJB does we might run into problems as described later.
5. FacadeJaxWs uses JNDI (without URL and credentials) to lookup the remote home interface of a BackendEjb EJB2 bean, collocated on the same managed server in a different EAR. As in the Backend case it creates a remote stub and invokes a method on it
6. FrontendEjb processes the request. Depending on what exactly this EJB does, we might run into further problems as described later

Depending on when and if at all the FacadeJaxWs closes the initial context when invoking the BackendEjb, a security exception can be thrown in different places.

Variant 1 - FacadeJaxWs does not close the InitialContext at all

This seems to be the usual thing a developer does. This is depicted in the code below:

With this code we have laid a mine - if the processInFrontend method does not call any further EJBs, the code will work without problems. If, however, the processInFrontend method tries to invoke another (local) EJB2 using the standard IntialContext mechanism, we will get an exception like this:

java.lang.SecurityException: [Security:090398]Invalid Subject: principals=[backend_user]

As you can imagine, this is extremely unpleasant, as it will not cause any problems (the mine will lay dormant) until another developer comes in and extends the processInFrontend method to call another EJB. Then the mine will go off and will catch the developer by surprise - something that they have done many times (calling an EJB from another EJB using InitialContext lookup) suddenly does not work in that specific situation. With the WebLogic JNDI knowledge from above, there is a good explanation for that - line 13 pushes the backend_user on the user stack and it becomes the active user. Any new EJB invokations will run as this user which is not really known in this domain. Note that using @EJB to inject the FrontendEjbHome somehow worsens the situation, as the JNDI lookup is done much earlier (with the correct user) and we do not run into the problem in all cases.

Variant 2 - FacadeJaxWs closes the InitialContext before invoking the actual remote EJB method

This variant is depicted by the code below:

In this variant we try to close the InitialContext right after looking up the remote home interface of the EJB - we might have an utility class to lookup, narrow and return remote home interfaces and close the initial context right after that. Unfortunately, this will not work and will cause another security exception like this:

java.lang.SecurityException: [Security:090398]Invalid Subject: principals=[ws_user]

This exception is actually thrown by the Backend managed server when we try to create the remote stub using the remote home interface on line 16. This is a much better situation than the previous variant, as we will always run into the exception (note: it does not seem to happen in EJB2 code). The explanation also comes easy - closing the context on line 15, pops the backend_user from the stack and the active user again becomes the ws_user. This user, however, is not known in the backend domain and thus the exception is thrown.

Variant 3 - FacadeJaxWs closes the InitialContext in a finally block after invoking the remote EJB method

Finally we get to the correct usage of InitialContext - we close it after invoking the actual business method like this:

This time we call the business method right after creating the initial context - this allows for the method to run in the context of the backend_user. Only after we have finished our interactions with the backend EJB, do we close the context (on line 21), switching back to the ws_user and executing the FrontendEjb as a user that is known in its WebLogic domain.

Thursday, April 23, 2015

Performance Pitfalls: Inheriting JEE Singletons

In a previous blog post I wrote about the fact that JEE singleton EJBs have a very restrictive default lock mode - each method requires a WRITE lock which effectively allows for at most one method of the singleton to run at the same time.

Today I stumbled upon another gotcha: inherited methods are not affected by the default lock type set on the subclass. Consider the following code:

abstract class Parent {
public void methodA() {}
}

@Singleton
@Lock(LockType.READ)
class Subclass extends Parent {
public void methodB() {}
}

We have an abstract base class which implements some common functionality across multiple singletons (represented by the methodA()method). It is implemented as a POJO without any JEE annotations. Then we have the actual singleton which inherits from the base class and defines a default locking type READ. In our case, the developers were actually expecting that all methods, including those inherited from the Parent class, would have a READ lock type. Unfortunately, that was not the case. According to the EJB 3.1 specification, the methods inherited from the Parent class have a lock type as defined in the Parent class - if not specified then it falls back to the default WRITE lock type (page 113):

A concurrency locking attribute specified on a superclass S appplies to the business methods defined by S. If a class-level concurrency attribute is not specified on S, it is equivalent to specification of Lock(WRITE) on S.

In the example above we observed exactly that behavior. As a result, a thread running a method from the Parent class was effectively blocking any other thread from running any other method on the same singleton. In our case the situation was exacerbated by methods blocking on a slow external system.

The solution for us was easy as we did not really require any synchronization - switching to bean managed synchronization fixed the problem:

@Singleton
@ConcurrencyManagement(ConcurrencyManagementType.BEAN)
class Subclass extends Parent {
public void methodB() {}
}

This time again we would have saved us some work if we switched to bean managed synchronization right away. I hope that this blog may help you work around similar issues in your applications.

Monday, March 30, 2015

Performance Tuning: java socket timeout

In my previous post I wrote about the socket connect timeout. In this one I am going to write about the other low-level socket timeout - the SO_TIMEOUT. Although most java programmers are probably never going to use the socket API directly, these timeouts are often configurable for higher-level protocols and must be well understood.

The socket timeout (SO_TIMEOUT) is the maximal amount of time that the application is willing to wait for a single read call on the socket to complete.

This timeout only limits the duration of a single invocation of the read method. Consider a higher-level protocol in which your application needs to read a whole line from the TCP connection. For this you could wrap the socket's input stream into a java.io.BufferedReader. When you read a line from the buffered reader, it will in turn repeatedly call read on the socket and gather the received bytes until a new line has been detected and then return the whole line at once. For a line of 10 bytes (including the new-line) this could translate to up to 10 calls of the read method. Provided that you have configured the socket timeout, this would translate to an overall duration of 10 times the socket timeout - assuming that each read call returns 1 byte just before the timeout has been reached. As you see, the socket timeout is not well-suited to set timeouts for higher-level protocols. However, without that timeout you will not be able to react to other timeouts.

Let us first see what happens under the hood when no timeout is set - based on the following simple java code:

1. Socket socket = new Socket();
...
2. socket.getInputStream().read();

This translates to the following syscalls on a linux machine:

1. socket(PF_INET6, SOCK_STREAM, IPPROTO_IP) = 6 <0.000091>
...
2. recvfrom(6, <unfinished ...>
3. <... recvfrom resumed> "Hello back\n", 8192, 0, NULL, NULL) = 11 <4.823417>

On line 2 we see that the JVM invokes the recvfrom syscall on the socket. This is a blocking operation and will only return when data has been received or an error has been detected. In this case, we have received data after 4.823417 seconds. Unfortunately, this call could potentially block forever and in turn block the calling java thread until the JVM has been restarted (calling Thread.interrupt() from within the JVM will not help). The situation might sound like something very exotic but is in practice quite frequent when complex network infrastructure is involved. Imagine that your application has successfully established a connection and now wants to fetch data from the other side. If you forcefully restart a firewall on the way (not giving it a chance to send RST (reset) packets to all connected parties to allow them to gracefully close their connections) or physically terminate the connection and then restart the other party or do a similar thing that causes it to forget about the connection, you will end up in a state known as a half-open connection. A connection can remain in this state forever and so will your read method and in turn the thread that has called it. There will be no way to recover from this without a restart of the JVM.

In order to prevent such scenarios, it is always a good idea to set the socket timeout (e.g. 4444 milliseconds) like this:

1. Socket socket = new Socket();
2. socket.setSoTimeout(4444);
3. socket.getInputStream().read();

If you do that, the JVM will always first poll the socket for data with this timeout and only call the blocking recvfrom syscall if data is available (and the blocking recvfrom call will return very fast in this case). The trace for this code looks like this:

1. socket(PF_INET6, SOCK_STREAM, IPPROTO_IP) = 6 <0.000078>
...
2. poll([{fd=6, events=POLLIN|POLLERR}], 1, 4444 <unfinished ...>
3. <... poll resumed> ) = 1 ([{fd=6, revents=POLLIN}]) <1.265994>
4. recvfrom(6, "Hi\n", 8192, 0, NULL, NULL) = 3 <0.000155>
We see that in this case the JVM first invokes the poll syscall with the SO_TIMEOUT (line 2) and calls the recvfrom syscall only after the poll returned that there is data to fetch (line 3) . In case of a half-open connection, the poll syscall would return a timeout after around 4444 ms and give the control back to the JVM. It in turn will raise a java.net.SocketTimoutException to inform your thread about the timeout:

1. poll([{fd=6, events=POLLIN|POLLERR}], 1, 4444 <unfinished ...>
2. <... poll resumed> ) = 0 (Timeout) <4.448941>

On line 2 we see that the poll syscall returns a timeout code after 4.448941 seconds. Your application can handle the timeout anyway it wants - either repeating the call (giving the other side some more time to respond) or closing the connection.

In this article we have seen that reading from a TCP socket without setting the socket timeout might lead to java threads being blocked forever. Although the SO_TIMEOUT cannot be used as a substitute for higher-level protocol timeouts (e.g. the whole HTTP response must be received within 4 seconds) it still allows the JVM to take back control from time to time and decide whether to repeat the read call or give up if a higher-level protocol timeout has been reached. If there is one thing that you should remember from this post, it is that you should always set the socket timeout.

Monday, March 23, 2015

Performance Tuning: java socket connect timeout

Timeouts are one of the most underrated features among java developers - they add some complexity to the code without paying off with business functionality. They also force us to ask ourselves unpleasant questions and consider uncommon situations. The failure to correctly understand and use them often results in major availability/performance issues in production systems. This is the first of a series of blog posts covering some of the more common timeouts in the java world. Using a bottom-up approach, I will start with the low-level stream socket API and its connect timeout.

The connect timeout is the maximal amount of time that your program is willing to spend waiting for a TCP/IP connection to be established.

TCP establishes connections using a three-way handshake mechanism - the initiator of the connection sends a SYN packet, waits for a SYN/ACK packet from the other party and responds with an ACK. For the initiator, the connection is considered established as soon as it has received the SYN/ACK packet and has sent an ACK back (sending the ACK is however not really blocking).
The linux network stack re-sends SYN packets if it does not receive corresponding SYN/ACKs on time. The time it waits for a SYN/ACK is doubled every time (aka exponential back-off: 1, 2, 4, 8, ...). This process is repeated up to a threshold value specified in /proc/sys/net/ipv4/tcp_syn_retries. For a value of 6, the network stack will send 6 SYN packets before finally giving up: the first one right away, the second after 1 second, the third packet two seconds later, and so on until it sends the sixth SYN packet. After waiting for 64 more seconds, it will give up and return a code indicating a timeout. In this case it will take a total of 127 seconds (1+2+4+8+16+32+64 seconds).

Java offers overloaded versions of the java.net.Socket#connect() call - one with a timeout parameter and one without:

Socket socket = new Socket();
InetSocketAddress addr = new InetSocketAddress(host, port)
socket.connect(addr); // without a timeout
socket.connect(addr, 3333); // with a timeout of 3333 milliseconds

The behavior without a timeout depends on the OS network stack. As we have seen above, in case of my linux it took 127 seconds until the kernel indicated a timeout to the JVM and it in turn raised a java.net.ConnectException. During that time, the thread that called the connect method would block and it would be impossible to interrupt it from within the JVM - Thread.interrupt() will have no effect. For many applications that value is probably way too high and might cause your application to reserve resources for long time.

A naive approach to provoke this scenario would be to try to connect to the wrong port / ip-address. In most cases it will not take long until you get a java.net.ConnectException: Connection refused. The reason for this is that the network stack on the other side is polite and responds to your SYN packet with a RST (reset) packet - this is the TCP/IP way of saying that there is no process listening on this port. In real life, however, there is a multitude of issues that might cause SYN packets to get dropped (it might be a paranoid firewall somewhere on the way to the other party that just drops your packets without responding in any way, an overloaded switch / router / firewall, misconfiguration in routing, ...). In such cases your program will end up blocking for longer periods of time.
You might be able to reproduce this behavior using one of the IP addresses defined in RFC-5737 - e.g. 192.0.2.1 - this address is not routed and thus no response will be received. Alternatively, you could configure iptables to drop SYN packets to the target port to test how your application reacts to this situation.

Now lets have a look at the overloaded version of the connect method with a timeout. In this case, you instruct the OS to override the default behavior and make sure that it does not wait much longer than your timeout parameter. In this case, the network stack behaves as above but the JVM is given the chance to act after the specified timeout is elapsed.

You can find out how that works on a linux host by recording the interaction between the JVM and the kernel in terms of invoked system calls (aka syscalls). This can be achieved with strace. Consider the following trace snippets:

Without a timeout
1. socket(PF_INET6, SOCK_STREAM, IPPROTO_IP) = 6 <0.000091>
2. connect(6, {sa_family=AF_INET6, sin6_port=htons(1234), inet_pton(AF_INET6, "::ffff:127.0.0.1", &sin6_addr), sin6_flowinfo=0, sin6_scope_id=0}, 28) = 0 <0.000156>

On line 2, the JVM invokes the connect syscall without first switching the socket to non-blocking mode. The call will block until the network stack establishes a connection or gives up with an error. In this case we see that it returns 0 (after just 156 microseconds) - meaning that the connection has been successfully established. A failure (the network stack gave up after 127 seconds) could look like this:

1. socket(PF_INET6, SOCK_STREAM, IPPROTO_IP) = 6 <0.000091>
2. connect(6, {sa_family=AF_INET6, sin6_port=htons(1234), inet_pton(AF_INET6, "::ffff:127.0.0.1", &sin6_addr), sin6_flowinfo=0, sin6_scope_id=0}, 28 <unfinished ...>
3. <... connect resumed> ) = -1 ETIMEDOUT (Connection timed out) <127.215560>
Line 3 is the actual return from the connect syscall invoked on line 2 after 127.215560 seconds.

With a timeout:
1. socket(PF_INET6, SOCK_STREAM, IPPROTO_IP) = 6 <0.000078>
2. fcntl(6, F_GETFL) = 0x2 (flags O_RDWR) <0.000090>
3. fcntl(6, F_SETFL, O_RDWR|O_NONBLOCK) = 0 <0.000078>
4. connect(6, {sa_family=AF_INET6, sin6_port=htons(1234), inet_pton(AF_INET6, "::ffff:127.0.0.1", &sin6_addr), sin6_flowinfo=0, sin6_scope_id=0}, 28) = -1 EINPROGRESS (Operation now in progress) <0.000170>
5. poll([{fd=6, events=POLLOUT}], 1, 3316) = 1 ([{fd=6, revents=POLLOUT}]) <0.000128>

On lines 2 and 3, the JVM switches the socket to non-blocking mode using the fcntl syscall. This causes the connect syscall on the next line to return immediately (well after 170 microseconds) with -1, indicating that the operation is in progress and that the JVM has to check the status of the operation (using poll) actively. On line 5, the JVM invokes the poll syscall telling it to wait up to 3316 milliseconds. In our case this call returns 1 after 128 microseconds, indicating that the connection has been in fact established successfully. It might also return 0 to indicate a timeout (after roughly <timeout> milliseconds), in which case the JVM takes over again and reports the problem by raising a java.net.SocketTimeoutException and giving the control back to the java program. BTW, I noticed that the timeout passed to the poll syscall is always around 10-20 ms less than the one passed to the java connect method. My naive explanation is that the JVM starts counting down the connection duration sometime before the call to poll and passes the difference to poll (in this case I called the connect method with 3333 ms, so I guess it started the connection duration measurement 17 ms before invoking poll). In case of a timeout the trace would look like this (poll returned after 3.317667 seconds):
5. poll([{fd=6, events=POLLOUT}], 1, 3314 <unfinished ...>
6. <... poll resumed> ) = 0 (Timeout) <3.317667>

As you have hopefully seen in this article, failing to specify a connect timeout in your application will result in very long non-interruptible blocks in case of network issues. If there is one thing that is guaranteed for network connections is that sooner or later connectivity problems will arise. It is best to prepare for them by specifying meaningful timeouts and not allowing network issues to completely bring your application down.

Notes

You can trace the interaction between the JVM and the linux kernel using strace. For example the call strace -f -o strace.log -tt -T -s 1024 java -jar socket-timeouts*.jar 127.0.0.1 1234 0 0 will run the JVM and record all of the syscall activity of the JVM and all of its child processes and write them in a human-readable format in strace.log
You can use nc -l <port> to listen on a port
You might be able to simulate a dropped SYN packet when connecting to 192.0.2.1
I used this program available on github to analyze the JVM socket timeouts
For details you can check out the manual pages for the syscalls, eg. man 2 connect
The examples above are from a JDK 1.8.0_40 running on a 64-bit Ubuntu 3.16.0-30

Sunday, March 15, 2015

Performance Pitfalls: commons-configuration

commons-configuration is a java library that offers a common API to access configuration data from a variety of sources like properties files, XML files, etc. It is a bit outdated but still a convenient way to make certain aspects of an application configurable. One of its great features is that it can be instructed (configured) to reload the configuration automatically. For file based configuration sources, it can check for changes in the underlying files and trigger a reload if necessary. This is pretty handy for test and development (maybe even production) environments as you can make configuration changes without restarting your application.

Unfortunately, all those nice features come at a great cost. Let's have a look at the base class for file-based configuration files - the AbstractFileConfiguration. Access to every configuration item goes through the getProperty method which looks like this:

public Object getProperty(String key) {
    synchronized (reloadLock) {
        reload();
        return super.getProperty(key);
    }
}

The implementation first acquires the reloadLock, then it calls the reload method to check whether a reload is necessary and perform it if it is. It then returns the actual property's value. This lock easily becomes a major bottleneck - each and every thread will try to acquire the same lock whenever any property (from the same configuration source) is being read. It is important to note that this behaviour does not depend on the actual reloading strategy and reloading intervals. See also this bug report describing the issue.

Apart from the performance aspects, this implementation might also result in inconsistent configuration being returned during a reload. Lets say your application calls an external service that requires authentication. Lets assume that you store the username and password in a configuration file and use commons-configuration to fetch them every time they are required. Lets say you have agreed on new credentials with the service provider and you are given a few hours to change them. In that period both the old and the new credentials are valid. Now imagine you change the credentials in the file. As we might potentially reload the configuration every time a property is read, you could end up reading the old username and just a moment later reading the new password within the same business operation. This would result in inconsistent credentials and in this case will cause the external service call to fail.

How could this have been done better? Well, to address both issues, one could have designed the library in such a way, that the user is forced to first get the current, read-only snapshot of the whole configuration and then query any number of configuration items from this snapshot. Obviously, reading from a read-only object does not require any synchronization and it would prevent inconsistent values being read as the snapshot is updated only as a whole. The user of this library would be advised to fetch a snapshot of the configuration when she starts processing a business operation and reuse it throughout the business operation. Additionally, one would reload the configuration in the background. If the background job decides that a new configuration is available, it would take all the time it needs to load the new configuration and store it in a new read-only snapshot. Then it would have to use some kind of synchronization (e.g. read-write locks, a volatile field at the very least) to switch the current snapshot with the new one. This operation would only happen if the configuration has actually changed and it would be a fast operation as it would only consist of changing a pointer.

Another issue arises when the same configuration is shared among multiple machines. It is not really possible to guarantee that the new configuration will be loaded at the same point in time. This could be addressed if one would store a validity period for each configuration source. The new configuration is then just appended (the old one has to stick around for a while longer in case servers are restarted before the new one kicks in) to the current configuration with a validity period in the future. This can be done safely in advance and when the time comes (provided that all machines have synchronized clocks), the changes will become visible at the same point in time.

To sum up, my advice is - do not use commons-configuration in any performance critical execution path - at least until version 2 is released and it has been verified that it does not have any major locking issues.