Thursday, April 23, 2015

Performance Pitfalls: Inheriting JEE Singletons

In a previous blog post I wrote about the fact that JEE singleton EJBs have a very restrictive default lock mode - each method requires a WRITE lock which effectively allows for at most one method of the singleton to run at the same time.

Today I stumbled upon another gotcha: inherited methods are not affected by the default lock type set on the subclass. Consider the following code:

abstract class Parent {
  public void methodA() {}

class Subclass extends Parent {
  public void methodB() {}

We have an abstract base class which implements some common functionality across multiple singletons (represented by the methodA()method). It is implemented as a POJO without any JEE annotations. Then we have the actual singleton which inherits from the base class and defines a default locking type READ. In our case, the developers were actually expecting that all methods, including those inherited from the Parent class, would have a READ lock type. Unfortunately, that was not the case. According to the EJB 3.1 specification, the methods inherited from the Parent class have a lock type as defined in the Parent class - if not specified then it falls back to the default WRITE lock type (page 113):
A concurrency locking attribute specified on a superclass S appplies to the business methods defined by S. If a class-level concurrency attribute is not specified on S, it is equivalent to specification of Lock(WRITE) on S.
In the example above we observed exactly that behavior. As a result, a thread running a method from the Parent class was effectively blocking any other thread from running any other method on the same singleton. In our case the situation was exacerbated by methods blocking on a slow external system.

The solution for us was easy as we did not really require any synchronization - switching to bean managed synchronization fixed the problem:

class Subclass extends Parent {
  public void methodB() {}

This time again we would have saved us some work if we switched to bean managed synchronization right away. I hope that this blog may help you work around similar issues in your applications.

Monday, March 30, 2015

Performance Tuning: java socket timeout

In my previous post I wrote about the socket connect timeout. In this one I am going to write about the other low-level socket timeout - the SO_TIMEOUT. Although most java programmers are probably never going to use the socket API directly, these timeouts are often configurable for higher-level protocols and must be well understood.

The socket timeout (SO_TIMEOUT) is the maximal amount of time that the application is willing to wait for a single read call on the socket to complete.

This timeout only limits the duration of a single invocation of the read method. Consider a higher-level protocol in which your application needs to read a whole line from the TCP connection. For this you could wrap the socket's input stream into a When you read a line from the buffered reader, it will in turn repeatedly call read on the socket and gather the received bytes until a new line has been detected and then return the whole line at once. For a line of 10 bytes (including the new-line) this could translate to up to 10 calls of the read method. Provided that you have configured the socket timeout, this would translate to an overall duration of 10 times the socket timeout - assuming that  each read call returns 1 byte just before the timeout has been reached. As you see, the socket timeout is not well-suited to set timeouts for higher-level protocols. However, without that timeout you will not be able to react to other timeouts.

Let us first see what happens under the hood when no timeout is set - based on the following simple java code:

1. Socket socket = new Socket();
2. socket.getInputStream().read();

This translates to the following syscalls on a linux machine:

1. socket(PF_INET6, SOCK_STREAM, IPPROTO_IP) = 6 <0.000091>
2. recvfrom(6,  <unfinished ...>
3.  <... recvfrom resumed> "Hello back\n", 8192, 0, NULL, NULL) = 11 <4.823417>

On line 2 we see that the JVM invokes the recvfrom syscall on the socket. This is a blocking operation and will only return when data has been received or an error has been detected. In this case, we have received data after 4.823417 seconds. Unfortunately, this call could potentially block forever and in turn block the calling java thread until the JVM has been restarted (calling Thread.interrupt() from within the JVM will not help). The situation might sound like something very exotic but is in practice quite frequent when complex network infrastructure is involved. Imagine that your application has successfully established a connection and now wants to fetch data from the other side. If you forcefully restart a firewall on the way (not giving it a chance to send RST (reset) packets to all connected parties to allow them to gracefully close their connections) or physically terminate the connection and then restart the other party or do a similar thing that causes it to forget about the connection, you will end up in a state known as a half-open connection. A connection can remain in this state forever and so will your read method and in turn the thread that has called it. There will be no way to recover from this without a restart of the JVM.

In order to prevent such scenarios, it is always a good idea to set the socket timeout (e.g. 4444 milliseconds) like this:

1. Socket socket = new Socket();
2. socket.setSoTimeout(4444);
3. socket.getInputStream().read();

If you do that, the JVM will always first poll the socket for data with this timeout and only call the blocking recvfrom syscall if data is available (and the blocking recvfrom call will return very fast in this case). The trace for this code looks like this:

1. socket(PF_INET6, SOCK_STREAM, IPPROTO_IP) = 6 <0.000078>
2. poll([{fd=6, events=POLLIN|POLLERR}], 1, 4444 <unfinished ...>
3. <... poll resumed> ) = 1 ([{fd=6, revents=POLLIN}]) <1.265994>
4. recvfrom(6, "Hi\n", 8192, 0, NULL, NULL) = 3 <0.000155>

We see that in this case the JVM first invokes the poll syscall with the SO_TIMEOUT (line 2) and calls the recvfrom syscall only after the poll returned that there is data to fetch (line 3) . In case of a half-open connection, the poll syscall would return a timeout after around 4444 ms and give the control back to the JVM. It in turn will raise a to inform your thread about the timeout:

1. poll([{fd=6, events=POLLIN|POLLERR}], 1, 4444 <unfinished ...>
2. <... poll resumed> ) = 0 (Timeout) <4.448941>

On line 2 we see that the poll syscall returns a timeout code after 4.448941 seconds. Your application can handle the timeout anyway it wants - either repeating the call (giving the other side some more time to respond) or closing the connection.

In this article we have seen that reading from a TCP socket without setting the socket timeout might lead to java threads being blocked forever. Although the SO_TIMEOUT cannot be used as a substitute for higher-level protocol timeouts (e.g. the whole HTTP response must be received within 4 seconds) it still allows the JVM to take back control from time to time and decide whether to repeat the read call or give up if a higher-level protocol timeout has been reached. If there is one thing that you should remember from this post, it is that you should always set the socket timeout.

Monday, March 23, 2015

Performance Tuning: java socket connect timeout

Timeouts are one of the most underrated features among java developers - they add some complexity to the code without paying off with business functionality. They also force us to ask ourselves unpleasant questions and consider uncommon situations. The failure to correctly understand and use them often results in major availability/performance issues in production systems. This is the first of a series of blog posts covering some of the more common timeouts in the java world. Using a bottom-up approach, I will start with the low-level stream socket API and its connect timeout.

The connect timeout is the maximal amount of time that your program is willing to spend waiting for a TCP/IP connection to be established.

TCP establishes connections using a three-way handshake mechanism - the initiator of the connection sends a SYN packet, waits for a SYN/ACK packet from the other party and responds with an ACK. For the initiator, the connection is considered established as soon as it has received the SYN/ACK packet and has sent an ACK back (sending the ACK is however not really blocking).
The linux network stack re-sends SYN packets if it does not receive corresponding SYN/ACKs on time. The time it waits for a SYN/ACK is doubled every time (aka exponential back-off: 1, 2, 4, 8, ...).  This process is repeated up to a threshold value specified in /proc/sys/net/ipv4/tcp_syn_retries. For a value of 6, the network stack will send 6 SYN packets before finally giving up: the first one right away, the second after 1 second, the third packet two seconds later, and so on until it sends the sixth SYN packet. After waiting for 64 more seconds, it will give up and return a code indicating a timeout. In this case it will take a total of 127 seconds (1+2+4+8+16+32+64 seconds).

Java offers overloaded versions of the call - one with a timeout parameter and one without:

Socket socket = new Socket();
InetSocketAddress addr = new InetSocketAddress(host, port)
socket.connect(addr); // without a timeout
socket.connect(addr, 3333); // with a timeout of 3333 milliseconds

The behavior without a timeout depends on the OS network stack. As we have seen above, in case of my linux it took 127 seconds until the kernel indicated a timeout to the JVM and it in turn raised a During that time, the thread that called the connect method would block and it would be impossible to interrupt it from within the JVM - Thread.interrupt() will have no effect. For many applications that value is probably way too high and might cause your application to reserve resources for long time.

A naive approach to provoke this scenario would be to try to connect to the wrong port / ip-address. In most cases it will not take long until you get a Connection refused. The reason for this is that the network stack on the other side is polite and responds to your SYN packet with a RST (reset) packet - this is the TCP/IP way of saying that there is no process listening on this port. In real life, however, there is a multitude of issues that might cause SYN packets to get dropped (it might be a paranoid firewall somewhere on the way to the other party that just drops your packets without responding in any way, an overloaded switch / router / firewall, misconfiguration in routing, ...). In such cases your program will end up blocking for longer periods of time.
You might be able to reproduce this behavior using one of the IP addresses defined in RFC-5737 - e.g. - this address is not routed and thus no response will be received. Alternatively, you could configure iptables to drop SYN packets to the target port to test how your application reacts to this situation.

Now lets have a look at the overloaded version of the connect method with a timeout. In this case, you instruct the OS to override the default behavior and make sure that it does not wait much longer than your timeout parameter. In this case, the network stack behaves as above but the JVM is given the chance to act after the specified timeout is elapsed.

You can find out how that works on a linux host by recording the interaction between the JVM and the kernel in terms of invoked system calls (aka syscalls). This can be achieved with strace. Consider the following trace snippets:

Without a timeout
1. socket(PF_INET6, SOCK_STREAM, IPPROTO_IP) = 6 <0.000091>
2. connect(6, {sa_family=AF_INET6, sin6_port=htons(1234), inet_pton(AF_INET6, "::ffff:", &sin6_addr), sin6_flowinfo=0, sin6_scope_id=0}, 28) = 0 <0.000156>

On line 2, the JVM invokes the connect syscall without first switching the socket to non-blocking mode. The call will block until the network stack establishes a connection or gives up with an error. In this case we see that it returns 0 (after just 156 microseconds) - meaning that the connection has been successfully established. A failure (the network stack gave up after 127 seconds) could look like this:

1. socket(PF_INET6, SOCK_STREAM, IPPROTO_IP) = 6 <0.000091>
2. connect(6, {sa_family=AF_INET6, sin6_port=htons(1234), inet_pton(AF_INET6, "::ffff:", &sin6_addr), sin6_flowinfo=0, sin6_scope_id=0}, 28 <unfinished ...>
3. <... connect resumed> ) = -1 ETIMEDOUT (Connection timed out) <127.215560>
Line 3 is the actual return from the connect syscall invoked on line 2 after 127.215560 seconds.

With a timeout:
1. socket(PF_INET6, SOCK_STREAM, IPPROTO_IP) = 6 <0.000078>
2. fcntl(6, F_GETFL) = 0x2 (flags O_RDWR) <0.000090>
3. fcntl(6, F_SETFL, O_RDWR|O_NONBLOCK) = 0 <0.000078>
4. connect(6, {sa_family=AF_INET6, sin6_port=htons(1234), inet_pton(AF_INET6, "::ffff:", &sin6_addr), sin6_flowinfo=0, sin6_scope_id=0}, 28) = -1 EINPROGRESS (Operation now in progress) <0.000170>
5. poll([{fd=6, events=POLLOUT}], 1, 3316) = 1 ([{fd=6, revents=POLLOUT}]) <0.000128>

On lines 2 and 3, the JVM switches the socket to non-blocking mode using the fcntl syscall. This causes the connect syscall on the next line to return immediately (well after 170 microseconds) with -1, indicating that the operation is in progress and that the JVM has to check the status of the operation (using poll) actively. On line 5, the JVM invokes the poll syscall telling it to wait up to 3316 milliseconds. In our case this call returns 1 after 128 microseconds, indicating that the connection has been in fact established successfully. It might also return 0 to indicate a timeout (after roughly <timeout> milliseconds), in which case the JVM takes over again and reports the problem by raising a and giving the control back to the java program. BTW, I noticed that the timeout passed to the poll syscall is always around 10-20 ms less than the one passed to the java connect method. My naive explanation is that the JVM starts counting down the connection duration sometime before the call to poll and passes the difference to poll (in this case I called the connect method with 3333 ms, so I guess it started the connection duration measurement 17 ms before invoking poll). In case of a timeout the trace would look like this (poll returned after 3.317667 seconds):
5. poll([{fd=6, events=POLLOUT}], 1, 3314 <unfinished ...>
6. <... poll resumed> ) = 0 (Timeout) <3.317667>

As you have hopefully seen in this article, failing to specify a connect timeout in your application will result in very long non-interruptible blocks in case of network issues. If there is one thing that is guaranteed for network connections is that sooner or later connectivity problems will arise. It is best to prepare for them by specifying meaningful timeouts and not allowing network issues to completely bring your application down.

  • You can trace the interaction between the JVM and the linux kernel using strace. For example the call strace -f -o strace.log -tt -T -s 1024 java -jar socket-timeouts*.jar 1234 0 0 will run the JVM and record all of the syscall activity of the JVM and all of its child processes and write them in a human-readable format in strace.log 
  • You can use nc -l <port> to listen on a port
  • You might be able to simulate a dropped SYN packet when connecting to
  • I used this program available on github to analyze the JVM socket timeouts
  • For details you can check out the manual pages for the syscalls, eg. man 2 connect
  • The examples above are from a JDK 1.8.0_40 running on a 64-bit Ubuntu 3.16.0-30

Sunday, March 15, 2015

Performance Pitfalls: commons-configuration

commons-configuration is a java library that offers a common API to access configuration data from a variety of sources like properties files, XML files, etc. It is a bit outdated but still a convenient way to make certain aspects of an application configurable. One of its great features is that it can be instructed (configured) to reload the configuration automatically. For file based configuration sources, it can check for changes in the underlying files and trigger a reload if necessary. This is pretty handy for test and development (maybe even production) environments as you can make configuration changes without restarting your application.

Unfortunately, all those nice features come at a great cost. Let's have a look at the base class for file-based configuration files - the AbstractFileConfiguration. Access to every configuration item goes through the getProperty method which looks like this:

public Object getProperty(String key) {
    synchronized (reloadLock) {
        return super.getProperty(key);

The implementation first acquires the reloadLock, then it calls the reload method to check whether a reload is necessary and perform it if it is. It then returns the actual property's value. This lock easily becomes a major bottleneck - each and every thread will try to acquire the same lock whenever any property (from the same configuration source) is being read. It is important to note that this behaviour does not depend on the actual reloading strategy and reloading intervals. See also this bug report describing the issue.

Apart from the performance aspects, this implementation might also result in inconsistent configuration being returned during a reload. Lets say your application calls an external service that requires authentication. Lets assume that you store the username and password in a configuration file and use commons-configuration to fetch them every time they are required. Lets say you have agreed on new credentials with the service provider and you are given a few hours to change them. In that period both the old and the new credentials are valid. Now imagine you change the credentials in the file. As we might potentially reload the configuration every time a property is read, you could end up reading the old username and just a moment later reading the new password within the same business operation. This would result in inconsistent credentials and in this case will cause the external service call to fail.

How could this have been done better? Well, to address both issues, one could have designed the library in such a way, that the user is forced to first get the current, read-only snapshot of the whole configuration and then query any number of configuration items from this snapshot. Obviously, reading from a read-only object does not require any synchronization and it would prevent inconsistent values being read as the snapshot is updated only as a whole. The user of this library would be advised to fetch a snapshot of the configuration when she starts processing a business operation and reuse it throughout the business operation. Additionally, one would reload the configuration in the background. If the background job decides that a new configuration is available, it would take all the time it needs to load the new configuration and store it in a new read-only snapshot. Then it would have to use some kind of synchronization (e.g. read-write locks, a volatile field at the very least) to switch the current snapshot with the new one. This operation would only happen if the configuration has actually changed and it would be a fast operation as it would only consist of changing a pointer.

Another issue arises when the same configuration is shared among multiple machines. It is not really possible to guarantee that the new configuration will be loaded at the same point in time. This could be addressed if one would store a validity period for each configuration source. The new configuration is then just appended (the old one has to stick around for a while longer in case servers are restarted before the new one kicks in) to the current configuration with a validity period in the future. This can be done safely in advance and when the time comes (provided that all machines have synchronized clocks), the changes will become visible at the same point in time.

To sum up, my advice is - do not use commons-configuration in any performance critical execution path - at least until version 2 is released and it has been verified that it does not have any major locking issues.

Monday, March 9, 2015

Performance Pitfalls: JEE Singletons

JEE singletons are session beans which are guaranteed to exist at most once per application/JVM combination. They were introduced with EJB 3.1 as a replacement for the "normal" java singletons in a JEE context.

The java pattern for singletons relies on a private static field holding the single instance of the class. The static field is accessible through a static method. The field is either eagerly initialized in a static block or lazily initialized upon first invocation of the method. In order to guarantee that there is only one single instance, the method is usually synchronized. In high concurrency scenario you might use double checked locking and not class level synchronization.

Due to the fact that static variables are stored in the context of a class, such singletons are only guaranteed to be unique within the same class-loader. If you have a JEE application consisting of multiple modules containing the same singleton class, you are in trouble. Each module can (and probably will) have its own class-loader and thus its own static instance of the singleton resulting in multiple instances within the same JVM.

One very important aspect of JEE singletons is the fact that all of their methods are per default synchronized! If not otherwise configured, each singleton gets one read-write lock (check out this for details on read-write locks) from the container. Each method acquires the WRITE lock upon invocation. This effectively means that you can only have only one method of the class being executed at any point in time. If that is what you want, you are fine.

In order to achieve the same behaviour as with a "normal" java singleton (no synchronization of the singleton's methods) you will need to  explicitly switch to bean managed concurrency using the @ConcurrencyManagement(ConcurrencyManagementType.BEAN) annotation. If you need the read-write lock semantics, e.g. you have methods that change the state of the singleton and others that only read it, you can set the default lock mode of all methods to READ (check out section 4.8.5 of the specification) and only annotate the state-changing methods with the WRITE lock mode.

If you just migrate from java to JEE singletons without considering synchronization, you risk getting contention on the singleton's instance methods which would result in poor performance.

Monday, February 23, 2015

Performance Pitfalls: hibernate query plan cache thrashing

Hibernate internally uses a cache that maps HQL statements (as strings) to query plans. The cache consists of a bounded map limited by default to 2048 elements (configurable). All HQL queries are loaded through this cache. In case of a miss, the entry is automatically added to the cache. This makes it very susceptible to thrashing - a scenario in which we constantly put new entries into the cache without ever reusing them and thus preventing the cache from bringing any performance gains (it even adds some cache management overhead). To make things worse, it is hard to detect this situation by chance - you have to explicitly profile the cache in order to notice that you have a problem there. I will say a few words on how this could be done later on.

So the cache thrashing results from new queries being generated at high rates. This can be caused by a multitude of issues. The two most common that I have seen are - bugs in hibernate which cause parameters to be rendered in the JPQL statement instead of being passed as parameters and the use of an "in" - clause. 

Due to some obscure bugs in hibernate, there are situations when parameters are not handled correctly and are rendered into the JPQL query (as an example check out HHH-6280). If you have a query that is affected by such defects and it is executed at high rates, it will thrash your query plan cache because each JPQL query generated is almost unique (containing IDs of your entities for example).

The second issue lays in the way that hibernate processes queries with an "in" clause (e.g. give me all person entities whose company id field is one of 1, 2, 10, 18). For each distinct number of parameters in the "in"-clause, hibernate will produce a different query - e.g. select x from Person x where in (:id0_) for 1 parameter, select x from Person x where in (:id0_, :id1_) for 2 parameters and so on. All these queries are considered different, as far as the query plan cache is concerned, resulting again in cache thrashing. You could probably work around this issue by writing a utility class to produce only certain number of parameters - e.g. 1, 10, 100, 200, 500, 1000. If you, for example, pass 22 parameters, it will return a list of 100 elements with the 22 parameters included in it and the remaining 78 parameters set to an impossible value (e.g. -1 for IDs used for foreign keys). I agree that this is an ugly hack but could get the job done. As a result you will only have at most 6 unique queries in your cache and thus reduce thrashing.

So how do you find out that you have the issue? You could write some additional code and expose metrics with the number of entries in the cache e.g. over JMX, tune logging and analyze the logs, etc. If you do not want to (or can not) modify the application, you could just dump the heap and run this OQL query against it (e.g. using mat):
SELECT l.query.toString() FROM INSTANCEOF org.hibernate.engine.query.spi.QueryPlanCache$HQLQueryPlanKey l It will output all queries currently located in any query plan cache on your heap. It should be pretty easy to spot whether you are affected by any of the aforementioned problems.

As far as the performance impact goes, it is hard to say as it depends on too many factors. I have seen a very trivial query causing 10-20 ms of overhead spent in creating a new HQL query plan. In general, if there is a cache somewhere, there must be a good reason for that - a miss is probably expensive so your should try to avoid misses as much as possible. Last but not least, your database will have to handle large amounts of unique SQL statements too - causing it to parse them and maybe create different execution plans for every one of them.

Sunday, February 22, 2015

Performance Pitfalls - Hibernate Criteria API

The criteria API is a way to programmatically create dynamic JPA queries in a type safe manner. It is supposed to be used when the JPA query cannot be statically defined up front because the actual query depends on input provided by the user - e.g. a search form with many optional filter conditions that can be flexibly combined in any possible way.

In hibernate the criteria API is used to create a graph of hibernate objects (implementations of the org.hibernate.Criteria interface and related objects) representing the required query. This graph of objects is then internally transformed to a JPQL query (basically a string). This string is then parsed and mapped to a SQL query. It is interesting to notice, that hibernate does not map the criteria object graph directly to SQL but rather goes through an intermediate JPQL query string. This process is repeated every time a criteria API query is executed.

Obviously, this process requires a lot of CPU cycles each time a query is executed. In cases when such queries are run very often this might already be a prohibitive factor. Unfortunately, it gets even worse.

The even more concerning issue is part of the process of transforming JPQL queries to SQL. Hibernate tries to load some of the tokens in the JPQL query as classes to determine whether they are actually classes. This results in hibernate calling ClassLoader.loadClass() with parameters like "a0", "b1" etc.

Class-loaders are organized in hierarchies. In an enterprise application a WAR class-loader would have a parent EAR class-loader with a parent the application server class-loader and so on. The default behaviour of each class-loader, when looking for a class, is to first try to load the class from the parent class-loader. If this fails, it will try to load the class itself. As a result of this strategy, trying to load a non-existent class is pretty much the worst-case scenario for a class-loader - it has to look for the class in the root class-loader and in each class-loader down the hierarchy to finally determine that the class is not known and throw a ClassNotFoundException. This has two issues. First of all, normally class-loaders do not cache such cases. When a class is found, some class-loaders might remember that (cache it) so that subsequent calls are much faster. Missing classes are usually not cached. This means that looking for a non-existent class (even the same one) over and over again always hits this worst-case scenario causing each class-loader to go through all of its classes. The second issue lays in the way the some class-loaders (e.g. the WebLogic application server class-loader) which assume that loading classes does not happen that often and use a very coarse-grained locking - loading a class basically locks the class-loader through the whole search process.

Lets get back to our criteria API query. One enterprise application obviously has one single class-loader instance (and the same class-loaders on the way up in the hierarchy). Each criteria API query tries to get the lock on the same class-loader while looking for JPQL tokens. This results in all criteria API queries contending for the same locks - a high lock contention in the class-loader and sync times are what follows.

In this blog post we saw that one needs to be willing to pay a very high price to use use criteria API with hibernate. Depending on the performance requirements it might be okay for some applications. For others, however, that is absolutely prohibitive and any high-load process must not use criteria API. My advice as a performance engineer is to avoid criteria APIs as much as possible. In many cases it is possible to use a few named queries (which, by the way, go through the same JPQL to SQL process but only once upon start-up) or even revert to native SQL or other technologies.