the critical 3%: A story about backwards compatibility

I have recently been involved in a massive WebLogic 12.2.1.3 and Hibernate upgrade. There are way too many things that went wrong and took days to figure out, but one I will remember for long time.

After all the tests were green, we fixed a few bugs that the tests missed and implemented workarounds for new hibernate bugs, it was time to deploy our work to a higher environment (pre-production). I did not have many concerns as we have deployed to a few environments without issues already, ran many tests and did all that was in our power to reduce the risk of anything going wrong.

Guess what - something went wrong. In the first possible instance - the deployment failed. We were not able to deploy our EAR files because they could not find the data source in JNDI. The root cause for this was an interruption / socket read timeout exception while opening connections in the data-pools. These problems were sporadic, hit different servers and different connection pools. Pretty weird because other pools were successfully initialised at the same time.

We knew that we have configured a 3 second connection timeout in the oracle JDBC URL connection string using the NETWORK_CONNECT_TIMEOUT=3 parameter. Three seconds for a data-center is a very long time for it to fail that often. Unfortunately, increasing the timeout to a much higher value was not a viable option because of other issues we have experienced in the past with RAC failover etc.

The team got bigger (because nine women can make a baby in one month) and we started looking in many directions - are the virtual machines overloaded, are the hypervisors overloaded, do we have issues on the database, is it network, we captured network traffic, looked through endless TCP streams (btw, now I know more about oracle ONS than before), restarted the whole environment, restarted database services, increased the entropy by running rngd, changed java to use urandom (by working around an age old bug and specifying the entropy source as /dev/./urandom). Nothing helped and we were not getting anywhere.

Then I used an old and trivial java program that just connects to the Oracle database via JDBC, executes a configurable SQL statement a given number of times with a configurable concurrency and shuts down. In this case I configured it to execute the classical health-check statement "select 1 from dual" with 1 thread. I started it in a loop and voila - after a while an invocation failed with a socket timeout exception. Excellent, the problem was not in WebLogic 12.2.1.3! We were relieved - it was an environment issue that must be addressed but is nothing that should stop us from proceeding to production. I turned off my computer.

After putting the baby to sleep I decided to look through the logs again because I did not want to miss something and risk issues in production. After a while I noticed something that looked suspicious - in the log where I was executing the java program in a loop I noticed that the timestamps of the two successful executions that surrounded the failure, had around 2 seconds of time difference. So how can we exceed a timeout of 3 seconds for less than 2 seconds wall clock time? It just isn't possible (ignoring anomalies like time synchronisation, leap-seconds and such). After eliminating some other possibilities (or some that we have already eliminated before that point, like ONS/FAN) I came to the conclusion that the time unit must be wrong. A search in the web lead me to a documentation on an Oracle site that, among million other things, had a tiny note that the unit of NETWORK_CONNECT_TIMEOUT changed from seconds to milliseconds (with ojdbc 12.2.x) and if you want to keep the old behaviour you have to apply a patch!

So a commercial library used by a ton of people changes the unit of a parameter and hides the information in a note somewhere.

Let that sink for a second! Or is it a millisecond...

3 things went wrong during the development of the ojdbc driver:

They defined a property which contains a duration without including the unit in the name of the property. That might look redundant (e.g. NETWORK_CONNECT_TIMEOUT_SEC) but many many people will thank you later. So the rule is always include the unit in the name of duration properties.
Initially they decided to go for a seconds unit - that is way to coarse for such a low-level timeout. I would always go for milliseconds unless it makes absolutely no sense and then probably still do it.
They changed the API without any reason. If you need to have a millisecond timeout, then just introduce a new property called NETWORK_CONNECT_TIMEOUT_MS. Keep the old one in the interface and map to milliseconds. Fail if both are defined. It is that easy. If you want to reuse the timeout, well, provide a patch that changes to milliseconds and keep the default version to be backwards compatible.

Note that if they had followed the first rule (and named the property NETWORK_CONNECT_TIMEOUT_SEC in the first place) they would have never decided to change the unit to milliseconds and would have been forced to introduce a new property thus causing much less WTFs / day.

I hope this helps the next victim of the NETWORK_CONNECT_TIMEOUT oracle jdbc8 12.2 issue.

Originally posted at https://dimovelev.blogspot.com/2017/11/a-story-about-backwards-compatibility.html.

the critical 3%

Friday, November 17, 2017

A story about backwards compatibility

About Me