the critical 3%: 2017

In this post I will write about an idea we came up with while working on a hibernate upgrade of a medium sized monolithic project. I found it to be very useful when learning new libraries / technologies or upgrading them.

You may have already heard about learning tests. If you are familiar with them you can skip this paragraph. Imagine that you have a library that you need to use but are unfamiliar with. To learn it, you will probably read some documentation, try out examples in a trivial application, maybe do some research online what others are doing. You will repeat that until you are confident that you understand how the library works and are ready to use it in a real-world scenario. The central point of learning tests is to use a test framework to try stuff out instead of writing throw-away mini-applications. It requires a bit more work but you end up learning the library, getting better at writing unit tests and having an executable test suite that verifies certain aspects in an automated fashion. Great!

These learning tests can be helpful to other members of the team too. Developers can use them to learn the library by reading, executing and debugging the tests. The resulting test-suite will be fast to execute as it only has to setup whatever is required by this library and nothing else - learning a library for HTTP communication for example will require you to spin up a simple web server. It will still be probably much faster to do this instead of running your full blown application with a database and many more dependencies so that you can run a test that happens to use this library somewhere along the line.

You can promote the test-suite to an independent project. In a micro-service world you will probably be using this library in many micro-services and do not want to limit its usage to a single application. As it is only testing aspects of the library, it is agnostic to your application logic and can thus live independently. You can think of this independent test project as a playground where you can reproduce issues, try out new features and learn. The best part is that the cost for this has been pretty low so far - you just kept the by-product of your learning process that you would have otherwise deleted.

If you go one step further and improve the quality of those tests, carefully document and distill them, and really treat them as first class citizens, you will end up with a set of tests that verify your expectations of a library - something like a testable contract between your application and the library (well a real contract would require that both parties agree to the conditions - it is more like unwritten terms and conditions that the library has and which we have agreed to by using it). In this test-suite you will have tests for all major features that you use, probably in functionally relevant variations. You will also include tests for exotic features - anything that surprised you or you found non-trivial at the time. Any behaviour that has been discovered in production via a bug report should also be part of the test-suite. With the time it will grow and get refined.

Having such a testable contract will come in handy when upgrading. Now you can use your test-suite to get the upgrade process started. You change the library version and get an instant feedback from the compiler - all compilation problems are changes to the public API that you will need to understand and later rework in you actual project. What is more exciting are the test results. Failing tests indicate deviations from your contract - expectations that are not met by the library any more. You will have to understand each such change and later on identify the places in your application that use this feature and adapt them. You will be able to switch between versions, debug through the library code with the working and with the failing versions and search for an explanation for the change. There are always implicit expectations that you have but have not coded into tests - the tests do not give you a full-proof guarantee for a smooth upgrade but will get you very far very fast. Their major goal is to help you learn what has changed in the new version so that you can apply this knowledge to your applications afterwards. A red test is also much harder to miss than a sentence in a multi-page changelog of the library.

Back to the hibernate upgrade. Unfortunately there were no pre-existing learning tests that we could use, no playground to reproduce and debug stuff. In many cases we had to remotely debug a WebLogic server to find out what was going on. Soon we noticed that such a playground project would be very helpful to verify certain assumptions about how hibernate behaves. We created one that starts an embedded in-memory database and creates the schema on the fly. We defined dedicated entities with the required relationships for each test-case. Using a custom JDBC driver we wrapped the actual driver and were able to track the SQL statements executed by hibernate. We parsed those statements using the in-memory database facilities and wrote some assertion methods that could structurally compare SQL statements. Now we were in the position to execute some code and write expectations on what SQL statements get executed by hibernate behind the scene (ignoring the generated aliases).

Let me give you an example of one such test that turned out pretty important. As I am interested in non-functional aspects of systems, I wanted to know exactly what operations trigger loading of lazily loaded entities. One of the tests I wrote checked whether getting the primary key of a referenced ManyToOne entity (e.g. child.getParent().getId()) did trigger a load. It did and I was not expecting it because the entity on the many side already had a column containing the primary key of the referenced entity on the one-side, so there was no real need to actually load the whole referenced entity to just return the id. But hibernate still did it so I wrote a few assertions that the primary key getter does actually trigger a lazy load. We also noticed that we did use this side effect of the getter to make sure that lazily loaded entities got loaded before detaching them and passing them to other components. Well, that test paid off pretty fast as it showed us that the next minor hibernate upgrade (5.2.12) does not fulfil this expectation - with any future upgrade we have to be careful in places where we were calling the primary key getter to provoke a load (https://hibernate.atlassian.net/browse/HHH-11838)

After this particular experience I wrote some other playground / contract applications - so far they have had a pretty good return on investment and I hope that this post motivates you enough to try this out too. If you already have, please share your experience or best-practices as comment on this post at https://dimovelev.blogspot.com/2017/11/library-playground-and-testable.html.

I have recently been involved in a massive WebLogic 12.2.1.3 and Hibernate upgrade. There are way too many things that went wrong and took days to figure out, but one I will remember for long time.

After all the tests were green, we fixed a few bugs that the tests missed and implemented workarounds for new hibernate bugs, it was time to deploy our work to a higher environment (pre-production). I did not have many concerns as we have deployed to a few environments without issues already, ran many tests and did all that was in our power to reduce the risk of anything going wrong.

Guess what - something went wrong. In the first possible instance - the deployment failed. We were not able to deploy our EAR files because they could not find the data source in JNDI. The root cause for this was an interruption / socket read timeout exception while opening connections in the data-pools. These problems were sporadic, hit different servers and different connection pools. Pretty weird because other pools were successfully initialised at the same time.

We knew that we have configured a 3 second connection timeout in the oracle JDBC URL connection string using the NETWORK_CONNECT_TIMEOUT=3 parameter. Three seconds for a data-center is a very long time for it to fail that often. Unfortunately, increasing the timeout to a much higher value was not a viable option because of other issues we have experienced in the past with RAC failover etc.

The team got bigger (because nine women can make a baby in one month) and we started looking in many directions - are the virtual machines overloaded, are the hypervisors overloaded, do we have issues on the database, is it network, we captured network traffic, looked through endless TCP streams (btw, now I know more about oracle ONS than before), restarted the whole environment, restarted database services, increased the entropy by running rngd, changed java to use urandom (by working around an age old bug and specifying the entropy source as /dev/./urandom). Nothing helped and we were not getting anywhere.

Then I used an old and trivial java program that just connects to the Oracle database via JDBC, executes a configurable SQL statement a given number of times with a configurable concurrency and shuts down. In this case I configured it to execute the classical health-check statement "select 1 from dual" with 1 thread. I started it in a loop and voila - after a while an invocation failed with a socket timeout exception. Excellent, the problem was not in WebLogic 12.2.1.3! We were relieved - it was an environment issue that must be addressed but is nothing that should stop us from proceeding to production. I turned off my computer.

After putting the baby to sleep I decided to look through the logs again because I did not want to miss something and risk issues in production. After a while I noticed something that looked suspicious - in the log where I was executing the java program in a loop I noticed that the timestamps of the two successful executions that surrounded the failure, had around 2 seconds of time difference. So how can we exceed a timeout of 3 seconds for less than 2 seconds wall clock time? It just isn't possible (ignoring anomalies like time synchronisation, leap-seconds and such). After eliminating some other possibilities (or some that we have already eliminated before that point, like ONS/FAN) I came to the conclusion that the time unit must be wrong. A search in the web lead me to a documentation on an Oracle site that, among million other things, had a tiny note that the unit of NETWORK_CONNECT_TIMEOUT changed from seconds to milliseconds (with ojdbc 12.2.x) and if you want to keep the old behaviour you have to apply a patch!

So a commercial library used by a ton of people changes the unit of a parameter and hides the information in a note somewhere.

Let that sink for a second! Or is it a millisecond...

3 things went wrong during the development of the ojdbc driver:

They defined a property which contains a duration without including the unit in the name of the property. That might look redundant (e.g. NETWORK_CONNECT_TIMEOUT_SEC) but many many people will thank you later. So the rule is always include the unit in the name of duration properties.
Initially they decided to go for a seconds unit - that is way to coarse for such a low-level timeout. I would always go for milliseconds unless it makes absolutely no sense and then probably still do it.
They changed the API without any reason. If you need to have a millisecond timeout, then just introduce a new property called NETWORK_CONNECT_TIMEOUT_MS. Keep the old one in the interface and map to milliseconds. Fail if both are defined. It is that easy. If you want to reuse the timeout, well, provide a patch that changes to milliseconds and keep the default version to be backwards compatible.

Note that if they had followed the first rule (and named the property NETWORK_CONNECT_TIMEOUT_SEC in the first place) they would have never decided to change the unit to milliseconds and would have been forced to introduce a new property thus causing much less WTFs / day.

I hope this helps the next victim of the NETWORK_CONNECT_TIMEOUT oracle jdbc8 12.2 issue.

Originally posted at https://dimovelev.blogspot.com/2017/11/a-story-about-backwards-compatibility.html.

the critical 3%

Wednesday, November 22, 2017

Library Playground and Testable Contracts

Friday, November 17, 2017

A story about backwards compatibility

About Me