Featured Post

The great debacle of healthcare.gov

This is the first time in history when the president of the United States of America, or probably for any head of state around the world,...

Sunday, September 15, 2013

A brief history of a software bug

The month of September has a huge significance in my professional career, to be precise - at my current work. Last year (2012) in September we had encountered a production defect in one of our Java web application - application users weren't able to use the application due to database locks on a table that plays the role of a gateway controller for that application. And finally, in September, this year (2013), I have officially closed that defect as resolved. It was the longest running production application bug in my entire professional career that took exactly one year to resolve, the second longest was little over than a month. This was an amazing experience, though frustrating occasionally, through which that I had gone through while resolving this issue, so thought to share with you of this journey.

At first, let me give a brief description of this bug - the application had a central gateway table that was used to control the access of the application's asset (don't want to disclose the detail of the internal of it) and each user needed to gain control of that asset (or to be little more technical - lock the asset) before it can continue working on it. The lock is released when that user completes the task or the lock gets stale. The stale lock is determined by a specified amount of time and after that any users comes into the system to acquire lock of any assets (not only for the stale lock), the system does a house keeping to clean all existing stale locks. Yes I know what you're thinking right now - it's not efficient way of controlling the access of anything in a software but it is what it is. And we have started seeing periodical (once or twice in a month) database locks (this is Oracle database locks, not to be confused with the application's lock that's used to define the ownership of the asset at a given point in time) when the table lock is found on that gateway controller table which prevents all other users to get ownership of assets to start business transaction with. The quickest workaround was to kill the database locks by the DBAs or recycling the web application that destroys all the active connections and eventually release the database locks. Now, it's our job to debug it and find a fix of it.

The first issue occurred in September, 2012 immediately after the release where we had migrated to Websphere Application Server 6.1 Fix Pack 17 to 6.1 Fix Pack 31 and Oracle 10g to Oracle 11g R2 on Exadata. Though I believe in a thumb rule that says "if your working application is broken then look for what have changed now". But here the answer comes with so much wide area to cover that it was almost impossible to narrow down the change set. The changes were done on application server machine, Unix version, Application server version, Database server machine, Database server version, Network configuration, Data Center - that means almost everything but the code.

I started to take one at a time to narrow down the target. At first the investigation point was the application code to find if there’s any place where the database transaction isn’t properly committed. Even though the application was working for years with no issue but my thought process was: may be the code was written in a poorly manner with a fundamental weakness which works on a sunny day scenario but due to the change in underlying environment, that weakness turned out as a deadly bottleneck. I found a place where the transaction was getting committed/rolled back by the Websphere Application Server (WAS) through a WAS specific configuration. My first thought was that, could that be the source of this lock issue. But then the question would arise that how come that same piece of code was working for years until we migrated to this new infrastructure. But the issue was repeating once or twice in a month and the production server needed to recycle which requires at least two hours blackout window. After few occurrences the issue got escalated to Senior management level and I was pulled in to explain – I took over that application's management just a few months before the migration. So I was kind of pushed to prove myself by fixing the production issue. So I started to look for whatever it requires to fix it sooner than later.

My firm hunch was that the issue occurs due to the change in the infrastructure (as the code base was unchanged for months) but I took a shot on that particular transaction handling where Websphere configuration was used. It's possible that, may be the committing/rolling back of transaction that was handed over to Websphere Application Server, was not doing its job right in the newly changed infrastructure. I had no proof to support my hunch but I decided to act rather than sitting idle hopelessly to see another production blackout. More over the usual bouncing game had started with IBM and Oracle team that it’s not something that is causing due to their system and they blindly blamed application code to cause this problem. I politely told them that this application code didn’t undergo any changes for months. Then the push back game started between two parties. Anyway, I asked the team to change the code where WAS handles the connection commit/rollback and changed the code to handle it within the code. We waited another few weeks and the lock issue resurfaced – the change in the code didn’t work.

As a next step I changed my strategy. Rather than looking for the silver bullet and waiting for months (the issue occurs infrequently but regularly once/twice in a month and only happens in Production environment), I decided to change the underlying code that causes the application to go to on its knee, i.e. changed the definition of the application’s locking that required a periodical cleanup of data which caused that table being locked and eventually causing the entire application to freeze. To give a backdrop – the cleanup of inserted records was implemented to prevent building up of stale application locks by the users who close the internet browser abruptly without properly logging out from the application. This cleanup was embedded to the users log out action and as well as when a user attempts to acquire the lock i.e. first cleans up others mess (which I never liked as a design philosophy – one poor guy is responsible to cleanup of the other guy's mess) and then continue using the application. Anyway, I decided to change the definition of the application item’s lock as anything that’s note stale (through a timeout of 60 minutes) and if some users do not log out properly from the application, it stays there and  next user can pick up that application item by re locking it. That didn’t help either, we continued getting locking issue with the prior rate i.e. once/twice in a month. But one thing it did – now the entire application isn’t getting frozen but only a single user who was trying to acquire the lock of that application item, was getting impacted and the production support team kills the session holding the row lock as a regular incident without creating a big fuss of it.

Now it gave us enough time to look into the issue while the issue arises. No one even notices that there’s any production issue – as now it’s a normal priority rather than critical priority. We continued on each and every point of changes one by one. Rather than I go detail on each of one, let me describe them briefly -

A. We have changed the application transaction's strategy to roll back of any user's session that times out to give back the control of that from Websphere Application Server to application code

B. Added Index to all foreign keys of the database tables that has referential integrity constraints (parent-child relationship) as in Oracle 11g the table locking mechanism is improved (?)

C. Worked with Oracle DBA to investigate the database configurations if there's anything for which we should change the application code which bounced back to us as saying that Database configuration is PERFECT

D. Similarly, worked with Websphere Application Server admins to verify the configuration changes if anything is conspicuous triggering the database locks

E. Verified the Network communication with corporate network administrator team if there's any change in network topology that might have increased latency or causing network packets to drop. Also worked with Unix server Admin team if the network interface card has any issue with packet loss of such but no luck

F. Added JDBC parameters (oracle.jdbc.ReadTimeout, oracle.net.READ_TIMOUT etc.) to make sure that any kind of wait on the network connection or database don't keep the database connection hanging on that might cause to keep the table lock active. No luck

G. As mentioned before, we have changed the code to change the behavior of the application
    - got rid of the dependency on IBM Websphere Application Server to roll back uncommitted transaction
    - changed the application architecture to go from table level locking to row level locking - which actually helped us hugely to buy precious time to continue investigation

All of the above changed couldn't resolve the issue. And as this issue was occurring around once in a month, and can't be reproduced in non-production environment, every changes took us around a month to be found worthless. Anyway, in July, I have decided to change the configuration of the Oracle from Load balancing RAC to fail over RAC. This we had discussed long before but couldn't find any evidence to back that up if Oracle RAC could be the issue. One day in July, I was on my way home and got email from our application support team that they have found database lock once again and asking for my confirmation to kill it (I set the process to make sure I'm in the loop for every events). Anyway, I told the person to kill the locks but one thing I found fishy that the screenshot of the database lock was showing to have on the 2nd node of the RAC. It shouldn't be the case because we had made it changed to fail over mode so explicit changes is required to use the 2nd node. I asked the person to confirm on what node the application is running now and got the confirmation (these communication was going back and forth on my Blackberry) that the app is running on the first node. This gave me a clear indication that there's something fishy on the RAC level. Next day, I discovered that the RAC configuration from load balancing to fail over was rolled back when the DBA had a reboot of the database server for maintenance and the load balancing mode was enabled. That's the moment of the truth. I asked DBA to change it back to fail over mode and made sure that the configuration change was persistent and then waited for two months until this month before I declare victory over this bug.

Now here's my hypothesis about this application bug - the application was using distributed database transaction and probably the RAC configuration wasn't done in correct way so that when the two phase commit takes place the second phase some time (around once or twice in a month in our case) doesn't found the first phase transaction and wait for ever. Even though this above scenario should come up with different application error (with ORACLE error code) but that's my only hypothesis so far. I would keep my eyes open to find concrete points to back up my hypothesis.