Featured Post

The great debacle of healthcare.gov

This is the first time in history when the president of the United States of America, or probably for any head of state around the world,...

Sunday, September 15, 2013

A brief history of a software bug

The month of September has a huge significance in my professional career, to be precise - at my current work. Last year (2012) in September we had encountered a production defect in one of our Java web application - application users weren't able to use the application due to database locks on a table that plays the role of a gateway controller for that application. And finally, in September, this year (2013), I have officially closed that defect as resolved. It was the longest running production application bug in my entire professional career that took exactly one year to resolve, the second longest was little over than a month. This was an amazing experience, though frustrating occasionally, through which that I had gone through while resolving this issue, so thought to share with you of this journey.

At first, let me give a brief description of this bug - the application had a central gateway table that was used to control the access of the application's asset (don't want to disclose the detail of the internal of it) and each user needed to gain control of that asset (or to be little more technical - lock the asset) before it can continue working on it. The lock is released when that user completes the task or the lock gets stale. The stale lock is determined by a specified amount of time and after that any users comes into the system to acquire lock of any assets (not only for the stale lock), the system does a house keeping to clean all existing stale locks. Yes I know what you're thinking right now - it's not efficient way of controlling the access of anything in a software but it is what it is. And we have started seeing periodical (once or twice in a month) database locks (this is Oracle database locks, not to be confused with the application's lock that's used to define the ownership of the asset at a given point in time) when the table lock is found on that gateway controller table which prevents all other users to get ownership of assets to start business transaction with. The quickest workaround was to kill the database locks by the DBAs or recycling the web application that destroys all the active connections and eventually release the database locks. Now, it's our job to debug it and find a fix of it.

The first issue occurred in September, 2012 immediately after the release where we had migrated to Websphere Application Server 6.1 Fix Pack 17 to 6.1 Fix Pack 31 and Oracle 10g to Oracle 11g R2 on Exadata. Though I believe in a thumb rule that says "if your working application is broken then look for what have changed now". But here the answer comes with so much wide area to cover that it was almost impossible to narrow down the change set. The changes were done on application server machine, Unix version, Application server version, Database server machine, Database server version, Network configuration, Data Center - that means almost everything but the code.

I started to take one at a time to narrow down the target. At first the investigation point was the application code to find if there’s any place where the database transaction isn’t properly committed. Even though the application was working for years with no issue but my thought process was: may be the code was written in a poorly manner with a fundamental weakness which works on a sunny day scenario but due to the change in underlying environment, that weakness turned out as a deadly bottleneck. I found a place where the transaction was getting committed/rolled back by the Websphere Application Server (WAS) through a WAS specific configuration. My first thought was that, could that be the source of this lock issue. But then the question would arise that how come that same piece of code was working for years until we migrated to this new infrastructure. But the issue was repeating once or twice in a month and the production server needed to recycle which requires at least two hours blackout window. After few occurrences the issue got escalated to Senior management level and I was pulled in to explain – I took over that application's management just a few months before the migration. So I was kind of pushed to prove myself by fixing the production issue. So I started to look for whatever it requires to fix it sooner than later.

My firm hunch was that the issue occurs due to the change in the infrastructure (as the code base was unchanged for months) but I took a shot on that particular transaction handling where Websphere configuration was used. It's possible that, may be the committing/rolling back of transaction that was handed over to Websphere Application Server, was not doing its job right in the newly changed infrastructure. I had no proof to support my hunch but I decided to act rather than sitting idle hopelessly to see another production blackout. More over the usual bouncing game had started with IBM and Oracle team that it’s not something that is causing due to their system and they blindly blamed application code to cause this problem. I politely told them that this application code didn’t undergo any changes for months. Then the push back game started between two parties. Anyway, I asked the team to change the code where WAS handles the connection commit/rollback and changed the code to handle it within the code. We waited another few weeks and the lock issue resurfaced – the change in the code didn’t work.

As a next step I changed my strategy. Rather than looking for the silver bullet and waiting for months (the issue occurs infrequently but regularly once/twice in a month and only happens in Production environment), I decided to change the underlying code that causes the application to go to on its knee, i.e. changed the definition of the application’s locking that required a periodical cleanup of data which caused that table being locked and eventually causing the entire application to freeze. To give a backdrop – the cleanup of inserted records was implemented to prevent building up of stale application locks by the users who close the internet browser abruptly without properly logging out from the application. This cleanup was embedded to the users log out action and as well as when a user attempts to acquire the lock i.e. first cleans up others mess (which I never liked as a design philosophy – one poor guy is responsible to cleanup of the other guy's mess) and then continue using the application. Anyway, I decided to change the definition of the application item’s lock as anything that’s note stale (through a timeout of 60 minutes) and if some users do not log out properly from the application, it stays there and  next user can pick up that application item by re locking it. That didn’t help either, we continued getting locking issue with the prior rate i.e. once/twice in a month. But one thing it did – now the entire application isn’t getting frozen but only a single user who was trying to acquire the lock of that application item, was getting impacted and the production support team kills the session holding the row lock as a regular incident without creating a big fuss of it.

Now it gave us enough time to look into the issue while the issue arises. No one even notices that there’s any production issue – as now it’s a normal priority rather than critical priority. We continued on each and every point of changes one by one. Rather than I go detail on each of one, let me describe them briefly -

A. We have changed the application transaction's strategy to roll back of any user's session that times out to give back the control of that from Websphere Application Server to application code

B. Added Index to all foreign keys of the database tables that has referential integrity constraints (parent-child relationship) as in Oracle 11g the table locking mechanism is improved (?)

C. Worked with Oracle DBA to investigate the database configurations if there's anything for which we should change the application code which bounced back to us as saying that Database configuration is PERFECT

D. Similarly, worked with Websphere Application Server admins to verify the configuration changes if anything is conspicuous triggering the database locks

E. Verified the Network communication with corporate network administrator team if there's any change in network topology that might have increased latency or causing network packets to drop. Also worked with Unix server Admin team if the network interface card has any issue with packet loss of such but no luck

F. Added JDBC parameters (oracle.jdbc.ReadTimeout, oracle.net.READ_TIMOUT etc.) to make sure that any kind of wait on the network connection or database don't keep the database connection hanging on that might cause to keep the table lock active. No luck

G. As mentioned before, we have changed the code to change the behavior of the application
    - got rid of the dependency on IBM Websphere Application Server to roll back uncommitted transaction
    - changed the application architecture to go from table level locking to row level locking - which actually helped us hugely to buy precious time to continue investigation

All of the above changed couldn't resolve the issue. And as this issue was occurring around once in a month, and can't be reproduced in non-production environment, every changes took us around a month to be found worthless. Anyway, in July, I have decided to change the configuration of the Oracle from Load balancing RAC to fail over RAC. This we had discussed long before but couldn't find any evidence to back that up if Oracle RAC could be the issue. One day in July, I was on my way home and got email from our application support team that they have found database lock once again and asking for my confirmation to kill it (I set the process to make sure I'm in the loop for every events). Anyway, I told the person to kill the locks but one thing I found fishy that the screenshot of the database lock was showing to have on the 2nd node of the RAC. It shouldn't be the case because we had made it changed to fail over mode so explicit changes is required to use the 2nd node. I asked the person to confirm on what node the application is running now and got the confirmation (these communication was going back and forth on my Blackberry) that the app is running on the first node. This gave me a clear indication that there's something fishy on the RAC level. Next day, I discovered that the RAC configuration from load balancing to fail over was rolled back when the DBA had a reboot of the database server for maintenance and the load balancing mode was enabled. That's the moment of the truth. I asked DBA to change it back to fail over mode and made sure that the configuration change was persistent and then waited for two months until this month before I declare victory over this bug.

Now here's my hypothesis about this application bug - the application was using distributed database transaction and probably the RAC configuration wasn't done in correct way so that when the two phase commit takes place the second phase some time (around once or twice in a month in our case) doesn't found the first phase transaction and wait for ever. Even though this above scenario should come up with different application error (with ORACLE error code) but that's my only hypothesis so far. I would keep my eyes open to find concrete points to back up my hypothesis.

5 comments:

Shahidul Mahfuz said...

Congratulation on your KILL. This definitely was an interesting story. Nice job (and you certainly will not get promoted for this solution). You are now the expert/ experienced guy of the development/deployment process.

I didnt clearly understand the word "Experience" at the initial journey of my job career. Now a days, i see things happens repeatedly, has a good control over the task- which gives me the option to identify/ solve the problem....

Btw, was it a BUG? or its defect or something else? it was not in the specs.... :-)

Mohammad Masud said...

Thanks. "...things happens repeatedly..." - is very interesting, specially, in the case when you try to investigate a software malfunction and you've deployed your application software on a commercial application server, database server, computer server, and Operating Systems. Almost always the response to a ticket you raise to their support team comes back (I would say bounce back) immediately with saying "it's not our problem please check with your developer, Oracle (if the response is from IBM), or IBM (if the response is from Oracle), blah blah".

I'm convinced that this was a bug (defect, issue, etc. are synonymous to it) and for two practical reasons -

1. The application is developed to work 24 X 7. All the use cases of the software, that was given by users, had success scenarios where they suppose to execute the request when clicked but in this case, the application was breaking by showing error message that the application can't process the request due to lock in the database table. Which is a clear violation of the success scenario laid down on the use case

2. This application actually worked for years smoothly and started breaking only after the infrastructure migration. So, even if users didn't ask explicitly (in our case use did ask) that the application software should work, the past users experience set the expectation (which is actually requirement) that this application should continue behaving in the same manner and shouldn't break every month once or twice. You can compare this with the civic right in a society where the right is not written but you expect certain things from other in a society, likewise, last several years unbroken application behavior (and endorsed by users by accepting it) set the requirement that it should continue this behavior

Anonymous said...

Nice story. Good to know that the issue is fixed finally. But valuable lesson learnt I believe.

-Sayem

Shahidul Mahfuz said...

The topics was related to killing a bug and as per definition it was "suppose to work 24 x 7" it can be termed as bug.

However, its a vague statement that it should work 24 x 7. It would be interesting to see what test cases can cover this statement. And as a tester (from my experience), i'm pretty confident that it never includes all possible scenarios (and its a human nature, can not be 100% of anything).

What i understood so far with my experience, specs, explicit, implicit - these words acts/varies differently from man to man and role to role...

going off the topic here...
recently with new IOS 7 release, user found one issue, from the lock screen you can make phone call. Now, at a first glance, it sounds like its a (straight forward) bug. But if you look at the details, it says, it takes 7 tries to make that call on the lock screen.

Now, my question is, did anybody think about writing something like this in specs? or in test plan/cases? if anyone even wrote something like this, it would be the nth try things like this may happen (from black box testing, i do not have any idea what intelligence developers put while writing the code - no offense, i dont know how else to express it).

Usually, when and if i write this kind of test cases, most of the time my experience is it gets the lowest priority as ppl wont do it or its not in specs!

I have no intention to make it a debate whether anything found by user is a Issue/ Bug/ Defact or not. I'm open to call it by any name as long as developer agreed to fix it. I have used this technique with the writer as well :)).

Mohammad Masud said...

No disagreement on that the "24 x 7" is a vague requirement. Honestly speaking, I actually used that as an expression and not as a requirement that it should be tested to work at 12 O'clock at night :). Anyway, I kind of derived that expression from the use case step where it says something like "System should show the data entry form when user clicks on the START button". So, if at any point of time, the system shows anything but the expected data entry form after user had clicked on that button, I would interpret that as it's not working as mentioned in the use case. Absolutely agree with you that it's not easy to detect this kind of issue; that's the reason it took us a year to identify and fix it.

You made a very good point on the iOS bug (or feature?) that actually allows the user to make a call while the phone is in lock mode. No doubt that it's pretty hard to detect that kind of issues that takes seven steps to reproduce and it's more of a question (refer to http://quest-to-achieve.blogspot.com/2013/07/how-much-testing-is-enough-for-software.html) for that product or software. I would argue that this is a quite a big deal that the security of the phone is compromised due to this loophole. So Apple needs to decide on how many level of testing should be done on it's critical features like the phone's security and how much testing to be done on some fun features like customize your music controls.

But, at the end, you know and I know that there's nothing called absolutely bug free software in this world, and the question would be how much we're ready to invest to make a software relatively bug free.