Featured Post

The great debacle of healthcare.gov

This is the first time in history when the president of the United States of America, or probably for any head of state around the world,...

Friday, November 8, 2013

How to add Facebook Share button on your blog

You can have the Facebook's share button on your blog site within few seconds which will help you to boost your stats of blog posts through sharing on this de facto social networking platform. Let's check out the step by step instructions on how to get it done on your, for instance, blogger.com site.

1.  Go to the Template tab and click on "Edit HTML" button

2. Copy the below JavaScript code under the tag:

<div id='fb-root'/> <script>(function(d, s, id) { var js, fjs = d.getElementsByTagName(s)[0]; if (d.getElementById(id)) return; js = d.createElement(s); js.id = id; js.src = &quot;//connect.facebook.net/en_US/all.js#xfbml=1&quot;; fjs[removed].insertBefore(js, fjs); }(document, &#39;script&#39;, &#39;facebook-jssdk&#39;));<script/>

3. Now you're ready to place the Facebook share button any place on your blog site. You may add the Share button on each of your post or add a gadget to the entire blog site.

4. To add a gadget for the entire site, go to the Layout tab (from the left navigation shown on the step 1) and click on the "Add a Gadget" link

5. Add the "HTML/JavaScript" gadget

5. Edit the just added gadget by clicking on "Edit" link and put the below JavaScript code in the content text area and save it

<div class="fb-share-button" data-href="http://quest-to-achieve.blogspot.com" data-type="button_count"></div>

Don't forget to change the "data-href" value to your site URL, otherwise it would always come to my blogger site. Also you can do the same at the end of you each post by copying the above JavaScript code while using the corresponding URL as "data-href". 

Enjoy your blog post on the social networking site! and share my post on your Facebook page

Saturday, October 26, 2013

The great debacle of healthcare.gov

This is the first time in history when the president of the United States of America, or probably for any head of state around the world, is blamed for the quality of a software and had to stood up to defend a mere software. Though nobody should extract fun out of that fiasco, but I kind of appreciate the fact that how technology has taken the epicenter of every ones life in this country. The talk shows, news anchors, and congressional hearing, everyone is talking on how software development works, how much testing was done or not done, how the integration of subsystems works within software etc. This is all I do for my living so I couldn't hold my temptation to chip in and have my two cents on it.

I want to make it clear that this healthcare.gov is no way a simple system. I would rather say that this is an integrated platform of very complex systems in terms of the volume of work, need of the sheer amount of coordination with all the stakeholders and expectation of a real time communication among various existing government IT systems. Just to get an idea of how vast was the volume of work, check out this information: it consists of 6 complex systems, 55 contractors were working in parallel, 5 government agencies were involved, used in 36 States, and covers 300 insurers' 4500 insurance plans. All these tells us one thing, to deliver a successful system like that the planning must be very precise, and execution of that should have been close to perfect.

Having said that I would focus on three areas of it and would try to reflect on it based on my experiences that I gathered throughout these years of working in the field of software engineering. Before I jump into my analysis on the issues, let's see the architectural modules of the health care marketplace system. Among other, the system has these significant modules: the eligibility & enrollment module, plan and financial management module, Data Services module, Registration module, Income verification system etc.

Now let's take a look into the detail. The first area of failure, I believe, was proper planning. This system is designed as a real-time system and planned to go live in a big bang approach, which is, in my opinion, single biggest mistake. I would explain the real time system architecture in the later part, but the big bang release idea must have come from political part of the decision making process. This market place system connects to several disparate systems so it would have been a better plan to have it in a phased approach and then slowly ramp up users into the systems. The gradually ramping up of load could've been achieved through either integrating with as few external systems as possible or by rolling out the system to a few selected states and then gradually expand the user base. The other planning failure was, in my opinion, the belittling of the role of testing in software development. The testing of each systems might have been planned right, but the integration testing might have not received the required level of priority. Any plan should focus on the highest risk points and have a mitigation plan around those risks. There shouldn't be any doubt that integration of all these disparate systems should be the highest risk to mitigate for this health care marketplace system.

Second area that I would like to focus on is the architecture and design of this system. This market place platform was designed to work as a real time system which is a bit surprising to me that the healthcare.gov site is designed to be accessed by millions of users while in the back end it would be connecting to several federal government's legacy systems with the expectation that all those systems would respond back within a predefined time. How on earth they thought that every legacy systems of the government e.g. USCIS, IRS, Social Security systems etc., those are designed decades ago, would be able to deliver the expected level of performance that a real time systems, like healthcare.gov, would need to function. Real time system only makes sense when you have full control of all underlying systems or at least you're given the guarantee by the underlying systems to return with the response of a request within a predefined period of time. In this case, the underlying systems for the healthcare.gov neither has full control nor has guaranteed response time from dependent systems, so designing a real time system on top of multiple legacy systems is nothing but a pipe dream. The real time online market place should've started with a bottom up approach i.e. enable the downstream systems to deliver real time response and then open the grand door of healthcare.gov. Again, this may not be acceptable due to the politically set deadline of the Affordable Care Act but, in the other hand, you can't always squeeze or compromise with the software development process. On top of that, adding more people to the team isn't always a solution to get  software faster (for reference, check out the brooks law) but in fact it may end up adding more delay and chaos in the project.

The last area of my analysis is the test strategy of the software and its execution. In my opinion, the focus of the test strategy should've been on the early integration the system and of its testing because that's the biggest risk factor to the successful roll out of it. Though this could've been covered in the planning and architecture and designing phase, but a good solid test strategy has ability to influence the overall planning and cloud work as an early warning indicator of any disaster. If you aren't allowing the QA team to fully test it out after it's integrated, you may be living in a lala land assuming that it would all work fine as designed. I've mentioned in my several older posts that, the greatest mistake made by most of the software professionals is to make assumption and never verifying those assumptions ahead of time. As a side note, I've another post (How much testing is enough) on the importance of testing, and its strategy where I explained how should one decide to go for a confidence or risk based testing approach where I ended the post with my suggestion to  encourage the decision makers to boldly take risks that can be affordable. In this case, the Center for Medicare and Medicaid Services (CMS), the managing organization of this insurance market place platform, had gone through that path of taking risks but in absolute wrong way. They had decided to push the integration testing (the biggest risk factor of this kind of systems) to the very end of the release cycle and had given just two weeks (according to CGI Federal) for integration testing which is, to my point of view, a joke to the people who have a basic understanding of the software development process. Just compare this to the overall development period of the system which is more than two years. If the integration testing gets only two weeks, I don't dare to ask about how much time was given to have load testing and stress testing. From my personal experience of a software project, I remember we had failed to deliver software on time for which we came to know just a few weeks before the go live that we're not going to meet the deadline. The single most mistake, along with few others, was done during the planning of that project which was to push out the final integration to the very end of the release. The integration with the other system was designed on paper and, anyone involved in software development know that most of the time the design always looks shinier on paper but doesn't always glow similarly when you do the real coding (that's why we have Prototype and Proof of Concept model in software development process). Anyway, later when the real integration started, we found that the design on paper didn't work due to some very basic assumptions that went wrong (e.g. the primary key of two systems were named same but used for completely different purposes). which led us to a complete redesign which caused the go live to be pushed out by months. I've got a plan to write about that experience in detail in a separate post in the future.

At the end, I would like to close this with a positive note that after all, these are all fixable. I strongly believe that all these software glitches would be fixed but probably won't be overnight and might need to go through a thorough re-architecting of it. The best would have been to start it with asynchronous real time system in the first phase and then slowly move to synchronous real time while upgrading the other federal government legacy systems, but I don't think that government can afford this at this moment, so they've to swallow this bitter pill and work it right.

Share on Facebook

Sunday, September 15, 2013

A brief history of a software bug

The month of September has a huge significance in my professional career, to be precise - at my current work. Last year (2012) in September we had encountered a production defect in one of our Java web application - application users weren't able to use the application due to database locks on a table that plays the role of a gateway controller for that application. And finally, in September, this year (2013), I have officially closed that defect as resolved. It was the longest running production application bug in my entire professional career that took exactly one year to resolve, the second longest was little over than a month. This was an amazing experience, though frustrating occasionally, through which that I had gone through while resolving this issue, so thought to share with you of this journey.

At first, let me give a brief description of this bug - the application had a central gateway table that was used to control the access of the application's asset (don't want to disclose the detail of the internal of it) and each user needed to gain control of that asset (or to be little more technical - lock the asset) before it can continue working on it. The lock is released when that user completes the task or the lock gets stale. The stale lock is determined by a specified amount of time and after that any users comes into the system to acquire lock of any assets (not only for the stale lock), the system does a house keeping to clean all existing stale locks. Yes I know what you're thinking right now - it's not efficient way of controlling the access of anything in a software but it is what it is. And we have started seeing periodical (once or twice in a month) database locks (this is Oracle database locks, not to be confused with the application's lock that's used to define the ownership of the asset at a given point in time) when the table lock is found on that gateway controller table which prevents all other users to get ownership of assets to start business transaction with. The quickest workaround was to kill the database locks by the DBAs or recycling the web application that destroys all the active connections and eventually release the database locks. Now, it's our job to debug it and find a fix of it.

The first issue occurred in September, 2012 immediately after the release where we had migrated to Websphere Application Server 6.1 Fix Pack 17 to 6.1 Fix Pack 31 and Oracle 10g to Oracle 11g R2 on Exadata. Though I believe in a thumb rule that says "if your working application is broken then look for what have changed now". But here the answer comes with so much wide area to cover that it was almost impossible to narrow down the change set. The changes were done on application server machine, Unix version, Application server version, Database server machine, Database server version, Network configuration, Data Center - that means almost everything but the code.

I started to take one at a time to narrow down the target. At first the investigation point was the application code to find if there’s any place where the database transaction isn’t properly committed. Even though the application was working for years with no issue but my thought process was: may be the code was written in a poorly manner with a fundamental weakness which works on a sunny day scenario but due to the change in underlying environment, that weakness turned out as a deadly bottleneck. I found a place where the transaction was getting committed/rolled back by the Websphere Application Server (WAS) through a WAS specific configuration. My first thought was that, could that be the source of this lock issue. But then the question would arise that how come that same piece of code was working for years until we migrated to this new infrastructure. But the issue was repeating once or twice in a month and the production server needed to recycle which requires at least two hours blackout window. After few occurrences the issue got escalated to Senior management level and I was pulled in to explain – I took over that application's management just a few months before the migration. So I was kind of pushed to prove myself by fixing the production issue. So I started to look for whatever it requires to fix it sooner than later.

My firm hunch was that the issue occurs due to the change in the infrastructure (as the code base was unchanged for months) but I took a shot on that particular transaction handling where Websphere configuration was used. It's possible that, may be the committing/rolling back of transaction that was handed over to Websphere Application Server, was not doing its job right in the newly changed infrastructure. I had no proof to support my hunch but I decided to act rather than sitting idle hopelessly to see another production blackout. More over the usual bouncing game had started with IBM and Oracle team that it’s not something that is causing due to their system and they blindly blamed application code to cause this problem. I politely told them that this application code didn’t undergo any changes for months. Then the push back game started between two parties. Anyway, I asked the team to change the code where WAS handles the connection commit/rollback and changed the code to handle it within the code. We waited another few weeks and the lock issue resurfaced – the change in the code didn’t work.

As a next step I changed my strategy. Rather than looking for the silver bullet and waiting for months (the issue occurs infrequently but regularly once/twice in a month and only happens in Production environment), I decided to change the underlying code that causes the application to go to on its knee, i.e. changed the definition of the application’s locking that required a periodical cleanup of data which caused that table being locked and eventually causing the entire application to freeze. To give a backdrop – the cleanup of inserted records was implemented to prevent building up of stale application locks by the users who close the internet browser abruptly without properly logging out from the application. This cleanup was embedded to the users log out action and as well as when a user attempts to acquire the lock i.e. first cleans up others mess (which I never liked as a design philosophy – one poor guy is responsible to cleanup of the other guy's mess) and then continue using the application. Anyway, I decided to change the definition of the application item’s lock as anything that’s note stale (through a timeout of 60 minutes) and if some users do not log out properly from the application, it stays there and  next user can pick up that application item by re locking it. That didn’t help either, we continued getting locking issue with the prior rate i.e. once/twice in a month. But one thing it did – now the entire application isn’t getting frozen but only a single user who was trying to acquire the lock of that application item, was getting impacted and the production support team kills the session holding the row lock as a regular incident without creating a big fuss of it.

Now it gave us enough time to look into the issue while the issue arises. No one even notices that there’s any production issue – as now it’s a normal priority rather than critical priority. We continued on each and every point of changes one by one. Rather than I go detail on each of one, let me describe them briefly -

A. We have changed the application transaction's strategy to roll back of any user's session that times out to give back the control of that from Websphere Application Server to application code

B. Added Index to all foreign keys of the database tables that has referential integrity constraints (parent-child relationship) as in Oracle 11g the table locking mechanism is improved (?)

C. Worked with Oracle DBA to investigate the database configurations if there's anything for which we should change the application code which bounced back to us as saying that Database configuration is PERFECT

D. Similarly, worked with Websphere Application Server admins to verify the configuration changes if anything is conspicuous triggering the database locks

E. Verified the Network communication with corporate network administrator team if there's any change in network topology that might have increased latency or causing network packets to drop. Also worked with Unix server Admin team if the network interface card has any issue with packet loss of such but no luck

F. Added JDBC parameters (oracle.jdbc.ReadTimeout, oracle.net.READ_TIMOUT etc.) to make sure that any kind of wait on the network connection or database don't keep the database connection hanging on that might cause to keep the table lock active. No luck

G. As mentioned before, we have changed the code to change the behavior of the application
    - got rid of the dependency on IBM Websphere Application Server to roll back uncommitted transaction
    - changed the application architecture to go from table level locking to row level locking - which actually helped us hugely to buy precious time to continue investigation

All of the above changed couldn't resolve the issue. And as this issue was occurring around once in a month, and can't be reproduced in non-production environment, every changes took us around a month to be found worthless. Anyway, in July, I have decided to change the configuration of the Oracle from Load balancing RAC to fail over RAC. This we had discussed long before but couldn't find any evidence to back that up if Oracle RAC could be the issue. One day in July, I was on my way home and got email from our application support team that they have found database lock once again and asking for my confirmation to kill it (I set the process to make sure I'm in the loop for every events). Anyway, I told the person to kill the locks but one thing I found fishy that the screenshot of the database lock was showing to have on the 2nd node of the RAC. It shouldn't be the case because we had made it changed to fail over mode so explicit changes is required to use the 2nd node. I asked the person to confirm on what node the application is running now and got the confirmation (these communication was going back and forth on my Blackberry) that the app is running on the first node. This gave me a clear indication that there's something fishy on the RAC level. Next day, I discovered that the RAC configuration from load balancing to fail over was rolled back when the DBA had a reboot of the database server for maintenance and the load balancing mode was enabled. That's the moment of the truth. I asked DBA to change it back to fail over mode and made sure that the configuration change was persistent and then waited for two months until this month before I declare victory over this bug.

Now here's my hypothesis about this application bug - the application was using distributed database transaction and probably the RAC configuration wasn't done in correct way so that when the two phase commit takes place the second phase some time (around once or twice in a month in our case) doesn't found the first phase transaction and wait for ever. Even though this above scenario should come up with different application error (with ORACLE error code) but that's my only hypothesis so far. I would keep my eyes open to find concrete points to back up my hypothesis.

Tuesday, July 30, 2013

Design software and IT systems in the way we're designed: Organic Software Design

When we are given a problem to solve in IT (software or hardware), we have a few things in our laundry list to align for the the developed system. To name a few: Scalable, Extendable, Reliable, Secure, Fault tolerance, avoid Single Point of Failure (SPoF) etc. They're definitely absolutely required. But I feel that we're narrowing down too much on getting a good system that we most of the time miss the big picture. We try to build system that's unnatural within this natural universe we live in.

For decades, we have accepted the notion that whatever we build in IT are artificial and meant to remain artificial which is true but no one is stopping us to get influenced by nature - in fact - nature is the humungous system that survives through all these millions of years proving that it is reliable, scalable, fault tolerant, very few SPoF, you name it. Actually other science and technology disciplines have always adopted the notion to get influenced by "nature" the but in IT it seems to have very negligible influence.

It's time to think and see with a brand new way when we design a IT system - think to keep the design organic. One very important factors of design that's abandoned in nature is the "Distributed Processing". To be more specific, nature has created this world in the essence of distribution in processing and in risks while keeping a centrally managed control and monitoring system. Apache Hadoop is one of the system that became tremendously successful by keeping the distributed processing and distributed risks at the core of it's design while keeping a centrally managed control point. In Hadoop the Data node (highly distributed in processing and risks) and the Name node (central management) are example of that.

Another design philosophy of natural design is the Multilevel Security. Think about when we eat something or as simple as breathing, the first level securities are always in place at the interfaces thought they're not very strong and definitely don't intended to replace the real antibody system in side.

Below checklist can be run on a software or IT systems to determine if its design is organic:
1. The system doesn't consume or produce more data than it absolutely needs. There should be a quality ratio on how much data a system uses and how much data it absolutely needed to perform that process. Data derivation should be a must.

2. It recycles its own garbage data i.e. the recycling should be part of the design of the systems. When we design a system we mostly design to produce data but rarely design on recycling the information that is used by it. For example, when a program or process is no longer in need of any specific data then there has to be a way to recycle that information.

3. It collaborates to perform to its best i.e. it performs better or can do more when it's able to collaborate with other systems.

4. The multilevel security is in place. Traditional password based security isn't unnatural as when you break it once you're in full control. The natural security would like at the entrance of the system it would have a first pass of security (biometric authentication would be first choice) and within the system, every movement is monitored and an army of security process (just like our anti body, RBC, WBC etc.) would be ready to stop them while any suspicious move is taken. For that pattern recognition, profiling of user's activity etc. would be used to differentiate between good move and bad move.

5. The system should be intelligent in nature. This feature is required to build most of the above mentioned features but it should have more than just security, data recycle etc. The intelligent system may or may not be Artificial Intelligent (AI) systems but it ought to be able to intelligent enough to interact with environment like network quality, data quality, availability of dependent system and based on those it can upgrade or downgrade its features availability.

I've been thinking about this topic for quite some time so thought to start putting it to my blog if it can be taken any further. Also I would like go bit in depth for some of the check list items in future posts.

How much testing is enough for a software?

This is a million dollar question, especially when you're in a release crunch time and you found out that you don't have automated test suites for your, technically speaking , System Under Test (SUT). So, how would you define the testing boundary of a software? Like most other questions in this world, the answer is similar - it depends. It depends on what you're testing, why you're testing and most importantly how you're testing it- manually or in automated fashion. For the remainder of this post let me clarify what I believe is the goal of software testing which is to certify the behaviors of a software by labeling them if they work or they aren't; releasing it to the end user is a whole different ball game.

Let's first take the most important factor i.e. How you're testing. If you happen to have the fully automated test cases and test suits, my immediate response would be - Run 'em all. This is the safest and most cost efficient way to certify the behavior of a software. Like the Microsoft way, for MS Windows they execute the entire test suits and build it every night. If you can afford it do it, why take the chance. In Statistical world, we take sample because getting the entire population of data is unrealistic. Similarly if you don't have that luxury then you pick and choose based on the reality and expectations set for the software. I've explained it later part of this post.

Now the next factor is, when you're testing with the assumption that you don't have automated test coverage (or at least the time frame doesn't allow you to run the full automated test suits) and you have to test the software manually to certify it. If you are under pressure to complete the testing within a specified time frame then my suggestion is - go for targeted test. To determine what and how much to test, follow the Goldilocks' principle - don't over test it nor under test it, test which is JUST RIGHT. You'll always find that you can cover 80% of the software's feature by executing just around 20% of the test cases that you have and you would spend remaining 80% of your resources to cover the rest; check the famous 80-20 rule if you don't take my words - http://en.wikipedia.org/wiki/Pareto_principle. So identify that 20% test cases and run them and if you're asked to release your software before you are able to open the remaining test cases, go ahead and release the software. One important thing to remember, make sure you release it with a confidence factor attached with it - so for instance when you run 80% test cases of it, label it as "QA certified with 80% confidence". I know you can't release it to the external world with that tag but nonetheless you should have that number tagged with it to communicate the risk factor to the management. And most importantly, to cover your back.

The last but definitely not the least factor is- what you're testing. If you're testing the NASA's space flight module, you have no choice but to test it FULL and in that case you're certainly not told to test it by a certain time frame, but the release of it i.e. the mission launch date would be determined by when you're able complete 100% of the testing. Same is true when you're talking about medical equipment software or life supporting systems. But when you're testing a non-mission critical systems and missing a bug in the software won't take the company down (I remember once I logged in to Facebook and I was taken to profile of
some one else while I could see certain photo album of mine within that profile; which happened in around 2010 and now in 2013, Facebook turned out as the de-facto platform for social networking - that bug was surely missed out by some test engineer but who did care about that) and you're given a deadline to meet, go ahead boldly with confidence based testing. One more thing about test coverage, there are some cases where you want to test both positive and negative test cases for a feature and there are cases where you're just fine of running only positive test cases (this is applicable for the software that are built to use in house and won't ever go out of the corporate intranet boundary)

The bottom line is, you always want to create every possible test cases for every features and run them all but you need to act based on the reality on the ground and don't hesitate to take calculated risks.