This is the first time in history when the president of the United States of America, or probably for any head of state around the world, is blamed for the quality of a software and had to stood up to defend a mere software. Though nobody should extract fun out of that fiasco, but I kind of appreciate the fact that how technology has taken the epicenter of every ones life in this country. The talk shows, news anchors, and congressional hearing, everyone is talking on how software development works, how much testing was done or not done, how the integration of subsystems works within software etc. This is all I do for my living so I couldn't hold my temptation to chip in and have my two cents on it.
I want to make it clear that this healthcare.gov is no way a simple system. I would rather say that this is an integrated platform of very complex systems in terms of the volume of work, need of the sheer amount of coordination with all the stakeholders and expectation of a real time communication among various existing government IT systems. Just to get an idea of how vast was the volume of work, check out this information: it consists of 6 complex systems, 55 contractors were working in parallel, 5 government agencies were involved, used in 36 States, and covers 300 insurers' 4500 insurance plans. All these tells us one thing, to deliver a successful system like that the planning must be very precise, and execution of that should have been close to perfect.
Having said that I would focus on three areas of it and would try to reflect on it based on my experiences that I gathered throughout these years of working in the field of software engineering. Before I jump into my analysis on the issues, let's see the architectural modules of the health care marketplace system. Among other, the system has these significant modules: the eligibility & enrollment module, plan and financial management module, Data Services module, Registration module, Income verification system etc.
Now let's take a look into the detail. The first area of failure, I believe, was proper planning. This system is designed as a real-time system and planned to go live in a big bang approach, which is, in my opinion, single biggest mistake. I would explain the real time system architecture in the later part, but the big bang release idea must have come from political part of the decision making process. This market place system connects to several disparate systems so it would have been a better plan to have it in a phased approach and then slowly ramp up users into the systems. The gradually ramping up of load could've been achieved through either integrating with as few external systems as possible or by rolling out the system to a few selected states and then gradually expand the user base. The other planning failure was, in my opinion, the belittling of the role of testing in software development. The testing of each systems might have been planned right, but the integration testing might have not received the required level of priority. Any plan should focus on the highest risk points and have a mitigation plan around those risks. There shouldn't be any doubt that integration of all these disparate systems should be the highest risk to mitigate for this health care marketplace system.
Second area that I would like to focus on is the architecture and design of this system. This market place platform was designed to work as a real time system which is a bit surprising to me that the healthcare.gov site is designed to be accessed by millions of users while in the back end it would be connecting to several federal government's legacy systems with the expectation that all those systems would respond back within a predefined time. How on earth they thought that every legacy systems of the government e.g. USCIS, IRS, Social Security systems etc., those are designed decades ago, would be able to deliver the expected level of performance that a real time systems, like healthcare.gov, would need to function. Real time system only makes sense when you have full control of all underlying systems or at least you're given the guarantee by the underlying systems to return with the response of a request within a predefined period of time. In this case, the underlying systems for the healthcare.gov neither has full control nor has guaranteed response time from dependent systems, so designing a real time system on top of multiple legacy systems is nothing but a pipe dream. The real time online market place should've started with a bottom up approach i.e. enable the downstream systems to deliver real time response and then open the grand door of healthcare.gov. Again, this may not be acceptable due to the politically set deadline of the Affordable Care Act but, in the other hand, you can't always squeeze or compromise with the software development process. On top of that, adding more people to the team isn't always a solution to get software faster (for reference, check out the brooks law) but in fact it may end up adding more delay and chaos in the project.
The last area of my analysis is the test strategy of the software and its execution. In my opinion, the focus of the test strategy should've been on the early integration the system and of its testing because that's the biggest risk factor to the successful roll out of it. Though this could've been covered in the planning and architecture and designing phase, but a good solid test strategy has ability to influence the overall planning and cloud work as an early warning indicator of any disaster. If you aren't allowing the QA team to fully test it out after it's integrated, you may be living in a lala land assuming that it would all work fine as designed. I've mentioned in my several older posts that, the greatest mistake made by most of the software professionals is to make assumption and never verifying those assumptions ahead of time. As a side note, I've another post (How much testing is enough) on the importance of testing, and its strategy where I explained how should one decide to go for a confidence or risk based testing approach where I ended the post with my suggestion to encourage the decision makers to boldly take risks that can be affordable. In this case, the Center for Medicare and Medicaid Services (CMS), the managing organization of this insurance market place platform, had gone through that path of taking risks but in absolute wrong way. They had decided to push the integration testing (the biggest risk factor of this kind of systems) to the very end of the release cycle and had given just two weeks (according to CGI Federal) for integration testing which is, to my point of view, a joke to the people who have a basic understanding of the software development process. Just compare this to the overall development period of the system which is more than two years. If the integration testing gets only two weeks, I don't dare to ask about how much time was given to have load testing and stress testing. From my personal experience of a software project, I remember we had failed to deliver software on time for which we came to know just a few weeks before the go live that we're not going to meet the deadline. The single most mistake, along with few others, was done during the planning of that project which was to push out the final integration to the very end of the release. The integration with the other system was designed on paper and, anyone involved in software development know that most of the time the design always looks shinier on paper but doesn't always glow similarly when you do the real coding (that's why we have Prototype and Proof of Concept model in software development process). Anyway, later when the real integration started, we found that the design on paper didn't work due to some very basic assumptions that went wrong (e.g. the primary key of two systems were named same but used for completely different purposes). which led us to a complete redesign which caused the go live to be pushed out by months. I've got a plan to write about that experience in detail in a separate post in the future.
At the end, I would like to close this with a positive note that after all, these are all fixable. I strongly believe that all these software glitches would be fixed but probably won't be overnight and might need to go through a thorough re-architecting of it. The best would have been to start it with asynchronous real time system in the first phase and then slowly move to synchronous real time while upgrading the other federal government legacy systems, but I don't think that government can afford this at this moment, so they've to swallow this bitter pill and work it right.
Share on Facebook