Amod Malviya, CTO, Flipkart, says the outage was not due to an infrastructure failure. He says changes in the nature of traffic affect Flipkart's complex algorithms, and throw up hidden choke points that don't show up in routine stress tests.


On July 22, the Xiaomi Mi3 smartphone went on sale on the website of India's largest online retailer, Flipkart. This was an exclusive launch introducing Xiaomi's phones to India. Given the marketing bonanza enveloping the launch, a sea of online buyers logged on the site. Buyers started seeing an HTTP Error 503 Service Unavailable response.

Amod Malviya, CTO, Flipkart, says the outage wasn't due to an infrastructure failure. "In fact, we have almost never gone down in the past three years because of that sort of an issue," he says.

Explaining what led to the outage he says, "We have SOA-based systems in place. When a customer lands on, a large number of complex algorithms run in the background in a distributed services environment."

These algorithms assess the nature of traffic based on transactions being executed on the website. These algorithms monitor several dynamic behaviors and attributes, such as what products are being increasingly viewed, which payment gateways are being used more, what specific set of allied accessories are being researched, which combo deals are being opted for, and much more.

"When you have a change in the nature of traffic, it changes the interaction dynamics of those services, and this typically throws up hidden choke points which do not show up in a routine stress test. This is the kind of issue we discover in moments like these," Malviya says.

Malviya says that despite conducting a stress test to assess their preparedness for such a situation, the company could not avoid the outage. "A stress test doesn't necessarily capture a different pattern of traffic. When incoming traffic is of a different nature than what we usually see, the way the system experiences stress is different from the way you have been testing it out. That's when such a crisis occurs," he says.

"We were expecting a spike in traffic, but, as is obvious, the spike went way beyond what we had anticipated. As a result, a large number of customers were impacted," he said.

A similar situation arose when Flipkart launched the Moto G mobile phone earlier this year. "During the earlier crash, the issue that we identified is not the issue that led to the Xiaomi outage. With every such instance, we reduce the probability of future failures," Malviya says.

Malviya says their response to the crash used a two-pronged approach. "We had a contingency plan which enabled us to run in a degraded service mode. Besides, Flipkart has a very deep instrumentation that allows it to analyze what went wrong. Having the former helped us recover quickly from the outage, while the latter is helping us carry out RCA (root cause analysis) and find a relevant fix," he says.

Flipkart is now carrying out a root cause analysis of the outage. "RCA is an integral part of every technical glitch that we find at Flipkart. Though we have reached some conclusions, our RCA for this outage is still ongoing," he says.