At a Computerworld Hong Kong event years ago, I heard a senior executive explain how online retailer Amazon re-engineered their e-commerce process for the Christmas gift-giving season.
Like many retail businesses, Amazon experiences an extraordinary surge in business during this time. But unlike bricks-and-mortar shops, Amazon's business--from order-taking to payment-processing--is data-centric.
The executive said that Amazon's "Christmas rush": a flood of orders requiring a supply-chain that guarantees delivery by December 25, couldn't be replicated. So the e-commerce giant built a test-system which ran in parallel with their online system, and ran EVERY order through it simultaneously to see if it would break.
And they did it two years in a row, he said, before migrating to the new system. Amazon couldn't afford a misstep with their mission-critical business engine. Not during the short timeframe when the data pours in furiously and every last byte needs to be processed, because shops that fail to deliver on Santa's schedule make kids cry, and parents have long memories.
Carat Cheung Ming-nga was shedding tears last week--tears of joy at being named Miss Hong Kong. But Netizens who expected to have a say in the voting shredded their handkerchiefs instead, as TVB's much-anticipated online voting scheme went haywire. The aftermath: apologies from TVB executives, finger-pointing, press-conferences, Internet-forum rants, and hundreds of complaints to the Communications Authority.
What went wrong? The "H-word" (hackers) was immediately trotted out. TVB's general manager for broadcasting Cheong Sin-keung said: "There was an unusual high volume of traffic, and we don't rule out the possibility that hacking activities were involved."
But was it? Websites in Hong Kong like Urbtix and online events like the POPvote straw-poll earlier this year have experienced traffic-spikes that blitzed their servers. The problem is compounded when Netizens, eager for tickets to a Joey Yung concert or the Rugby Sevens, repeatedly hit the "refresh" button, further overwhelming the system. No hackers involved, just a data-stampede.
The problem is scalability, and TVB's strategy was to leverage a cloud-based platform to handle the traffic: Microsoft's Azure. That's a good plan, but as ever, the devil is in the details.
"People seem to assume that cloud gives you unlimited resources in every possible way, and it just doesn't," said Richard Stagg, managing consultant at Hong Kong-based security consultancy Handshake Networking. "More bandwidth, maybe, and more flexibility, but in the end a Web application has as much resources as it has been allocated and those resources can be consumed just like any other Web server."
Also, cloud-based computing resources are first and foremost computers: when they're overloaded they may require rebooting, and reboots are never instant. While cloud is a more scalable platform, as Stagg noted, it requires allocation of resources and cost-benefit analysis--all subject to TCO and ROI scrutiny. Sometimes it's worth it: Amazon valued their e-commerce operation so highly they ate two years of operating costs for a test-system to ensure it would handle peak loads under time-critical pressure.
The Miss Hong Kong telecast provided a 10-minute window not only for viewers to vote for their favorite ingenue, but also enter a lucky draw for a HK$480,000 Mini Cooper automobile. Voting for TVB's Miss Hong Kong 2012 could be done via a Web application and a mobile application--the mobile app is called "TVB fun."
The results were catastrophic: "...our engineers identified some unusual data traffic targeting the TVB fun application in the operating records of the related voting system," said Microsoft Hong Kong in a statement issued just after the incident. "Within the voting period, the system already recorded unusual data traffic which was many times higher than the original expectation. This was substantially higher than the total number of people watching the program and also the total population of Hong Kong."
MS HK posited at that time that this was "grounds to suspect that the application was attacked by malicious hackers, causing abnormal disruption in the operation of the application and thus the subsequent uploading of related data to the cloud system for further processing." But our sister technology site Asia Cloud Forum followed up on the story--ACF editor Carol Ko conducted an interview with Chin-Tang Chin, Microsoft Hong Kong's director of developer and platform evangelism group, which shed more light on the situation.
"Assumptions are made in any application systems design, such as that on the expected number of voters," said Chin. "In reality, when there's unusual traffic pattern that breaks the original assumptions, the system'd behave differently from how it's supposed to behave...Systems fall apart when traffic pattern is usual and behave widely different from the assumptions."
Chin said that "the Miss Hong Kong voting application [TVB fun] was jointly developed by TVB and Cherry Picks while Microsoft provided technical support with the underlying cloud platform [Windows Azure]...However, Microsoft may not have complete information on the project because the company wasn't involved in developing the voting application. Technology is part of the entire app design process during which a company has to take many business-related issues into consideration."
The Microsoft Hong Kong director also said: "I'd like to clarify that there's an app on top of the voting system while Windows Asure is underneath the same system. Throughout the entire process of Miss Hong Kong voting, Windows Azure did not suffer any impact, and Azure was functioning exactly the way it was supposed to do."
So what happened? TVB has said they plan to hire a third-party consulting firm to investigate the incident. So, I'm going to speculate. I'm a tech journalist, this is a blog post and I'm going to examine filtered bits of data from the past few days and give it my best guess.
The TVB fun mobile app was developed at least in part by Cherrypicks: a Hong Kong firm with an excellent track record. The firm issued a statement that reads, in part: "Cherrypick[s] has been providing [the] TVB Fun mobile apps service for Television Broadcast Limited since 2011. The service includes providing entertainment content and voting features. Generally, it has been effective and well-functioning since [the start of] operation." And TVB has issued a statement with its version of events.
The app is available for both the iOS and Android platforms, and Android malware has been on the rise lately (the number of new malicious programs targeting the Android platform almost trebled in the second quarter of the year, according to figures from Kaspersky Lab, and security firm Sophos offers a free Sophos Mobile Security app for Android devices). But I have yet to hear of any malware aimed at highjacking the TVB fun app, and consider this possibility extremely low.
The general consensus is that the traffic which crashed TVB's Miss Hong Kong online party came via the mobile app. Those of us who live in Hong Kong know that the Miss Hong Kong pageant is a high-profile event (when the local film industry was thriving, MHK-winners were invariably offered movie-contracts). We also know that Hong Kongers are glued to their mobile devices, and keyed into local pop culture.
So let's freeze time at the T-minus 15-minute point for the online votefest of Sunday, August 26, 2012. Flat-screen televisions across the HKSAR announce that voting for your favorite MHK candidate (and entry in the lucky draw) will open soon. Some press the "enter" button on their computer, smartphone or iPad. Some press it on all three devices. Microsoft's engineers "identify some unusual data traffic."
At some point, likely before the stated voting-period, the system reaches critical mass. As MS HK's Chin pointed out: "When an unusual traffic pattern occurs, which breaks the original assumptions made for the application, this will cause the system to behave differently from what it is supposed to behave."
Votes were kicked back due to overload, viewers screamed in frustration and hit the buttons again and again on their devices. Servers howled in anguish. The dreaded "infinite-loop ping" circled from Aberdeen to Sheng Shui (and across the border?) as server-requests were recycled on a Möbius strip of discontent.
That's my best guess. A highly popular event produces great interest in a short time-window, and a massive overload of data-requests shuts down the allocated resources. It's happened before, and it'll happen again (US President Barack Obama's unexpected August 29 "Ask Me Anything" appearance on reddit.com broke the Web site's servers).
It's the simplest explanation.
As Richard Stagg noted, cloud resources are not unlimited. That's why I began this post with the Amazon example. I know of no more torturous testing of any IT system prior to deployment. But for Amazon, the stakes were critical. TVB won't vanish after this incident--they'll lick their wounds and (hopefully) learn from it.
And by the way, congratulations to the wonderfully named Carat Cheung.