Sunday, May 12, 2013

A Story of "Design for Failure"


When we come to the era of cloud computing, what's the most important factor you can imaging for the cloud computing? You may think of scaling. It could be, scaling is very important when your business getting bigger and bigger. You may think of backup, it always should be. You may also think of programable computing resources. That's a really important concept from AWS. Machine is programable, you can programmatically add or delete a  machine within seconds, instead of purchasing from vendor and deploy it to data center. You can allocate a new reliable database, without dependency on operations team. However, as a startup, my business is starting from scratch, and I do everything myself. In my practice, "Design for Failure" is really the top priority at the very beginning.

As AWS providing EC2, and other vendors providing VPS, it would be a common sense to use VPS instead of building your own data center when you are not so big. Scaling is not so important because I'm still very small, limited machines are enough to support scale of current users. But I do designed for scaling in future. Design for failure? Yes, I have considered, but not so seriously. My VPS provider, Linode, claimed a 99.95% availability, and Linode has very good reputation in this industry. I trust them.

Some background around my online service. I released a new version of desktop application PomodoroApp at the end of 2012, and support data synchronization across computers. User will rely on my server to sync data. It's yet another a new service on Internet, no one knows it. I'm not sure tomorrow it will be only 1 new users or 1,000 new users. Although I designed a reliable and scalable server architecture, I applied a minimum viable architecture for servers in order to reduce the cost. Perhaps nobody will use the service in next week. 2 web servers, one to host my website, and another to host a node.js server for data synchronization. It provide only rest services, I'll call it sync server. 1 MongoDB database server instance. Each one can be a single point of failure. It's acceptable if I have 99.95% availability. My sync server is in a very low load, so I configured the sync server to be the secondary of MongoDB replica set. The server code also support accessing data from replica set.



Everything ran very well in the coming 2 months. I keep improving the server, adding new features. Users came to use my service from google, blog, Facebook, twitter, and increased with a stable speed. When I have new code, just need 1 seconds to restart service. February 17th, 2013, for an unknown reason, database server is out of service. Nobody knows the reason, Linode technical support managed to fix the issues. When database server was down, the secondary database on sync server became primary, and all data read/write switched to database on my sync server automatically, this may take 1 minute, depending on the timeout settings. So the outage of the database server has no impact to my sync service. 

However, I'm just lucky for the incident of Feb 17. Just 3 days later, my sync server is down, and I even cannot restart the server from Linode managed console.  This took 55 minutes. I got alerts from monitor service pingdom, also from customers report. This is the first lesson. So the single point of failure does happen. I decided to add more sync servers. Consequently, a load balance server is necessary for 2 sync servers. In addition, I added the 3rd replica set which has 1 hour delay from primary server. In case there are any data broken, I can recover it from the backup server. You may ask why 1 hour delay instead of 24 hours. Ideally there should be multiple delayed replica set servers. In my production environment, user count is still small, and there is no necessary for sharding so far. But my new features, or my changes to existing code is only tested on dev environment. When I deployed it to server, it may make damage to server. I need a backup plan for this case. Even there are still SPOF, it 's much better:)

The real disaster happened in May 11, I am going to deploy new version which resolved some issues on database. The new version handled index creation on database. I use a web based admin tool to manage my MongoDB instances. When I connect production database for final release testing, I happened to found a duplicated index on the collection. I'm not sure why this happen, so I deleted one on admin tool. The tool reported that 2 indexes are both deleted. Later when I continue my testing and try to sync data to server. I got the error that failed to commit to database. This never happens before. Then I use MongoDB console to check the collection. What made me surprising is, the whole collection is lost, neither to be created again. I shutdown the MongoDB server, and then try to restart it. Failed! The database log indicates "exception: BSONObj size: 0 (0x00000000) is invalid. Size must be between 0 and 16793600(16MB) First element: EOO". Googling the exception does not help much. Oh my, finally I have to recover the database. Fortunately I have a replica set which have realtime mirror for the database, and another replica set which has 1 hour delay for the database. I spent about 2 hours on fixing the issue, but my sync service is still online and functioned well. Because I have "stepDown" my primary and the secondary is now work as primary. Doing these troubleshooting does not hurt my online service. MongoDB really did an excellent job on the replica set pattern.

Initially I decided to recover the database from the replica set which has 1 hour delay. But it's in another datacenter, I use scp to copy data file, only 1.7M bytes/seconds, I have 9G bytes data in total. That would spent a long time for copying. Then I checked the new primary database, fortunately found that the new primary(the old secondary) is in good shape, the data file does not broken. Then I stopped the primary database, and spent about 2 minutes to copy all the files with a 29M bytes file transfer speed within the same datacenter. Again, it's still a very small business. 2 minutes outage is acceptable, because my client software support offline mode, it has local database, and can work at the place without Internet. When the network is available, it will sync to server. Some users even disabled the sync feature because they don't what to upload any data to server. After all files are copied, I restart MongoDB. It took several seconds to recovery the uncommitted data from oplog, and try to duplicate from the primary server. Everything works well now. MongoDB rocks!

Even I have the ultimate backup plan designed and tested on my client software, it still make me tense very much. Actually my  backup plan is, if the whole database is lost, I can still recover all the data. My client software supports offline mode, it duplicated all the data for the user. Automatic data recovery from user's machine to server has already been there. 

This story is the first real disaster for me so far. I respect VPS provider Linode, and respect to software companies who provided linux server, node.js, MongoDB. But it's really a must to keep the "design for failure" the top priority even you are very small. The hardware may be outage, the software may have bugs, the IO or the memory may be corruption. Hackers may need your server. People may say, the only thing that never change is change. My lesson is, the only thing that never failure is failure. Without these lessons, "Design For Failure" would never have so tremendous impact for my future design. 

1 comment:

  1. Nice article & a good gesture to share the failure experience.

    ReplyDelete