TypePad architecture: Problems and solutions

TypePad was and probably is one of the first and largest paid blogging service in the world. In a presentation at OSCON 2007 , Lisa Phillips and Garth Webb spoke about TypePad’s problems in 2005. Since this is a common problem with any successful company I found it interesting enough to research a little more.

TypePad was, like any other service, initially designed in the traditional way with Linux, Postgres, Apache, mod_perl, perl as the front end and NFS storage for images on a filer. At that time they were pushing close to 250mbps (4TB per day) through multiple pipes and with growing user base, activity and data they were growing at the rate of 10 to 20% per month.

Just before the planned move to newer better data center, sometime in Oct 2005, TypePad started experiencing all kinds of problems due to its unexpected growth. The unprecedented stress on the system caused multiple failures over the next two months which ranged from hardware, software, storage to networking issues. While at times it made reading or publishing services to be completely unavailable, it also caused sporadic performance issues with statistic calculations.

One of the most visible failures was in December of 2005 when during a routine maintenance, in the middle of the process of adding redundant storage, something caused the complete storage cluster to go offline which caused the entire bank of webservers serving the webpages went down . Because they had separate storage cluster for backend database, it wasn’t affected by the outage directly.

Its at times like these that most companies fail to communicate with their users. Sixpart, fortunately, understood this early and did its job well.

Today Typepad’s architecture is similar to the one of Livejournal with users distributed over multiple master-master mysql replication. They have partitioned the database by UserIDs and have a global database to map UserIDs to partitions. They use Mysql 5.0 with InnoDB and Linux Heartbeat for HA.

The images though they decided to switch from a NFS storage to Perlbal ( Perl-based reverse proxy load balancer and web server) +MogileFS (open source distributed file system) which can scale much better with lower overhead over commodity hardware. Look at the image on the right which how Typepad served images in the transition phase from NFS to MogileFS. Follow the arrows with numbers to see how the requests go through within the network. For an image stored on MogileFS (Mogstored), the app server talks to MogileDB through mod_perl2 first (Step 3,4). MogileDB/mod_perl2 sends a Perlbal internal redirect(Step 5,6,7) to the actual image resource which is located on Mogstored(step 8,9).

Since most of the activity on the blogs are read only operations, it made sense to add memcached early into the process to ease load on a lot of components.

memcached is a high-performance, distributed memory object caching system, generic in nature, but intended for use in speeding up dynamic web applications by alleviating database load.

In another interesting approach to scalable architecture they recognized the fact that one of the most write intensive operations was commenting system which made them experiment with “The Schwartz“. This technology helped them use a queuing mechanism which could reliably delay write intensive operations to the database effectively allowing it to scale more.

The Schwartz is taglined “a reliable job queue system” and was originally developed as a generic job processing system for Six Apart’s hosted services. It is used in production today on TypePad, Livejournal and Vox for managing tasks that can be performed by the system without user interaction.

References

http://www.sixapart.com/typepad/news/2005/10/to_our_customers.html

http://www.niallkennedy.com/blog/archives/2005/12/typepad-outage-details.html

http://www.movabletype.org/documentation/administrator/publishing/publish-queue.html