TypePad architecture: Problems and solutions

TypePad was and probably is one of the first and largest paid blogging service in the world. In a presentation at OSCON 2007 , Lisa Phillips and Garth Webb spoke about TypePad’s problems in 2005. Since this is a common problem with any successful company I found it interesting enough to research a little more.

TypePad was, like any other service, initially designed in the traditional way with Linux, Postgres, Apache, mod_perl, perl as the front end and NFS storage for images on a filer. At that time they were pushing close to 250mbps (4TB per day) through multiple pipes and with growing user base, activity and data they were growing at the rate of 10 to 20% per month.

Just before the planned move to newer better data center, sometime in Oct 2005, TypePad started experiencing all kinds of problems due to its unexpected growth. The unprecedented stress on the system caused multiple failures over the next two months which ranged from hardware, software, storage to networking issues. While at times it made reading or publishing services to be completely unavailable, it also caused sporadic performance issues with statistic calculations.

One of the most visible failures was in December of 2005 when during a routine maintenance, in the middle of the process of adding redundant storage, something caused the complete storage cluster to go offline which caused the entire bank of webservers serving the webpages went down . Because they had separate storage cluster for backend database, it wasn’t affected by the outage directly.

Its at times like these that most companies fail to communicate with their users. Sixpart, fortunately, understood this early and did its job well.

Today Typepad’s architecture is similar to the one of Livejournal with users distributed over multiple master-master mysql replication. They have partitioned the database by UserIDs and have a global database to map UserIDs to partitions. They use Mysql 5.0 with InnoDB and Linux Heartbeat for HA.

The images though they decided to switch from a NFS storage to Perlbal ( Perl-based reverse proxy load balancer and web server) +MogileFS (open source distributed file system) which can scale much better with lower overhead over commodity hardware. Look at the image on the right which how Typepad served images in the transition phase from NFS to MogileFS. Follow the arrows with numbers to see how the requests go through within the network. For an image stored on MogileFS (Mogstored), the app server talks to MogileDB through mod_perl2 first (Step 3,4). MogileDB/mod_perl2 sends a Perlbal internal redirect(Step 5,6,7) to the actual image resource which is located on Mogstored(step 8,9).

Since most of the activity on the blogs are read only operations, it made sense to add memcached early into the process to ease load on a lot of components.

memcached is a high-performance, distributed memory object caching system, generic in nature, but intended for use in speeding up dynamic web applications by alleviating database load.

In another interesting approach to scalable architecture they recognized the fact that one of the most write intensive operations was commenting system which made them experiment with “The Schwartz“. This technology helped them use a queuing mechanism which could reliably delay write intensive operations to the database effectively allowing it to scale more.

The Schwartz is taglined “a reliable job queue system” and was originally developed as a generic job processing system for Six Apart’s hosted services. It is used in production today on TypePad, Livejournal and Vox for managing tasks that can be performed by the system without user interaction.

References

http://www.sixapart.com/typepad/news/2005/10/to_our_customers.html

http://www.niallkennedy.com/blog/archives/2005/12/typepad-outage-details.html

http://www.movabletype.org/documentation/administrator/publishing/publish-queue.html

Facebook internals

The code leaked during a facebook bug was posted online by an anonymous user. Though the source itself didn’t look very damaging, it did damage the brand “facebook”. But I won’t go into that in this post, and instead I would like to discus the facebook internals here which alex.moskalyuk touched upon.

Alex pointed out that this is not the only code from facebook we have seen. Infact we already know a lot more about how facebook works internally than what most of us would find from the source code to the index.php published yesterday.

  1. PHP – This is no surprise. Though PHP is not developed at faceboook, Alex points out that facebook developers are involved atleast at some level in the development of the php.
  2. Apache – Neither should this be
  3. Mysql – Same here..
  4. Valgrind – This is a suite of tools for debugging and profiling Linux programs. With the tools that come with Valgrind, you can automatically detect many memory management and threading bugs, avoiding hours of frustrating bug-hunting, making your programs more stable. You can also perform detailed profiling, to speed up and reduce memory use of your programs. Other tools related to this which they user are callgrind/Calltree , KCachegrind and OProfile.
  5. APC – Facebook developers have talked about using Alternative PHP Cache in some presentations they have given in the past.
  6. Facebook Thrift – Thrift is a software framework for scalable cross-language services development. It combines a powerful software stack with a code generation engine to build services that work efficiently and seamlessly between C++, Java, Python, PHP, and Ruby. Thrift was developed at Facebook, and its been released as open source. More information can be found in this whitepaper.
  7. Memcached is a high-performance, distributed memory object caching system, generic in nature, but intended for use in speeding up dynamic web applications by alleviating database load. The use of this shouldn’t come as a surprise since most of the new web2.0 companies, especially the ones using php and python have experimented or implemented it at some level.
  8. phpsh is another interesting tool facebook developers use internally. It is an interactive shell for php that features readline history, tab completion, quick access to documentation. It is ironically written mostly in python.
  9. Facebook has released a lot of code to support the facebook platform and to get users to develop for it.
  10. Facebook firefox plugin is the last one I’d like to mention here. This again is open source (since you can see the code once you open up the plugin yourself).

References