Archive for August, 2007

Scalability Stories for Aug 30

Thursday, August 30th, 2007
  1. I found a very interesting story on how memcached was created. Its an old story titled “Distributed caching with memcached“. I also found an interesting FAQ on memcached which some of you might like.
  2. Inside Myspace is another old story which follows Myspace’s growth over time. Its a very long and interesting read which shouldn’t be ignored.
  3. Measuring Scalability tries to put numbers to the problem of scalability. If you have to justify the cost of scalability to anyone in your organization, you should atleast skim through this page
  4. I found a wonderful story on the humble architecture of Mailinator and how it grew over time on just one webserver. It receives approx 5 million emails a day and runs the whole operation pretty much in memory with no logs or database to leave traces behind. And here is another page from the creator of Mailinator abouts its stats from Feb.
  5. Finally another very interesting presentation/slide on the topic of “Scalable web architecture” which focuses primary on LAMP architecture.

Popularity: 20%

Thoughts on scalability

Tuesday, August 28th, 2007

Here is an interesting contribution on the topic from Preston Elder 

I’ve worked in multiple extremely super-scaled applications (including ones sustaining 70,000 connections at any one time, 10,000 new connections each minute, and 15,000 concurrent throttled file transfers at any one time - all in one application instance on one machine).

The biggest problem I have seen is people don’t know how to properly define their thread’s purpose and requirements, and don’t know how to decouple tasks that have in-built latency or avoid thread blocking (and locking).

For example, often in a high-performance network app, you will have some kind of multiplexor (or more than one) for your connections, so you don’t have a thread per connection. But people often make the mistake of doing too much in the multiplexor’s thread. The multiplexor should ideally only exist to be able to pull data off the socket, chop it up into packets that make sense, and hand it off to some kind of thread pool to do actual processing. Anything more and your multiplexor can’t get back to retrieving the next bit of data fast enough.

Similarly, when moving data from a multiplexor to a thread pool, you should be a) moving in bulk (lock the queue once, not once per message), AND you should be using the Least Loaded pattern - where each thread in the pool has its OWN queue, and you move the entire batch of messages to the thread that is least loaded, and next time the multiplexor has another batch, it will move it to a different thread because IT is least loaded. Assuming your processing takes longer than the data takes to be split into packets (IT SHOULD!), then all your threads will still be busy, but there will be no lock contention between them, and occasional lock contention ONCE when they get a new batch of messages to process.

Finally, decouple your I/O-bound processes. Make your I/O bound things (eg. reporting via. socket back to some kind of stats/reporting system) happen in their own thread if they are allowed to block. And make sure your worker threads aren’t waiting to give the I/O bound thread data - in this case, a similar pattern to the above in reverse works well - where each thread PUSHING to the I/O bound thread has its own queue, and your I/O bound thread has its own queue, and when it is empty, it just collects the swaps from all the worker queues (or just the next one in a round-robin fashion), so the workers can put data onto those queues at its leisure again, without lock contention with each other.

Never underestimate the value of your memory - if you are doing something like reporting to a stats/reporting server via. socket, you should implement some kind of Store and Forward system. This is both for integrity (if your app crashes, you still have the data to send), and so you don’t blow your memory. This is also true if you are doing SQL inserts to an off-system database server - spool it out to local disk (local solid-state is even better!) and then just have a thread continually reading from disk and doing the inserts - in a thread not touched by anything else. And make sure your SAF uses *CYCLING FILES* that cycle on max size AND time - you don’t want to keep appending to a file that can never be erased - and preferably, make that file a memory mapped file. Similarly, when sending data to your end-users, make sure you can overflow the data to disk so you don’t have 3mb data sitting in memory for a single client, who happens to be too slow to take it fast enough.

And last thing, make sure you have architected things in a way that you can simply start up a new instance on another machine, and both machines can work IN TANDEM, allowing you to just throw hardware at the problem once you reach your hardware’s limit. I’ve personally scaled up an app from about 20 machines to over 650 by ensuring the collector could handle multiple collections - and even making sure I could run multiple collectors side-by-side for when the data is too much for one collector to crunch.

I don’t know of any papers on this, but this is my experience writing extremely high performance network apps

:)

[ This post was originally written by Preston on slashdot. Reproduced here with his permission. If you have an interesting comment or a writeup you want to share on this blog, please forward it to me or submit a comment to this post ]

Tags:

Popularity: 30%

Loadbalancer for horizontal web scaling: What questions to ask before implementing one.

Tuesday, August 28th, 2007

A single server, today, can handle an amazing amount of traffic. But sooner or later most organizations figure out that they need more and talk about choosing between horizontal and vertical scaling. If you work for such an organization and also happen to manage networking devices, you might find a couple of loadbalancers on your desk one day along with a yellow sticky note with a deadline on it.

Loadbalancers, by definition, are supposed to solve performance bottlenecks by distributing or balancing load between different components its managing. Though you would normally find loadbalancers in front of a webserver, a lot of different individuals have found other interesting ways of using it. For example I know some organizations have so much network traffic that they can’t use just one sniffer or firewall to do their job. They ended up using some high end loadbalancers to intelligently loadbalance network traffic through multiple sniffers/firewalls attached to them. These are rare, so don’t worry about it.

But in general, if you are having a web scalability issue, you would probably start looking at hardware solutions first. And if you do, here are my recommendation on questions you need to ask before you investigate how to set it up.

  • Is the loadbalancer really needed ?
    • Identify the bottlenecks in the system. Adding webservers/appservers will not solve the problem if database is the bottleneck.
      • BTW, If you can replicate read only database to multiple servers, you might be able to use a loadbalancer to balance that traffic…
    • If most of the traffic is due to static content, investigate apache’s ability to set better cachable headers. Or use a CDN(Content distribution network) like akamai/limelight to offload cachable objects.
    • DNS and SMTP are examples of protocols which are designed to be automatically loadbalanced. If an SMTP server fails to respond, the DNS protocol will help an SMTP client to find alternative SMTP servers for a destination. If your organization has controls over the protocol they are using, they should investigate the possibility of using DNS as the loadbalancing mechanism.
  • Is it a software or a hardware loadbalancer?
    • Most people can’t think of loadbalancer as anything else but a dedicated hardware. Interestingly, chances are you are already using software loadbalancers in your network without knowing it (like the DNS loadbalancer I mentioned above). I’ve seen a decline in commercial web software loadbalancers in the recent years, which has been replaced by open source components like mod_jk loadbalancer, perlbal,etc.
    • In this particular writeup, I’m going to focus on hardware loadbalancer to keep it short.
  • Do you need failover ?
    • If you need loadbalancer failover, you should be buying in pairs
    • Some loadbalancer work in Active-Active mode, where both loadbalancers can be functional, while others allow only Active-Passive mode.
    • The term “failover” can mean many things. Make you you ask the right questions to understand what a loadbalancer really offers.
      • It could mean TCP session failover where every single TCP connection will be failed over.
      • It could also mean HTTP session failover (where one session is defined by one unique Cookie). If a loadbalancer supports only this mode, every single user connected to the loadbalancer will notice a blip when the primary loadbalancer dies. More often than not, this is what a loadbalancer vendor usually provides. And unfortunately not everyone in pre-sales is aware of this “minor” detail.
  • How do you want to do configuration management ?
    • Not all brands are made equal. Some are easy to manage and others aren’t. I personally prefer CLI over GUI on any day.
    • But more than the CLI/GUI, I want the ability to revision control and compare different versions of configuration files. The current brand we use at work, unfortunately, doesn’t provide an easy way to do this. If you are supporting a big operation with multiple loadbalancers and have to replicate same/similar setup in multiple places, then this is something you shouldn’t compromise on.
  • How many servers are we talking about ?
    • If your loadbalancer device has ports on it to attach the servers physically to it, you should make sure you don’t have more servers than the number of ports
    • In most cases, however, you can add a 100BaseT switch and add more servers to it. But regardless, having an approximate number of servers will help you decide some of the other questions later.
  • Will all of these server be part of the same farm/cluster ?
    • Some organizations setup different websites on the same loadbalancer and set it up in such a way that the loadbalancer distributes load to different farm/cluster of servers depending on which domain name the client requested for.
    • Some loadbalancers can also inspect the URL and send requests with different paths (in the same domain) to different server farms. For example “/images” could go to one server and “/signup” could go to another one.
    • There are still others who might keep a set of servers on standby mode (not active) waiting to be brought up automatically in an event of problems with the primary clusters.
  • Will all of the servers have the same weights when loadbalancing ?
    • For example are there some servers which should get more traffic than others because they are faster ?
  • Is session stickyness important ?
    • If a user ends up on a particular webserver, is there any reason why that user should continue to stay on that webserver ? There are different kinds of session stickiness which I can think off.
      • The most popular kind is probably IP based stickiness where traffic from same IP always goes to the same webserver. This was a great way of loadbalancing until companies like AOL decided that they will loadbalance outgoing traffic using different proxy servers, effectively changing source IP address of traffic coming from the same client.
      • My favorite session stickiness mechanism is Cookies. Cookies can be used as session tracking IDs to associate a user session with a particular webserver. There are many different ways of implementing this of which these are the few interesting ways I’ve used
        • Allow the loadbalancer to set a cookie for you in each session without the tracking cookie.
        • Most web application servers like PHP and java use cookie names like “PHPSESSIONID” or “JSESSIONID” which is an excellent session identifier which a loadbalancer can track.
        • There are a few other interesting cookie options… but I’d rather not discuss it here at this moment.
    • If you really need session stickyness, you should investigate further if your application is really horizontally scalable. Most often, sticky session feature is used as a bandaid to temporarily give an impression of scalable web app design, which could, in future, prove disastrous.
    • Session stickiness comes with some other configuration baggage as well.
      • You need to decide how long an idle session should be considered active before its shutdown. If you have a lot of traffic and if you set this session timeout to be too high, it can quickly fill up your memory.
      • If possible set the timeout to be as close to the application timeout on your app server.
  • Does the traffic have to be encrypted using SSL ?
    • Some loadbalancers have built in SSL engines.
    • Others have capabilities to offload them to SSL accelerators.
    • One could also set it up in such a way that traffic is decrypted on the apache server after the loadbalancing is done. Please be cautious here. If you decide to do SSL decryption on the webserver, you are effectively disallow the loadbalancer to inspect HTTP packets which can be otherwise used to make intelligent routing decisions
  • Do you need compression ?
    • Some Load balancers which come with SSL engines support on-the-fly compression which can significantly speed up user experience if you have a lot of compressible objects.
  • Debugging
    • Whether you like it or not, one day you will have some problem with your loadbalancer and you will be asked by their support team to get some sniffs for them. This usually is a painful process, especially if the equipment is in production network. One feature which can simplify this is allowing the device to sniff on itself. This is not a must-have, but probably a like-to-have feature.
  • Other services
    • Checklist for other services… if you want
      • NTP - Have its time always be in sync with rest of your network
      • Syslog - Ability to send syslog messages.
      • Mail - Ability to send mails when problems happen
      • SNMP - Monitoring and traps
  • External factors
    • A standard sized company investing in loadbalancers usually don’t invest on a single loadbalancer. They usually buy a pair for production network, and another pair for QA or staging networks. By the time you add the support costs, it gets very expensive. If you make an investment like that, make sure the company selling that to you is still alive a couple of years down the line to support you.
    • Loadbalancer’s are not as reliable as phones. Finding bugs in a loadbalancer is much easier than you think. If you don’t have a reliable support team who is willing to help you patch the code fast enough, you might see some downtime or performance issues
    • At the same time, if the company does release a lot of patches on a weekly or monthly basis, you should find out how stable its code is. I usually ask how long it has been since the last stable release.




Popularity: 27%

TypePad architecture: Problems and solutions

Saturday, August 25th, 2007

TypePad was and probably is one of the first and largest paid blogging service in the world. In a presentation at OSCON 2007 , Lisa Phillips and Garth Webb spoke about TypePad’s problems in 2005. Since this is a common problem with any successful company I found it interesting enough to research a little more.

TypePad was, like any other service, initially designed in the traditional way with Linux, Postgres, Apache, mod_perl, perl as the front end and NFS storage for images on a filer. At that time they were pushing close to 250mbps (4TB per day) through multiple pipes and with growing user base, activity and data they were growing at the rate of 10 to 20% per month.

Just before the planned move to newer better data center, sometime in Oct 2005, TypePad started experiencing all kinds of problems due to its unexpected growth. The unprecedented stress on the system caused multiple failures over the next two months which ranged from hardware, software, storage to networking issues. While at times it made reading or publishing services to be completely unavailable, it also caused sporadic performance issues with statistic calculations.

One of the most visible failures was in December of 2005 when during a routine maintenance, in the middle of the process of adding redundant storage, something caused the complete storage cluster to go offline which caused the entire bank of webservers serving the webpages went down . Because they had separate storage cluster for backend database, it wasn’t affected by the outage directly.

Its at times like these that most companies fail to communicate with their users. Sixpart, fortunately, understood this early and did its job well.

Today Typepad’s architecture is similar to the one of Livejournal with users distributed over multiple master-master mysql replication. They have partitioned the database by UserIDs and have a global database to map UserIDs to partitions. They use Mysql 5.0 with InnoDB and Linux Heartbeat for HA.

The images though they decided to switch from a NFS storage to Perlbal ( Perl-based reverse proxy load balancer and web server) +MogileFS (open source distributed file system) which can scale much better with lower overhead over commodity hardware. Look at the image on the right which how Typepad served images in the transition phase from NFS to MogileFS. Follow the arrows with numbers to see how the requests go through within the network. For an image stored on MogileFS (Mogstored), the app server talks to MogileDB through mod_perl2 first (Step 3,4). MogileDB/mod_perl2 sends a Perlbal internal redirect(Step 5,6,7) to the actual image resource which is located on Mogstored(step 8,9).

Since most of the activity on the blogs are read only operations, it made sense to add memcached early into the process to ease load on a lot of components.

memcached is a high-performance, distributed memory object caching system, generic in nature, but intended for use in speeding up dynamic web applications by alleviating database load.

In another interesting approach to scalable architecture they recognized the fact that one of the most write intensive operations was commenting system which made them experiment with “The Schwartz“. This technology helped them use a queuing mechanism which could reliably delay write intensive operations to the database effectively allowing it to scale more.

The Schwartz is taglined “a reliable job queue system” and was originally developed as a generic job processing system for Six Apart’s hosted services. It is used in production today on TypePad, Livejournal and Vox for managing tasks that can be performed by the system without user interaction.

References

http://www.sixapart.com/typepad/news/2005/10/to_our_customers.html

http://www.niallkennedy.com/blog/archives/2005/12/typepad-outage-details.html

http://www.movabletype.org/documentation/administrator/publishing/publish-queue.html

Popularity: 27%

How Skype network handles scalability..

Monday, August 20th, 2007

There was a major skype outage last week and though there is an “official explaination” and other discussions about it floating around, I found this comment from one of the GigaOm readers more interesting to think about. Now this particular description may not accurately describe the problem (which might be speculation as well) but it does describe , in a few words, how skype’s p2p network scales out. You should also take a look at the detailed discussion of the skype protocol here.

Number of Skype Authentication servers:
Count == 50; // Clustered
Number of potential Skype clients:
Count = 220,000,000 // Mostly decentralized
Number of SuperNode clients to maintain network connectivity:
Count = N / 300 at any one time.

• If there are 3.0 million users online then the ratio is 3,000,000 / 300 = 10,000 == Supernodes available
• Supernodes are bootstraps into the network for normal first run clients (”and handle routing of children calls”).
• Supernodes maintain the network overlay via a DHT(”Distributed Has Table”) “type” method. // This is normally very slow and done over UDP
• If a client cannot find a Supernode, regardless of authentication via central server then is NOT allowed on the Skype network.

Lack of Supernodes mean lack of network connectivity regardless of successful login via “central server”.
You CAN be a Supernode but not have full network connectivity because you have only a portion of the “Distributed Index Data aka DHT”.
MOST people that become Supernodes will bail out if they cannot keep a clear route (”aka calls bail out, client restarts and aborts Supernode status, thus booting it’s 300 - 500 Children and putting them into a “Connecting mode”.

Children that are trying to “Connect” are unable to do anything unless they have a “Supernode” as a parent. // No calls, No IM….

The overview of this is as follows:

Skype introduced a flaw into the network that dealt with “routing” and “fucked” the “decentralized data store aka DHT” this in turn ran clients on a RANDOM search of Supernodes which at this point were well booted off of the network.

In the End:
It is a huge cycle, no matter how many bugs they “fix” in the “central servers” it will take many days for N nodes to become Supernodes so they can route X data from peer A to peer B. This is NOT minor, a fix to the centralized server code base to relay data to N Supernodes there is lack there of, resulting of a very segregate network. Right now there are approximatly 10,000 sub Skype networks instead of 1 Single “in sync” network. When this “data store(see DHT) is in sync globally then the Skype network will be again STABLE.

I know this is very broad but, unless magically all of said nodes can recreate the “single overlay (DHT)” then nothing will be in sync. You will see delayed messaged, delayed or incorrect profiles and presence.

My take, in the end is give it 48 more hours and it may be semi-stable, but hey this is what you get with using end users as your own redundancy…

Yours…

Popularity: 29%

Links on scalability, performance and problems

Sunday, August 19th, 2007
8/19/2007 Big Bad Postgres SQL
8/19/2007 Scalable internet architectures
8/19/2007 Production troubleshooting (not related to scalability)
8/19/2007 Clustered Logging with mod_log_spread
8/19/2007 Understanding and Building HA/LB clusters
8/12/2007 Multi-Master Mysql Replication
8/12/2007 Large-Scale Methodologies for the World Wide Web
8/12/2007 Scaling gracefully
8/12/2007 Implementing Tag cloud - The nasty way
8/12/2007 Normalized Data is for sissies
8/12/2007 APC at facebook
8/6/2007 Plenty Of fish interview with its CEO
8/6/2007 PHP scalability myth
8/6/2007 High performance PHP
8/6/2007 Digg: PHP’s scalability and Performance td>

Popularity: 16%

Builtwith.com : Find out what a website’s frontend is built with

Sunday, August 19th, 2007

This is a very interesting website which allows you to understand the technology behind the websites you visit. Here is more from its about page

BuiltWith is a web site profiler tool. Upon looking up a page, BuiltWith returns all the technologies it can find on the page. BuiltWith’s goal is to help developers, researchers and designers find out what technologies pages are using which may help them to decide what technologies to implement themselves.

BuiltWith technology tracking includes widgets (snap preview), analytics (Google, Nielsen), frameworks (.NET, Java), publishing (WordPress, Blogger), advertising (DoubleClick, AdSense), CDNs (Amazon S3, Limelight), standards (XHTML,RSS), hosting software (Apache, IIS, CentOS, Debian) .

Popularity: 28%

ETags and loadbalancers

Monday, August 13th, 2007

A few weeks ago the company I work with noticed a weird problem with its CDN (Content Delivery Network) provider. They noticed that HEAD requests were being responded to by the CDN edge nodes using objects in the cache which had already expired. Whats worse is that even after an explicit content expiry notification was sent, the HEAD responses were still wrong. Long story short, the CDN provider had to setup bypass rules for the HEAD requests so that it always bypasses the cache. There was a slight performance overhead with this, but the workaround solved the problem.

Now while this was going on, one of the guys at the CDN support helping us mentioned something about Etags and why we should be using it. I didn’t understand how Etags would solve the problem if the CDN itself had a bug which was ignoring expiry information, but I said I’ll investigate.

Anyway, the traditional way of communicating object expiry is using the Last-Modified timestamp. ETags is another way of doing that, except that its more accurate.
A little more digging explained that ETags is not a hash of the contents of the file, but a combination of file’s inode, file size and last-modified timestamp. This is definitely more accurate and I could see why this might be better than just having last-modified timestamp. But what the CDN support guy didn’t mention is that if you are serving content from multiple webservers, even if you rsync the content between the servers, the Etags will always be different because rsync or any other standard copy commands don’t have control over the inode number used.

A little more search on the net confirmed that this is a problem and that ETags should probably be shut off (or modified such that it doesn’t use inodes) on servers behind loadbalancers.

Popularity: 23%

Scaling PlentyOfFish.com

Monday, August 13th, 2007

There is a very interesting interview with Markus Frind, the one man army behind the website PlentyOfFish.com. The site boasts of traffic higher than match.com, about 30 million page views a day, and runs on a single webserver with a couple of database servers. Markus has found interesting ways of surviving different kinds of problems he had. Here is the direct link to interview in wmv format.

Popularity: 11%

Facebook internals

Sunday, August 12th, 2007

The code leaked during a facebook bug was posted online by an anonymous user. Though the source itself didn’t look very damaging, it did damage the brand “facebook”. But I won’t go into that in this post, and instead I would like to discus the facebook internals here which alex.moskalyuk touched upon.

Alex pointed out that this is not the only code from facebook we have seen. Infact we already know a lot more about how facebook works internally than what most of us would find from the source code to the index.php published yesterday.

  1. PHP - This is no surprise. Though PHP is not developed at faceboook, Alex points out that facebook developers are involved atleast at some level in the development of the php.
  2. Apache - Neither should this be
  3. Mysql - Same here..
  4. Valgrind - This is a suite of tools for debugging and profiling Linux programs. With the tools that come with Valgrind, you can automatically detect many memory management and threading bugs, avoiding hours of frustrating bug-hunting, making your programs more stable. You can also perform detailed profiling, to speed up and reduce memory use of your programs. Other tools related to this which they user are callgrind/Calltree , KCachegrind and OProfile.
  5. APC - Facebook developers have talked about using Alternative PHP Cache in some presentations they have given in the past.
  6. Facebook Thrift - Thrift is a software framework for scalable cross-language services development. It combines a powerful software stack with a code generation engine to build services that work efficiently and seamlessly between C++, Java, Python, PHP, and Ruby. Thrift was developed at Facebook, and its been released as open source. More information can be found in this whitepaper.
  7. Memcached is a high-performance, distributed memory object caching system, generic in nature, but intended for use in speeding up dynamic web applications by alleviating database load. The use of this shouldn’t come as a surprise since most of the new web2.0 companies, especially the ones using php and python have experimented or implemented it at some level.
  8. phpsh is another interesting tool facebook developers use internally. It is an interactive shell for php that features readline history, tab completion, quick access to documentation. It is ironically written mostly in python.
  9. Facebook has released a lot of code to support the facebook platform and to get users to develop for it.
  10. Facebook firefox plugin is the last one I’d like to mention here. This again is open source (since you can see the code once you open up the plugin yourself).

References

Popularity: 29%