Showing posts from August, 2007

Scalability Stories for Aug 30

I found a very interesting story on how memcached was created . Its an old story titled " Distributed caching with memcached ". I also found an interesting FAQ on memcached which some of you might like. Inside Myspace is another old story which follows Myspace's growth over time. Its a very long and interesting read which shouldn't be ignored. Measuring Scalability tries to put numbers to the problem of scalability. If you have to justify the cost of scalability to anyone in your organization, you should atleast skim through this page I found a wonderful story on the humble architecture of Mailinator and how it grew over time on just one webserver. It receives approx 5 million emails a day and runs the whole operation pretty much in memory with no logs or database to leave traces behind. And here is another page from the creator of Mailinator abouts its stats from Feb . Finally another very interesting presentation/slide on the topic of " Scalable

Thoughts on scalability

Here is an interesting contribution on the topic from Preston Elder   I've worked in multiple extremely super-scaled applications (including ones sustaining 70,000 connections at any one time, 10,000 new connections each minute, and 15,000 concurrent throttled file transfers at any one time - all in one application instance on one machine). The biggest problem I have seen is people don't know how to properly define their thread's purpose and requirements, and don't know how to decouple tasks that have in-built latency or avoid thread blocking (and locking). For example, often in a high-performance network app, you will have some kind of multiplexor (or more than one) for your connections, so you don't have a thread per connection. But people often make the mistake of doing too much in the multiplexor's thread. The multiplexor should ideally only exist to be able to pull data off the socket, chop it up into packets that make sense, and hand it off to some kind of

Loadbalancer for horizontal web scaling: What questions to ask before implementing one.

A single server, today, can handle an amazing amount of traffic. But sooner or later most organizations figure out that they need more and talk about choosing between horizontal and vertical scaling. If you work for such an organization and also happen to manage networking devices, you might find a couple of loadbalancers on your desk one day along with a yellow sticky note with a deadline on it. Loadbalancers, by definition, are supposed to solve performance bottlenecks by distributing or balancing load between different components its managing. Though you would normally find loadbalancers in front of a webserver, a lot of different individuals have found other interesting ways of using it. For example I know some organizations have so much network traffic that they can't use just one sniffer or firewall to do their job. They ended up using some high end loadbalancers to intelligently loadbalance network traffic through multiple sniffers/firewalls attached to them. These are rar

Feature or a bug ?

Dratz asks : Feature or a bug ?

TypePad architecture: Problems and solutions

TypePad was and probably is one of the first and largest paid blogging service in the world. In a presentation at OSCON 2007 , Lisa Phillips and Garth Webb spoke about TypePad's problems in 2005. Since this is a common problem with any successful company I found it interesting enough to research a little more. TypePad was, like any other service, initially designed in the traditional way with Linux, Postgres, Apache, mod_perl, perl as the front end and NFS storage for images on a filer. At that time they were pushing close to 250mbps (4TB per day) through multiple pipes and with growing user base, activity and data they were growing at the rate of 10 to 20% per month. Just before the planned move to newer better data center, sometime in Oct 2005, TypePad started experiencing all kinds of problems due to its unexpected growth. The unprecedented stress on the system caused multiple failures over the next two months which ranged from hardware, software, storage to networking issues.

Web storage for backups

I'm contemplating using S3 for backups. Paul Stamantiou has a script ready to go. The thing which convinced me was this chart Paul showed. For 10GB of space he paid under 3 dollars per month. Thats really cheap... GMail, Microsoft and yahoo all provide extra storage as well. However none of them have stable company supported APIs to allow users to upload data in this form.

How Skype network handles scalability..

There was a major skype outage last week and though there is an " official explaination " and other discussions about it floating around, I found this comment from one of the GigaOm readers more interesting to think about. Now this particular description may not accurately describe the problem (which might be speculation as well) but it does describe , in a few words, how skype's p2p network scales out. You should also take a look at the detailed discussion of the skype protocol here . Number of Skype Authentication servers: Count == 50; // Clustered Number of potential Skype clients: Count = 220,000,000 // Mostly decentralized Number of SuperNode clients to maintain network connectivity: Count = N / 300 at any one time. • If there are 3.0 million users online then the ratio is 3,000,000 / 300 = 10,000 == Supernodes available • Supernodes are bootstraps into the network for normal first run clients ("and handle routing of children calls"). •

Links on scalability, performance and problems

8/19/2007 Big Bad Postgres SQL 8/19/2007 Scalable internet architectures 8/19/2007 Production troubleshooting (not related to scalability) 8/19/2007 Clustered Logging with mod_log_spread 8/19/2007 Understanding and Building HA/LB clusters 8/12/2007 Multi-Master Mysql Replication 8/12/2007 Large-Scale Methodologies for the World Wide Web 8/12/2007 Scaling gracefully 8/12/2007 Implementing Tag cloud - The nasty way 8/12/2007 Normalized Data is for sissies 8/12/2007 APC at facebook 8/6/2007 Plenty Of fish interview with its CEO 8/6/2007 PHP scalability myth 8/6/2007 High performance PHP 8/6/2007 Digg: PHP's scalability and Performance td> : Find out what a website's frontend is built with

This is a very interesting website which allows you to understand the technology behind the websites you visit. Here is more from its about page BuiltWith is a web site profiler tool. Upon looking up a page, BuiltWith returns all the technologies it can find on the page. BuiltWith’s goal is to help developers, researchers and designers find out what technologies pages are using which may help them to decide what technologies to implement themselves. BuiltWith technology tracking includes widgets (snap preview), analytics (Google, Nielsen), frameworks (.NET, Java), publishing (WordPress, Blogger), advertising (DoubleClick, AdSense), CDNs (Amazon S3, Limelight), standards (XHTML,RSS), hosting software (Apache, IIS, CentOS, Debian) .

Getting ready for Social Network portability

Brad Fitzpatrick and David Recordon, kicked off another round of discussions on aggregating, decentralizing and Social network portability in a post called " Thoughts on the Social Graph ". The post is long, but he summarized the problem statement into a few lines.. Users and developers alike are going crazy. There's too many social networks out there to keep track of. Developers want to make more, and users want to join more, but it's all too much work to re-enter your friends and data. We need to lower the amount of pain for both users and developers and let a thousand new social applications bloom. I've mentioned this problem in the past as well and feel like this is long overdue. Sites like Plaxo and Facebook have taken a step in the right direction, but its not the solution. As I see it the real solution should be something similar to the XMPP standard which opened up the chat protocol to allow decentralized chat networks work with each other . Also read

Microsoft Live ID out : Google going to support OpenID soon... I predict

The other day I briefly mentioned the pain point of the web2.0 world and how consolidation, aggregation and summarization will help reduce some of it. Microsoft today formally announced the availability of Microsoft Live ID as a contender for the providing SSO (single sign on) services in the web 2.0 world. Live ID, incase you didnt know,  is the repackaged version of Microsoft Passport Network , which had failed so badly that it forced Microsoft to pull it out of the market. Here are some examples of how to use other languages like php, perl, python, ruby etc to do authentication using Live ID. Microsoft is not the first one to openly come out with a SSO technology. Liberty Alliance and OpenID are other opensource competitors which have some foothold in this market already. The move to SSO, in the web 2.0 world, (Single sign on) is bound to happen regardless of how scary some people might find it to be. If you can trust your online bank with 100000 dollars and trust 3 compa

DNS Rebinding what ?

Everyone who knows what a "DNS Rebinding attack" is please raise your hands. I'm so glad I can't see yours, because I'm ashamed of myself for not knowing this one. For those who are "pretending" not to know please read on. Browsers use domain names to enforce same-domain policy for a lot of security features. Interestingly depending on which client you are using its possible to set a low DNS TTL and change the IP address such that without a change in domain name a script could interact with another website as long as browser can be made to believe that its still the same domain. To do this, all that the client needs to do is initially server contents from its own server and while the javascript is running, update the DNS such that the javascript can interact with a new domain from where it could steel information for the attacker. There are some safe gaurds to stop these kinds of attacks, but for most part these kinds of attack can be done easily on the

ETags and loadbalancers

A few weeks ago the company I work with noticed a weird problem with its CDN (Content Delivery Network) provider. They noticed that HEAD requests were being responded to by the CDN edge nodes using objects in the cache which had already expired. Whats worse is that even after an explicit content expiry notification was sent, the HEAD responses were still wrong. Long story short, the CDN provider had to setup bypass rules for the HEAD requests so that it always bypasses the cache. There was a slight performance overhead with this, but the workaround solved the problem. Now while this was going on, one of the guys at the CDN support helping us mentioned something about Etags and why we should be using it. I didn't understand how Etags would solve the problem if the CDN itself had a bug which was ignoring expiry information, but I said I'll investigate. Anyway, the traditional way of communicating object expiry is using the Last-Modified timestamp. ETags is another way of doing t


There is a very interesting interview with Markus Frind , the one man army behind the website . The site boasts of traffic higher than, about 30 million page views a day, and runs on a single webserver with a couple of database servers. Markus has found interesting ways of surviving different kinds of problems he had. Here is the direct link to interview in wmv format.

Facebook internals

The code leaked during a facebook bug was posted online by an anonymous user. Though the source itself didn't look very damaging, it did damage the brand "facebook". But I won't go into that in this post, and instead I would like to discus the facebook internals here which alex.moskalyuk touched upon. Alex pointed out that this is not the only code from facebook we have seen. Infact we already know a lot more about how facebook works internally than what most of us would find from the source code to the index.php published yesterday. PHP - This is no surprise. Though PHP is not developed at faceboook, Alex points out that facebook developers are involved atleast at some level in the development of the php. Apache - Neither should this be Mysql - Same here.. Valgrind - This is a suite of tools for debugging and profiling Linux programs. With the tools that come with Valgrind, you can automatically detect many memory management and threading bugs, avoiding h

Mysql Cluster

Link "Introduction to MySQL Cluster The NDB storage engine (MySQL Cluster) is a high-availability storage engine for MySQL. It provides synchronous replication between storage nodes and many mysql servers having a consistent view of the database. In 4.1 and 5.0 it's a main memory database, but in 5.1 non-indexed attributes can be stored on disk. NDB also provides a lot of determinism in system resource usage. I'll talk a bit about that." Technorati Profile

Facebook code leaked.. but was it Hacked too ?

Everyone would be talking about this soon. Someone leaked the source of the index page of facebook on a website called facebook secrets . Update: Brandee Barker from Facebook responded to Nic on Techcrunch. Hi Nic- I wanted to clarify a few things in your story. Some of Facebook’s source code was exposed to a small number of users due to a bug on a single server that was misconfigured and then fixed immediately. It was not a security breach and did not compromise user data in any way. The reprinting of this code violates several laws and we ask that people not distribute it further. Thanks to you and the TC readers for helping us out on this one. Brandee Barker Facebook What is not clear is whether this was a hack or was someone inside involved. This is what Nik Cubrilovic from TechCrunch has to say... "There are a number of clear ramifications here. The first is that the code can be used by outsiders to better understand how the Facebook application works, for the purp

How To Design A Good API and Why it Matters

A very interesting Google talk about designing a good API. This may not seem like a scalability issue, but if you really want to host a horizontally scalable system you need to have a good scalable API design to go with it. Every day around the world, software developers spend much of their time working with a ... all variety of Application Programming Interfaces (APIs). Some are integral to the core platform, some provide access to widely distributed frameworks, and some are written in-house for use by a few developers. Nearly all programmers occasionally function as API designers, whether they know it or not. A well-designed API can be a great asset to the organization that wrote it and to all who use it. Good APIs increase the pleasure and productivity of the developers who use them, the quality of the software they produce, and ultimately, the corporate bottom line. Conversely, poorly written APIs are a constant thorn in the developer's side, and have been known to harm the b

Content Delivery network: Will Price war boost web performance ?

GigaOm has an interesting write up on the commoditization  of the CDN service   and the pricewar raging in the industry. Akamai itself saw a significant stock market drop in the last couple of weeks. "That burp has come with the increase in the number of competitors , each one trying to cash in on the boom in online video and other digital content. Limelight Networks (LLNW), Level 3 (LVLT), Internap (INAP), CDNetworks, along with new entrants Panther Express and EdgeCast Networks are some of the CDN players currently involved in a catfight with Akamai.  " CDN is an excellent way of boosting performance and providing PoP in different parts of the world which can benefit by faster content delivery. going public...Why ?

Mashable mentioned that accoona is going public. It says... Most of Accoona’s revenue comes from its e-commerce business, which operates in North America. It’s online lead generation and search engine services are used in the US, Europe and China. Its search technology was hailed as a viable competitor to other major search engines such as Google, when it launched its Internet service a few years ago. Accoona’s attempt at differentiation is that of its semantic search, incorporating the meaning of words into your queries, allow you to further filter your search results based on your highlighted keywords, and will revise information in real time, offering relevant data such as fax and phone numbers, addresses, etc. for particular information you look up. My question is... why ? The site itself looks unpleasant to visit, slow to search and has at least a few implementation bugs at least. On top of that I found the advertisements annoying to look at and the search

New Talks and Slides links from Aug 5 2007

If you haven't seen these links before.. you should check this page first " Talks and slides from web architects ". But if you have already seen that page... here are the updates from last week.   PDF Case for Shared Nothing   PDF The Chubby Lock Service for Loosely-Coupled Distributed Systems     Building Highly Scalable Web Applications 1/1/2006 Slides The Ebay architecture 1/1/2007 Slides PHP & Performance 4/20/2007 Video Brad Fitzpatrick - Behind the Scenes at LiveJournal: Scaling Storytime 5/4/2006   Scalable computing with Hadoop 6/3/2007   Hadoop Map/Reduce 8/3/2007   Introduction to hadoop 6/1/2007 Slides Hadoop distributed file system     Yahoo experience with hadoop 7/25/2007   Meed Hadoop 8/3/2007 webpage The Hadoop Distributed File System: Architecture and Design 7/25/2007 Blog Yahoo's Hadoop Support 7/18/2007 Blog Running Hadoop MapReduce on Amazon EC2 and Amazon S3 8/3/2007   Interpreting the Data: Parallel Analysis with Sawzall 10/18/2005 Video BigT

Crowdsourcing the google way

Remember googles innovative image labeler idea ? They seem to be doing it again with getting the masses to build maps for Google in india. India unlike US and many other western countries doesn't have well documented maps for its streets. Eicher is the only organization I know about which actively maps and provides printed maps in india . Here is what Braddy Forrest has to say... " Google has been sending GPS kits to India that enable locals to make more detailed maps of their area. After the data has been uploaded and then verified against other participant's data it becomes a part of the map. The process is very reminiscent of what Open Street Map , the community map-building project, has been doing. The biggest difference is that the data (to my knowledge) is owned by Google and is not freely available back to the community like it is with OSM."

The "me too" phenomenon and Identity theft

A very interesting article from Muhammad Saleem on the "me too" phenomenon. My problem with this phenomenon is that this might make stealing identity easier than before. In this new web 2.0 world, if I need your passwords or mother's maiden name, all I have to do is build an interesting application which you would like to try out at least once. Once I have your password or other key information (which most likely be the same across all your applications), I can shut the side down and do other interesting things. I'm an open advocate of OpenID which attacks some of the issues, but its no silver bullet. More from Muhammad's blog.. "Everyday a new company announces a 'new' product which is nothing more than the old product with slight modifications or a few small additional features. This mentality is not only bad for users but also for marketers and even the startups. A prime example of this phenomenon can be witnessed by comparing Dodgeball , Twitte

Hadoop and HBase

This may not be a surprise for a lot of people but it was for me. Even though I have been using lucene and nutch for some time, I didn't really know enough about Hadoop and HBase until recently. Hadoop Scalable: Hadoop can reliably store and process petabytes. Economical: It distributes the data and processing across clusters of commonly available computers. These clusters can number into the thousands of nodes. Efficient: By distributing the data, Hadoop can process it in parallel on the nodes where the data is located. This makes it extremely rapid. Reliable: Hadoop automatically maintains multiple copies of data and automatically redeploys computing tasks based on failures. Hadoop implements MapReduce , using the Hadoop Distributed File System ( HDFS ) (see figure below.) MapReduce divides applications into many small blocks of work. HDFS creates multiple replicas of data blocks for reliability, placing them on compute nodes around the cluster. MapReduce can then pro

Talks and slides from various web architects

For latest set of links go here . This is a collection of various slides, pdfs and videos about designing scalable websites I collected time. If you have something interesting which might go in here, please let me know. Date Type Title 6/23/2007 Blog Getting Started with Drupal 6/23/2007 Blog 4 Problems with Drupal 6/23/2007 Video Seattle Conference on Scalability: MapReduce Used on Large Data Sets 6/23/2007 Video Seattle Conference on Scalability: Scaling Google for Every User 6/23/2007 Video Seattle Conference on Scalability: VeriSign's Global DNS Infrastucture 6/23/2007 Video Seattle Conference on Scalability: YouTube Scalability 6/23/2007 Video Seattle Conference on Scalability: Abstractions for Handling Large Datasets 6/23/2007 Video Seattle Conference on Scalability: Building a Scalable Resource Management 6/23/2007 Video Seattle Conference on Scalability: SCTPs Reliability and Fault Tolerance 6/23/2007 Video Seattle Conference on Scalability: Lessons In Building Scalable Sys

Scalable web architectures

I've been reading a lot about scalable web architectures lately and made a big enough collection of links to see that this could be interesting to others. Instead of putting all those links here in this blog, I've started a separate blog here If you have an interesting link/links to share please send it over to me.

Youtube scalability

Scalable Internet Architectures

By Theo Schlossnagle   As a developer, you are aware of the increasing concern amongst developers and site architects that websites be able to handle the vast number of visitors that flood the Internet on a daily basis. Scalable Internet Architecture addresses these concerns by teaching you both good and bad design methodologies for building new sites and how to scale existing websites to robust, high-availability websites. Primarily example-based, the book discusses major topics in web architectural design, presenting existing solutions and how they work. Technology budget tight? This book will work for you, too, as it introduces new and innovative concepts to solving traditionally expensive problems without a large technology budget. Using open source and proprietary examples, you will be engaged in best practice design methodologies for building new sites, as well as appropriately scaling both growing and shrinking sites. Website development help has arrived in the form of Scalab

Book: Building Scalable Web Sites

Building, scaling, and optimizing the next generation of web applications By Cal Henderson Learn the tricks of the trade so you can build and architect applications that scale quickly--without all the high-priced headaches and service-level agreements associated with enterprise app servers and proprietary programming and database products. Culled from the experience of the lead developer, Building Scalable Web Sites offers techniques for creating fast sites that your visitors will find a pleasure to use. Creating popular sites requires much more than fast hardware with lots of memory and hard drive space. It requires thinking about how to grow over time, how to make the same resources accessible to audiences with different expectations, and how to have a team of developers work on a site without creating new problems for visitors and for each other. Presenting information to visitors from all over the world * Integrating email with your web applications * Planning hardwar