Scaling PHP : HipHop and Quercus

While PHP is very popular, it unfortunately doesn’t perform as some of its competitors. One of the ways to make things faster is to write PHP Extensions in C++. In this post we will describe two different ways developers can solve this problem and the milage you might get from either model may vary.

Since Facebook is mostly running PHP, it noticed this problem pretty early, but instead of asking its developers to move from PHP to C++ one of their developers hacked up a solution to transform PHP code into C++.

Yesterday, Facebook announced they are opening up HipHop, a source code transformer, which changes PHP code into a more optimized C++ code and uses g++ to compile it. With some minor sacrifices (no eval support)  they noticed they were able to get 50% performance improvement. And since they serve 400 billion page views every month, that kind of saving can free up a lot of servers.

More info on HipHop

Quercus on the other hand is a 100% java implementation of PHP 5. What makes this more interesting is that Quercus can now run in Google App Engine pretty much the same way JSPs can.

Caucho’s Quercus presents a new mixed Java/PHP approach to web applications and services where Java  Caucho Technologyand PHP tightly integrate with each other. PHP applications can choose to use Java libraries and technologies like JMS, EJB, SOA frameworks, Hibernate, and Spring. This revolutionary capability is made possible because 1) PHP code is interpreted/compiled into Java and 2) Quercus and its libraries are written entirely in Java. This architecture allows PHP applications and Java libraries to talk directly with one another at the program level. To facilitate this new Java/PHP architecture, Quercus provides and API and interface to expose Java libraries to PHP.

The demo of Quercus running on GAE was very impressive. Any pure PHP code which doesn’t need to interact with external services would work beautifully without any issues on GAE. But the absence of Mysql in GAE means SQL queries have to be mapped to datastore (bigtable) which might require a major rewrite to parts of the application. But its not impossible, as they have shown by making wordpress run on GAE (crawl might be a better word though).

While Quercus is opensource and is as fast as regular PHP code in interpreted mode, the compiler which is way faster is not free. Regardless Quercus is a step in the right direction, and I sincerely hope PHP support on GAE is here to stay.

Windows Azure

Windows Azure is an application platform provided by Microsoft to allow others to run applications on Microsoft’s “cloud” infrastructure. Its finally open for business (as of Feb 1, 2010). Below are some links about Azure for those who are still catching up.Windows Azure logo.jpg

Wikipedia: Windows Azure has three core components: Compute, Storage and Fabric. As the names suggest, Compute provides computation environment with Web Role and Worker Role while Storage focuses on providing scalable storage (Blobs, Tables, Queue) for large scale needs.

The hosting environment of Windows Azure is called the Fabric Controller – which pools individual systems into a network that automatically manages resources, load balancing, geo-replication and application lifecycle without requiring the hosted apps to explicitly deal with those requirements.[3] In addition, it also provides other services that most applications require — such as the Windows Azure Storage Service that provides applications with the capability to store unstructured data such as binary large objects, queues and non-relational tables.[3] Applications can also use other services that are a part of the Azure Services Platform.

The real concerns about Cloud infrastructure (as it is today)

While “private clouds may not be the future” they are definitely needed today. Here are some of the top issues bothering some organizations which have been thinking about going into the cloud. Some of issues were based on Craig Bolding’s talk on “Guide to cloud security”.cluod

  • Unlike your own data center, you will never know what the cloud vendors are running, or how they backup, or what their DR plans are. They will say you shouldn’t care, but do you remember what happened to the Tmobile customer’s on Danger ?
  • Uptime, availability and responsiveness is less predictable than in a self hosted environment. In most cases the cloud vendors may not even choose to let customers know about major maintenance if they don’t anticipate any issues. Organizations who manage their own infrastructure would always try to avoid doing two major changes which have interdependencies.
  • Multi-Tenancy means you may have to worry about a noisy neighbor.
  • Muti-Tenancy could also lead one to interesting issues which were never thought about before. What if there was a way to do an “injection attack”. Depending on how Multi-Tenancy is implemented, you could potentially touch other customers data.
  • Infrastructure and platform lock-in issues are worrying for many organizations who are thinking long term. Most cloud vendors don’t really have a long history to show their track record.
  • Change control and detailed change log is missing.
  • Individual customers don’t have much decision making power on what a vendor should do next. In a privately hosted environment the stake holders are asked before something is done, but in larger infrastructure, you are a small fish in a huge pond.
  • Most cloud vendors have multiple layers of cloud infrastructure dependent on each other. Its hard to understand how issues around one type of cloud could impact others. This is especially true from Security view point. A bad flaw in a lower layer of the architecture could impact all other platforms built over it.
  • Moving applications to cloud means dealing with a different style of programming designed for horizontal scalability, data consistency issues, health monitoring, load balancing, managing state, etc.
  • Identify management is still in early stages. Integration with corporate Identify management infrastructure would be important to make it easy for individuals from large organizations on external clouds.
  • Who takes care of scrubbing disks when data is moved around ? What about data on backup tapes ? This is very important in application handling highly sensitive data.
  • Just like credit card fraud, one has to worry about CPU time fraud. Is the current billing and reporting good enough to help large organizations figure out what is real and what could be fraud ? They need a real-time fraud detection mechanism. And what about loss of service due to DOS attacks ? Who pays for that ?
  • Need a better mechanism to bill large corporations.
  • On the non-technical side, there are a lot of questions related to SLAs, Compliance issues, Terms of services, Legal issues around cross border services, and even questions about whether law enforcement have a different set of rules when search and seizure is required.
  • Not too far from being another form of “outsourcing”.

Photo credit: akakumo

Hive @Facebook

Hive is a data warehouse infrastructure built over Hadoop. It provides tools to enable easy data ETL, a mechanism to put structures on the data, and the capability to querying and analysis of large data sets stored in Hadoop files. Hive defines a simple SQL-like query language, called QL, that enables users familiar with SQL to query the data. At the same time, this language also allows programmers who are familiar with the MapReduce fromwork to be able to plug in their custom mappers and reducers to perform more sophisticated analysis that may not be supported by the built-in capabilities of the language.

At a user group meeting, Ashish Thusoo from Facebook data team, spoke about how Facebook uses Hive for their data processing needs.

Problem

Facebook is a free service and has been experiencing rapid growth in last few years. The amount of data it collects, which used to be around 200GB per day in March 2008, has now grown to 15TB per day today.  Facebook realized early on that insights derived from simple algorithms on more data is better than insights from complex algorithm on smaller set of data.

But the traditional approach towards ETL on proprietary storage systems was not only getting expensive to maintain, it was also limited in the size it could scale to. This is when they started experimenting with Hadoop.

How Hadoop gave birth to Hive

Hadoop turned out to be superior in availability, scalability and manageability. Its efficiency wasn’t that great, but one could get more throughput by throwing more cheap hardware at it. Ashish pointed out that though at that point partial availability, resilience and scale was more important than ACID they had a hard time finding Hadoop programmers within Facebook to make use of the cluster.

It was this, that eventually forced Facebook, to build a new way of querying data from Hadoop which doesn’t require writing map-reduce jobs in java. That quickly lead to the development of hive, which does exactly what it was set out to do. Lets look at a couple of examples of hive queries.

  hive> FROM invites a INSERT OVERWRITE TABLE events SELECT a.bar, count(1) WHERE a.foo > 0 GROUP BY a.bar;
  hive> INSERT OVERWRITE TABLE events SELECT a.bar, count(1) FROM invites a WHERE a.foo > 0 GROUP BY a.bar;

Hive’s long term goal was to develop a system for managing and querying structured data built on top of Hadoop. To do that it used map-reduce mechanisms for execution and used HDFS for storage. They modeled the language on SQL, designed it to be extensible, interoperable and be able to out perform traditional processing mechanisms.

How it is usedimage

Facebook has a production Hive cluster which is primarily used for log summarization, including aggregation of impressions, click counts and statistics around user engagement. They have a separate cluster for “Ad hoc analysis” which is free for all/most Facebook employees to use. And over time they figured out how to use it for spam detection, ad optimization and a host of other undocumented stuff.

Facebook Hive/Hadoop statistics

The scribe/Hadoop cluster at Facebook has about 50 nodes in the cluster today and processes about 25TB of raw data. About 99% of its data is available for use within 20 seconds. The Hive/Hadoop cluster where most of the data processing happens has about 8400 cores with roughly about 12.5 PB of raw storage which translates to 4PB of usable storage after replication. Each node in the cluster is a 8 core server with 12TB of storage each.

All in all, Facebook gets 12 TB of compressed new data and scans about 135 TB of compressed data per day. There are more than 7500 Hive jobs which use up about 80000 computer hours each day.

References

HAProxy : Load balancing

Designing any scalable web architecture would be incomplete without investigating “load balancers”.  There used to be a time when selecting and installing load balancers was an art by itself. Not anymore.

A lot of organizations today, use Apache web servers as a proxy server (and also as a load balancer) for the backend application clusters. Though Apache is the most popular web server in the world, it also considered over-weight if all you want to do is proxy a web application. The huge codebase which apache comes with and the separate modules which need to be compiled and configured with it, could soon become a liability.

HAProxy is a tiny proxying engine which doesn’t have all the bells and whistles of apache, but is highly qualified to act as a HTTP/TCP proxy server. Here are some of the other wonderful things I liked about it

  • Extremely tiny codebase. Just two runtime files to worry about, the binary and the configuration file.
  • Compiles in seconds. 10 seconds the last time I did it.
  • Logs to syslog by default
  • Can load balance HTTP as well as regular TCP connections. Can easily load balance most non-HTTP applications.
  • Can do extremely detailed performance (and cookie capture) logging. It can differentiate backend processing time from the end-user request completion time. This is extremely helpful in monitoring performance of backend services.
  • It can do sticky load balancing out of the box
  • It can use application generated cookies instead of self-assigned cookies.
  • It can do health monitoring of the nodes and automatically removes them when health monitors fail
  • And it has a beautiful web interface for application admins who care about number.

A few other notes

  • HAProxy really doesn’t serve any files locally. So its definitely not a replacement for your apache instance if you are using it to serve local files.
  • It doesn’t do SSL, so you sill need an SSL engine in front of it if you need secure http.
  • HAProxy is not the only apache replacement. Varnish is a strong candidate which can also do caching (with ESI). And while you are at it, do take a look at Perlbal which looked interesting.

Some Interesting external linksimage

Finally a sample configuration file with most of the features I mentioned above configured for use. This is the entire thing and should be good enough for a production deployment with minor changes. 

global
        log loghost logfac info
        maxconn 4096
        user webuser 
        group webuser 
        daemon

defaults
        log     global
        stats   enable
        mode    http
        option  httplog
        option  dontlognull
        option  httpclose
        retries 3
        option  redispatch
        maxconn 2000
        contimeout      5000
        clitimeout      300000
        srvtimeout      300000

listen  http_proxy 0.0.0.0:8000
        option httpchk HEAD /app/health.jsp HTTP/1.0
        mode http
        cookie SERVERID insert
        capture cookie JSESSIONID len 50
        capture request header Cookie len 200
        capture request header Host len 50
        capture request header Referer len 200
        capture request header User-Agent len 150
        capture request header Custom-Cookie len 15
        appsession JSESSIONID len 32 timeout 3600000

        balance roundrobin
        server server1_name server1:8080 weight 1 cookie server1_name_cookie check inter 60000
        server server2_name server2:8080 weight 1 cookie server2_name_cookie check inter 60000

Cassandra for service registry/discovery service

My last post was about my struggle to find a good distributed ESB/Service-discovery solution built over open source tools which was simple to use and maintain. Thanks to reader comments (Dan especially) and some other email exchanges, it seems like building a custom solution is unavoidable if I really want to keep things simple.

Dan suggested that I could use DNS to find seed locations for config store which would work very well in a distributed network. If security wasn’t a concern this seed location could have been on S3 or SimpleDB, but the requirement that it needs to be secured on internal infrastructure forced me to investigate simple replicated/eventually-consistent databases which could be hosted internally in different data centers with little or no long term administration cost.

My search lead me to investigate a few different NOSQL options

But the one I finally settled on as a possible candidate was Cassandra. Unlike some of the others, since our application platform was based on java, Cassandra was simple to install and setup. The fact that Facebook used it to store 50TB of data across 150 servers helped us convince it was stable as well.

The documentation on this project isn’t as much as I would have liked, but I did get it running pretty fast. Building a service registry/discovery service on top of this is whats next on my mind..

More on Cassandra

If you are interested in learning more about cassandra I’ll recommend you to listen to this talk by Avinash Lakshman (facebook) and read a few other posts listed here.

Cassandra: Articles

  • Cassandra — Getting Started: Cassandra data model from a Java perspective

  • Using Cassandra’s Thrift interface with Ruby

  • Cassandra and Thrift on OS X: one, two, three

  • Looking to the Future with Cassandra: how Digg migrated their friends+diggs data set to Cassandra from mysql

  • Building Scalable Databases: Denormalization, the NoSQL Movement and Digg

  • WTF is a SuperColumn? An Introduction to the Cassandra Data Model

  • Meet Scalandra: Scala wrapper for Cassandra

  • Cassandra and Ruby: A Love Affair? – Engine Yard’s walk-through of the Cassandra gem

  • Up and Running with Cassandra: featuring data model examples of a Twitter clone and a multi-user blog, and ruby client code

  • Facebook Engineering notes and Cassandra introduction and LADIS 2009 paper

  • ArchitectureInternals

  • ArchitectureGossip

  • Cassandra: Presentations

  • Cassandra in Production at Digg from NoSQL East 09

  • Introduction to Cassandra at OSCON 09

  • What Every Developer Should Know About Database Scalability: presentation on RDBMS vs. Dynamo, BigTable, and Cassandra

  • IBM Research’s scalable mail storage on Cassandra

  • NOSQL VideoNOSQL Slides: More on Cassandra internals from Avinash Lakshman.

  • Video of a presentation about Cassandra at Facebook: covers the data model of Facebook’s inbox search and a lot of implementation details. Prashant Malik and Avinash Lakshman presenting.

  • Cassandra presentation at sigmod: mostly the same slides as above

  • If any of you have worked on cassandra, please let me know how that has been working out for you.

    Google app engine review (Java edition)

    For the last couple of weekends I’ve been playing with Google App Engine, (Java edition) and was pleasantly surprised at the direction it has taken. I was also fortunate enough to see some Google Engineers talk on this subject at Google I/O which helped me a lot to compile all this information.

    But before I get into the details, I like to warn you that I’m not a developer, let alone a java developer. My experience with java has been limited to prototyping ideas and wasting time (and now probably yours too). appengine_lowres

    Developing on GAE isn’t very different from other Java based development environments. I used the eclipse plugin to build and test the GAE apps in the sandbox on my laptop. For most part everything you did before will work, but there are limitations introduced by GAE which tries to force you to write code which is scalable.

    1. Threads cant be created – But one can modify the existing thread state
    2. Direct network connections are not allowed – URLConnection can be used instead
    3. Direct file system writes not allowed. – Use Memory, memcache, datastore instead. ( Apps can read files which are uploaded as part of the apps)
    4. Java2D not allowed -  But certain Images API, Software rendering allowed
    5. Native Code not allowed-  Only pure Java libraries are allowed
    6. There is a JRE class whitelist which you can refer to to know which classes supported by GAE.

    GAE runs inside a heavily version of jetty/jasper servlet container currently using Sun’s 1.6 JVM (client mode). Most of what you would did to build a webapp world still applies, but because of limitations of what can work on GAE, the libraries and frameworks which are known to work should be explicitly checked for. If you are curious whether the library/framework you use for your webapp will work in GAE, check out this page for the official list of known/working options (will it play in app engine).

    Now the interesting part. Each request gets a maximum of 30 seconds in which it has to complete or GAE will throw an exception. If you are building a web application which requires large number of datastore operations, you have to figure out how to break requests into small chunks such that it does complete in 30 seconds. You also have to figure out how to detect failures such that clients can reissue the request if they fail.

    But this limitation has a silver lining. Though you are limited by how long a request can take to execute, you are not limited by the number of simultaneous requests currently (you can get to 32 simultaneous threads in free account, and can go up higher if you want to pay). Theoretically you should be able to scale horizontally to as many requests per second as you want.  There are few other factors, like how you architect your data in datastore, which can still limit how many operations per second you can do. Some of the other GAE limits are listed here.

    You have to use google’s datastore api’s to persist data to maximize GAE’s potential. You could still use S3, SimpleDB or your favorite cloud DB storage, but the high latency would probably kill your app first.

    The Datastore on GAE is where GAE gets very interesting and departs significantly from most traditional java webapp development experiences. Here are a few quick things which took me a while to figure out.

    1. Datastore is schemaless (I’m sure you knew this already)
    2. Its built over google’s BigTable infrastructure. (you knew this as well…)
    3. It looks like SQL, but don’t be fooled. Its so crippled that you won’t recognize it from two feet away. After a week of playing with GAE I know there are at least 2 to 3 ways to query this data, and the various syntaxes are confusing.  ( I’ll give an update once a figure this whole thing out)
    4. You can have Datastore generate keys for your entities, or you can assign it yourself. If you decide to create your own keys (which has its benefits BTW) you need to figure out how to build the keys in such a way that they don’t collide with unintentional consequences.
    5. Creation of “uniqueness” index is not supported.
    6. Nor can you do joins across tables. If you really need a join, you would have to do it at the app. I heard there are some folks coming out with libraries which can fake a relational data model over datastore… don’t have more information on it right now.
    7. The amount of datastore CPU (in addition to regular app CPU) you use is monitored. So if you create a lot of indexes, you better be ready to pay for it.
    8. Figuring out how to index your data isn’t rocket science. Single column indexes are automatically built for you. Multi-column indexes need to be configured in the app. GAE sandbox running on your desktop/laptop does figure out which indexes you need by monitoring your queries, so you may not have to do much for most part. When you upload the app, the config file instructing which index are required is uploaded with it. In GAE Python, there are ways to tell google not to index some fields
    9. Index creation on GAE takes a long time for some reason. Even for small tables. This is a known issue, but not a show stopper in my personal opinion
    10. Figuring out how to breakup/store/normalize/denormalize your data to best use GAE’s datastore would probably be one of the most interesting challenges you would have to deal with.
    11. The problem gets trickier if you have a huge amount of data to process in each request. There are strict CPU resource timeouts which currently look slightly buggy to me (or work in a way I don’t understand yet). If a single query takes over a few seconds (5 to 10) it generally fails for me. And if the same HTTP request generates a lot of datastore queries, there is a 30 second limit on the HTTP request after which the request would be killed.
    12. From what I understand datastore is optimized for reads and writes are expensive. Not only do indexes have to be updated, each write needs to be written to the disk before the operation is considered complete. That brings in physical limitations of how fast you can process data if you are planning to write a lot of data. Breaking data into multiple tables is probably a better way to go
    13. There is no way to drop a table or a datastore. You have to delete it 1000 rows at a time using you app currently. This is one of the biggest issues brought up by the developers and its possible it would be fixed soon.
    14. There is no way to delete an application either…
    15. There is a python script to upload large amount of data to the GAE datastore. Unfortunately, one needs to understand how the datamodel you designed for java app looks like in python world. This has been a blocker for me, but I’m sure I could have figured it out using google groups if I really wanted to.
    16. If I understand correctly the datastore (uses BigTable architecture) is built on top of 4 large bigtables.
    17. If I understand correctly, though GAE’s datastore architecture supports transactions, its Master-Master replication across multiple-datacenters has some caveats which needs to be understood. GAE engineers explained that 2 Phase comit and Paxos are better at handling data consistencies across datacenters but suffers from heavy latency because of which its not used for GAE’s datastore currently. They hope/plan to give some kind of support for a more reliable data consistency mechanism.

    Other than the Datastore, I’d like to mention a few other key things which are important central elements of the GAE architecture.

    1. Memcache support is built in. I was able to use it within a minute of figuring out that its possible. Hitting datastore is expensive and if you can get by with just using memcache, thats what is recommended.
    2. Session persistence exist and its persisted to both memcache and datastore. However its disabled by default and GAE engineers recommend to stay away from it. Managing sessions is expensive, especially if you are hitting datastore very frequently.
    3. Apps can send emails (there are paid/free limits)
    4. Apps can make HTTP requests to outside world using URLConnection
    5. Apps get google authentication support out of the box. Apps don’t have to manage user information or build login application/module to create user specific content.
    6. Currently GAE doesn’t provide a way to set which datacenter (or country) to host your app from (Amazon allows users to choose US or EU). They are actively working to solve this problem.

    Thats all for now, I’ll keep you updated as things move along. If you are curious about something very specific, please do leave a comment here or at the GAE java google group.

    Experimenting with SimpleDB (Flagthis.com)

    A few years ago I wrote a simple online bookmarking tool called Flagthis. The tool allowed one to bookmark sites using a javascript bookmarklet from the bookmark tab. The problem it was trying to solve is that most links people bookmark are never used again if they are not checked out within the next few days.  The tool helps the user ignore bookmarks which were not used in last 30 days.

    The initial version of this tool used MySQL database. The original application architecture was very simple, and other than the database it could have scaled horizontally. Over the weekend I played a little with SimpleDB and was able to convert my code to use SimpleDB in a matter of hours.

    image 

    Here are some things I observed during my experimentation

    1. Its not a relational database.
    2. Can’t do joins in the database. If joins have to be done, it has to be done at the application which can be very expensive .
    3. De-normalizing data is recommended.
    4. Schemaless: You can add new columns (which are actually just new row attributes) anytime you want.
    5. You have to create your own unique row identifiers. SimpleDB doesn’t have a concept of auto-increment
    6. All attributes are auto-Indexed. I think in Google App Engine you had to specify which columns need indexing. I’m wondering if this would increase cost of using SimpleDB.
    7. Data is automatically replicated across Amazon’s huge SimpleDB cloud. But they only guarantee something called “Eventually Consistent”. Which means data which is “put” into the system is not guaranteed to be available in the next “get”.
    8. I couldn’t find a GUI based tool to browse my SimpleDB like the way some S3 browsers do. I’m sure someone will come up with something soon. [Updated: Jeff blogged about some simpleDB tools here]
    9. There are limits imposed by SimpleDB on the amount of data you can put in. Look at the tables below.

     

    Attribute Maximum
    domains 100 active domains
    size of domains 10GB
    attributes per domain 250,000,000
    attributes per item 256 attributes
    size per attribute 1024 characters

     

    Attribute Maximum
    items returned in a query response 250 items
    seconds a query may run 5 seconds
    attribute names per query predicate 1 attribute name
    comparisons per predicate 10 operators
    predicates per query expression 10 predicates

    Other related discussions (Do checkout CouchDB)

    Amazon launches CloudFront

    Update as of Feb 28th 2009: Contradictory to my initial speculation, Amazon CloudFront is nothing like Akamai WAA. This is very depressing to me as an Akamai/WAA customer… I’m sure folks at Akamai don’t share this opinion.  CloudFront seems to be a glorified S3 solution which is mostly used for static (non-dynamic) content.

    ————-

    Amazon has finally opened the doors of its new CDN (Content Delivery Network) called CloudFront. But instead of building a completely new product it has interestingly expanded its S3 network to include content replication for lower latency content delivery. By not reinventing a whole new way of uploading data to the CDN network, Amazon has seriously cut down the cost for end users to try out this technology.

    image Most of the CDNs I’ve investigated do very well with static content which needs to be periodically refreshed somehow.

    There is at least one service from Akamai called WAA – Web application accelerator which seem to understand the importance of accelerating extremely dynamic content using intelligent routing and closer points of presence to end user. WAA doesn’t put the content closer to the end user, but provides an extremely efficient conduit for this traffic where Akamai controls both ends network by placing a POP in front of the client and the server. By doing this Akamai can take control of TCP/IP window sizes within its network and provide a low latency, higher bandwidth response to the customer. In addition to all this Akamai also provides an option to cache some data ( as defined in the HTTP headers, or WAA configuration ) to be cached for a longer duration.

    Though Amazon might be doing replication as well, it may be closer to the Akamai’s WAA model than what you thought. Its kind of obvious that if the data is going to change all the time, there has to be some kind of master-slave concept, and its also clear that if many people are accessing that data around the world it has to be transported through a very efficient high bandwidth network to the various Amazon Points of presence around the world. And finally just like the Akamai’s WAA model, it probably does the cache content to serve the content directly from its local cache incase the object hasn’t changed on the master since the last time it was retrieved.

    A month ago I went shopping, looking for alternatives to Akamai’s WAA and didn’t find anyone. I suspect CloudFront changes that a little bit. One significant difference between Amazon and most CDNs out there including CloudFront is that there is relatively very little work which needs to be done by the developer to integrate with WAA. This is not true with most CDNs, and certainly not true for CloudFront if you are not already on S3. But it does change the dynamics of this industry.

    Scaling technorati – 100 million blogs indexed everyday

    Indexing 100 million blogs with over 10 billion objects, and with a user base which is doubling every six months, technorati has an edge over most blog search engines. But they are much more than search, and any technorati user can explain you that. I recommend you read John Newton’s interview with David Sifry which I found fascinating. Here are the highlights from the interview if you don’t have time to read the whole thing

    • Current status of technorati
      • 1 terabyte a day added to its content storage
      • 100 million blogs
      • 10 billion objects
      • 0.5 billion photos and videos
      • Data doubling every six months
      • Users doubling every six months
    • The first version was supposed to be for tracking temporal information on low budget.
      • That version put everything in relational database which was fine since the index sizes were smaller then physical memory
      • It worked fine till about 20 million blogs
    • The next generation took advantage of parallelism.
      • Data was broken up into shards
      • Synced up frequently between servers
      • The database size reached largest known OLTP size.
        • Writing as much data as reading
        • Maintaining data integrity was important
          • This put a lot of pressure on the system
    • The third generation
      • Shards evolved
        • The shards were based on time instead of urls
        • They moved content to special purpose databases instead of relational database
      • Don’t delete anything
      • Just move shards around and use a new shard for latest stuff
    • Tools used
      • Green plum – enables enterprises to quickly access massive volumes of critical data for in-depth analysis. Purpose built for high performance, large scale BI, Greenplum’s family of database products comprises solutions suited to installations ranging from departmental data marts to multi-terabyte data warehouses.
    • Should have done sooner
      • Should have invested in click stream analysis software to analyze what clicks with the users
        • Can tell how much time users spend on a feature