Scaling technorati - 100 million blogs indexed everyday
Indexing 100 million blogs with over 10 billion objects, and with a user base which is doubling every six months, technorati has an edge over most
blog search engines. But they are much more than search, and any technorati user can explain you that. I recommend you read John Newton's interview with David Sifry which I found fascinating. Here are the highlights from the interview if you don't have time to read the whole thing
- Current status of technorati
- 1 terabyte a day added to its content storage
- 100 million blogs
- 10 billion objects
- 0.5 billion photos and videos
- Data doubling every six months
- Users doubling every six months
- The first version was supposed to be for tracking temporal information on low budget.
- That version put everything in relational database which was fine since the index sizes were smaller then physical memory
- It worked fine till about 20 million blogs
- The next generation took advantage of parallelism.
- Data was broken up into shards
- Synced up frequently between servers
- The database size reached largest known OLTP size.
- Writing as much data as reading
- Maintaining data integrity was important
- This put a lot of pressure on the system
- The third generation
- Shards evolved
- The shards were based on time instead of urls
- They moved content to special purpose databases instead of relational database
- Don't delete anything
- Just move shards around and use a new shard for latest stuff
- Shards evolved
- Tools used
- Green plum - enables enterprises to quickly access massive volumes of critical data for in-depth analysis. Purpose built for high performance, large scale BI, Greenplum’s family of database products comprises solutions suited to installations ranging from departmental data marts to multi-terabyte data warehouses.
- Should have done sooner
- Should have invested in click stream analysis software to analyze what clicks with the users
- Can tell how much time users spend on a feature
- Should have invested in click stream analysis software to analyze what clicks with the users
Comments