My interesting project for this quarter was writing a search engine to index blog entries. The experience to do something like this without knowing anything about resources required would probably be risky and stupid. But since this was just an educational project to undertand search technology and to learn java, capacity planning was last thing on my mind.
Based on the resource I had, it was pretty clear to me that I can't build another yahoo or google. Besides who needs another one of those anyway. Also when I started working on this project, google hadn't released thier blog search engine. Needless to say indexing blogs looked pretty interesting. How difficult could it be to build another Technorati anyway ? A few servers running a crawler and a few database servers is all one needs with a nice front end written in pretty php.
If search engine was all just about searching a text from a database, then it would be called a database. To build a search engine one needs to solve a few other parts of the puzzle. One of them was the "crawler". If I had done my homework before I started off, I would have found out that there are quite a few crawlers already available for free.
A crawler's job is to crawl around on the internet looking for new URLs, which in this case needs to be RSS/Atom feeds. A unintelligent bot would probably go around in no particular order following new links. The crawler I wrote had a few optimizations which helped it collect a massive amount of URLs very quickly.
- Supply with very rich seed urls
- Allow feeds to be harnessed for new sources of feeds
- Check sites like weblogs.com for new URLs periodically
- Looking for words like 'rss' or 'atom' in URL and excluding those with 'gif' and 'jpg' would help identify the type of object without downloading it
- One more thing which I should have done, but didn't was to use the HEAD command to get content type before downloading the complete URL.
- When a URL is downloaded, it is parsed to look for more href links to other URLs.
Writing crawler is simple because there are only as many HTTP versions. RSS is another story. Everyone and their friends are coming out with thier own RSS extentions. Rome is how I got away with it. These guys are going a long way to build a generic RSS reader, and they make it a lot easier to add your own modules to it to read new add-ons.
Unfortunately one thing which I haven't investigated enough is the differences in various feed formats. Rome makes it so transparent that I didn't have to do too much of digging to accept the different formats. However, since it doesn't make sence to crawl multiple feeds(different formats) of the same site, I'll have to do this exercise at some point to figure out which feed format is better for indexing purposes.
As I said before, Rome, allows developers to add their own modules into Rome to support reading of non-standard (or newer) xml tags which are not supported by Rome by default. Form example I noticed some feeds had two versions of 'description', one was longer than the other. When I noticed that Rome was picking the wrong one, I wrote a module to extract the longer description. Similarly I also wrote an iTunes content extracter which helped Rome to understand iTunes content in Feeds.
There have been studies done on the internet by various organizations and they have all concluded that we live in a very polluted blogging world , where people have been using blogs to skew search engine results.
Weblogs is a good place to see these spam logs in action. Some of the interesting patterns which I used to reduce the effect of spam were
- Ignore blogs which publish large number of items at a time. Except some new websites most blogs are published one at a time over a long period of time
- I noticed that spam bloggers are very uncreative when it comes to finding a name for the blog. Most of them have a - in the blog name or blog domain name. Skipping such domains helped a lot in reducing spam.
- Free blog services are the largest source of spam blogs. Blogspot turned out to be the most annoying one when I analysed this problem
- If I had some kind of language or pharse analyser, its possible that most of the items in the same feed would look very similar to each other. I didn't have time to do this, but I'm sure one of you guys would do it eventually
This writeup is a work in progress.... here are some more notes which I still have to write about
- You need to know databases very very very well to understand how to optimize searches
- Mysql is good, but few critical features which it doesn't have could improve your search engine a lot
- Java is expensive... c/c++ would be better. But Java is very easy to develop in. Google started by using python... so in short its possible
- There is a lot of crap out there...
- Its possible that 60 to 70% of the blogs pinging weblogs.com are spam
- You better be ready to handle UTF8. I learnt it the hard way
- And even if things are in latin script, not everything is in english
- If you are planning to index blogs, make sure u understand the languages...
- But even if you do understand the languages... you should look at all the new words these teens come up with. I swear I didn't recogonize half of them as english untill I really tried hard to understand them. I thought 3l33t was cool... but these are are way ahead
- Distributed computing is the way to go. One computer itself can't do much
- Make sure you have plans for scaling ready at hand. You have no idea how fast these things grow.
- And when you start doing so many HTTP queries, weird things happen on the network. On mine, the router required a reset once every day. I'd never heard of this before... and when I called in to complain about this one year old product, even customer support hadn't heard about it