- Seeding problem- DMOZ might have a good list of seed URLs for a traditional crawler, but there wasn't a DMOZ like public data source for feeds which I could use. I ended up crawling sites providing OPMLs, and sites like weblogs.com for new feeds.
- Spamming - It dawned on me pretty fast that sites like Weblogs.com was probably not the best place to look for quality feeds. Every tom dick and harry were pinging weblogs.com and so were all the spammers. The spam statistics blew me away. 60% of the blogs were spam according to a few analysts in 2005. I'm sure this number has gone up now.
- The Size - I thought just crawling feeds would be easy to manage, since that doesn't require images/css/etc to be archived. But I was so wrong. I crossed 40GB of storage within a week or two of crawling. I could always add new harddisk, but without the ability to detect spam nicely and without a scalable search platform this blog search engine was DOA.
- I also underestimated number of posts I was collecting. I had to increase the byte size for some of the IDs in Java and Mysql.
- Feed processing/Searching - Mysql is fast, but in the hands of a totally untrained professional like me, it can be a ticking time bomb. Though I made good start, I struggled to get a grip on how the indexing, inner/outer joins work. I had underestimated the complexities of databases.
- Threading - I over-designed the threaded application without investing enough time to understand how threading works in java. That, with a few caching features I created became the memory leak I so much wanted to avoid. It was a mess :)
- I liked PHP for its simplicity, and Java for its speed. Unfortunately my attempt to design the UI in PHP and leave the backend in Java didn't work very well because of my lack of experience with PHP-java interoperability.
- Feed update frequency - Some feeds update faster than others. To calculate when you need to crawl next is an interesting problem by itself. Especially because some feeds update more frequently in certain parts of the day than others. Apparently google reader's backend checks for feed updates about every one hour. So if you have 10 million feeds to crawl thats about 2777 feed requests per second. There is no way I can do that from a single machine in my basement.
- The worst annoying problem I had was not really my fault :) . I own a belkin wireless router which became extremely unstable when my crawlers ran. I had to resort to daily reboots of this device to solve the problem. And on busy days it required two.
The reason why I'm not embarrassed, blogging about my mistakes, is because I'm not a developer to begin with. And the second reason is that I'm about to take a second shot at it to see if I can do it better this time. The objective is not to build another search engine, but to understand and learn from your mistakes and do it better.
The first phase of this learning experience resulted in what you see currently on blogofy.com. The initial prototype displayed here does limited crawling to gather feed updates. The feed update algorithm is already a little smarter (it updates some more often than others). The quality of the feeds should also be better because of human behind the engine who manually adds the feeds to the database. I also realized soon after my first attempt that I should have investigated JSON to make Java and PHP talk to each other. In the current version of blogofy the core engine is in PHP except the indexing/storage engine which will soon move to Solr(which is java based). PHP has JSON based hooks to talk to Solr which seems to work very well. Solr incase you didn't know is a very fast lucene based search engine which does much better than Mysql for the kind of search operations I would be doing. And yes, I replaced Belkin router with an apple wireless... so lets see how that works out to be.
Will send more updates if I make progress. If any of you have ideas on what else I should be doing please let me know.