Archive for the ‘interesting’ Category

Google address translation

Friday, January 13th, 2006

John Resig has come up with a very example of how to do address translation of a physical address into a lat/long using geocoder library. The examples he provided work only in US and Canada.

H.R.3402: Shutting down anonymous posts on internet ?

Monday, January 9th, 2006


Decian McCullagh  mentions that H.R.3402 has provisions which make it criminal offence to post a message without disclosing your true identity. The prohibition is part of the Violence Against Women and Department of Justice Reauthorization Act

The prohibition aparently only restricts annoying anoymous messages. But who decides whats anonymous and whats not ? Whats not annoying to you could be extreemly repugnant to me.

A lot of web discussion forums including services like Slashdot, which allow anonymous comments, would soon be full of criminals. And if this is really acted upon and force users to create accounts on each of these webservices just to post a comment, how many of these services will you be posting your comment on ? Personally I have a slashdot account, and I rarely leave anonymous messages, but if someone comes with a new services tommorow which forces me to create an account with a new loginname and password, I’m pretty sure I’ll be thinking twice about it.

If you are one of those who don’t mind creating accounts, but use same or similar passwords everywhere, this would be good opertunity for password harvesters to take your passwords.

Remember the PGP export embargo ? Did that stop the world from using it ? Is it really possible to actually enforce this on the internet if the law is just applicable to US ?

If you have read the bill or have some more information on this law I’m interested to find out more.

bash on OS X

Thursday, December 22nd, 2005

Funny thing happened… I had a directory with 2 dmp files. One was compressed using gzip, and one was un-compressed. When I typed gzip -d and pressed tab, it listed the only compressed file which had a name ending in gz. No matter how many times I pressed tab, it didn’t pick the other file which I know exists in that directory. So obviously, me being me, typed gzip without -d on the shell and pressed tab… and voila, it listed the uncompressed file and didn’t list the other file which was already compressed.

I’ve been working with bash for a long time, and haven’t noticed such a feature before… I was pleasently surprised to find it.

Fun writing a search engine

Tuesday, December 13th, 2005

Introduction

My interesting project for this quarter was writing a search engine to index blog entries. The experience to do something like this without knowing anything about resources required would probably be risky and stupid. But since this was just an educational project to undertand search technology and to learn java, capacity planning was last thing on my mind.

Based on the resource I had, it was pretty clear to me that I can’t build another yahoo or google. Besides who needs another one of those anyway. Also when I started working on this project, google hadn’t released thier blog search engine. Needless to say indexing blogs looked pretty interesting. How difficult could it be to build another Technorati anyway ? A few servers running a crawler and a few database servers is all one needs with a nice front end written in pretty php.

Crawling

If search engine was all just about searching a text from a database, then it would be called a database. To build a search engine one needs to solve a few other parts of the puzzle. One of them was the “crawler”. If I had done my homework before I started off, I would have found out that there are quite a few crawlers already available for free.

A crawler’s job is to crawl around on the internet looking for new URLs, which in this case needs to be RSS/Atom feeds. A unintelligent bot would probably go around in no particular order following new links. The crawler I wrote had a few optimizations which helped it collect a massive amount of URLs very quickly.

  • Supply with very rich seed urls
  • Allow feeds to be harnessed for new sources of feeds
  • Check sites like weblogs.com for new URLs periodically
  • Looking for words like ‘rss’ or ‘atom’ in URL and excluding those with ‘gif’ and ‘jpg’ would help identify the type of object without downloading it
  • One more thing which I should have done, but didn’t was to use the HEAD command to get content type before downloading the complete URL.
  • When a URL is downloaded, it is parsed to look for more href links to other URLs.

Identifying feeds

Writing crawler is simple because there are only as many HTTP versions. RSS is another story. Everyone and their friends are coming out with thier own RSS extentions. Rome is how I got away with it. These guys are going a long way to build a generic RSS reader, and they make it a lot easier to add your own modules to it to read new add-ons.

Unfortunately one thing which I haven’t investigated enough is the differences in various feed formats. Rome makes it so transparent that I didn’t have to do too much of digging to accept the different formats. However, since it doesn’t make sence to crawl multiple feeds(different formats) of the same site, I’ll have to do this exercise at some point to figure out which feed format is better for indexing purposes.

As I said before, Rome, allows developers to add their own modules into Rome to support reading of non-standard (or newer) xml tags which are not supported by Rome by default. Form example I noticed some feeds had two versions of ‘description’, one was longer than the other. When I noticed that Rome was picking the wrong one, I wrote a module to extract the longer description. Similarly I also wrote an iTunes content extracter which helped Rome to understand iTunes content in Feeds.

Spam blogs

There have been studies done on the internet by various organizations and they have all concluded that we live in a very polluted blogging world , where people have been using blogs to skew search engine results.

Weblogs is a good place to see these spam logs in action. Some of the interesting patterns which I used to reduce the effect of spam were

  • Ignore blogs which publish large number of items at a time. Except some new websites most blogs are published one at a time over a long period of time
  • I noticed that spam bloggers are very uncreative when it comes to finding a name for the blog. Most of them have a - in the blog name or blog domain name. Skipping such domains helped a lot in reducing spam.
  • Free blog services are the largest source of spam blogs. Blogspot turned out to be the most annoying one when I analysed this problem
  • If I had some kind of language or pharse analyser, its possible that most of the items in the same feed would look very similar to each other. I didn’t have time to do this, but I’m sure one of you guys would do it eventually

Other notes

This writeup is a work in progress…. here are some more notes which I still have to write about

  • You need to know databases very very very well to understand how to optimize searches
  • Mysql is good, but few critical features which it doesn’t have could improve your search engine a lot
  • Java is expensive… c/c++ would be better. But Java is very easy to develop in. Google started by using python… so in short its possible
  • There is a lot of crap out there…
  • Its possible that 60 to 70% of the blogs pinging weblogs.com are spam
  • You better be ready to handle UTF8. I learnt it the hard way
  • And even if things are in latin script, not everything is in english
  • If you are planning to index blogs, make sure u understand the languages…
  • But even if you do understand the languages… you should look at all the new words these teens come up with. I swear I didn’t recogonize half of them as english untill I really tried hard to understand them. I thought 3l33t was cool… but these are are way ahead
  • Distributed computing is the way to go. One computer itself can’t do much
  • Make sure you have plans for scaling ready at hand. You have no idea how fast these things grow.
  • And when you start doing so many HTTP queries, weird things happen on the network. On mine, the router required a reset once every day. I’d never heard of this before… and when I called in to complain about this one year old product, even customer support hadn’t heard about it

Interesting movie…

Sunday, December 11th, 2005

This is an interesting movie of the space shuttle launch.