Google address translation
Friday, January 13th, 2006John Resig has come up with a very example of how to do address translation of a physical address into a lat/long using geocoder library. The examples he provided work only in US and Canada.
John Resig has come up with a very example of how to do address translation of a physical address into a lat/long using geocoder library. The examples he provided work only in US and Canada.
Decian McCullagh mentions that H.R.3402 has provisions which make it criminal offence to post a message without disclosing your true identity. The prohibition is part of the Violence Against Women and Department of Justice Reauthorization Act
The prohibition aparently only restricts annoying anoymous messages. But who decides whats anonymous and whats not ? Whats not annoying to you could be extreemly repugnant to me.
A lot of web discussion forums including services like Slashdot, which allow anonymous comments, would soon be full of criminals. And if this is really acted upon and force users to create accounts on each of these webservices just to post a comment, how many of these services will you be posting your comment on ? Personally I have a slashdot account, and I rarely leave anonymous messages, but if someone comes with a new services tommorow which forces me to create an account with a new loginname and password, I’m pretty sure I’ll be thinking twice about it.
If you are one of those who don’t mind creating accounts, but use same or similar passwords everywhere, this would be good opertunity for password harvesters to take your passwords.
Remember the PGP export embargo ? Did that stop the world from using it ? Is it really possible to actually enforce this on the internet if the law is just applicable to US ?
If you have read the bill or have some more information on this law I’m interested to find out more.
Funny thing happened… I had a directory with 2 dmp files. One was compressed using gzip, and one was un-compressed. When I typed gzip -d and pressed tab, it listed the only compressed file which had a name ending in gz. No matter how many times I pressed tab, it didn’t pick the other file which I know exists in that directory. So obviously, me being me, typed gzip without -d on the shell and pressed tab… and voila, it listed the uncompressed file and didn’t list the other file which was already compressed.
I’ve been working with bash for a long time, and haven’t noticed such a feature before… I was pleasently surprised to find it.
My interesting project for this quarter was writing a search engine to index blog entries. The experience to do something like this without knowing anything about resources required would probably be risky and stupid. But since this was just an educational project to undertand search technology and to learn java, capacity planning was last thing on my mind.
Based on the resource I had, it was pretty clear to me that I can’t build another yahoo or google. Besides who needs another one of those anyway. Also when I started working on this project, google hadn’t released thier blog search engine. Needless to say indexing blogs looked pretty interesting. How difficult could it be to build another Technorati anyway ? A few servers running a crawler and a few database servers is all one needs with a nice front end written in pretty php.
If search engine was all just about searching a text from a database, then it would be called a database. To build a search engine one needs to solve a few other parts of the puzzle. One of them was the “crawler”. If I had done my homework before I started off, I would have found out that there are quite a few crawlers already available for free.
A crawler’s job is to crawl around on the internet looking for new URLs, which in this case needs to be RSS/Atom feeds. A unintelligent bot would probably go around in no particular order following new links. The crawler I wrote had a few optimizations which helped it collect a massive amount of URLs very quickly.
Writing crawler is simple because there are only as many HTTP versions. RSS is another story. Everyone and their friends are coming out with thier own RSS extentions. Rome is how I got away with it. These guys are going a long way to build a generic RSS reader, and they make it a lot easier to add your own modules to it to read new add-ons.
Unfortunately one thing which I haven’t investigated enough is the differences in various feed formats. Rome makes it so transparent that I didn’t have to do too much of digging to accept the different formats. However, since it doesn’t make sence to crawl multiple feeds(different formats) of the same site, I’ll have to do this exercise at some point to figure out which feed format is better for indexing purposes.
As I said before, Rome, allows developers to add their own modules into Rome to support reading of non-standard (or newer) xml tags which are not supported by Rome by default. Form example I noticed some feeds had two versions of ‘description’, one was longer than the other. When I noticed that Rome was picking the wrong one, I wrote a module to extract the longer description. Similarly I also wrote an iTunes content extracter which helped Rome to understand iTunes content in Feeds.
There have been studies done on the internet by various organizations and they have all concluded that we live in a very polluted blogging world , where people have been using blogs to skew search engine results.
Weblogs is a good place to see these spam logs in action. Some of the interesting patterns which I used to reduce the effect of spam were
This writeup is a work in progress…. here are some more notes which I still have to write about
This is an interesting movie of the space shuttle launch.