RSS
 

Archive for the ‘interesting’ Category

Google address translation

13 Jan

John Resig has come up with a very example of how to do address translation of a physical address into a lat/long using geocoder library. The examples he provided work only in US and Canada.

 
Comments Off

Posted in google, interesting

 

H.R.3402: Shutting down anonymous posts on internet ?

09 Jan


Decian McCullagh  mentions that H.R.3402 has provisions which make it criminal offence to post a message without disclosing your true identity. The prohibition is part of the Violence Against Women and Department of Justice Reauthorization Act

The prohibition aparently only restricts annoying anoymous messages. But who decides whats anonymous and whats not ? Whats not annoying to you could be extreemly repugnant to me.

A lot of web discussion forums including services like Slashdot, which allow anonymous comments, would soon be full of criminals. And if this is really acted upon and force users to create accounts on each of these webservices just to post a comment, how many of these services will you be posting your comment on ? Personally I have a slashdot account, and I rarely leave anonymous messages, but if someone comes with a new services tommorow which forces me to create an account with a new loginname and password, I’m pretty sure I’ll be thinking twice about it.

If you are one of those who don’t mind creating accounts, but use same or similar passwords everywhere, this would be good opertunity for password harvesters to take your passwords.

Remember the PGP export embargo ? Did that stop the world from using it ? Is it really possible to actually enforce this on the internet if the law is just applicable to US ?

If you have read the bill or have some more information on this law I’m interested to find out more.

 
Comments Off

Posted in interesting, politics

 

bash on OS X

22 Dec

Funny thing happened… I had a directory with 2 dmp files. One was compressed using gzip, and one was un-compressed. When I typed gzip -d and pressed tab, it listed the only compressed file which had a name ending in gz. No matter how many times I pressed tab, it didn’t pick the other file which I know exists in that directory. So obviously, me being me, typed gzip without -d on the shell and pressed tab… and voila, it listed the uncompressed file and didn’t list the other file which was already compressed.

I’ve been working with bash for a long time, and haven’t noticed such a feature before… I was pleasently surprised to find it.

 
Comments Off

Posted in apple, interesting

 

Fun writing a search engine

13 Dec

Introduction

My interesting project for this quarter was writing a search engine to index blog entries. The experience to do something like this without knowing anything about resources required would probably be risky and stupid. But since this was just an educational project to undertand search technology and to learn java, capacity planning was last thing on my mind.

Based on the resource I had, it was pretty clear to me that I can’t build another yahoo or google. Besides who needs another one of those anyway. Also when I started working on this project, google hadn’t released thier blog search engine. Needless to say indexing blogs looked pretty interesting. How difficult could it be to build another Technorati anyway ? A few servers running a crawler and a few database servers is all one needs with a nice front end written in pretty php.

Crawling

If search engine was all just about searching a text from a database, then it would be called a database. To build a search engine one needs to solve a few other parts of the puzzle. One of them was the “crawler”. If I had done my homework before I started off, I would have found out that there are quite a few crawlers already available for free.

A crawler’s job is to crawl around on the internet looking for new URLs, which in this case needs to be RSS/Atom feeds. A unintelligent bot would probably go around in no particular order following new links. The crawler I wrote had a few optimizations which helped it collect a massive amount of URLs very quickly.

  • Supply with very rich seed urls
  • Allow feeds to be harnessed for new sources of feeds
  • Check sites like weblogs.com for new URLs periodically
  • Looking for words like ‘rss’ or ‘atom’ in URL and excluding those with ‘gif’ and ‘jpg’ would help identify the type of object without downloading it
  • One more thing which I should have done, but didn’t was to use the HEAD command to get content type before downloading the complete URL.
  • When a URL is downloaded, it is parsed to look for more href links to other URLs.

Identifying feeds

Writing crawler is simple because there are only as many HTTP versions. RSS is another story. Everyone and their friends are coming out with thier own RSS extentions. Rome is how I got away with it. These guys are going a long way to build a generic RSS reader, and they make it a lot easier to add your own modules to it to read new add-ons.

Unfortunately one thing which I haven’t investigated enough is the differences in various feed formats. Rome makes it so transparent that I didn’t have to do too much of digging to accept the different formats. However, since it doesn’t make sence to crawl multiple feeds(different formats) of the same site, I’ll have to do this exercise at some point to figure out which feed format is better for indexing purposes.

As I said before, Rome, allows developers to add their own modules into Rome to support reading of non-standard (or newer) xml tags which are not supported by Rome by default. Form example I noticed some feeds had two versions of ‘description’, one was longer than the other. When I noticed that Rome was picking the wrong one, I wrote a module to extract the longer description. Similarly I also wrote an iTunes content extracter which helped Rome to understand iTunes content in Feeds.

Spam blogs

There have been studies done on the internet by various organizations and they have all concluded that we live in a very polluted blogging world , where people have been using blogs to skew search engine results.

Weblogs is a good place to see these spam logs in action. Some of the interesting patterns which I used to reduce the effect of spam were

  • Ignore blogs which publish large number of items at a time. Except some new websites most blogs are published one at a time over a long period of time
  • I noticed that spam bloggers are very uncreative when it comes to finding a name for the blog. Most of them have a – in the blog name or blog domain name. Skipping such domains helped a lot in reducing spam.
  • Free blog services are the largest source of spam blogs. Blogspot turned out to be the most annoying one when I analysed this problem
  • If I had some kind of language or pharse analyser, its possible that most of the items in the same feed would look very similar to each other. I didn’t have time to do this, but I’m sure one of you guys would do it eventually

Other notes

This writeup is a work in progress…. here are some more notes which I still have to write about

  • You need to know databases very very very well to understand how to optimize searches
  • Mysql is good, but few critical features which it doesn’t have could improve your search engine a lot
  • Java is expensive… c/c++ would be better. But Java is very easy to develop in. Google started by using python… so in short its possible
  • There is a lot of crap out there…
  • Its possible that 60 to 70% of the blogs pinging weblogs.com are spam
  • You better be ready to handle UTF8. I learnt it the hard way
  • And even if things are in latin script, not everything is in english
  • If you are planning to index blogs, make sure u understand the languages…
  • But even if you do understand the languages… you should look at all the new words these teens come up with. I swear I didn’t recogonize half of them as english untill I really tried hard to understand them. I thought 3l33t was cool… but these are are way ahead
  • Distributed computing is the way to go. One computer itself can’t do much
  • Make sure you have plans for scaling ready at hand. You have no idea how fast these things grow.
  • And when you start doing so many HTTP queries, weird things happen on the network. On mine, the router required a reset once every day. I’d never heard of this before… and when I called in to complain about this one year old product, even customer support hadn’t heard about it
 
Comments Off

Posted in blogging, interesting

 

Interesting movie…

11 Dec

This is an interesting movie of the space shuttle launch.

 
Comments Off

Posted in interesting, science

 

Bluetooth on the way back

15 Aug
When King Danish Harald Blåtand, united Norway and Denmark, little did he know that a technology named after him ( Blåtand translates to blue-tooth) will have a chance of becoming a corner stone of the telecommunication industry.
This industry is one of the fastest growing sectors in todays world, and whether you’d like it or not it is constantly changing the world around you.
If it were not for the cell phone industry, we would still be hooked to our wired phones, and had it not been for the internet E-mails would just have been a fantasy.
And in this fast changing world one protocol which is growing very rapidly is ‘bluetooth’ . And just like everything before ‘bluetooth’ wasn’t created in a day. In fact it went through some rough times before its started catching on again.The telecom industry today is not very different from what it was 1000s of years ago. There still are many different ways to communicate and some are more popular than others. But human ingenuity over time and has lead to unification of communication protocols. Though it may look like its doing the same thing, a telephone is very different from a cellphone and a cellphone is different from a satellite phone. But they all manage to get along very well, and if I call your home phone line from a cellphone in US over a satellite connection, it will still reach you and we’d still be able to talk. Internet is another perfect example of this unification which brought together computers worldwide.

While people were still fascinated by internet and wired networks, in the early 90s Ericsson predicted that the day is not far away when computers inside your home will talk to other computers and even with other electronic devices like cell phones, digital cameras, keyboards and mouse wirelessly. In 1994 they started an effort to come up with a standard for devices to communicate with each other they way computers can over wired networks. This search for a new, inexpensive communication standard ( protocol ) which could allow one device to detect the presence of another and allow it to communicate with another it using low powered radio signals was soon joined by 5 companies. Unfortunately, in spite of some early success, the process of defining a standard slowed down significantly by 1999 when the consortium had over 1200 company participants. This is when blue-tooth’s problems started.

While bluetooth was still in its infancy, a new protocol IEEE 802.11 started gaining momentum. This new communication protocol was specifically designed for high speed communication between computers and networking devices using radio frequency. This was probably the toughest moment in the history of bluetooth. Eclipsed by 802.11s success bluetooth standard was on the verge of extinction.

Interestingly, though 802.11 is faster, allowed greater distances and supported much more communication features, its complexity required the device to do more work and send stronger radio signals for it to be able to communicate with others. This inadvertently forced it to draw much more power. This was not a problem for devices which are hooked up to the power, or for laptops which are charged very frequently, but it definitely was a problem for devices like cellphones and digital camera’s which have very small battery capacity and cant be connected to power outlet for extended periods of times. This together with the realization of low cost of manufacturing bluetooth devices marked the comeback of this unique protocol from the dead. IEEE 802.11 still has very strong market presence, but bluetooth has carved a niche for itself which has a very big fan base.

The most popular bluetooth device today which demonstrates the power and simplicity this popular protocol is the cellphone. Most new cellphones allow users to exchange phone numbers with a click of a button, some allow you to transfer files ands photographs between your computer, some even allow you to talk using handsfree bluetooth headset. Among the other devices which are very quickly catching on are bluetooth enabled keyboards and mouse which replace ps2 and usb wires giving the users the freedom of moving around without being tied to their computers.

Infact the day is not far when you would see bluetooth in remote controls of you televisions and VCRs and may be one day even be able to control your VCR from your cellphone. Bluetooth brings with it the freedom of communication with other devices which is unmatched by anything else in the communication industry today.

 
Comments Off

Posted in interesting

 

Google’s secert 301/302 bug

04 Dec

Introduction: I heard about this only today, but seems like this is one of the most secret bugs which google is being hit with right now. Whats interesting is that this has been going on for a while. I saw references to similar problems made in posts made in 2003.

Problem: If site A points to site B using meta-refresh/redirects in a certain way, google interprets it in such a way that site A has the same content as site B. Based on what I saw in different posts across the internet, site A doesn’t need to have any replicated content hosted on it. It just needs a meta-refresh pointing to site B. This by itself is not the problem however, since the most popular site will still show up first on the google search pages. This becomes a problem if the redirect is initiated by a page which has a higher PR (Page Ranking) within google. So if site A somehow has higher PR, it could effectively hijack site B by abusing its PR ranking using this kind redirect to site B.

Analysis: So there are many ways of doing a redirect using HTTP return status.

Also, its possible to use “meta-redirects” within pages which can do a “refresh” to another page. “meta-redirects” is the equivalent of a 302 at the HTML layer. If this bug is for real, it must be within the page retrieval engine in google robot which “gets” the page for the robot. There are some applications and probably some perl modules which would automatically retrieve redirected pages even if the original request didnt specifically request the module to recursively request for the redirected object.

References:

 
Comments Off

Posted in Uncategorized, google, interesting

 

ReplayTV hacking

10 Feb

ReplayTV coding is not exactly related to security, but I’m adding it here, cause its all about hacking. I’ll keep posting replaytv scripts I work here http://www.royans.net/security/projects/replaytv/

 
Comments Off

Posted in hacking, interesting