Archive for the ‘blogging’ Category

Sysadmin Day

Friday, July 28th, 2006

Pat yourselves on your back for fixing all those servers,
- doing backup,recovery and user creation.
Pat yourselves for saying no to root and yes to sudo,
- for writing ACLs and scripting voodoo…

Pat again for waking at 2am
- just to put your cellphone on charge.
..for dealing with people
- who wanted everything a day past

Pat again for reading 650 mails a day.
- for blocking SYNFIN floods on ur network
..for carrying those secure-ids
- even while you are not at work.

When you are done patting… please stop by a bar
- pick your pagers and throw away..
’cause you all need a break once in a while
- atleast on the feaking System Admin Day !!

Notes: WikiMapia, Digg, IPv6, flock and Google Sync.

Sunday, June 25th, 2006

WikiMapia

  • This is the first time I happen to stumble upon WikiMapia, which looks like a wiki of maps. Very interesting and creative idea. WikiMapia uses Google Maps API and allows users to mark places and add text to locations around the world.
  • Its like  a large world map with people scribling all over it. Google recently updated its global map database to include some very high resolutions satallite images around the world which makes WikiMapia an even more very interesting new service to look out for.

Digg

  • Digg has been around for just over a year and has already surpassed slashdot in traffic volume. The Digg 3.0 release party demoed some really interesting new tools which are set to come out soon after 3.0 release on monday. The one tool which already exists is Digg Spy.

IPv6

  • US Government has plans to enable IPv6 on backbone routers by 2008.
  • Comcast is probably the first large organization who has already started deploying IPv6. Here are some interesting presentation slides from one of their talks.
  • I looked up ARIN and noticed that Google, Microsoft and Cisco all have /32 assigned to them which is a significant allotment. Even though ARIN policy kind-of states that /32 allotments requires the aquiree to act as an ISP and give away atleast 200 blocks to smaller ISPs or organizations in 5 years, I don’t think this is enforced. Cisco for example has its IPv6 block since 2000 and is well past its 5 year limit.
  • Aparently, during IPv6 I also found out that while IPv6 is being deployed, multihoming is not yet standardized.

Flock

  • If you like Firefox you’ll like Flock too. Just like the web is slowing moving towards web 2.0, flock is kind of an extention to the firefox experience which gives you “web 2.0 rich” experience.
  • Features like social tagging, blogging and photo sharing are built into the browser. But what I liked the best in flock is its implementation of the RSS new reader.
  • Flock beta 1 was released on June 13th.

Google Sync

  • Google Sync is a firefox plugin which claims to synchronize your browser settings with your gmail account so that you can carry them with you when you switch desktops.
  • Unfortunately though flock is based off firefox, its not supported which is a shame cause I primarily use flock. However, there is a hacked version of Google Sync which will work for flock here.
  • BTW, I think that Google Sync is far from mature, ’cause over the weekend Google Sync successfully locked up my Firefox browser on windows XP and even reboot doesn’t bring it up anymore.

18 lessons on blogging

Tuesday, December 20th, 2005

Here are 18 Lessons I’ve Learnt about Blogging: Blog Tips at ProBlogger.

Fun writing a search engine

Tuesday, December 13th, 2005

Introduction

My interesting project for this quarter was writing a search engine to index blog entries. The experience to do something like this without knowing anything about resources required would probably be risky and stupid. But since this was just an educational project to undertand search technology and to learn java, capacity planning was last thing on my mind.

Based on the resource I had, it was pretty clear to me that I can’t build another yahoo or google. Besides who needs another one of those anyway. Also when I started working on this project, google hadn’t released thier blog search engine. Needless to say indexing blogs looked pretty interesting. How difficult could it be to build another Technorati anyway ? A few servers running a crawler and a few database servers is all one needs with a nice front end written in pretty php.

Crawling

If search engine was all just about searching a text from a database, then it would be called a database. To build a search engine one needs to solve a few other parts of the puzzle. One of them was the “crawler”. If I had done my homework before I started off, I would have found out that there are quite a few crawlers already available for free.

A crawler’s job is to crawl around on the internet looking for new URLs, which in this case needs to be RSS/Atom feeds. A unintelligent bot would probably go around in no particular order following new links. The crawler I wrote had a few optimizations which helped it collect a massive amount of URLs very quickly.

  • Supply with very rich seed urls
  • Allow feeds to be harnessed for new sources of feeds
  • Check sites like weblogs.com for new URLs periodically
  • Looking for words like ‘rss’ or ‘atom’ in URL and excluding those with ‘gif’ and ‘jpg’ would help identify the type of object without downloading it
  • One more thing which I should have done, but didn’t was to use the HEAD command to get content type before downloading the complete URL.
  • When a URL is downloaded, it is parsed to look for more href links to other URLs.

Identifying feeds

Writing crawler is simple because there are only as many HTTP versions. RSS is another story. Everyone and their friends are coming out with thier own RSS extentions. Rome is how I got away with it. These guys are going a long way to build a generic RSS reader, and they make it a lot easier to add your own modules to it to read new add-ons.

Unfortunately one thing which I haven’t investigated enough is the differences in various feed formats. Rome makes it so transparent that I didn’t have to do too much of digging to accept the different formats. However, since it doesn’t make sence to crawl multiple feeds(different formats) of the same site, I’ll have to do this exercise at some point to figure out which feed format is better for indexing purposes.

As I said before, Rome, allows developers to add their own modules into Rome to support reading of non-standard (or newer) xml tags which are not supported by Rome by default. Form example I noticed some feeds had two versions of ‘description’, one was longer than the other. When I noticed that Rome was picking the wrong one, I wrote a module to extract the longer description. Similarly I also wrote an iTunes content extracter which helped Rome to understand iTunes content in Feeds.

Spam blogs

There have been studies done on the internet by various organizations and they have all concluded that we live in a very polluted blogging world , where people have been using blogs to skew search engine results.

Weblogs is a good place to see these spam logs in action. Some of the interesting patterns which I used to reduce the effect of spam were

  • Ignore blogs which publish large number of items at a time. Except some new websites most blogs are published one at a time over a long period of time
  • I noticed that spam bloggers are very uncreative when it comes to finding a name for the blog. Most of them have a - in the blog name or blog domain name. Skipping such domains helped a lot in reducing spam.
  • Free blog services are the largest source of spam blogs. Blogspot turned out to be the most annoying one when I analysed this problem
  • If I had some kind of language or pharse analyser, its possible that most of the items in the same feed would look very similar to each other. I didn’t have time to do this, but I’m sure one of you guys would do it eventually

Other notes

This writeup is a work in progress…. here are some more notes which I still have to write about

  • You need to know databases very very very well to understand how to optimize searches
  • Mysql is good, but few critical features which it doesn’t have could improve your search engine a lot
  • Java is expensive… c/c++ would be better. But Java is very easy to develop in. Google started by using python… so in short its possible
  • There is a lot of crap out there…
  • Its possible that 60 to 70% of the blogs pinging weblogs.com are spam
  • You better be ready to handle UTF8. I learnt it the hard way
  • And even if things are in latin script, not everything is in english
  • If you are planning to index blogs, make sure u understand the languages…
  • But even if you do understand the languages… you should look at all the new words these teens come up with. I swear I didn’t recogonize half of them as english untill I really tried hard to understand them. I thought 3l33t was cool… but these are are way ahead
  • Distributed computing is the way to go. One computer itself can’t do much
  • Make sure you have plans for scaling ready at hand. You have no idea how fast these things grow.
  • And when you start doing so many HTTP queries, weird things happen on the network. On mine, the router required a reset once every day. I’d never heard of this before… and when I called in to complain about this one year old product, even customer support hadn’t heard about it

RSS Hijacking on the rise…

Sunday, December 11th, 2005

RSS hijacking is probably unavoidable in short term. Here is an interesting discussion on this topic