December 22, 2004

Using Google to fix your 404 errors ( Part II )

A few weeks ago I wrote a small hack to use google to handle 404 errors. You can find that artcle here Using Google to handle website 404 errors

Unfortunately even though it works, its not optimal. Here are the few drawbacks I noticed

  • I was using Meta Redirects. Some bots didn't understand that very well

  • Meta redirect generates a 302 (temporary move) instead of 301 (permanent)

  • Some bots, browsers were refreshing the same page in an endless loop for some reason.


So in frusteration I wrote another piece of code. This time I'm using google web-api to get my results internally, instead of forcing the user to go to the google website for the first best hit. Here is the code I'm using. Please remember to put in your google key in the right place before you try it out yourself.
#!/usr/bin/perl
use strict;
use SOAP::Lite;
my $request=$ENV{REQUEST_URI};
my $httphost=$ENV{HTTP_HOST};
my @found=();
my $foundtext="";
my $lookfor=&fix;($request);
my $site="www.royans.net";

if ($httphost =~/security/i) {$site="security.royans.net";}
if ($httphost =~/desijokes/i) {$site="desijokes.royans.net";}

&getnewurl;($lookfor,"$site");
print "Status: 301 Moved Permanentlyn";
print "Location: $found[0]n";
print "Content-type: text/htmlnn";

print "$foundtext";

## This removes some characters to help google do a better search based on content rather than the file name
sub fix()
{
my ($lookfor)=@_;
$lookfor=~s/// /g;
$lookfor=~s/./ /g;
$lookfor=~s/?/ /g;
$lookfor=~s/-/ /g;
$lookfor=~s/_/ /g;
return $lookfor;
}

sub getnewurl()
{
my ($lookingfor,$site)=@_;

my $google_key='Your Google Key here';
my $google_wdsl = "/home2/rkt/www/cgi-bin/GoogleSearch.wsdl";
my $query = "$lookingfor site:$site";
my $google_search = SOAP::Lite->service("file:$google_wdsl");

my $results = $google_search -> doGoogleSearch( $google_key, $query, 0, 10, "false", "", "false", "", "latin1", "latin1");

@{$results->{resultElements}} or exit;
foreach my $result (@{$results->{resultElements}}) {
$found[$#found+1]=$result->{URL};
$foundtext="$foundtext

$result->{title}
{URL}> $result->{URL}
$result->{snippet}

";
}
}

December 04, 2004

Google's secert 301/302 bug

Introduction: I heard about this only today, but seems like this is one of the most secret bugs which google is being hit with right now. Whats interesting is that this has been going on for a while. I saw references to similar problems made in posts made in 2003.

Problem: If site A points to site B using meta-refresh/redirects in a certain way, google interprets it in such a way that site A has the same content as site B. Based on what I saw in different posts across the internet, site A doesn't need to have any replicated content hosted on it. It just needs a meta-refresh pointing to site B. This by itself is not the problem however, since the most popular site will still show up first on the google search pages. This becomes a problem if the redirect is initiated by a page which has a higher PR (Page Ranking) within google. So if site A somehow has higher PR, it could effectively hijack site B by abusing its PR ranking using this kind redirect to site B.

Analysis: So there are many ways of doing a redirect using HTTP return status.

Also, its possible to use "meta-redirects" within pages which can do a "refresh" to another page. "meta-redirects" is the equivalent of a 302 at the HTML layer. If this bug is for real, it must be within the page retrieval engine in google robot which "gets" the page for the robot. There are some applications and probably some perl modules which would automatically retrieve redirected pages even if the original request didnt specifically request the module to recursively request for the redirected object.

References: