Phillip Pearson - web + electronics notes

tech notes and web hackery from a new zealander who was vaguely useful on the web back in 2002 (see: python community server, the blogging ecosystem, the new zealand coffee review, the internet topic exchange).


Papers from Graphics Interface magazine

Something I came across at work today that I know I'm going to want to refer to later on in life: Some collected pre-1996 GI papers.

(In case it's not obvious: is not my job; it's what I hack on when I'm at home. I work writing image processing software in C++ at a smallish software/hardware company in New Zealand.)
... more like this: [, ]

Worth reading

How the ecosystem crawl works

Dave: "Interesting. I thought it would show up because I (and others) link to it from my blogroll. I guess the crawl you do is only one level deep? How often do you read the XML file?"

You're right: the crawl is only one level deep. Here are the details:

I've got one big text file which lists all the pages to download and specifies titles for them. The crawler (a Python script) reads in that text file and drops everything in a big hash table, munging the URLs a bit to detect duplicates (that, and are the same site, for example).

Now it runs through all the URLs, downloading them if it doesn't already have a cached copy, processing all the HTML, downloading data if necessary, and tallying up all the links.

Once it's downloaded everything, it spits out ecosystem/index.html, and all the stats pages.

Every night (or morning, or whenever I feel like it) I run a bash script that pulls down and, runs a Perl script that converts the blogs found within into the same format as my blog list file, then passes the output through another Perl script that removes all entries that are already in the master blog list. I then cat the output from all that onto the end of the blog list.

I have a Python script that gives me some info on the cache:

phil@icicle:~/crawler$ ./
August 08 - 1159
August 09 - 1304
August 10 - 1044
August 11 - 752
August 12 - 892
August 13 - 635

Total: 5786

There is also a bash script that removes the oldest 100 entries from the cache. I run this a few times before starting a crawl, to guarantee that it will refresh at least a few blog pages each time. I'm aiming to keep the whole cache younger than 7 days; as you can see from the above numbers, the earliest pages I have are from August 8.

Did I cover everything? :)