For some reason right now I'm motivated to rewrite or tidy up some of the bits of the old Blogging Ecosystem code and release them as mini-projects. The other day's URL stemmer code was the first bit.
The next one will be a bit of C++ code (with a Python wrapper) to do longest prefix matching on parts of URLs using hashes. This sort of thing comes in handy for blog crawler type applications if you don't have a reliable stemmer (and I don't expect my URL stemmer to work 100% of the time). Basically, you give it a big list of blog URLs, and after that you can give it a URL and it will tell you whether it starts with any of the blog URLs.
How it works is to repeatedly query the hash table with less and less of the URL each time. So if you give it
http://foo.bar.com/users/1234/weblog/2002/04/01/#my-post, it will see if any of the following match any known blog URLs:
It would probably match the one in bold above.
You can do this with a database, but it seems that databases don't tend to squish everything into as small a space as they could, which only gives you about 10% of the performance of C++ code like what I'll release (sometime).
Anyway - if you are interested, leave a comment here. I'll post about it on this blog when I'm done.