Reading Leonard Richardson’s paper about his Ultra Gleeper recommendation engine, I notice that he’s run into the problem of stemming weblog URLs.
I managed to write a reasonable stemmer back when I was running the [[Blogging Ecosystem]]; if I remember, when I’ve got some free time I’ll dig this out and improve it to do a better job matching more modern[1] URLs. It’s a function that would be handy to have in an open source library.
Update: I’ve done this and put up a first cut of the code; there will be more posts on this blog under the ‘urlstemmer’ topic as things proceed.
1. Back in 2002 and 2003, when the ecosystem was operating, people tended to use either simple MT-style archive links (“/archives/12345.html”) or dated ones (“/2003/2/2.html”), whereas now it’s quite popular to put your post title in the URL.