OK, I think I’ve cracked the problem of how to do a proper search engine for [[Radio]] weblogs. The problem with using an existing search engine (like [[ht://Dig]], which I’ve wasted[1] far too much time getting to understand) is that it indexes whole HTML pages. This is fine for most sites, but Radio blogs[2] put many posts on a single page, and all blogs put heaps of junk (blogrolls, etc) around the outside of the posts, making it easy for a search engine to get sidetracked in its search for real content.
So now I’m hacking away on a standalone search engine that will integrate with Radio at a much deeper level. The concept is that Radio will inform the engine directly of the content of posts, and it’ll index that rather than trying to extract the info from the HTML directly.
This will result in search capabilities pretty much equivalent to Blosxom’s (more), which are the best I’ve seen so far.
I’m doing it in Java, because that means I can use [[Lucene]] ([[Lupy]] is good, but not quite complete yet). Other useful bits are [[Tomcat]] for the web serving and [[HSQLDB]] for data storage. Also [[Apache XML-RPC]] for communications. This is my first proper Java project (not the same as the one I started a few months back, but that never went anywhere in the end), so I’m going through the learning curve at the same time, which makes things interesting. Let’s see how this goes …
-
At least I now know a fair bit about ht://Dig’s htsearch
module and its build process …
-
(or blogs made with anything other than Movable Type, which defaults to archiving each post on a new page)