OK, I think I've cracked the problem of how to do a
proper search engine for
Radio weblogs. The problem with using an existing search engine (like
ht://Dig, which I've wasted[1] far too much time getting to understand) is that it indexes whole HTML pages. This is fine for most sites, but Radio blogs[2] put many posts on a single page, and
all blogs put heaps of junk (blogrolls, etc) around the outside of the posts, making it easy for a search engine to get sidetracked in its search for real content.
So now I'm hacking away on a standalone search engine that will integrate with Radio at a much deeper level. The concept is that Radio will inform the engine directly of the content of posts, and it'll index that rather than trying to extract the info from the HTML directly.
This will result in search capabilities pretty much equivalent to
Blosxom's (
more), which are the best I've seen so far.
I'm doing it in Java, because that means I can use
Lucene (
Lupy is good, but not quite complete yet). Other useful bits are
Tomcat for the web serving and
HSQLDB for data storage. Also
Apache XML-RPC for communications. This is my first proper Java project (not the same as the one I started a few months back, but that never went anywhere in the end), so I'm going through the learning curve at the same time, which makes things interesting. Let's see how this goes ...
---
1. At least I now know a fair bit about ht://Dig's
htsearch
module and its build process ...
2. (or blogs made with anything other than Movable Type, which defaults to archiving each post on a new page)
... more like this: [