OK, I think I've cracked the problem of how to do a proper
search engine for Radio
weblogs. The problem with using an existing search engine (like ht://Dig
, which I've wasted far too much time getting to understand) is that it indexes whole HTML pages. This is fine for most sites, but Radio blogs put many posts on a single page, and all
blogs put heaps of junk (blogrolls, etc) around the outside of the posts, making it easy for a search engine to get sidetracked in its search for real content.
So now I'm hacking away on a standalone search engine that will integrate with Radio at a much deeper level. The concept is that Radio will inform the engine directly of the content of posts, and it'll index that rather than trying to extract the info from the HTML directly.
This will result in search capabilities pretty much equivalent to Blosxom's
), which are the best I've seen so far.
I'm doing it in Java, because that means I can use Lucene
is good, but not quite complete yet). Other useful bits are Tomcat
for the web serving and HSQLDB
for data storage. Also Apache XML-RPC
for communications. This is my first proper Java project (not the same as the one I started a few months back, but that never went anywhere in the end), so I'm going through the learning curve at the same time, which makes things interesting. Let's see how this goes ...
1. At least I now know a fair bit about ht://Dig's
module and its build process ...
2. (or blogs made with anything other than Movable Type, which defaults to archiving each post on a new page)
... more like this: [