Bordering on the ridiculous, PubSub's Hyperbole number exceeds 1.6 TRILLION at times.
The number represents the number of "matches" per day, where a "match" is a test to see if an arriving feed entry matches a subscription. Somewhere over 300k subscriptions times ~1.5M feed entries per day gives them an average Hyperbole number of about 500 billion, and a peak number up around 1.6 trillion (because people don't post evenly over 24 hours).
This is kind of cool, and it gives an idea of the magnitude of the problem PubSub deals with. Two things I'd like to see would be an indication of the number of unique subscriptions and the number of actual matches (i.e., the number of items that actually make it out into their search result feeds) they see per day.
My favourite part of his post: "If our algorithm wasn't sub-linear, we'd need massively more machines than we currently use for matching. (i.e. we'd need more than one...)". Hah!
I got to meet PubSub's Bob Wyman and Salim Ismail in person last night; they seem like nice guys.
Now all I need to do is resist the temptation to take Bob up on his challenge to write a similarly fast matching engine :-)
I can't resist thinking about the algorithm, though. His response to the first comment states that you're only allowed to match one message at a time. That means the sub-linearity has to come out of indexing the subscriptions. I guess that explains why they don't do traditional-style search as well: their algorithm only works "backwards".
Back of the envelope calculation time. 2 million entries per day is 23 per second on average, so you'd have to aim for at least 100 per second to handle (current) peak load. That gives you 10 ms per entry, or 24 million CPU cycles on Bob's desktop box. A linear algorithm would get 80 cycles per subscription, i.e. not enough. The average entry is probably 100 words or less.
I'd start by making a hash table of all the words used in subscriptions. Assign each word a token ID. At the same time, compile all the subscriptions down to expression trees that refer to token IDs. Merge duplicate subscriptions.
Presumably a subscription has to have at least one "required word" - otherwise it would match everything in the database. So you order the compiled subscriptions by their least common "required word". Some indirection would be required to cope with subscriptions which OR together a bunch of match expressions - or perhaps you could break these up into separate "virtual subscriptions", then follow them back to their subscription record when you get a match.
Now walk through the words in the entry and make a list of token IDs for all words in the entry that are present in a subscription. Sort them or drop them in whatever associative container you have available. Now walk through the words and for each, find all subscriptions that require that word and evaluate their expression trees in the usual way.
This should be pretty fast, but it will bog down if you start getting a lot of subscriptions with complex expressions. The solution to that would be to merge similar expressions. Instead of just making a big list of all the unique subscriptions, compile all the subscriptions down into a big tree that mirrors the aforementioned ordered list of required words and has lists of subscriptions on the leaves of the tree.
This way, the searches (1) +FOO, (2) +FOO+BAR, (3) +FOO-BAR, (4) +FOO+BAZ, (5) +FOO+BAZ-BAR would result in a tree like this:
→ BAR 
→ -BAR 
→ BAZ 
→ -BAR 
For each word with a token ID, you'd have to evaluate the whole expression tree for that word and keep track (in another associative container) of the subscriptions that have been matched so far. 24 million CPU cycles per entry, or 240k per word, should easily be enough for this.
So there you go. Pretty fast, and probably not too hard to implement, either. I wonder how close this thought experiment got to PubSub's actual algorithm ...