Scaling Technorati

September 4, 2005

Dave Sifry is in damage control mode after the last couple of weeks’ backlash against Technorati.

Interesting that keyword search is working better than URL search. I would have thought URL search would be the easier of the two to get running well — and the easiest to parallelize well. It’s still a pretty big problem, though:

1.4 million new posts per day, eh. If each post has an average of 5 links, that’s 7 million links per day. 2.5 billion links per year. If the average link takes 100 bytes to store, that’s 250 Gb per year. So storage isn’t a big deal - especially when you normalise a little.

1.4 million posts evenly spread over 24 hours means 16 per second, or 81 new links to go into the database every second. Anyone got benchmarks on how many SELECTs per second you could do on a MySQL table containing 10 billion rows (1 TB) and taking 81 INSERTs every second?

Anyway, that’s not how you’d do it. Each time a new link came in, you’d hash it, then assign it to a server based on the hash value. Then when someone does a URL search, you’d do the same thing and ask the right server about it. Enough servers would result in things working fairly well.

But - how many is “enough”? You’d probably want to keep the data on each small enough to fit in memory. Ordinary boxes these days can probably take 3 GB, so 333 servers could handle a terabyte of links. Ouch! But then, Dave did say they just added 400 servers, so maybe this isn’t so far off.

Phillip Pearson - web + electronics notes