A year or so ago I wrote some code to search the user list for aSmallWorld, and I keep meaning to write a general tool to do that sort of job for other sites.
Originally ASW was using a MySQL query like this:
SELECT * FROM member_info WHERE firstname LIKE '%$fn%' AND lastname LIKE '%$ln%'
Testing this out on a snapshot of their database, this takes about 1.5 seconds to search through the 129K rows in the
member_info table. How about if I copy the IDs and names into a memory table?
CREATE TABLE mem_inf_cache (custid INT, firstname VARCHAR(255), lastname VARCHAR(255)) ENGINE=MEMORY SELECT custid,firstname,lastname FROM member_info;
Executing the SELECT on the
mem_inf_cache table takes 0.28 sec to scan the table. This is still not particularly impressive. So let's dump the data out into a text file and see what
grep can do.
SELECT custid,firstname,lastname FROM member_info INTO OUTFILE '/tmp/asw_members.txt';
This produces a 2.8 megabyte text file containing user IDs and names. Running
time grep 'Pearson' /tmp/asw_members.txt tells me 0.013 s real, 0.002 s user and 0.011s system. That's about twice as fast as MySQL's memory table, presumably because we don't have the overhead of MySQL's internal table format - it's just one big string.
(Going back into MySQL and making a mem table with one big string for everything results in the query taking 0.19 secs - still not bad.)
grep's not that fun to use, though. How about if we load it into Python and do the same thing?
>>>s = open("/tmp/asw_members_outfile.txt").read()
>>> st = time.time(); s.find("Mark"); print time.time() - st
>>> st = time.time(); s.find("Marc\tCanter"); print time.time() - st
>>> st = time.time(); s.find("Phillip\tPearson"); print time.time() - st
>> st = time.time(); s.find("someone who doesn't exist"); print time.time() - st
The last search, which runs through the entire string, took about the same amount of time as the grep, which makes sense as grep was probably running out of the disk cache and Python's string.find function is implemented in C, so each did a similar amounts of work.
Now let's try doing it with a regex instead:
>>> st = time.time(); re.findall("Pearson", s); print time.time() - st
['Pearson', 'Pearson', 'Pearson', 'Pearson', 'Pearson', 'Pearson', 'Pearson', 'Pearson', 'Pearson', 'Pearson', 'Pearson', 'Pearson', 'Pearson']
>>> st = time.time(); re.findall(".*?Pearson.*?", s); print time.time() - st
[..., '92478\tPhillip\tPearson', ...]
So it doesn't take much longer for a simple regex to run over the string than it does to do a straight find, although it gets expensive if you try to get the regex to pull out the full string for you.
How does this scale? If I go back to the Python code and run everything on 100 copies of the text (i.e. 287 megabytes), s.find("someone who doesn't exist") takes 1.48 s and re.findall("Pearson", s) takes 1.72 secs, so it's pretty linear - you can search 129K users in 0.015 s or 12.9M users in 1.5 s (i.e.: 8.6M users/s on an Athlon XP 2000+).
It would be nice to optimise a little further, as member searching is quite a popular activity, and the query volume will go up along with the data volume as you get more users.
I'm afraid there's no way to make it faster to search a large number of names than a small number of names, but you can scale better than linearly. One interesting thing is that out of the 129K users, there are only 19K unique first names and 19K unique last names. These numbers would hopefully increase slower than linearly with respect to the number of users, so you could build a unique list, run the substring search on that, then query into the main index to get the user IDs after you have full strings, which should run a bit quicker in MySQL.
That's all I have time for today. I'd like to spend some more time looking into the problem of 'advanced search' - searching for more than just names. This can be a pain to do in MySQL because you need to create so many indices to cater to the many different queries the user may select - indexing requirements are quite different for "firstname like '%phillip%' and age > 20 and age < 30 and most_often_in='christchurch'" and "lastname like '%pearson%' and age < 30 and country='new zealand' and industry='software'", for example. It's easy enough to do a linear search (like the grep approach above) for this, but I'd like to think about better-scaling alternatives.