Phillip Pearson - web + electronics notes: Selecting intersecting sets from MySQL

2005-2-24

Selecting intersecting sets from MySQL

Talking to Peter Van Dijck about SQL - specifically one thing he's working on for mefeedia.

He's got a couple of tables that look like this:

TABLE tags (id, name, userid, ...)

TABLE tags2movies (tagsid, moviesid)

... and he wants to be able to pull out all the movies that match a particular set of tags.

I don't know of an easy way to do this sort of thing quickly; it seems to boil down to the problem of searching for groups of characters in a huge list of strings.

Here are some possible ways you could do it:

1. Walk through the movies table, and for each movie, see if it matches the tags (one at a time). This is really slow - it's a linear search through the movies table, with many lookups into tags2movies. In MySQL, it would end up something like SELECT * FROM movies m, tags2movies t1, tags2movies t2 WHERE m.id=t1.moviesid AND t1.tagsid=123 AND m.id=t2.moviesid AND t2.tagsid=234.

2. Grab all movies matching one of the tags, then check each one to see if it matches the other tags. This is a bit quicker, because you don't need to think about a lot of the movies.

3. Find the least common tag out of everything, then do method (2), starting with that tag. (This is good if you put a 'count' column in the 'tags' table that remembers the number of movies matching that tag). This would be quite quick if the user includes an uncommon tag in the search, but slow if he searches for a movie including a bunch of common tags.

4. Pull all the movie IDs and tags for each movie out of the DB and store them in memory, then write some C or C++ code to quickly search through that. This is quick if you have less than a few hundred thousand movies. If less than that - say less than 50K - you can write the search engine in Python or some other 'fast' language you like and save some time.

The easy way

#2 looks like the easiest one to start with - you can probably do something like this:

CREATE TEMPORARY TABLE search_results (movie INT) SELECT moviesid FROM tags2movies WHERE tagsid=123;

DELETE s FROM search_results s,tags2movies t WHERE s.movie=t.moviesid AND t.tagsid=234 AND t.moviesid IS NULL;

(and then add another DELETE for each tag after the second one, if you want to search for more than two tags at a time).

I'm not sure if I got the join syntax right in the DELETE command - but what it's meant to do is look for a row in tags2movies with moviesid=s.movie and tagsid=234, and delete the row from search_results if it can't find a matching tags2movies row. You could do it this way instead (which would be faster if not many movies are tagged with tag 234, or slower if there aren't many movies in search_results and lots of movies are tagged with tag 234):

DELETE FROM search_results WHERE moviesid NOT IN (SELECT moviesid FROM tags2movies WHERE tagsid=234);

Once that's working, you might be able to speed it up (see method #3 above) by starting with:

SELECT id FROM tags WHERE id in (123, 234) ORDER BY movie_count;

... and then processing the tags in the order they come out -- so your temp table starts out populated with movies corresponding to the least common tag, meaning that it never needs to get too large.

The fast way (maybe)

If you end up having lots of popular tags, and the tags2movies table gets much bigger than the movies one, #4 might end up being the fastest way to search. It could be a bit of work to implement, but would let you do things like scoring and fuzzy matches in future.

Any suggestions?

Anybody else got any better ideas? My background is engineering, not computer science, so I tend to work things out from first principles. Unfortunately this can mean I miss out on being able to use an algorithm somebody else has come up with - and I bet there's a better way to do this sort of thing :)

... more like this: [MySQL]

Sleeping in Airports

A guide for the true budget traveller!

... more like this: [Links]