Re: Fast union/intersection of two sequences?

Jim Roskind (jar@infoseek.com)
Tue, 24 Jan 1995 00:37:49 -0800

> Date: Tue, 24 Jan 1995 02:47:38 -0500
> From: Skip Montanaro <skip@automatrix.com>
>
> Is there any faster way to compute the union/intersection of two sequences
> than by looping over one list testing for inclusion in the other? For
> example:
>
> l1 = [1, 2, 3, 4, 5]
> l2 = [1, 3, 5, 7, 9]
>
> union = l1
> intersection = []
> for e in l2:
> if e not in l1: union.append(e)
> if e in l1: intersection.append(e)
>
> This works fine for short sequences, but fails miserably for long
> ones.

Yup, there is an easy and efficient way in python :-)

I posted something about project growth a while back, and how getting
rid of the n-squared algorithms is about all Python usually needs to
be competitive in terms of speed. This is a fine example of how a
"simple approach" causes bad performance. Now that you've noted the
hot spot (a profiler would identify this in a heart beat), it is no
big problem to speed it up.

The problem is basically that if you have two lists of length M and N,
then (if you count carefully), you find that taking the intersection
as you have proposed (using "in" operator, which does a linear search
of the list) takes about M x N operations. When M and N are about
1000, then this is roughly 1,000,000 operations (yuck).

The cool trick in python is to use a dictionary, which employs what is
called hashing. For M and N roughly 1000, this will take about M + N
or roughly 2000 operations (which means you'll get something close to
a 500 times speed up!).

The code would look something link the following:

def intersection(list1, list2):
int_dict = {}
list1_dict = {}
for e in list1:
list1_dict[e] = 1
for e in list2:
if list1_dict.has_key(e):
int_dict[e] = 1
return int_dict.keys()

def union(list1, list2):
union_dict = {}
for e in list1:
union_dict[e] = 1
for e in list2:
union_dict[e] = 1
return union_dict.keys()

Once you've seen the above stuff, it starts to make more sense to keep
you "sets" as keys of a dictionary, rather than storing them in
arrays. Since you posed the problem with arrays as the container for
holding the lists, I gave the algorithm in terms of them.

If you were going to code the above problem in C, and you didn't want
to get into real esoteric algorithms, you probably use a hash table,
which is basically what dictionaries are built upon ;-). Hence by
using Python dictionaries, you get all the advantages of using a nice
implementation of a hash table, without any of the headaches of
getting the code right (in C/C++).

As an aside.... you could keep the sets as sorted lists (arrays), and
then an intersection or union wouldn't be that bad (merge-sort tricks
would apply), but doing it all using dictionaries is clearly the easy
(and efficient) way to go.

You also mentioned that you actually had elements that were instances
of classes. To make the above work (i.e., to use an element as a key
to a dictionary), you need to have a "hashable" class. This means
that you need to implement a "__hash__" method on the class, which
typically implies that the instances are pretty constant in contents
(it is not permissible for the value returned by a hash function on a
class to change over time... or at least over the time period that the
instance remains a key to a dictionary!). Since you had a "__cmp__"
method on the class, I'm guessing that you'll be able to make
everything fly with little trouble.

Jim

p.s., I didn't really run the above code, so I apologize in advance
for any typos.

-- 
Jim Roskind
InfoSeek Corporation
voice: 408.982.4469
fax: 408.986.1889
jar@infoseek.com
----------------------------------------------------------------------------
PGP 2.6.1 Key fingerprint =  0E 2A B2 35 01 9B 5C 58  2D 52 05 9A 3D 9B 84 DB 
To get my PGP 2.6 Public Key, "finger -l jar@infoseek.com | pgp -kaf"