RELATIONAL ALGEBRA IN 102 LINES (kjbuckets release 2) [longish]

Aaron Watters (aaron@funcity.njit.edu)
Tue, 28 Feb 1995 15:53:41 GMT

++--what next? [npq]: f
++...This program posts news to thousands of machines throughout the entire
++civilized world. [My] message will cost the net hundreds if not thousands of
++dollars to send everywhere... But what the heck do I care?

=====================================================================
ANNOUNCING KJBUCKETS RELEASE 2:

GET IT FROM ftp: ftp.cwi.nl /pub/python/kjbuckets.tar.gz

WHAT IT IS:
kjbuckets is a C extension to python which defines three Python data
types kjSet, kjGraph, and kjDict, implemented using a fast and space
efficient hash table strategy. The types are tightly coupled and may
be combined and manipulated using a collection of fast "set at a time"
operations written in C.

If you need to manipulate collections and mappings quickly
take a look at this module. It comes with no warrantee of course,
but it has been pounded pretty heavily and I consider it fairly
solid.

A SHORT EXAMPLE MODULE USING KJBUCKETS: SIMPLE RELATIONAL ALGEBRA

The following is a "relationally complete" implementation of
relational algebra, restricted to equality predicates. It
is both short (102 LOC) and fast -- I mean FAST. For example,
the natural join uses a hash-join algorithm which couldn't
be much faster without the use of precomputed indices.

[One reason it's short is that it doesn't check as many errors as it
should, doesn't allow "greater than" tests, etcetera, and selections
are restricted to simple conjunctions. All these things can be fixed
easily, but, hey, get kjbuckets and do it yourself!]

To appreciate this fully, you should have been tormented and confused
by an intro db instructor like myself (and judging by my midterms, I
confused them successfully this semester).

Features of kjbuckets used below include
kjSet(), kjDict(), kjGraph()
-- initializers with a variety of arguments allowed
X+Y, X-Y, X&Y, X*Y
-- union, difference, intersect, graph composition respectively.
X.add(e), D.remap(G), kjUndump(s1,s2), D.dump(s1), X.items()...
-- Hey! download kjbuckets and look it up in the documentation!
[After all, that's the whole point of this post!]
Most of the hard work in the implementation is done by the kjbuckets
module, the Python interpreter can just lean back and relax!

==============simple.py
# A simple implementation of the relational algebra
# using kjbuckets
from kjbuckets import *

def relFromDictSet(schemeseq, dictSet):
result = relation(schemeseq, [] )
result.rowset = dictSet
return result

class relation:

def __init__(self, schemeseq, listofrows):
self.schemeseq = schemeseq
self.scheme = kjSet(schemeseq)
rowset = kjSet()
for row in listofrows:
rowset.add(kjUndump(schemeseq, row))
self.rowset = rowset

def pprint(self):
print self.schemeseq
print "============"
for row in self.rowset.items():
print row.dump(self.schemeseq)

def addDicts(self, dictseq): # not used...
for dict in dictseq:
self.rowset.add(dict)

def checkUnionCompatible(self,other):
if self.scheme != other.scheme:
raise ValueError, "operands not union compatible"

# relational union
def __add__(self, other):
self.checkUnionCompatible(other)
return relFromDictSet(self.schemeseq, self.rowset + other.rowset)

# relational difference
def __sub__(self, other):
self.checkUnionCompatible(other)
return relFromDictSet(self.schemeseq, self.rowset - other.rowset)

# natural join (hash based algorithm)
def __mul__(self,other):
commonatts = self.scheme & other.scheme
resultset = kjSet()
if commonatts: # do a hash based join
dumper = tuple(commonatts.items())
selfgraph = kjGraph() # hash index for self
othergraph = kjGraph() # hash index for other
for row in self.rowset.items():
selfgraph[row] = row.dump(dumper)
for row in other.rowset.items():
othergraph[row.dump(dumper)] = row
for (selfrow, otherrow) in (selfgraph * othergraph).items():
resultset.add(selfrow + otherrow)
else: # no common attributes: do a cross product
otherrows = other.rowset.items()
for selfrow in self.rowset.items():
for otherrow in otherrows:
resultset.add(selfrow + otherrow)
return relFromDictSet( tuple((self.scheme + other.scheme).items()),
resultset )

# selection using a att->value pairs (as conjunction)
def vSel(pairs, rel):
selected = kjSet()
selector = kjDict(pairs)
if selector.Clean()!=None:
for row in rel.rowset.items():
if (row + selector).Clean() != None:
selected.add(row)
return relFromDictSet(rel.schemeseq, selected)

# selection using att = att pairs (as conjunction)
def eqSelect(pairs, rel):
selected = kjSet()
selector = kjGraph(pairs)
selector = (selector + ~selector).tclosure() # sym, trans closure
for row in rel.rowset.items():
if row.remap(selector) != None:
selected.add(row)
return relFromDictSet(rel.schemeseq, selected)

# projection on attribute sequence (as conjunction)
def proj(atts, rel):
attset = kjSet(atts)
resultset = kjSet()
for row in rel.rowset.items():
resultset.add(attset * row)
return relFromDictSet(atts, resultset)

# renaming using (new,old) pair sequence
def rename(pairs, rel):
renames = kjDict(pairs)
untouched = rel.scheme - kjSet(renames.values())
mapper = renames + untouched
resultset = kjSet()
for row in rel.rowset.items():
resultset.add(mapper * row)
return relFromDictSet(tuple(mapper.keys()), resultset)

=========== end of simple.py

Now let me show you the "simple" module in use. First we need some relations.
I'll steal C.J.Date's canonical/soporific supplier/parts database:

=========== simpletest.py
# database of suppliers, parts and shipments
# from Date, page 79 (2nd ed) or page 92 (3rd ed) */
from simple import *

#suppliers
S = relation(
('snum', 'sname', 'status', 'city'),
[ (1, 'Smith', 20, 'London'),
(2, 'Jones', 10, 'Paris'),
(3, 'Blake', 30, 'Paris'),
(4, 'Clark', 20, 'London'),
(5, 'Adams', 30, 'Athens')
])
#parts
P = relation(
('pnum', 'pname', 'color', 'weight', 'pcity'),
[ (1, 'Nut', 'Red', 12, 'London'),
(2, 'Bolt', 'Green', 17, 'Paris' ),
(3, 'Screw', 'Blue', 17, 'Rome' ),
(4, 'Screw', 'Red', 14, 'London'),
(5, 'Cam', 'Blue', 12, 'Paris'),
(6, 'Cog', 'Red', 19, 'London')
])
# shipments
SP = relation(
('snum', 'pnum', 'qty',),
[ (1, 1, 300),
(1, 2, 200),
(1, 3, 400),
(1, 4, 200),
(1, 5, 100),
(1, 6, 100),
(2, 1, 300),
(2, 2, 400),
(3, 2, 200),
(4, 2, 200),
(4, 4, 300),
(4, 5, 400)
])
=========== end simpletest.py

OK, now lets do some relational algebra queries
(prettified with a few extra blank lines for readability).

===========
% python
Python 1.1.1 (Feb 27 1995)
Copyright 1991-1994 Stichting Mathematisch Centrum, Amsterdam
>>> from simpletest import *
>>> # names and cities of suppliers
... proj(("sname","city"),S).pprint()

('sname', 'city')
============
('Smith', 'London')
('Jones', 'Paris')
('Blake', 'Paris')
('Clark', 'London')
('Adams', 'Athens')

>>> # part names of parts supplied by Blake
>>> proj(("pname",),vSel( ( ("sname","Blake"), ), S*SP*P)).pprint()

('pname',)
============
Bolt

>>> # supplier names and numbers where the supplier doesn't supply screws
... ( proj( ("sname","snum"), S) -
... proj( ("sname","snum"),
... vSel( ( ("pname", "Screw"), ), P*SP*S )
... ) ).pprint()

('sname', 'snum')
============
('Jones', 2)
('Blake', 3)
('Adams', 5)

=============
Have I wasted enough bandwidth today? Hardly! I have 50 exams
to grade, so you'll probably hear from me again before the day is out.
Aaron Watters
Department of Computer and Information Sciences
New Jersey Institute of Technology
University Heights
Newark, NJ 07102
phone (201)596-2666
fax (201)596-5777
home phone (908)545-3367
email: aaron@vienna.njit.edu
-----
There once was a man from Nantucket...
[Whoops, maybe not... how about this:]
There once was a man from Japan
Whose limericks never would scan
when told this was so
he said "Yes, I know,
But I always try to put as many words in the last line as I possibly can."