Improvements to the Dbm module and a new module Dbhash

Michael McLay (mclay@eeel.nist.gov)
Wed, 23 Mar 94 20:15:46 EST

The file ftp.eeel.nist.gov:/pub/python/dbhash.tar.gz contains the BSD
hash database library, a new persistent associative array class
(Dbhash), and an extension module, hashmodule.c. The ndbm emulation
functions found in the BSD library does not have the block size limit
of the standard UNIX ndbm library, so simply substituting the BSD
libdb.a archive eliminates the most common cause of the dbm.error:
Cannot add item to database.

The new class, Dbhash, which is a knockoff of Dbm.py, use the BSD hash
database functions instead of the ndbm emulated interface. The change
eliminates the string size limit of Dbm for almost all situations.
(Objects from a few bytes to hundreds of kilobytes can be store in the
same file, if configuration parameters are selected correctly.
Problems can develop when extreme values are used in parameters.) The
dbhash module adds several performance tuning parameters that are
available through the BSD hash interface. (See the paper on hash in
the BSD doc directory.)

Using hash increases performance slightly over ndbm and storage
efficiency is also improved. Storage overhead does degrades as the
size of dbhash grows to around 100k entries, but for databases in the
10k element range the overhead is reasonable. The space overhead is
about 1.54 for 1000 elements when objects average about 11.6k bytes.
The overhead jumps to over a factor of four by 100k entries.

Using the technique similar to Dbm.py, the Dbhash.py module places a
thin wrapper over a builtin module named dbhash. The wrapper allows
the Dbhash objects to use any built-in type as an key or value in the
the database, not just strings. One additional change in Dbhash has
improved read performance considerably over Dbm. Instead of using ``
and eval() to encode and evaluate the key and value into dbhash, the
Dbhash class uses marshal.dump and marshal.load. A test was conducted
dumping and restoring 1000 instances of a complex object[1]. Dumping
with 'array' and loading with eval(array) took:

stringify 33.0 seconds
eval 132.7 seconds

Using marshal.dump to dump the same array and then reloading using
marshal.load takes:

dumps 24.7 seconds
loads 8.7 seconds

Startup time for python might be reduced if it Dbhash were used to store
compiled .py files instead of .pyc files. The change would eliminate
opening multiple .pyc files to execute a single script. Programs
should execute a bit faster, a benchmark test indicated that using a
single file might reduce start-up time by 10%.

My apologies for the messy tar file. The hashmodule was built in the
Modules directory because of problems I had trying to build it under
the Extensions directory. The db directory was placed under the
Modules directory and the following line was added to the Setup file.

dbhash hashmodule.o -Idb/include -Idb/hash -Idb/PORT/sunos.4.1.3/include -Idb/PORT/sunos.4.1.3 ./db/PORT/sunos.4.1.3/libdb.a

The db/PORT/sunos.4.1.3 Makefile must be executed before making
python. This configuration should probably be reworked to eliminate
the dependence on an architecture specific PORT directory and to
possibly make libdb.a automatically when python is built.

[1] The following object was used to for the complex comparison
tdata = ((['list',2,'of',0.4],[3,4.6,3,'mixed',7],{'bob':'smith'}),(['list',2,'of',0.4],[3,4.6,3,'mixed',7],{'bob':'smith'}),(['list',2,'of',0.4],[3,4.6,3,'mixed',7],{'bob':'smith'}),(['list',2,'of',0.4],[3,4.6,3,'mixed',7],{'bob':'smith'}),(['list',2,'of',0.4],[3,4.6,3,'mixed',7],{'bob':'smith'}),(['list',2,'of',0.4],[3,4.6,3,'mixed',7],{'bob':'smith'}),(['list',2,'of',0.4],[3,4.6,3,'mixed',7],{'bob':'smith'}),(['list',2,'of',0.4],[3,4.6,3,'mixed',7],{'bob':'smith'}),(['list',2,'of',0.4],[3,4.6,3,'mixed',7],{'bob':'smith'}),(['list',2,'of',0.4],[3,4.6,3,'mixed',7],{'bob':'smith'}))

Michael J. McLay
National Institute of Standards and Technology
Bld 220 Rm A357 (office), Bld 220 Rm B344 (mail)
Gaithersburg, Maryland 20899, (301)975-4099