Providing Persistence for
World-Wide-Web Applications

Jim Fulton, Digital Creations, L.C., jim@digicool.com

Abstract

As applications evolve beyond simple document retrieval, the need arises for efficient storage of persistent information that can be managed through the World-Wide-Web (Web) by customers and end users. We have developed a collection of modules that support management of persistent objects on the Web. The modules build on the Python pickle mechanism and provide support for:

The modules currently use a very simple data format. All data changes are made by appending new revisions to the end of a file, reducing opportunities for file corruption.

Introduction

Digital Creations, L.C., creates Web-based applications with rich user interfaces and with many options that may be set by customers. In the past, options were set in configuration files that were read by applications on start-up. Similarly, customers could customize the presentation of information by editing document templates. This approach had a number of drawbacks:

Customers need to be able to make incremental changes to a system, seeing the results of individual changes without changes being visible to end-users. When customers are satisfied with a set of changes, they need a way to make the changes "live". This capability amounts to a need for long-running transactions.

Customers want a way to be able to recover from mistakes made, through some sort of "oops" facility.

Some applications use data sets that are periodically uploaded by customers, such as classified ads or automotive dealer inventories. This process typically involves several steps such as:

This process had most of the same problems as the process for managing configuration files

To address the needs described above in a more effective manner, we decided to investigate the use of a persistent object system that could be used by applications to provide on-line configuration and data management. It was desirable to find a solution that would not incur the overhead of additional licensing fees or of an extra software package that needed to be installed and managed. The Python pickle mechanism provided much of what was needed to develop a persistent object store, so we decided to attempt to develop a pickle-based mechanism.

Requirements

To address the problem's described above, we set out to create a system that would satisfy the following requirements:

Storage and retrieval of object data, using an object-oriented mechanism
We did not want to have to define a series of database tables to reflect the various types of objects in the system, because we expected to have many different kinds of objects and we expected data structures to change from time to time, and possibly from object to object within the same class.
Storage of multiple versions of the same object
This is needed to support long-running transactions and to support reverting to earlier revisions.
Separate activation, deactivation, and update of sub-objects
It is common for all of the objects in an application to be connected in a single data structure. For example, an application might include a collection of newspapers, each of which contains a collection of automobile dealers, each of which has inventory, and so on. It would be inefficient to load and update all of this data at once.
A simple yet robust file format that would survive write errors and system failures.
The file format needed to be tolerant of various types of failures without significant loss of data. For example, a write error or system failure when writing or updating an individual record should not cause other records to become inaccessible.
Object identity
Each object has a unique identifier within a persistent store that supports reliable cross-object references, separate management of sub-objects, and association between multiple revision of the same object.

Approach

We have designed a collection of modules, called a Bobobase, that build on the pickle module to provide keyed storage of objects using a simple data format. The class architecture of the system is shown in figure 1.

Pickle Jars

At the heart of the mechanism is the PickleJar class that manages the storage of picklable objects. When an object is stored, it is converted to a pickle string and stored in a simple database. When an object is retrieved, the pickle string is retrieved from the simple database and unpickled.

Every object stored in a pickle jar is assigned an object identifier (OID). Persistent objects include this OID in their state. A pickle jar provides a mapping interface that maps OIDs to objects.

Persistent Objects

If an object is a subclass of the Persistent class, then the storage and retrieval of the object occurs in two steps. When the object is stored, the pickle string contains two pickles. The first pickle includes the class and initial arguments for the object, but not the object's state. The object's state is stored in the second pickle. When the object is retrieved, the pickle string is retrieved from the database and the first pickle in the pickle string is used to restore the object using saved initial arguments. Object state that is not captured through initial arguments is not restored initially. Any state created by the object's constructor is also hidden so that initial attempts to access state will fail. When an attempt is first made to access or update the object's state, the state will then be automatically restored from the pickle jar. By deferring the loading of object state in this manner, the loading of sub-objects is deferred until needed.

Persistent objects have a __save__ method that can be called to update the state of the persistent object in a pickle jar. Persistent objects keep track of state changes. If an object has changed since the last time it was retrieved from or saved in a pickle jar, then calling the __save__ method caused the object to save it's updated state. When saving an object's state, persistent sub-objects are not saved unless their state has been changed or unless they have not yet been saved in the pickle jar.

Transactions

Although persistent objects keep track of state changes, they do not automatically save their state when changes occur. Changes are not saved until the object's __save__ method is called. For large object systems, it is not practical for an application to explicitly call the __save__ method on all interesting objects. To automate saving object state, a transaction mechanism has been introduced.

A transaction is a sequence of operations, or program, that has the following properties:

Serializability
If there are multiple concurrently running transactions, the results of the transactions are equivalent to results from the same transactions run serially.
Atomicity
A transaction either runs to completion, or any persistent effects of the transaction are undone.

These transaction properties make transactions extremely import tools for providing reliable data management, because they free the programmer from important details of concurrency control and error recovery.

The Transactional and Transaction Manager classes are used to provide transaction semantics to persistent objects. Objects that are subclasses of both Transactional and Persistent have transactional semantics. When the state of a transactional persistent object changes, the object automatically registers itself with the current transaction manager. When a transaction is committed, the __inform_commit__ method is called on any registered objects. This method, in turn, calls the __save__ method on the registered objects. If a transaction is aborted, then the __inform_abort__ method is called on registered objects. This causes any saved changes to be undone.

Transaction managers manage transactions. An application may have a single transaction manager. When the transaction manager is installed, a special function, get_transaction, is installed in the __builtins__ module. The get_transaction function retrieves a transaction object. The transaction object has methods begin, commit, and abort to begin, commit, or abort a transaction. These functions are called by the application program. In the case of Python Object Publisher (Bobo) applications, these routines are called automatically by Bobo. At the beginning of each HTTP request, Bobo gets a transaction and calls it's begin method. If the request is completed successfully, then Bobo calls the transactions commit method, causing all changes made during the request to be made permanent. If an error occurs, the transactions abort method is called to make sure that any changes made by the request are undone.

Transactional objects call a transaction's register method to register the fact that their state has been changed by the transaction.

To date, only one transaction manager, SingleThreadedTransaction, has been implemented. As it's name implies, it has been designed to support single-threaded applications. The module SingleThreadedTransaction exports a Persistent class that is a subclass of the Transactional class and the ordinary Persistent class.

Simple Databases

Pickle jars manage the storage and retrieval of objects using a simple database that stores pickle strings. A simple database is an object that maps OIDs to strings. Simple databases can be implemented in a number of ways, such as:

We have currently implemented a simple database, called a multiple revision simple database (MRSDB) that stores pickle strings in file records. Each revision of an object is stored in a different record. Each record in a MRSDB contains several pieces of information:

When an object is updated, a new record is added. The record containing the previous version of object's pickle string is unchanged. This approach has a number of advantages:

A disadvantage of this approach is that file space for old revision must be recovered through an explicit pack operation.

There is no separate index files. When a MRSDB is opened, an index that maps OIDs to current revision file positions is built by reading the header, or everything except the pickle string, from each record. This approach has the advantage that there is no separate index file to get out of synchronization with the data file, however, it has the disadvantage that it may be time consuming to read the record headers. In long-running processes, however, long start-up times are not critical.

Caching

Pickle jars maintain a cache containing references to all retrieved objects that are referenced by objects outside the cache. On every access to the pickle jar, incremental garbage collection is performed by removing objects from the cache whose reference counts have dropped to one, and therefore are referenced only by the cache.

It is necessary to hold references to all living objects in the cache so that database changes can be reflected in living objects.

Of course, use of a cache improves performance by preventing the unpickling of objects on each access.

Pickle Dictionaries

Pickle jars provide a relatively low-level interface. Pickle jars map OIDs to objects, but for many applications, it is much more natural to map names to objects. Pickle dictionaries provide for persistent storage of objects by key. For performance reasons, keys are limited to marshalable objects.

In addition to providing persistent storage, pickle dictionaries simplify the assembly of pickle jars and MRSDBs by providing a simple constructor taking a file name to create or open a MRSDB, and create associated pickle jars and dictionaries.

Usage

Using of this persistence mechanism is straightforward and, when used with a transaction manager, nearly transparent. Application classes must simply subclass from Transactional and Persistent. Consider the Keywords class shown below which manages persistent collections of keywords:

from SingleThreadedTransaction import Persistent
from STPDocumentTemplate import HTMLFile
from PSA_Admin import admin_groups


dtml_dir='../private/'
MessageDialog=HTMLFile(dtml_dir+'MessageDialog.dtml')

class Keywords(Persistent):
    'PSA keywords'
    
    def __init__(self):
                self.items=[]

    manage__allow_groups__=admin_groups
    manage=HTMLFile(dtml_dir+'Keywords.dtml')
    manage_edit__allow_groups__=admin_groups
    def manage_edit(self, added=[], deleted=[], PARENT_URL=''):
                'change keywords'
                if deleted:
                           self.items=filter(lambda k, deleted=deleted:
                                 k not in deleted,
                                 self.items)
                if added:
                           self.items=self.items+filter(lambda k: k, added)
                           self.items.sort()
        
                return MessageDialog(
                           title='Keywords Successfully Updated',
                           message=(
                        '''
                        <strong>%s keywords were successfully added or deleted</strong>
                        ''' % (len(added)+len(deleted))),
                           action=PARENT_URL+'/manage',
                           )

    def __len__(self): return len(self.items)
    def __getitem__(self,index): return self.items[index]


In this example, the Keywords class implements a persistent sequence of keywords. To be persistent, it was only necessary to subclass. SingleThreadedTransaction.Persistent. SingleThreadedTransaction.Persistent is a subclass of SingleThreadedTransaction.Transactional and PickleDictionary.Persistent.

In addition to making sure that application classes are transactional persistent, it is unfortunately necessary to assure that sub-objects of application objects:

In the Keywords class above, the sub-object, items, is mutable, but it is used immutably. All changes to the items attribute are made by reassigning the attribute. This, in turn, causes the Keywords object's change in state to be registered with the transaction manager.

If one were to directly access and change a Keywords object's items attribute, the change would not be persistent, because list objects are not transactional persistent and the change would not be registered with the transaction manager.

In addition to making sure that application classes are transactional persistent, it is also necessary to store application objects in pickle dictionaries or in transactional persistent objects that have already been stored in dictionaries. Consider the example below:

import Keywords, Members
from SingleThreadedTransaction import PickleDictionary

DB=PickleDictionary("../private/var/PSA")

if not DB.has_key('Keywords'):
    DB['Keywords']=Keywords.Keywords()
    get_transaction().commit()

if not DB.has_key('Members'):
    DB['Members']=Members.Members()
    get_transaction().commit()


This example is taken from a from a module that constitutes the "main" module for a Python Software Association (PSA) Bobo application. This application defines a collection of modules that provide PSA services such as defining PSA keywords, providing a PSA membership directory, and so on. In the example, the modules that provide keyword and membership services are imported. A pickle dictionary is opened to hold persistent objects. If the pickle dictionary did not exists before, then Keywords and Members objects are created and placed in the pickle dictionary.

Note that after placing each object in the pickle dictionary, the current transaction is committed. This is necessary because at the beginning of each HTTP request, Bobo begins a new transaction. Any uncommitted changes made prior to calling the transaction begin message are aborted, undoing the changes. Therefore we commit the object additions to assure that they are permanent,

We have seen in an earlier example that the Keywords class implements persistent sequences of keywords. The Members class implements a collection of members. However, the Members class implements a mapping from member identifiers to Member objects. This allows individual members to be accessed through Bobo. For example, to access member jim, one might use the URL:

http://www.digicool.com/PSA/Members/jim


Here the persistent member, jim, is accessed in the persistent collection of members, Members.

Persistent mappings are so commonly used that a class, PersistentMapping, is exported by the SingleThreadedTransaction module to simplify implementation.

Experience

Bobobase has been used in a number of Digital Creations products including an automotive advertising product, a relational database access product, and a demonstration product being developed for the PSA. The use of Bobobase has greatly simplified application development, has allowed us to provide applications with rich through-the-Web configuration interfaces, and simplified product installation.

Issues

A number of issues were encountered in this effort. These are discussed below:

Extension types

Until recently, the pickle module did not provide support for extension types. We provided a patch to the pickle module which allows extension types to be pickled if they have __class__ attributes that are bound to class-like objects that can be used to recreate object instances. We are releasing an "Extension Class" mechanism which will provide the needed behavior to extension types that make use of it.

Detecting object access

To implement the lazy activation of objects described here, it is necessary to know when an object is being accessed. This is done in Bobobase by overriding __getattr__ and __setattr__ methods. Unfortunately for Bobobase, __getattr__ is called only when a lookup fails. If an attribute can be obtained from a class, then the classes value will be returned without calling __getattr__. This can have serious consequences if the not yet loaded state of an object has a value for the attribute.

To solve this problem will require re-implementing the persistence mechanism as an extension class, since extension __getattr__ methods are always called. Implementing the persistence mechanism as an extension will also overcome the performance cost of calling a python __setattr__ function each time an attribute is set. Work is underway on an "Extension Class" that allows python classes to subclass from extension classes.

To help avoid this problem in Bobo applications, Bobo currently takes steps to assure that any object encountered while traversing a URL is activated.

Detecting sub-object changes

As mentioned earlier, problems can arise if persistent subobjects have non-persistent mutable subobjects, because sub-object changes might not be detected. A solution might be to restrict subobjects to persistent or immutable objects and to enforce the restriction with tests in __setattr__. Such a test might be quite expensive in the current implementation, but may be feasible in an extension-based implementation.

Deactivating objects

Currently, objects are automatically deactivated when they are no longer referenced. It would be desirable to also deactivate objects when they have not been used for some period of time. Unfortunately, in the current implementation, it is impossible to detect many object accesses, because the __getattr__ function is only called when normal attribute access fails. The extension-based implementation will allow all access times to be recorded. With access time information available, it will be possible to add logic to the pickle jar cache to deactivate the state of objects that have not been accessed in some period of time. An extension-based cache implementation will also be needed for efficiency.

Long startup time

In the current MRSDB implementation, the database index is constructed when the file is opened. For long-running processes, the delay associated with building this index is not problematic, however, for short-running processes, such as CGI scripts, the time taken to read large databases can be quite noticeable. Regular database compaction can reduce the delay by reducing the number of record headers that must be read. An extension-based implementation of MRSDBs, or of the function that creates the index should reduce startup time substantially. We also plan to cache header information in a separate file that can be read on startup, so that only records added since the cache was created need to be scanned.

Multiple threads and long-running transactions

Although the MRSDB format should accommodate simultaneous reading and writing, this feature has not been implemented. A design allowing separate single writer and reader threads and supporting a simultaneous long-running transaction and reads should be straightforward. Allowing multiple simultaneous long-running transactions is a bit more involved. We are considering a design using time-stamp transaction protocols.

Support for other data stores, such as relational databases

Our largest customer has made a significant commitment to relational database technology and would like to take advantage of their investment in this technology. Relational database management systems (RDBMSs) provide some significant advantages over our home-grown Bobobase, such as:

RDBMSs also have some disadvantages, such as:

We plan to investigate use of RDBMS tables as simple databases. The biggest challenge may be detecting when objects in the database have changed, so that in-memory objects can be updated.

MRSDB format issues

The MRSDB format provides most of the data needed to track changes to a database, however, it would also be desirable to record user names. Since all updates should be authenticated, user names should be available.

To increase the survivability of MRSDBs to file corruption, we plan to update the MRSDB data format to record string lengths at the end of each record as well as in the record header.

Summary

Bobobase provides a nearly transparent mechanism for adding persistence to Python applications. Support for transactional semantics provides increased reliability. The open architecture provides room for a number of extensions such as support for additional data storage mechanisms.

Although we have not met all of our original requirements, namely support for long running transactions, we expect to add this additional functionality in the near future.