Persistent Objects Spec: A case study :-)

Jim Roskind (jar@infoseek.com)
Tue, 2 Aug 1994 18:16:25 -0700

Since I haven't heard any comments on my spec, I'll assume it was a)
too long; b) too unreadable; c) too cryptic; d) ...

I thought I might get folks more warmed up by talking about the
functions/methods in the class from a historical design perspective.
This might give some more motivation to the issue, and allow other to
chime in as I stray from their view of reality. I'll try to pose a
series of questions and concerns, and note how we resolved them. The
questions will be posed by me, and then answered by me (meaning I get
to talk to myself, and still assert I'm not crazy ;-) ).

I'll take as a given that folks want to get persistence, as defined to
mean objects with lifetimes greater than the duration of a single
process.

--------------------------------------------------
1) What do I have to do to make a class be persistent?

To make a class persistent, all you have to do is add PersistentObject
to the list of base classes for your class. Your initializer for the
class (__init__()) must explicitly call the PersistentObject base
class. Finally, as mentioned in the spec, your class must have a
constructor that takes no arguments (use varargs if you don't already
have such a constructor), and you should be sure that your class
definition is in the outermost scope of your module.

For example, to make MyPer be persistent, I'd use code like:

class MyPer (PersistentObject):
__init__(self):
PersistentObject.__init__(self)
...

Once you've done the above to a class, the class will inherit the
property of persistence (across process boundaries).

--------------------------------------------------
2) How do I get to the instances in future processes?

If an object is persistent, you can get its persistent name by calling
GetID() on that object. To gain a reference to the object in a future
process, you should call the Find() method on any object of the class.

a = MyPer() # make an object
print a.GetID() # print out its name

In a future process, to reincarnate the above instance:

a_born_again = MyPer().Find("the name that printed above")

Notice that in the creation of a_born_again, we had to call a
constructor MyPer() just to make a dummy instance so that we could
call the Find() method. (Too bad Python doesn't have static member
functions :-) ).

If you have a name that you'd rather use, you can apply that name to
an object, and later use that name to reincarnate the object. For
example:

root = MyPer()
root.SetAlias("Root Of All My Data") # We *chose* this name :-)

Then in a later process, you can resurrect the above object via:

root_born_again = MyPer().FindAlias("Root Of All My Data")

Basically, if *you* want to use a name of your choice, use SetAlias()
and correspondingly FindAlias() methods. If you don't care, because
it is all being handled automatically (and you want to avoid the
consequences of a name clash), then use GetID() and Find().

--------------------------------------------------
3) I don't understand what the "Find()" method is for. More
specifically, why don't you just use a constructor to resurrect an
object from the persistent store (given its persistent name)?

This question is most interesting, because it came up in our paper
design, before we had done any implementing. We *wanted* to use a
constructor that returns the reincarnated object, but when we tried to
implement it, we found it was impossible to do so! Note that when you
are resurrecting objects (i.e., given a persistent name, and trying to
load the object into RAM), there is a certain chance that the object
will *already* have been loaded into RAM! A constructor *always*
returns a new object, and hence *can't* return a reference to an
existing object. We could be ashamed that we missed this obvious
problem in our paper design, but it seems easier to note it, and
explain the problem ;-).

As an example of where this problem takes place *very* commonly,
consider two distinct instances of persistent objects Inst1 and Inst2.
Suppose that a member of Inst1 points to Inst2, and a member of Inst2
points to Inst1. When the find method is called 'Find("Inst1")' then
this first instance is mostly loaded into RAM. The one missing piece
is the reference to Inst2. Fortunately loading Inst2 is automagically
induced another Find() (in order to flesh out Inst1), and hence a call
`Find("Inst2") is made. While filling out Inst2 in RAM, it is noted
that a reference to *some* other object is present, and so a call to
the method `Find("Inst1")` is performed to finish fleshing out Inst2.
It is critical that this latter Find() return a ref to the *existing*
preloaded (but incomplete) Inst1. In contrast, if these Find()
methods were actually constructors, then we would have created too
many objects. :-(.

--------------------------------------------------
4) Why can't you use id() as a mechanism for getting a persistent name
of on object? Why do you use the GetID() method?

I commented on this in a prior posting. The first problem is one of
uniqueness, and the second issue is the potential for optimization of
the underlying persistent store. Note that id() is currently defined
to be the address of an object within a single process. As a result,
this number may a) be reused within a process at a later time; b) be
reused across process boundaries. Hence using id() to provide a
persistent name is an unsafe approach (the names are not unique).

Looking for the potential for GC related optimization, it was an
interesting revelation for us that IF a persistent name of an object
is not provided (i.e., requested), then there is *no* reason to save
the object to a persistent store. Since the id() function keeps no
records of when it is called (for specific objects), there is no way
to tell which objects need to *really* be saved out to the persistent
store when the process terminates. As a result, it was desirable to
have either a function or a method that *remembered* that it was
called on a specific object.

--------------------------------------------------
5) Why do you use a Find() *method*, which processes the result of a
GetID() *method* to resurrect objects, instead of a global Find()
*function*? To put it another way, why is it necessary to know the
class of an object before you can resurrect an object?

There are a number of potential enhancements that come from having
methods rather than functions. By using methods, it is possible for
"smarter" classes to override basic Find/GetID methods. For example,
some very small immutable objects can provide vastly improved
performance by providing an name (via GetID()) that actually
encapsulates the entire state of the object, and then the
corresponding Find() can parse this name and resurrect an equivalent
object. We have found it to be a non-problem that we need to know the
class of an object before reincarnating it (polymorphic lists are
handled automagically when they appear within objects).

--------------------------------------------------
6) Why do you do have a Save() method, rather than a function that
takes a ref to the object needing saving?

As with the last question, the potential for optimization by "smarter"
objects appeared. When some of the other methods are overridden (such
as Find/GetID) it is commonly necessary to change whether a real
store-of-state is performed. With the specific optimization listed in
the last question, it is necessary to totally skip salvation. In
other cases (see next question) it is useful to *sometimes* skip
really writing an object to the persistent store. In yet other
examples, it is "convenient" to clean up an object before it is
shipped off to the persistent store. This can include general house
cleaning (removal of temporary slots) or even slightly reforming an
object to be more compact (example: removing internal cached info).
Most simply put, the Save() method is a last-ditch-hook that is given
control just before the actual salvation in performed.

It is also very convenient to have a Save() method for each object, so
that (potentially) users can force a premature save to persistent
store. This starts to get to an area that our design has not covered
as well as we would like, and that involves parallel access to
writable persistent objects (re: object locking, etc.)

--------------------------------------------------
7) What is a "WriteOnceObject?" What are they for?

"WriteOnceObject" is a class that can be used as a base class in many
cases in place of PersistentObject. An instance of a class derived
from WriteOnceObject is just a persistent object, that the programmer
agrees to "update" only in a certain way (at a certain time). The
easiest explanation of what they are and what they are for, comes from
their heritage...

The lifetime of a persistent object includes its construction in one
process, its salvation at the end of this first process, its
resurrection in a future process, its salvation at the end of that
process, ... . After using persistent objects in this mode, we
noticed that *some* classes of objects are actually only modified
during the first process (i.e., in the process in which the objects
are originally constructed). It also turned out (from examining
profiles of run) that loading and saving object (into files) was a big
expense. The obvious fix was to skip the saves in any process after
the first process. We call such objects "WriteOnceObject"s because
they are written to the persistent store exactly once. Note that the
programmer has to generally agree to *not* change the objects in
future process incarnations, as the changes will not be saved back
into the persistent store (we currently don't go to the expense to
monitor and validate the programmers adherence to this restriction).

The bottom line is than that a WriteOnceObject is part of a
performance optimization that is available to select classes of
persistent objects.

Once this optimization was isolated, additional implementations of
WriteOnceObject were constructed to further reduce object storage and
retrieval costs. For example, one of the "nicest" ones bundles
together all WriteOnceOb