Re: Persistent Objects Spec: A case study :-)

Jim Roskind (jar@infoseek.com)
Wed, 3 Aug 1994 11:53:36 -0700

> Date: Wed, 3 Aug 94 11:41:38 EDT
> From: tnb2d@brunelleschi.cs.virginia.edu
>
> I do have one question: Why is the distinction between
> write-once and write many necessary?

First, thanks for asking a question!!!! :-)

It is strictly a performance optimization and hence "not necessary,"
but the partitioning across this boundary (write-once vs write many)
turned out to have *many* important optimization implications. I'll
talk about some in a moment, but the general rule is that you should
develop using PersistentObject, and *then* optimize. Some
optimizations can be achieved by merely substituting in
WriteOnceObjects. There *is* one big anti-deadlock related issue that
comes out of WriteOnceObjects, but since I haven't dealt too much with
object locking, the impact of this feature is given very little press
(but I'll also talk about that aspect in a second).

> It seems to me that all you need
> is a dirty bit within each instance, and if an object hasn't been
> updated it needn't be saved, right?

Yes, but then you would have to figure out if the object has changed.
If you don't have internal hooks into Python, it is difficult to tell
if the dictionary associated with an instance has changed. If you
*do* have internal hooks in Python, there are still some
difficulties... Note that "changed" is actually more than a check for
"shallow" changes (such as modification of a value), it includes deep
changes. For instance, if we started with a instance:

Inst1.a = 5

and we ended the process with:

Inst1.a == 6

then clearly we need to save. Suppose we started with:

Inst1.a = [1,2.3]

and at the end of the process, this member still pointed to the same
list, but the list had changed :-(.

Inst1.a.append(4)

Note that lists are mutable objects, and can change. As a result of
such complexities, it is hard to imagine an automatic "dirty bit"
being set by some internal machinations in Python :-( . Given the
above complexities, the only real way to tell if an object changed is
to try to see if its Prepr() is basically the same. (Note that Prepr
"stops" when it reaches other persistent objects, and hence is the
"right" level of depth to consider. Unfortunately, the standard
"hash" would go too deep :-(, and would not traverse mutable objects.)

It turns out that there are two large costs associated with saving the
state of an object. The first is the evaluation of the Prepr, and the
second is the file system access (to write the data). If we had to do
the Prepr to decide if we needed to write, then a large portion of the
savings would vanish. I have thought about using this technique for
diminishing the file activity for write-many objects, but have not yet
found it worth implementing (I do a lot of profiling on our
application to decide what is and isn't worth doing). I can easily
believe that it will be worth doing for other applications (which have
more write-many objects hanging around).

As an aside, I should note that both of the performance problems
listed *might* go away in the future. All the code for evaluating the
Prepr of an object is written in Python. I expect that if this were
recoded in C, then it would be *many* times faster. Similarly, the
current implementation uses the file system to store objects. Using a
true database would probably (hopefully?) result in a many fold speed
increase in the read/write operations.

> This probably kills the
> bundle-as-one-object batch saves you spoke about,

Yes, the "bundle" approach only works if you are guaranteed that an
object can't change. Moreover, the bundling action is most beneficial
with lots of small objects, and really is a performance trick to beat
the file system costs down to a reasonable level.

> but would mean not
> having to remember whether or not an object is write-once and thus any
> changes would discarded.

Yes. An "automated" system would lessen the programmer's
responsibilities. If you want the rapid prototyping capability, then
you should skip using WriteOnceObjects and get guaranteed correctness
via PersistentObjects until you are ready to optimize ;-).

> It would also mean that ANY object could be
> not-saved/updated if it didn't need it.

Yupper. :-) Your proposal is a general optimization that could be
applied to all objects, including arbitrary write-many objects.

> I see this as distinct from
> whether or not it should be persistently stored. If I ask for a ref to
> a persistent object that is already in the store, and during my run I
> don't change it, I shouldn't have to save it at the end, right?

Note that it gets "saved" automagically. If the machinations were in
place, an unchanged (deep sense) object would not go through the
entire save process.

I think you can probably see that what you are saying is "right," but
it is hard to figure out (on the fly) if an object is changing. There
are *many* situations where the design of a system can *clearly*
identify a number of objects as "write-once," and using this info
simplifies the the implementation a *great* deal. :-)

I also mentioned early on that there was an interesting facet of
WriteOnceObjects that appeared when multiple threads are accessing
objects asynchronously. In general, when a PersistentObject is loaded
into RAM, the underlying persistent representation should be "locked."
When the process that has loaded the object is "finished" then the
data should (if needed) be written back to the persistent store, and
the object should be unlocked. Note that using this model, at most
one process can have an object loaded into RAM at any one time.
Extreme care must be taken in this regard to prevent deadlocks.

In contrast, when a WriteOnceObject is loaded into RAM, there is an
effective contract that indicates that it will *NOT* be modified in
RAM, and hence it is not necessary to lock the persistent
representation :-). This feature will tend to motivate designs to use
WriteOnceObjects, as they can avoid deadlock contention problems, and
allow multiple simultaneous accesses to the same object.

Jim

Jim Roskind
voice: 408.982.4469
fax: 408.986.1889
jar@infoseek.com