Flattening Python Objects

(Guido van Rossum, 14-17 November 1994)

(Revised: 1 December 1994)

Origin of the name 'flattening'

Because I want to leave the original 'marshal' module alone, and Jim complained that 'serialization' also means something totally different that's actually relevant in the context of concurrent access to persistent objects, I'll use the term 'flattening' from now on. Monty Python fans can associate this with 16 ton weights, while snake lovers can think of a Python's favorite way to kill its dinner. (The Modula-3 system uses the term 'pickled' data for this concept. They have probably solved all problems already, and in a type-safe manner :-)

File formats

While I intend to rewrite this code in C for efficiency, I will create a compatible prototype in Python first. With the efficient version in mind, I would like to use a binary file format. However, for debugging, it is much easier if the file format is ASCII. I'm not sure what I'll do yet. I expect that in either case it will be possible to gain a large factor in space by applying a general data compression algorithm such as gzip to the binary files -- just not as much as for flattening algorithms that generate Python source code (since source code is more voluminous than a specially-designed encoding).

Security

The unflattener should not use Python's eval() function or exec statement. If it does, it is vulnerable to attacks that present it with executable Python code that does something undesired. A second kind of attack: if (like the marshal module) the flatten module could transfer code objects, and if (unlike the marshal module) it could also transfer function or method objects, an attacker might be able to smuggle an object into an unflattening program that has a method which is called implicitly by the interpreter, such as __repr__ or __getattr__. This may be prevented by not flattening code, function or method objects at all.

Abstract flatten/unflatten operations

The lowest level interface to the flatten module will support the following operations:

The flattening and unflattening algorithms are designed together since they must be each other's logical mirror image. The flattening algorithm uses a dictionary mapping object id's to objects. It uses recursion too. The unflattening algorithm uses a similar dictionary and a stack. This doesn't contradict the goal that they be mirror images: using a stack is equivalent to recursion.

Instruction stream

The flattened data stream contains single-byte 'instructions' and multi-byte 'data items'. In the debugging (prototype) version of the flattener, instructions and data items are printable ASCII strings and are delimited by newline characters. A more efficient version may use binary encodings. We will have to see whether this saves enough time and space to be worth the effort (and worth the resulting lack of debuggability of flattened objects).

The virtual machine interpreting the instruction stream uses two data structures: a stack and a dictionary. The stack is used as intermediate storage; many instructions push data on the stack or pop it off. When the instruction stream ends, the top of the stack contains the object of interest. (There shouldn't be anything else left on the stack.) The dictionary is used as a way to refer to objects by their 'local name'.

Local names are integers. They are meaningful only within one flattened data stream. Not all objects need to have a local name: numbers, for instance, may remain anonymous. Objects that occur only once in the flattened object also needn't have a local name. It is up to the flattener to decide whether to assign local names to objects -- it may optimize for space by assigning local names sparingly. It is also up to the flattener to choose a name for an object. The prototype uses the object's id() but for the meaning of the flattened stream this is not important, as long as local names are unique. Note that without local names, data sharing cannot be represented -- and a special case of data sharing is recursion.

We also define a way to refer to objects with global names. These are not defined by the flatten module but by its caller. Global names are strings. The interpretation of global names is up to the caller. I propose that the following instructions be defined. I will give them symbolic names, which makes it easier to reason about them. Note that there are no flow control instructions -- the instruction stream is interpreted strictly sequentially. This is done so that the instruction stream can be read off a network connection, for instance.

PUT local_name
Store the object on the stack top as 'local_name'; leave it on the stack.
GET local_name
Push the object 'local_name' onto the stack.
PERSID persistent_id
Push the object 'persistent_id' onto the stack.
POP
Discard the object on the stack top.
DUP
Push the object on the stack top onto the stack.
NONE
Push the object 'None' onto the stack.
INT value
Push the integer 'value' onto the stack.
LONG value
Push the long integer 'value' onto the stack.
FLOAT value
Push the floating point number 'value' onto the stack.
STRING value
Push the string 'value' onto the stack.
MARK
Push a unique 'marker' onto the stack. Markers are used to create tuples, lists etc. without having to specify a count.
TUPLE
Create a tuple from the stack entries until the topmost marker (the deepest stack entry becomes the first tuple item). Everything up to and including the marker is removed from the stack and the new tuple is pushed onto it.
LIST
Create a list from the stack entries until the topmost marker (the deepest stack entry becomes the first list item). Everything up to and including the marker is removed from the stack and the new list is pushed onto it.
DICT
Create a dictionary from pairs of stack entries until the topmost marker (the deepest stack entry becomes the first dictionary key, the entry immediately above it the corresponding value, and so on). Everything up to and including the marker is removed from the stack and the new dictionary is pushed onto it.
INST module_name class_name
Create an argument list by taking the stack entries until the topmost marker. Then instantiate class 'class_name' defined in module 'module_name', passing these arguments to the constructore. Everything up to and including the marker is removed from the stack and the new instance is pushed onto it.
Classes can define what arguments should be used by defining a discipline __getinitargs__ which should return a tuple. The default argument list is an empty tuple.
APPEND
Append the object on the stack top to the list object immediately underneath it. The stack top is popped, leaving the list object on top of the stack.
SETITEM
The object on the stack top is a value, the object immediately underneath it is a key, and the object underneath that is a dictionary. Store the value in the dictionary under the key, popping the value and the key from the stack, leaving the dictionary object on top of the stack.
BUILD
The object on the stack top is an argument, the object immediately underneath it is a class instance. If the instance's class defines a __setstate__ discipline, it is called with the argument as parameter. if not, a default 'setstate' operation is invoked. After the 'setstate' operation, the argument is popped off the stack, leaving the class instance on top of the stack.
The default 'setstate' operation requires a dictionary as argument. It takes the keys of the dictionary as slot names and for each key sets the instance slot to the correponding value in the dictionary. I'm not sure if this should use setattr(inst, key, value) or inst.__dict__[key] = value (there is a difference if the class has a __setattr__ discipline!).
Classes can define an __getstate__ discipline which should return an argument for the __setstate__ discipline. The default 'setstate' operation returns the instance's __dict__ dictionary.

Questions

Prototype

The following is a tentative implementation, the files 'flatten.py' and 'pos.py'. It doesn't do anything about persistent id's yet. It finds a class's module name by scanning all modules for classes (and caching the results).