Invertible id(), or maybe hash the world, or ?

Jim Roskind (jar@infoseek.com)
Mon, 4 Apr 1994 23:41:49 +0800

I have a problem that I've now had to face in two programs I've
written. Since I'm new at this language, I thought perhaps someone
could tell me how I *should* be doing what I'm doing. Doing it my
current (possibly wrong) way really makes we want to have a function
that is the inverse of id(). Can someone tell me what I'm missing? I
*could* write an invertible function in C to do the trick, but I feel
I should try to work within the (Python) system.

The problem first appeared while I was writing a profiler. I needed a
way to associate values with specific functions. Each function was
identified with a code object instance, that can be gotten from a
frame instance. Specifically, the code object is "frame.f_code". The
natural thing (I thought) was to use associative arrays, and store
data:

func_data[frame.f_code] = d1, d2, d3

The perfect use of associative arrays (I was telling myself :-) ).
Alas, the bad news came when I tried to execute the above code, and I
found out that "code objects are not hashable." Hence, I *cannot*
index an associative array with a code object (oh poo!). :-(

One way to index the associative array is to use the id() of the code
object, and simply do:

func_data[id(frame.f_code)] = d1, d2, d3

but alas, when I'm done with my profiling, I'd have no way to figure
out what the func_data.keys() actually related to (in terms of
function name etc). I could save the relationship:

code = frame.f_code
func_data[id(code)] = d1, d2, d3
inverse_id[id(code)] = code

but I would have wasted a *lot* of time building up this silly
inverter relationship. (This section of code is *very* critical to
the profiler, as it is part of handling each event). I could bundle
the inverter relation in with the basic array:

func_data[id(frame.f_code)] = d1, d2, d3, frame.f_code

but then every unpacking/packing of this tuple will have an extra
assignment/re-assignment pair. :-(. Yes, even this time costs.

In the end, the fastest method I could find was to use the repr() of
the code object:

func_data[`frame.f_code`] = d1, d2, d3

I knew I was "hacking," but it worked (it gives me all the info I'll
need for later), and it ran ----fast-----. If only I had the
inverse_id(), ....

The second time I hit this problem was in creating persistent objects.
The problem was that I needed to save away the state of objects. I
needed (wanted?) a way to remember the fact that I had already encoded
an object in order to save time when asked to encode it again. It
was actually more than a need, because I can handle recursively
defined data. Hence when encoding data, which is a part of other
lists etc., it is critical to identify the recursive requests. The
natural way to do this was to store some info on each object as it was
translated:

Translation_In_Progress[obj] = 1

Alas, I soon was reminded that not every object is hashable!
Specifically, things as simple as lists are not hashable, and hence
cannot be used as the index in an associative array (oh, double poo!).
In this case, I had *no* alternative but to go the full (slow) route:

Translation_In_Progress[id(obj)] = 1, obj

It turned out that I needed to recover the actual object if I did hit
a recursive problem, and this seemed about the only way to make it
available (gracefully). If you don't know what I mean by the
"recursive" problem, try executing the following three lines of code
(which will crash python, as might be guessed by the warning in the
reference manual):

L = []
L.append(L)
`L`

It really pays to remember what you are trying to represent ;-).

If only I had an inverse_id(), I could have made this whole process
simpler. (It turns out that it takes about as much time to encode
persistent data as it takes to write it to disk, so this is actually a
fairly expensive endeavor). I admit that this section is not the
dominant portion of the code, but the cost is *very* measurable.

So what's the "right" answer?

Note that the standard "id()" function can't really be inverted,
because it does *not* bump the refcount on the object it processes.
As a result, the object can "go away," while the id number persists
(the ref manual hints that id() is really the address of the object.)
I could write an "unsafe" id inverter in C, but I'd be guilt ridden
and unable to sleep for hours ;-). I think IF I had an invertible id
function, it would really produce a "integer like" object that
effectively included a reference to the "real" class, with the
ref-count bumped accordingly. The critical functionality is that it
would compare properly under equality testing, and hence could be used
as an index into an associative array :-).

Suggestions??

Jim

Jim Roskind
408-982-4469
jar@infoseek.com