Re: Unique string storage, CONS cell lists

Mark Lutz (lutz@KaPRE.COM)
Tue, 5 Apr 94 13:57:23 MDT


> > But wouldn't extending strings be more simple, semantically, than
> > adding yet another data type?
>
> Simpler yes, better no. I think a few good points were already made
> about why it would be a bad idea to uniqify all strings.
>
> > If you made class attribute access fast enough, I'd be satisfied
> > with implementing symbols as a Python-coded class.

Ok; tell you what, guys. I'd be content with a built-in
'intern()' function that implemented the following sort of
functionality (this is summary of code in earlier mails):

--equivalent Python code--

class Symbol:
def __init__(self, str):
self.name = str
def __repr__(self):
return self.name

_tab = {}

def intern(str):
try:
return _tab[str]
except KeyError:
_tab[str] = Symbol(str)
return _tab[str]

--functionality---

x = intern(raw_input('enter a string:'))
x = intern('abc')
y = intern('a' + 'bc')

x is y -> yes: same class instance
print y -> 'abc'

// per tim's mail
y.name -> 'abc'
y.value = 99 -> same as y.setattr('value', 99)

// allow for dynamic property names too
y.getattr('name') -> 'abc'
y.setattr('value', 99)

I don't even care if it's implemented in C or Python, so long as
it's an advertised part of the language. This has a number of
advantages, which some of you have pointed out:

1) No real change to the language semantics (just a new built-in),
and hence no major-surgery for Guido to do (like making strings
unique globally might be).

2) No backward-compatibility problems for already-compiled code
(as Tim pointed out).

3) It's sufficient to support general symbol tables (running your
tokens through intern() isn't alot to ask)

4) It leaves the door open to making strings mutable (if Guido is
ever so inclined); this was the genesis of this whole topic,
remember?

5) It's probably not much slower than an internal implementation:
fetching class attributes does the same stuff, roughly.

The only real drawbacks I can see:

1) Class attribute fetching is currently slow. but so is...

2) It probably would break if Guido ever decides to do static
analysis of class attribute references to speed access (index
into a table, instead of hashing into a dictionary, like what
was done for local function variables).

This last point is a can-of-worms I'd rather not open...

So, make intern() built-in, and we can advertise Python as a
symbol-processing language. The only real reason I can see
that it would also make sense to tie this to uniquifying strings,
is if doing so *also* had some significant performance gains
internally (as you discuss below). If so, we could get rid of
intern(), but that's an open issue which is pointless to debate
without Guido's input.

> I agree. And, I think the way to make attribute access fast enough is
> to use symbols internally. I have not tried profiling Python yet, but
> I would bet that strcmp() is in the top 10%. An internal symbol
> management system could be used to eliminate many of these calls. The
> compiler and code generator would be able to generate more efficient
> code faster. Since much of the uniqueification could be done when
> compiling python code I see very few cases where a runtime penalty
> would be imposed. For example, in 'obj.attr' attr could interned
> during compilation.
>
> Now, assuming that Python has an internal mechanism for symbol
> management, why not export that functionality to the user. I would
> suggest adding a new lexical convention for symbol literals, like
> $symbol for example, and a couple of functions for converting symbols
> to strings and strings to symbols.
>
> I think that lisp has shown us how useful this concept really is, but
> the real benefit, IMHO, would be the speed increase in the python
> implementation.

hmmm; if strings get uniquified internally for speed in the interpreter,
why not open it up to the user _always_ (strings would be symbols, and
'is' would always work for string tests). And as you point out, string
uniquification could be done at compile-time in alot of cases. If strings
are unique internally, why would we need to support _both_ symbols, and
the currently non-unique 'string'?

About cons-cell lists:

> Speaking of lisp, is anyone interested in a cons cell object?

I would be. I'd be particuylarly interested in how it performs
for large lists against python-coded binary trees of the form:

'(head, tail)'

for example:

list = None
list = (1, list) -> cons(1, list)
list = (2, list)
list = (3, list) -> (3, (2, (1, None) ))

while list:
head, list = list
print head

or equivalently:

list = None
list = (list, 1)
list = (list, 2)
list = (list, 3) -> (( (None, 1), 2), 3)

while list:
list, tail = list
print tail

I've found this representation to be fast for large lists, since
it avoids the realloc/copy time incurred by append() (the tail/cdr
doesn't get copied on each cons; we just make a new tuple/cons-cell).
This performs similarly to Lisp cons-cells, and also maps pretty
closely to Prolog's '[X|Y]' list tree representation.

This tree list representation has been measured to be roughly 2 times
faster than native Python lists, when appending/pushing nodes. Python
lists are really variable-sized arrays, and don't perform well for pure
list processing. A list '+' does a malloc (for the new list), and 2
block list copies. A list append() does a realloc for the list (which
does a copy of the old list block). This gets slow for larger lists.

The problem I had with this, was that a class-based implementation:

def cons(a,b): return (a,b)

class ConsTree:
def __init__(self):
self.data = None
return self

class ConsStack(ConsTree):
def push(self, node):
self.data = (node, self.data)

def pop(self):
node, self.data = self.data
return node

def empty(self):
return (self.data == None)

def __len__(self): # or keep length as data
len, tree = 0, self.data
while tree:
len, tree = len+1, tree[1]
return len

def __getitem__(self, index):
len, tree = 0, self.data
while len < index and tree:
len, tree = len+1, tree[1]
return tree[0]

is not useable: the overheads of calling the methods more than
negates the savings from the list representation. So I have to
either use the '(x,y)' lists in-line, which is less than
satisfactory, or implement it all in C.

An experiment at implementing this in C failed for us (it wasn't
any faster), so I'm curious to see how you did it.

Mark Lutz