Re: Unique string storage

Tim Peters (tim@ksr.com)
Tue, 05 Apr 94 01:16:21 -0400

I can't claim that my programs are typical, but after looking them over I
didn't find any that I think uniquifying strings would speed up, and
found many I'm sure would slow down. Jim mentioned short-lived temps
during string construction as one source of special pain, & I'll add to
that the billions & billions (well ...) of lines of Python that crunch
over files. If Python started computing a hash on every input line, and
another N hashes for each string.split, and another for each str[i:j],
..., well, combine that with Guido's penchant for hashing via prime
modulus and KSR's lack of HW support for integer division, and it would
probably put this company out of business <snort>.

So I think Don is right: If Lisp is the model, and this is really needed,
then Python should follow the lead of (most) Lisps by _not_ uniquifying
strings, but introducing a new symbol type instead. I don't know that it
needs to have much to do with strings; in old-time Lisp the only clear
connection was that a symbol's "printname" property had as value the
symbol's name as a string (for such symbols as had names) -- and a
function like "intern" took a string and returned the symbol whose
printname was that string. OTOH, symbols play a crucial role in Lisp,
and currently play no role in Python -- for that reason I'm not sure that
adding a symbol type per se is actually attractive (in Lisp "it fits"; in
Python it's likely to look&feel like the late addition it would be).

> ...
> or you could get rid of getprop, putprop as well:
>
>
> Module Symbol4.py---
>
> class Symbol:
> def __init__(self, str):
> self.name = str
> def __repr__(self):
> return self.name
>
> _tab = {}
>
> def intern(str):
> try:
> return _tab[str]
> except KeyError:
> _tab[str] = Symbol(str)
> return _tab[str]
>
>
> New-and-re-improved example use---
>
> x = intern('abc')
> y = intern('a' + 'bc')
> x is y -> yes: same class instance
> print y -> 'abc'
> y.getattr('name') -> 'abc'
> y.setattr('value', 99)
>
>
> which is very simple code, but will still run much slower than
> built-in symbol table support.

Ah, but _why_ must that run much slower than built-in symbol table
support? As you said, in one sense a Python class instance is really a
dynamically-created symbol table ("namespace"). Fact is, if you took
that one more step and rewrote the last two lines as the even simpler
(but still legit Python):

y.name -> 'abc'
y.value = 99 -> same as y.setattr('value', 99)

then that _is_ running at the speed of Python's built-in symbol table
support <wink>. (Well, today I believe that between "y.name" and
"getattr(y, arbitrary_string_variable_with_value_name)", the latter takes
a fixed & small amount of time longer than the former, so there's not
much difference between them.)

So does this come back to what Don said? I.e., find a way to speed up
Python's own symbol tables, and in the absence of really bad luck that
would make this technique run at the same speed (& speeding "intern"
could, if needed, presumably be done by writing Symbol4Module in C). Don
is surely right that there's gotta be some faster way of dealing with
direct attribute references than compiling them into string objects and
strcmp'ing them at runtime.

A month or so ago, a scheme for reworking attribute lookup was kicked
around briefly on the list, and Guido seemed interested. If attribute
reference got _really_ fast, maybe we'd all find a lot more uses for
attributes. OTOH, that would tick off the OO purists even more.

sounds-more-attractive-by-the-minute<grin>-ly y'rs - tim

Tim Peters tim@ksr.com
not speaking for Kendall Square Research Corp