Re: Unique string storage

Mark Lutz (lutz@KaPRE.COM)
Tue, 5 Apr 94 10:40:14 MDT


> I can't claim that my programs are typical, but after looking them over I
> didn't find any that I think uniquifying strings would speed up, and
> found many I'm sure would slow down. Jim mentioned short-lived temps
> during string construction as one source of special pain, & I'll add to
> that the billions & billions (well ...) of lines of Python that crunch
> over files. If Python started computing a hash on every input line, and
> another N hashes for each string.split, and another for each str[i:j],
> ..., well, combine that with Guido's penchant for hashing via prime
> modulus and KSR's lack of HW support for integer division, and it would
> probably put this company out of business <snort>.

See other email; I think we all need to do some profiling before
making claims one way or the other about performance; clearly, there
would be extra work going on, but that extra cost may or may not be
significant.


> So I think Don is right: If Lisp is the model, and this is really needed,
> then Python should follow the lead of (most) Lisps by _not_ uniquifying
> strings, but introducing a new symbol type instead.

But wouldn't extending strings be more simple, semantically, than
adding yet another data type?

I have no religious affiliation with Lisp... but my comparison was
with Lisp 'atoms', not strings.

> OTOH, symbols play a crucial role in Lisp,
> and currently play no role in Python -- for that reason I'm not sure that
> adding a symbol type per se is actually attractive (in Lisp "it fits"; in
> Python it's likely to look&feel like the late addition it would be).

But wouldn't it be Good Thing if adding a minor extension to the
language could make Python useful as an AI language too?

And it's not just AI; as a copmpiler writer, I've seen dozens of
applications that make use of symbol tables.

> > Module Symbol4.py---
> >
> > class Symbol:
> > def __init__(self, str):
> > self.name = str
> > def __repr__(self):
> > return self.name
> >
> > _tab = {}
> >
> > def intern(str):
> > try:
> > return _tab[str]
> > except KeyError:
> > _tab[str] = Symbol(str)
> > return _tab[str]
> >
> > New-and-re-improved example use---
> >
> > x = intern('abc')
> > y = intern('a' + 'bc')
> > x is y -> yes: same class instance
> > print y -> 'abc'
> > y.getattr('name') -> 'abc'
> > y.setattr('value', 99)
> >
> > which is very simple code, but will still run much slower than
> > built-in symbol table support.
>
> Ah, but _why_ must that run much slower than built-in symbol table
> support? As you said, in one sense a Python class instance is really a
> dynamically-created symbol table ("namespace"). Fact is, if you took
> that one more step and rewrote the last two lines as the even simpler
> (but still legit Python):
>
> y.name -> 'abc'
> y.value = 99 -> same as y.setattr('value', 99)
>
> then that _is_ running at the speed of Python's built-in symbol table
> support <wink>.

Neat; I would agree with you completely and crawl back into the
electronic hole from which I sprang, except for the following:

-- My experience with classes suggests that they are almost *always* so
slow, that you wind up hard-coding the feature if you're interested in
efficiency; in a symbol-manipulation program, you would, since attribute
access would be central.

-- Extending Python to support symbol-tables as *built-in* functionality
may attract more interest in the language. You can implement anything
by writing classes (Python's a turing-machine, after all), but making
it built-in makes it easier to use.

> So does this come back to what Don said? I.e., find a way to speed up
> Python's own symbol tables, and in the absence of really bad luck that
> would make this technique run at the same speed (& speeding "intern"
> could, if needed, presumably be done by writing Symbol4Module in C). Don
> is surely right that there's gotta be some faster way of dealing with
> direct attribute references than compiling them into string objects and
> strcmp'ing them at runtime.
>
> A month or so ago, a scheme for reworking attribute lookup was kicked
> around briefly on the list, and Guido seemed interested. If attribute
> reference got _really_ fast, maybe we'd all find a lot more uses for
> attributes. OTOH, that would tick off the OO purists even more.

If you made class attribute access fast enough, I'd be satisfied
with implementing symbols as a Python-coded class.

Mark Lutz