Unique string storage (was Re: Why don't stringobjects have methods?)

Tim Peters (tim@ksr.com)
Mon, 04 Apr 94 15:42:13 -0400

Mark, the implementation would probably be difficult at this point. E.g,
code objects (.pyc files & such) already have (many!) string objects
buried in them, and it's unclear that those could be "uniquified" across
modules and functions cheaply on the fly.

Regardless of implementation, though, I suspect it would be a speed loss
overall, for most programs. The only operations it would clearly speed
up are string equality & inequality testing. Against that, every time a
string object is created it would have to be uniquified, whether or not
it was ever going to be used in one of the few operations uniquification
speeds.

Whether that's a win overall depends on how often string (in)equality
testing is done, compared to how often strings are created. I don't
know, but suspect that _most_ programs have an unfavorable ratio.

SNOBOL4 had a quite different reason for doing this: that language (like
tcl) allowed any string to be a variable name (e.g., <blank><tab><return>),
and wanted to store strings uniquely specifically to support that strange
language feature.

I also worry that if Python were changed to uniquify strings, a future
implementation of a mutable string class would be awful painful, in that
how to implement
mutable_str[i] = 'b'
would become a real baffler if mutable strings and regular (uniquified)
strings are to live together happily.

> ...
> (the time overhead is incurred at read-time, which may or may not
> justify the match-time improvement).

Right! And if Python implemented it behind the covers,

the time overhead is incurred at string-object creation time, which
may or may not justify the match-time improvement

except that the overhead suddenly extends to _all_ string creations
(modulo currently unimplemented optimization techniques).

How about trying this? Here's module "symbol":

_tab = {}

def intern(str):
try:
return _tab[str]
except KeyError:
_tab[str] = str
return str

>>> from symbol import intern
>>> a = intern('abc')
>>> b = intern('a' + 'bc')
>>> a is b
1
>>>

Using something like that is syntactically easy, and allows you to apply
uniquification only to those strings where you know that equality testing
is a very important operation. And I'm not sure uniquification would go
a heck of a lot faster even if it were built in to the language (since
the major expense would be some form of hash-table overhead anyway, and
"intern" is a cheap wrapper).

strings-for-thought-ly y'rs - tim

Tim Peters tim@ksr.com
not speaking for Kendall Square Research Corp