Regardless of implementation, though, I suspect it would be a speed loss
overall, for most programs. The only operations it would clearly speed
up are string equality & inequality testing. Against that, every time a
string object is created it would have to be uniquified, whether or not
it was ever going to be used in one of the few operations uniquification
speeds.
Whether that's a win overall depends on how often string (in)equality
testing is done, compared to how often strings are created. I don't
know, but suspect that _most_ programs have an unfavorable ratio.
SNOBOL4 had a quite different reason for doing this: that language (like
tcl) allowed any string to be a variable name (e.g., <blank><tab><return>),
and wanted to store strings uniquely specifically to support that strange
language feature.
I also worry that if Python were changed to uniquify strings, a future
implementation of a mutable string class would be awful painful, in that
how to implement
mutable_str[i] = 'b'
would become a real baffler if mutable strings and regular (uniquified)
strings are to live together happily.
> ...
> (the time overhead is incurred at read-time, which may or may not
> justify the match-time improvement).
Right! And if Python implemented it behind the covers,
the time overhead is incurred at string-object creation time, which
may or may not justify the match-time improvement
except that the overhead suddenly extends to _all_ string creations
(modulo currently unimplemented optimization techniques).
How about trying this? Here's module "symbol":
_tab = {}
def intern(str):
try:
return _tab[str]
except KeyError:
_tab[str] = str
return str
>>> from symbol import intern
>>> a = intern('abc')
>>> b = intern('a' + 'bc')
>>> a is b
1
>>>
Using something like that is syntactically easy, and allows you to apply
uniquification only to those strings where you know that equality testing
is a very important operation. And I'm not sure uniquification would go
a heck of a lot faster even if it were built in to the language (since
the major expense would be some form of hash-table overhead anyway, and
"intern" is a cheap wrapper).
strings-for-thought-ly y'rs - tim
Tim Peters tim@ksr.com
not speaking for Kendall Square Research Corp