Re: Unique string storage

Mark Lutz (lutz@KaPRE.COM)
Tue, 5 Apr 94 10:19:53 MDT


> Why do you want Python to do this for you?

See earlier email; it wouldn't be worth it *just* for string
comparison speed (probably), but extending the idea to support
general symbol-tables, as data associated with globally mapped
strings, might be worth it because it buys alot of extra
*functionailty*, not just efficiency.


> The two reasons I came up with are to a) increase speed of string
> compare operations; and b) decrease storage of duplicate strings.
>
> With that answer I realized a very fundamental question (which I like
> to ask whenever an optimization is suggested):
>
> What is the provable benefit that you will derive from such a
> optimization?
>
> This gets back to the basic question of how to optimize: Only optimize
> the sections where the time is being spent :-). Since you can almost
> never guess where this place is, use a profiler :-o

EXACTLY! I've got a few responses from people claiming that this
extension would absolutely slow down all Python programs, and so
shouldn't be explored. I don't wish to be rude, but IMHO the
burdon-of-proof rests on *both* sides of the issue.

In my experience, Python almost *never* performs the way you think it
will, after performance tuning. A few months ago, a co-worker of mine
spent some time writing a linear list data structure in C, as an extension
for Python (based on Lisp 'cons-cell' trees); to both our surprise, it
wasn't any faster than coding it as '(head,tail)' tuples in Python code,
and we abondoned the idea.

I've had similar 'surprises' about performance, when trying to hand-tune
sections of Python code [example: slicing (l[0], l[1:]) is faster for
traversing lists recursively, than indexing (l[i], i = i+1)]. It's also
been my experience that using classes always (as a general rule) is
slower than not; the '(head,tail)' list representation was slow enough
as a class to justify chard-coding it instead. Which is why I claimed
that the 'Symbol' class in my earlier mail is probably slower than
internal support (but I could be wrong).

When considering a Python extension, you need 2 pieces of information:

1) How much real extra time would this incur, in typical
programs?

2) Is that extra time really significant, given all the
other time/space overheads Python already imposes?

In Python (and other very-high-level languages), (2) usually dominates:
running a Python program already costs so much, that the *additional*
cost of hashing strings (for example) may not be significant.

So yes, absolutely!: run the profiler before hacking up the system.
But don't dismiss potentially useful extensions on the grounds of
performance without sufficient information either. In Python,
*functionality* is the main goal, not necessarily performance.
We're not trying to compete with C, after all. And as always, it's
Guido's job to determine the proper functionality/performance mix.
I'm just generating ideas.

> Hence instead of having Tim scan his code to see that this activity
> would not speed up his code, I think the burden of proof falls on the
> proponent to show at least some applications where this change would
> make a significant positive difference (gut feel doesn't count,
> profile data does). The more I thought about it, the more I came to
> expect that on most machine architectures, a string compare operation
> really flies. Hence I'm a bit doubtful that you'll really save that
> much by optimizing away the string compares.

You suggest using a profiler, and I concur. But I'm puzzled about why
you then go on to make unfounded claims about the performance of the
extension. I won't if you won't... :-)

> If you have an
> application where this is *really* all you are doing, then I'd bet
> that lex/flex was really the sort of tool you were looking for, and
> even the unique-ification stuff won't help you.

Well, maybe; if this extension would be useful for all applications
that require some sort of symbol table, then it would help open up
Python as a tool for such applications, if internalizing it performed
better than a class-based implementation.


> Then again, maybe you're after the space savings, hmmmmmmm.... I guess
> I'm not on a DOS box today, and I'm less sympathetic ;-) If this if
> what you're after, you'd also have to show a sample app that really
> filled up memory with strings (for a good reason!). I'd then note
> that what you're really after is a space time trade-off
> (unique-ification takes time, but will save memory), and you'd have a
> harder time selling us a slower python so that that it could shoe-horn
> itself into some minimal amount of RAM.

If it's really a significantly slower Python...

I'm not concerned about space performance much; the *extra* space
probably isn't significant.

If and when I have some time to play, I'll experiment with timing
some of this stuff. If the class-based technique in my earlier email
(module Symbol4.py) is as fast as a C-coded implementation, I'll
abandon the issue. However, performance wasn't all I was after;
functionality counts too.

Mark Lutz

we now return you to your regularly scheduled language politics :-)