Re: Timing of Python constructs

Tim Peters (tim@ksr.com)
Sat, 02 Apr 94 19:59:00 EST

Jim (and anyone else interested in timing), Steve Majewski posted a
"disassembler class" not long ago, which builds on the std library's
disassembler to display a really nice dump of Python's "compiled" code
(the virtual machine instructions such as you find in .pyc files).
Steve, think you should repost that, or send it to the newsgroup? If
not, I'll happily mail it to anyone who asks.

It makes it, e.g., very easy to _see_ why sticking os.times() in a loop
is slower than binding os.times to a local name before the loop. Even if
you don't know what Python's virtual machine instructions do (although
most are pretty clear), you can still pick up a lot by just counting how
many of them get generated as you fiddle your source code.

When you want to learn more, you can look up the name of a symbolic
instruction in Python/Python/ceval.c's evaluation loop (they show up in
the switch statements, like
case LOAD_FAST: ...
case BUILD_LIST: ...
etc
).

> ...
> This is one of many tricks that I noted as I developed my python
> profiler.

Just noting that people obsessed enough to _want_ find these tricks (yes,
I'm one) will find the disassembler to be an invaluable aid.

> ...
> This sort of optimizations are not suggested for general code, but
> when time is of the essence (such as when benchmarking times, or in a
> critical inner function as identified by a working profiler) the above
> two factoids can be very helpful.

Absolutely!

> > while n > 0:
> > temp = n - 1
> > n = temp
> >
> > ...
> >
> > while n > 0:
> > n = n - 1
>
> Actually, when running tests, you should be very careful to isolate
> the essence of what you are looking for. I was careful to use
> identical lengths in my function names, and short local variable names
> (to prevent such issues from having an impact on the results). In
> addition, your test has unwittingly contaminated the local dictionary
> that is used in the second while loop with the addition of another
> local variable to its context (the "temp" from the first loop).

Unless you're using a very old version of Python, local variables are
resolved at compile time. That is, each local variable is assigned a
unique index into a vanilla C array _at compile time_, and, e.g., a
Python local load just becomes a constant-time indexed load from that C
array. So, e.g., the code generated for those loops is not affected by
how many local variables there are, or how long their names are, or where
in the function they first appear.

Those things _can_ have effects, but they're at worst bizarre second-
order effects. E.g., lengthening (or shortening!) a name may move memory
just enough to provoke an unfortunate cache conflict. And in rare cases
Python needs to materialize the local namespace into a dict at runtime,
and that takes time proportional to the number of local names.

> Designing tests is not easy, and shortcuts often invalidate the focus
> of the tests.

Definitely, but some shortcuts really are pretty safe <wink>.

> The tests I presented in my example are much more focused.

'Twas indeed a nice timing structure! E.g., most people would have
started with

for i in range(n): ...

I looked at yours carefully, & don't think you missed any trick I know
about.

> IF I wanted to *really* worry about such issues, I'd even have several
> functions with different names to call, just to make sure that a
> hashing collision in the function name lookup was not biasing the
> results.

For that matter, you should also rerun the tests with random amounts of
padding between functions, to make sure the results aren't illusions
caused by rare I-cache conflicts <wink>. Seriously, the more tricks we
build into our HW, the more nightmarish it is to factor out random
interactions among them.

"but-it-runs-20x-faster-if-i-add-two-unreferenced-variables!"-ly y'rs
- tim

Tim Peters tim@ksr.com
not speaking for Kendall Square Research Corp