Re: threads on Solaris 2.3: do they work?

Guido.van.Rossum@cwi.nl
Mon, 28 Feb 1994 16:00:54 +0100

> I constructed a little function that should take 2 seconds to finish.
> With 7 threads running simultaneously, I expected it to take less than
> 4 seconds of real time (since I have 4 CPUs). This is what I got with
> fork. Worst case, if it didn't use the multi-tasking, should be 14
> seconds. It actually took 76 seconds of cpu time for all the threads
> to finish!!!
>
> Note the high number of context swithes in the vmstat listing: over
> 10,000 per second!! (for the fork version of this, the number of
> context switches remains very low). Is this the problem? Is there
> anyway to change this? The program follows below.

I've investigated this a bit and think I can offer an explanation.

The Python interpreter was originally designed for single-threaded
operation. When I added threads I had to add a global mutex which is
locked by the interpreter at most times, protecting the interpreter's
global variables (not that there are many, but enough to worry about
parallel accesses) and the reference counts of all objects. (The
accesses are so fine grained that a mutex per object, e.g. in the
INCREF and DECREF macros, would be prohibitively slow.) This mutex is
released and re-acquired every ten "Python virtual instructions"
(i.e. iterations through the interpreter's main loop in ceval.c:
eval_code()). The mutex is also released by all (or almost all
anyway) built-in functions and operations that may block for I/O, so
that if multiple threads are used to multiplex I/O streams, they work
as expected.

I have strong evidence that the frequent releasing and re-acquiring of
this mutex is causing the *slowdown* that Steve noticed, and which I
can also measure (with the program below) on our 2-CPU Sparcstation 10
running Solaris 2.3. My evidence: when I reduce the frequency to
every 100 or every 1000 instructions, the slowdown decreases to a much
more reasonable value. (See the three tables below.)

I ran the same test (once every 10 instructions) on a 2-CPU SGI Onyx
and found comparable slowdown values.

What to do about it? Releasing the lock less often has the
disadvantage that one thread can monopolize the CPU for a long time --
not all Python instructions take the same processing time and while
some instructions may execute in a microsecond others may take a
millisecond or more. Also, checking for keyboard interrupts is done
with the same frequency -- thus a program caught in an infinite loop
might be hard to stop.

Could a compromise be to reduce the checking frequency when threads
are used?

--Guido van Rossum, CWI, Amsterdam <Guido.van.Rossum@cwi.nl>
URL: <http://www.cwi.nl/cwi/people/Guido.van.Rossum.html>

======================================================================
Release+acquire every 10 Python instructions
n total t/thread slowdown
1 0.426 0.426 1.000
1 0.426 0.426 1.001
1 0.428 0.428 1.005
1 0.431 0.431 1.012
2 1.513 0.756 1.776
2 1.880 0.940 2.207
3 2.894 0.965 2.265
3 2.955 0.985 2.313
4 4.478 1.120 2.629
4 3.722 0.930 2.185
5 3.737 0.747 1.755
5 4.092 0.818 1.922
6 6.701 1.117 2.623
6 8.391 1.398 3.284
7 7.209 1.030 2.419
7 5.304 0.758 1.779
8 15.337 1.917 4.502
8 13.533 1.692 3.973
10 11.470 1.147 2.694
10 14.190 1.419 3.333
======================================================================
Release+acquire every 100 Python instructions
n total t/thread slowdown
1 0.386 0.386 1.000
1 0.386 0.386 1.000
1 0.386 0.386 1.001
1 0.391 0.391 1.013
2 0.923 0.462 1.196
2 0.902 0.451 1.169
3 1.490 0.497 1.287
3 1.597 0.532 1.379
4 2.130 0.532 1.380
4 2.199 0.550 1.425
5 2.905 0.581 1.506
5 2.974 0.595 1.541
6 3.738 0.623 1.614
6 3.711 0.619 1.603
7 3.978 0.568 1.473
7 4.309 0.616 1.595
8 4.859 0.607 1.574
8 5.280 0.660 1.710
10 7.039 0.704 1.824
10 6.851 0.685 1.775
======================================================================
Release+acquire every 1000 Python instructions
n total t/thread slowdown
1 0.382 0.382 1.023
1 0.382 0.382 1.023
1 0.382 0.382 1.025
1 0.387 0.387 1.036
2 0.778 0.389 1.043
2 0.782 0.391 1.048
3 1.175 0.392 1.050
3 1.181 0.394 1.055
4 1.508 0.377 1.010
4 1.499 0.375 1.004
5 1.866 0.373 1.000
5 1.887 0.377 1.011
6 2.310 0.385 1.032
6 2.264 0.377 1.011
7 2.737 0.391 1.048
7 2.865 0.409 1.097
8 3.274 0.409 1.097
8 3.297 0.412 1.105
10 4.115 0.412 1.103
10 4.048 0.405 1.085
======================================================================
# thread test

import os
import sys
import time
import getopt
import thread

mutex = thread.allocate_lock()
wait = thread.allocate_lock()
wait.acquire()

gettime = time.time

def run(tid):
global active
for i in xrange(LOOP): pass
mutex.acquire()
active = active-1
## print 'thread', tid, 'done,', active, 'left'
if active == 0: wait.release()
mutex.release()
thread.exit_thread()

def main():
global active, LOOP

LOOP = 10000
N = 10

opts, args = getopt.getopt(sys.argv[1:], 'i:')
for o, a in opts:
if o == '-i': LOOP = int(eval(a))

if not args: args = ['1', '10']
iargs = []
for arg in args: iargs.append(int(eval(arg)))
if 1 not in iargs: iargs.append(1)
iargs.sort()

results = []

fastest = 1000000.0

for N in iargs:
if N < 1:
print 'Ignoring weird value for N:', N
continue
T0 = gettime()
active = N
for i in range(N):
thread.start_new_thread(run, (i,))
wait.acquire()
T1 = gettime()

T = T1-T0
print N, 'threads,', LOOP, 'iterations',
print 'in', round(T, 3), 'seconds',
print 'i.e.', round(T/N, 3), 'seconds/thread'
results.append(N, T)
fastest = min(fastest, T/N)

print
print ' n total t/thread slowdown'
for n, t in results:
print '%3d %8.3f %8.3f %8.3f' % (n, t, t/n, t/n/fastest)

main()
======================================================================