This white paper discusses the problem of building and administering large
systems written in Python. It focuses on the adequacy of the current import
statement and module search path mechanism, and proposes some improvements.
Introduction
------------
Okay, here's the scenario. You have built a large software system using
Python, and you are maintaining it. The software system consists of many
independently written Python packages that you have obtained from various
internet sites, plus your own code. All told, there are almost a thousand
Python modules in the system. The hardware consists of a heterogenous
network, with Macs, PCs and various types of Unix machines, all running your
Python code. The software is located on an NFS file system that is exported
to all of the machines (including the PCs and Macs).
One of the problems in configuring a large system like this is configuring
the import statement so that it always maps a module name onto the correct
file name. Your job as system administrator is much easier if you can keep
this configuration information separate from the source code. This is
important for two reasons:
1) If you have to rename modules, and rewrite import statements in third
party packages that you have obtained, then you are creating a lot of
extra work for yourself; work that will have to be redone each time you
receive a new release of a package.
2) If the configuration information is separate from the source code, then
it is much easier to change. A configuration change means changing only
one file, instead of hundreds.
Right now, the PYTHONPATH variable is the sole mechanism for configuring the
import statement. I intend to show that this mechanism is inadequate for
large systems, and I will suggest improvements.
A Hierarchical Name Space for Modules
-------------------------------------
Python has a flat namespace for modules. This will eventually lead to the
same namespace congestion problems that are currently seen with C's flat
namespace for global objects. Hypothetical example: suppose that I were
to release a portable GUI toolkit called Grass, consisting of 50 or more
Python modules. These modules have names like Button, Window, ScrollBar, etc.
With the current module namespace, there is a high probability of conflicts
between my module names and the names of modules in the standard Python
library, or in Python packages written by other people.
I could try to avoid these problems by using a naming convention: I could
name my modules Grass_Button, Grass_Window, etc. But there is a better
solution: change Python to support a hierarchical name space for modules.
Under this scheme, the import statement is extended to support hierarchical
module names, which are sequences of identifiers separated by dots.
For example,
import Grass.Window
The imported module is bound to the last identifier in the module name.
Eg, 'Grass.Window' is bound to 'Window'.
Hierarchical module names are analogous to Unix path names. The modules
in a module hierarchy are stored in a directory tree. The root name of
a module hierarchy (Grass, in the example) is conventionally the name of
a package.
The components of a module hierarchy need not all exist in a single directory
tree; they can be spread out across the directories in the PYTHONPATH search
path. For example, suppose that my PYTHONPATH is:
/u/doug/lib/python:/usr/local/lib/python/sun4:/usr/local/lib/python
When I execute the statement
import Grass.Games.Minesweeper
then the import statement will search the following list of pathnames, until
it finds one that exists:
/u/doug/lib/python/Grass/Games/Minesweeper.py
/usr/local/lib/python/sun4/Grass/Games/Minesweeper.py
/usr/local/lib/python/Grass/Games/Minesweeper.py
Spreading the components of a package like 'Grass' across several directory
trees is done for a couple of reasons:
1) Portable modules are put in
/usr/local/lib/python/Grass/...
while machine dependent modules are placed in
/usr/local/lib/python/<machine-name>/Grass/...
The latter take precedence over the former. I can have implementations
of Grass for several different machine types on the same network file
system, and just vary the search path for each system.
2) The search path mechanism can be used to good effect by a team of
programmers who are developing a large Python application. The current
tested and working version of the application is stored in the 'master
tree'. A programmer who wishes to modify one or more of the modules in
this tree will create a 'shadow tree' which has the same directory
structure as the master tree. Into the shadow tree he puts copies of just
those modules he is working on. He sets his PYTHONPATH variable to
put <shadowTree> ahead of <masterTree>. Once he has made his changes and
tested them, the changes can be incorporated into the master tree. Since
each programmer has their own shadow tree, they can each work on different
parts of the system without interfering with one another.
You can do these things already with the existing Python import and search
mechanism. What I'm demonstrating here is that the switch to a hierarchical
name space preserves these capabilities.
Implementing this proposal requires only minor changes to Python, in
the grammar (for the syntax of a module name in an import statement), and
in the code that traverses the search path. $PYTHONPATH and sys.path remain
unchanged. sys.modules is still a dictionary mapping module names to modules,
except now the module names may contain dots.
Having a hierarchical name space for modules is not quite the same thing as
having nested modules. With the implementation I have outlined above, you
cannot write:
import Grass
Grass.Games.Minesweeper.main()
This will fail; the module "Grass" will not be found.
Multiple Search Paths
---------------------
A single search path that is shared by all packages becomes a problem if
you have multiple versions of some packages. Suppose that there are versions
of packages A and B in both /u/doug/lib/python and /usr/local/lib/python.
Suppose that I want to use the version of A in /u/doug/lib, but I want to
use the version of B in /usr/local/lib. There is no way to set my search
path to obtain this effect. The solution is to allow each package to have
its own search path.
Here is a further extension to Python which implements this.
The variable sys.paths is a dictionary mapping package names onto search paths
(lists of strings). When import looks up a module, it splits the module name
into two parts: the initial identifier is called the package name, and the
remaining part is the tail (which is empty if the module name consists of a
single identifier). The package name is used as an index into sys.paths.
If this succeeds, then the tail is combined with each component of the
resulting search path to generate a sequence of filenames to search.
If the package name is not contained in sys.paths, then the general
search mechanism using sys.path is used.
How does sys.paths get initialized? I first thought of encoding the
information in an environment variable, similar to how sys.path is initialized
from PYTHONPATH. But I think the following idea is better.
If the environment variable PYTHONINIT exists and begins with a /,
then it is the pathname of a Python module which is executed during startup
by the Python interpreter. If it doesn't begin with a /, then it contains
Python code which is evaluated during startup.
The search paths in sys.paths are treated a bit differently from the list of
directories in sys.path. This is best illustrated by an example.
Suppose that my init file contains the following lines:
from sys import paths
paths['algebra'] = ['/u/fred/python/algebra']
I've done this because I want to use fred's algebra package without
getting all of his other code in my search path. Now, if I import the
module 'algebra', then import will try to load the file:
/u/fred/python/algebra.py
If I import 'algebra.groups', then import will try to load:
/u/fred/python/algebra/groups.py
If I already have my own, quite different 'algebra' package, then I can
access fred's algebra package under the name 'falgebra' by using the following:
paths['falgebra'] = ['/u/fred/python/algebra']
In other words, the sys.paths mechanism lets me rename packages to eliminate
conflicts.
[Well, almost. This renaming won't work if fred's algebra package has multiple
modules, and if one of those modules includes another using the package name
'algebra'. Read ahead for a solution.]
Patching A Module
-----------------
Here's something else you can do with sys.paths. Suppose that you have
a distribution of a large package Pkg written in Python, and you find a bug.
There is a single function in one of the modules that doesn't work, and you
know how to reimplement that function to fix it. For some reason, you don't
want to patch the distribution itself; instead, you want to create a patch
that is *external* to the distribution. [Perhaps you don't have source for
the offending module; perhaps you don't want your change to conflict with
other users of the package.] Here's how to do it.
Let's assume that you are patching the function F in the module Pkg.M.
First, add these lines to your init file:
paths['Pkg'] = ['/u/you/python/Pkg', '/usr/Pkg']
paths['OrigPkg'] = ['/usr/Pkg']
This maps the package name Pkg to the patched version, while the package name
OrigPkg now refers to the unpatched version. Your patch file will be called
/u/you/python/Pkg/M.py, and it looks like this:
from OrigPkg.M import *
oldF = F
def F(): ... oldF() ...
You can patch built-in modules just as easily. For example, to patch the
posix module, you might use the following lines in your init file:
paths['builtin'] = ['*builtin']
paths['posix'] = ['/u/you/python/posix']
Within your patch file, you will import the real posix module using the
name "builtin.posix". (The special prefix "*builtin" is magic, and causes
the internal list of builtin modules to be searched when it is encountered
in a named search path.)
Context Sensitive Import
------------------------
You have two large Python packages, A and B. Both packages come with an
auxiliary package called 'util', which they use internally. While these
two 'util' packages are not the same, they both contain modules called
'util.parse'.
You would like to build an application that imports from both A and B.
The problem is that, no matter how you set your search paths, packages A
and B are both going to get the same module when they import 'util.parse'.
You could fix the problem by renaming modules & changing import statements
within either A or B, but you don't relish the idea of having to reapply
these changes each time you get a new distribution of A or B.
The solution is to define another new mechanism, which permits the mapping
from module names to file names to depend on which package is doing the
importing. This mechanism is sys.rename, a dictionary which maps pairs of
package names onto module name prefixes. For example, this:
from sys import rename
rename[('A', 'util')] = 'A_util'
means that whenever a module in package A imports from the util package,
the name of the module being imported will be rewritten so that the prefix
'util' is replaced by 'A_util'. To solve the problem given in above, you
might put the following lines in your init file:
paths['A_util'] = ['/usr/local/A/util']
paths['B_util'] = ['/usr/local/B/util']
rename[('A', 'util')] = 'A_util'
rename[('B', 'util')] = 'B_util'
Here's another situation where you might use this mechanism. You have just
received version 2.5 of package P, which promises better performance and
new features. You already have version 2.2 installed; it's stable, and many
of your existing applications are using it. You decide to install version
2.5, but to minimize disruption from possible incompatibilities, you want
to switch over only a few applications. You use a single global init file
to configure all of the Python software on your system. Here is an excerpt
from that init file:
# P2.2 is the default version of P
paths['P'] = ['/usr/local/p2.2']
paths['P2_5'] = ['/usr/local/p2.5']
In order to switch the 'frog' application over to using version 2.5 of P,
all you do is add the following line to your init file:
rename[('frog','P')] = 'P2_5'
If this turns out to cause problems, you delete the line, and frog now uses
P2.2 again.
Implementing the mechanism is straightforward. The import statement needs to
know the name of the module which called it; it extracts the package name
from the calling module name and the to-be-imported module name, uses
these two names as an index into sys.rename, and rewrites the target module
name if the lookup succeeds. The rewritten module name is then looked up
using the search algorithm described earlier.
Dealing with DOS
----------------
In order to write Python systems that are portable to DOS, one of the things
you have to worry about is making sure that all of your module names are
unique in the first 8 characters (independent of case). That's because of
the way that module names are mapped to file names. Since Python is used
to a large extent by Unix people in an academic environment, I'm sure that
many Python programmers will not spend any time worrying about this; and
therefore, problems *will* occur when porting Python software to DOS.
My proposal does not completely eliminate the 8 characters of significance
problem, but it does help. First, because it provides a hierarchical name
space, developers are more likely to name a collection of modules like this:
foonly.breadbasket
foonly.bakedgoods
than like this:
foonly_breadbasket
foonly_bakedgoods
The former pair of module names are distinct under DOS, the latter aren't.
Second, my system allows you to rename the first component of a module name.
So, if the second pair of module names were to appear in a program,
they could be mapped to distinct DOS filenames.
Conclusion
----------
This paper was partly inspired by a discussion of the inadequacies of the
import statement and PYTHONPATH mechanism that occurred on the python-list
a few weeks ago. At that time, it was suggested that the problems with the
import statement could be remedied by adding an option to specify a filename.
I'd like to argue against this idea.
The problem with putting filenames into import statements is that they are
not portable across operating systems, and may not even be portable across
different machines on the same network. Consider an NFS file system that
is exported to Unix, DOS and Macintosh machines. The file that is called
/net/frog/foo.py on the Unix machines is n:\frog\foo.py on the DOS boxes,
and net:frog:foo.py on the Macs. Now imagine installing a large Python
system whose author has decided to make extensive use of import statements
containing host-syntax pathnames. Sounds like a nightmare, right?
By contrast, my system provides a kind of abstract path name that is machine
independent, and which can be flexibly retargeted using a configuration file
that is separate from the software being configured. In fact, my system
guarantees that you can reconfigure any collection of Python modules without
changing the code. This guarantee would go out the window if you could put
absolute pathnames, using host OS syntax, into import statements.