How about an ARRAY type in Python

Stefan Esser (se@risc3.mi.Uni-Koeln.DE)
Fri, 22 Jan 1993 00:32:40 +0100

[Posting for Stefan Esser who is having trouble with mailing to the
list. Anyone else have a complaint about the list recently? --Guido]

Hi everybody !

Having been a long time user of Perl, I've given the just announced
Python 0.9.8 a try and was surprised of the clean design and easy
expandabilty of the language.

But there is one thing that I'd like to see changed for performance
reasons, and that is the fact, that the only array type supported is
the array of characters (that is strings).

I'm new to this list, so I don't know if there has been much discussion
on that topic and what the results (if any) have been...

Although the Python interpreter isn't fast compared to compiled languages,
I'd like to use Python for a purpose where Megabytes of data have to be
processed: the examination of nuclear physics spectroscopy data.

There is a library I've written some time ago, which performs conversion
and compression of our experimental data, but it has proven to be an
annoying task to combine these access functions into useable programs
performing tasks like matrix transpose or projections with typical
matrix sizes of 4096*4096 elements.

Python makes it (due to its exception concept) very easy to write programs
with reasonable behavior in the case of runtime or operator errors,
and catching exceptions in a user friendly way is the main reason for my
C programs' complexity now.

Now what I'd like to be able to do in Python:

1) Open existing matrix files in any of the supported formats
(little/big endian 2/4 byte integer, floats, text, compressed,...)
and getting information on file type and matrix dimensions
[ got that already to work fine ... ]

2) Access data from these files using the slice syntax of Python.
[ Quite simple if I'm going to unpack the arrays, the low level
routines operate on, into lists. But reading a typical matrix file
once would then mean unpacking 64 Mbyte of data into 16Mio int objects.
And that's why I (think that I) need an array object class. ]

3) Assign a 1-dim array (or a 1-dim slice of a 2-dim array) to a line or
column of a 2-dim array.
This ought to make it possible to write a matrix transpose this way:

for i in range(src.lines):
dst[i][:] = src[:][i]

The semantics of that operation have to be quite different from what
they are on lists now!
(Eg. src[:][i] is equivalent to src[i] now, if src is a list.)
This may require changes to the interpreter to work (I'm not sure
whether it can be easily implemented by __getitem__/__getslice__ and
corresponding __setXX__ methods ? If array_ass_slice() can be made to do
the right thing, then there was no need for any change to python ??).

4) Make this work not only on my special matrix file objects, but also on
a general Python object that represents an array in RAM by indexing
into a contigous large array, instead of as a list/tree structure.
The special object of 3) can now be made a special case of this more
general object.

5) Have a way to create array object instances, eg. using the following syntax:

a = 0[10] # 1-dim array of 10 integers, initialized to 0
f = 1.0[10][20] # 2-dim array of 10 lines of 20 floats each, init=1.0
o = None[10] # 1-dim array of 10 objects, initialized to <None>
s = ' '[100] # 100 byte string initialized to blank space
s = ''[100] # ... special case: as above, but init to '\0'

This array object ought (IMHO) to be represented by the following C struct:

typedef struct {
int mult; /* horner scheme multiplicator */
int limit; /* index range is [0 .. (limit-1)] */
}

typedef struct {
OB_HEAD
void *data;
char elemtype; /* elem_char, elem_int, elem_float, elem_obj */
char status; /* controls behaviour of slicing and assign. */
int dims; /* 0 = scalar, 1 = vector, 2 = matrix, ... */
dim_type dim[1]; /* actually repeated 'dims' times ... */
} arrayobject;

A 2-dim array A of 10 lines of 20 columns of integers might then be
represented by the following actual data:

data = <array base>
elemtype = elem_int
status = ( to be specified :)
dims = 2
dim[0].mult = 4 /* sizeof(int) */
dim[0].limit = 20
dim[1].mult = 20
dim[1].limit = 10

Slice and index operations only modify (a copy) of struct arrayobject,
eg. A[0:2][0:4] would correspond to the same data array, but the following
descriptor:

data = <array base>
elemtype = elem_int
status = ( to be specified :)
dims = 2
dim[0].mult = 4 /* sizeof(int) */
dim[0].limit = 4
dim[1].mult = 20
dim[1].limit = 2

A[1:3][2:6] would be identical, except that the data pointer would have
been adjusted to point to the address of A[1][2] instead of to the base
of the array.

data = <array base> + (( 1 ) * dim[1].mult + 2) * dim[0].mult
= <array base> + 88
elemtype = elem_int
...

Further ideas on array object semantics:

(This is what I need in my applications, it isn't neccessarily what the
way one would like an general array object to behave. That's the reason
I want to discuss this array object type ...)

Assignment compatibility requires same number of (free) dimensions and
same index range in each dimension. Each index (as opposed to slice)
operator reduces the dimension of the resulting object by one.

Given 3-dim arrays A,B of 10*20*30 elements (that is 30 per 'line'):

High dimension slices containing the full range of indizes can be
omitted: A[0] is interpreted as A[:][:][0] in the case of a 3-dim array.

Assignment compatibility:

B[:] = A[0:10][0:20][0:30] # same index ranges on both sides
B[:] = A[:] # same as above in this particular case

B[0:20][0] = A[0][0:20] # both got one index fixed and 20
# elements at the lowest non fixed
# index

B[0:2][0][0:2] = A[0][0:2][0:2] # same as above, except different
# number of elements

A[0][0:2][0:2] = ((1,2),(3,4)) # assignment between arrays and lists
# ought to work in both directions

Methods of array instances:

A.__min__() min(A[0:3]) # element with min. value
A.__max__() max(A{:]) # element with max. value
A.__len__() len(A[:]) # number of elements
A.__sum__() A[:] + B[:] # add two matrizes
A.__mult__() A[:] * n # mult each elem with scalar
A.__mult__() A[:] * B[:} # scalar product
...

Attributes:

A.dims # RO dimensions
A.dim # list of limits,
# highest dim first (or last ??)
??? A.elemtype # type object corresp. to elements

One think I'm considering useful is, to treat (given the above definition of
array A)

A[0] equiv. A[0][0][0],
but A[0:1] equiv. A[:][:][0:1] equiv A[:][:][0]

This would make access to the undelying element object easy without need
to have an index expression corresponding to the dimension of the array.

(Else in case of a 4-dim array A: A[0][0][0] equiv. A[:][0][0][0] ...
... that is: an 1-dim array as opposed to the element type (eg. integer)
in case of 'A' being a 3-dim array, given the above index expression).

(A flag, whether there has at least one slice op. [:] been applied to the
array, ought to be held in the '.status' element of the arrayobject struct.
Or perhaps the number of slice ops done ought to be kept in that field ...)

Hmm, that's enough for the moment I think (anybody still reading this ???).
Hope this wasn't to obfuscated (my thoughts and my english, that is ...).

Are their any design problems with my proposed array type ?

Anything to consider, to make it fit better into the Python frame ?

Is it impossible to do, what I'm trying to, because of language
limitations or design decisions in the Python compiler/interpreter ?

Any ideas ???

Thanks in advance,

STefan

--
 Stefan Esser,  Institute of Nuclear Physics,  University of Cologne,  Germany
 se@IKP.Uni-Koeln.DE                                           [134.95.192.50]
 se@MI.Uni-Koeln.DE                                            [134.95.208.16]