unevaluated sequences of initially indefinite length and 'filter'

Steven D. Majewski (sdm7g@elvis.med.virginia.edu)
Thu, 17 Feb 1994 14:24:00 -0500

First, some background and motivation, before we get into the surgery!

Lance got me to dig out my initial stab at a tar-file class, which got
me to thinking about sequences again. "pseudo-sequence" is the wrong
name: they DO always present the Abstract Data Type of a sequence, and
they often really ARE sequences, ( a tar-file certainly IS a sequence )
though they are sometimes not represented internally as a sequence, or
at least, not as the SAME sequence that they present to the outside
world.

Often, the primary reason for casting them as a sequence is so that
they can easily be processed sequentially by 'for'.

Some time ago, I posted an example of how to implement a sequence of
indefinite length by "promising" one more element that the current
length, until the end. ( And if the class reads-ahead one token, then
it doesn't even have to lie about it! ) This produces a stream like
sequence with initially one element, which attempts to produce or
evaluate more elements on demand.

For extremely large tar-files, I was considering a further
modification of this technique, where only an N-deep cache is kept.
Rather than growing full size when it had been processed to the end,
it would expand to a limit and stay there, throwing away previously
used values.

Using this in conjunction with 'filter' seemed like a good idea for
my tar-file class: the complete tar-file might be tens or hundreds of
megabytes, but a particular "interesting" subset might easily fit into
memory. And 'filter' could be used to produce a (real) list from the
sequence class.

So, I dusted off my old example, "flines.py" and appended a couple of
lines at the end to test it with 'filter' :
( I also had to change "builtin" to "__builtin__" . )

# Class Flines
#
# This is a demonstration/test of a class that "lazily" coerces
# the lines of a file into a sequence.
# It was written mostly to test the technique of lying about
# the length of the sequence by +1, except at the end.
# It has the feature that it will continue to try to read
# file when it gets to eof, so even if file continues to grow,
# so that:
# for x in flines_obj: print x[:-1]
# will always print whatever is currently available.
# ( But that also means that flines_obj[-1] is very context sensitive! )
#
# <sdm7g@Virginia.EDU>
#
import posix
import __builtin__

def rdopen( fname ): # open file or pipe for reading
if fname[-1] == '|' :
open = posix.popen
fname = fname[:-1]
else: open = __builtin__.open
return open( fname, 'r' )

class Flines:
def _lie( self ): return self._len + 1
def _get1( self ):
if self._cache[-1]:
self._cache.append( self._file.readline() )
self._len = self._len + 1
else:
self._cache[-1] = self._file.readline()
def __init__( self, fname ):
self._file = rdopen( fname )
self._cache = [ self._file.readline() ]
self._len = 1
def __len__( self ):
if self._cache[-1]:
return self._lie()
else:
self._get1()
if self._cache[-1]: return self._lie()
else: return self._len
def __getitem__( self, j ):
if j < 0 :
self._get1()
elif j >= self._len :
for index in range( self._len, j+1 ): self._get1()
return self._cache[j]
def __del__(self):
self._file.close()
def __getslice__( self, i, j ):
return self._cache[i:j]

F=Flines('nl -ba flines.py|' )
print '# Test me once...'
for x in F : print len(F),F._len, x[:-1]
print '# Test me twice ... '
for x in F : print len(F),F._len, x[:-1]

# test 'filter' --
# find all of the lines containing either "def" or "class" :
#
from string import find
print "--- test 3"
for x in filter( lambda s: (find(s, 'def') > -1 ) or (find(s,'class') > -1), F ) :
print x[:-1]

print "--- test 4"
for x in filter( lambda s: (find(s, 'def') > -1 ) or (find(s,'class') > -1), Flines( 'nl -ba flines2.py|') ) :
print x[:-1]

--------------------------

Test-3, using an already fully evaluated sequence, produced the
correct output, but Test-4 produced NOTHING.

A look at bltinmodule.c gives the answer: Unlike 'for', which evaluates
the length each time thru the loop ( which fact is what suggested the
trick to me in the first place! ), filter gets the length once and
then uses that length in a C 'for' loop to process the entire sequence.

A change of:

< for (i = j = 0; i < len; ++i) {

to:

> for (i = j = 0; i < (len = (*sqf->sq_length)(seq)) ; ++i) {

Almost did the trick, but produced an attribute error. The output
sequence is initially allocated to the size of the input sequence,
and then at the end, any excess elements are set to nulls.

I made to other changes:
(1) If setlistitem() fails, filter now tries addlistitem() to append
to the list. The fixes the attribute error problem.

(2) I preceeded the update of the len by a "( i < len ) ||" test,
so that filtering a sequence of fixed length will not incur the
excess overhead of evaluating the length again.

# --- diffs for bltinmodule.c:

110c110
< for (i = j = 0; i < len; ++i) {

---
> 	for (i = j = 0; (i < len) || (i < (len = (*sqf->sq_length)(seq))) ; ++i) {
134c134,135
< 			if (setlistitem(result, j++, item) < 0)
---
> 			if (setlistitem(result, j++, item) < 0) {
> 				if ( addlistitem( result, item ) < 0 )
135a137
> 			}

With these mods, filter of an indefinite length sequence will work, and test-4 in flines.py (above) will produce the same output as test-3. ( And when I rewrite the tarfile class to take advantage of it, you will be able to scan thru a tape with: MyFiles = filter( Predicate, TarFile( '/dev/rmt0' ) where Predicate is a function that returns True on the desired combination of attributes. )

- Steve Majewski (804-982-0831) <sdm7g@Virginia.EDU> - UVA Department of Molecular Physiology and Biological Physics

[ I wonder whether 'for' can be optimized to NOT always evaluate the length ? ]