Re: Ideas about enhancements to fileobjects

Tracy Tims (tracy@gold.sni.ca)
Tue, 23 Nov 1993 17:03:36 -0500

Text with >> is by Guido, responding to me.
Text with > is John Redford, jredford@lehman.com, responding to Guido.
Text with no prefix is me, responding to John Redford. Most of my reponse
follows the included message text.

>> filename: good. It's saved already so should be accessible.

>There are no guarantees that this file has not been moved, or removed,
>replaced with another file or link, or otherwise modified.

>This is data you had when you made the file. I think if you want it
>around you should keep it from then. Pass around a tuple with the file
>name & the file object, dont try to put the name & other application
>data into the object. Next someone will want the address & port of a
>socket to be part of the object.

Actually, the address and port ARE part of a socket "object".
If you look at the file-descriptor/socket/TCP/IP connection
implementation, you will find something very much like
object-oriented programming, with methods in the derived
classes calling methods in their superclasses. The address
and port are in the "instance" of a socket object. And
certainly you must agree that the information associated
with the address and port are public, even though their
storage is not. -TT

>> lineno: useful, but has one problem: it can't always be correct.
>> Keeping it up-to-date after read() is possible but may slow read() of
>> large files down a bit; keeping it up-to-date after seek() is
>> (realistically speaking) impossible. And a minor detail: should it
>> represent the number of lines read so far or the number of the next
>> line?
>>
>> Suggestion: make it a writable attribute, initialized to 0; set to -1
>> by seek(); if it's -1, it's left unchanged by read() and readline();
>> if >= 0, readline() bumps it by 1, read() bumps it by the number of \n
>> characters in the string read. Well, why not do the same for
>> writeline() and write()... Finally, initialize it to -1 when the file
>> is opened with mode 'rb' or 'wb'. I suggest that the filename be made
>> a writable attribute as well -- might be useful to cheat etc.

>This is comepletely untrustable. If you want to count the number of
>'\n's you have read, thats fine, but that dosent prevent someone from
>inserting more into the top of the file. If you want a number that
>equals the number of times you call readline(), thats easy enough to
>keep on your own.

>> peek functions: I'm less convinced that this is worth the additional
>> complexity -- and I've a feeling that it might encourage bad style (oh
>> there he goes again I hear some of you thinking :-). On the other
>> hand it might be a good idea. I've a suggestion for a slightly
>> different style of interface: f.peekline() would return the next
>> unpeeked line and f.peekline(n) would return the n'th line (counting
>> from 0, obviously). I don't see when f.peekreset() would be necessary
>> -- for definiteness, code should always use f.peekline(n) if there may
>> be different pieces of code peeking in the same file. Maybe
>> f.peekline() should mean f.peekline(n+1) when called after
>> f.peekline(n) if I understand correctly how you would use this most of
>> the time.

>I dont think this has any redeeming aspect. 'peek' semantics are not
>gaurenteed past 1 character. Peeking a regular file makes no sense. If
>you want to read the next 2 lines then seek back to where you are, do
>that. Or open 2 file descriptors & use one for read ahead. Using these
>peek function on a file that represented a socket would be a minor
>nightmare, as it would break any other dup'd readers of the socket.

Actually, the peek function isn't what causes the problem. Any
buffered i/o causes the problem. Another strawman burns merrily. -TT

>Oh, and this would definitely encourage bad style. _using_ it is bad
>style.

Wrong. (I've always wanted to say that.) -TT

>This mostly look like cruft that would slow down files just to make
>some applications minorly easier. Parsers arent really the kind of
>thing one expects to write more than once, if that, and it isnt
>supposed to be trivial even then.

I have proposed a natural, useful enrichment of an existing (de facto) python
abstract data type. You have objected to my proposal by saying:

a) since my implementation can be subverted, it is wrong
to do it.
b) enriching the line-oriented abstraction will reduce
python i/o performance.
c) that instead of using an appropriate abstract data-type
to build parsers that I should use implementation specific
features of files. Part of your reasoning seems to be that
parsers are supposed to be difficult to build.

I'll take these points in order.

So what if the implementation can be subverted by calling related
functions that are not part of the abstraction? There is no
guarantee that a private data-attribute of a python instance or
module won't changed by a programmer. Does this mean that classes
and modules should be removed from python? It is also possible to
subvert stdio by calling system i/o calls. Does this mean that
buffering should be removed from stdio?

You were worried that a file's name could change, invalidating the
saved copy of the name. But most programs don't have files open
long enough to make it worth designing them for expected concurrent
access to their files. Let's take a poll: how many people write
text-processing scripts that check that to see if a file has been
renamed, so that they can always issue a correct error message if
an error occurs? Furthermore there are more important properties
(like the data) of a file which can be changed concurrently with
a program that is using the file. This whole argument is completely
irrelevant.

As for a performance disadvantage in i/o, I doubt you will be able
to find any significant disadvantage within the existing overhead
of a python program. I have been doing some python profiling, so
I'm pretty confident about this. Calling a method in python swamps
many other performance issues. Checking an int for -1 on each i/o
call is not something you'll notice. If and when methods get
faster, we can make the i/o system faster (one way is to have python
incorporate its own buffering, so that we can guarantee that we
only check for -1 once per buffer fill).

Now the meat of your argument seems to be your disagreement with how
I want to program, and I must admit that it took me a little while to
figure out that you think that good style is to program by using
operating-system specific features of Unix files, sockets and pipes.

Let me see if I can understand how you would have me write a parser.
I should use seek() or two file descriptors if I want to read ahead.
I assume I am out of luck if I can't seek the fd, because you don't
like readahead mechanisms ("cruft", remember). So that means that
I can only write parsers for seekable fd's, unless I want to keep
all of the state internally. And I have to write my parser with
an OS-dependent mechanism. Not only that, but Unix doesn't let me
find out if an fd is seekable, so I can't even have the parser
check its arguments for correct type.

This is total nonsense.

As for placing the filename and the fileobject into a tuple and
passing that around--a tuple is NOT a structure. You seem to be
arguing that unnamed aggregates of different data types are a good
idea. This runs counter to my own experience and most of modern
programming practice. I think that the performance advantage of
python tuples over class instances is one of the python's greatest
weaknesses: it makes people comfortable with this kind of programming
style and argument. (Hey Guido, I've just given you advance notice
of a gripe I was going to make one of these days!)

So here we have this de facto abstract data-type in python: the
sequential, line-oriented input source. It has some operations
already implemented (f.readline(), f.readlines()), and what's more
important, it's an ADT which harmonizes with the python view of the
world. Python lists, python strings, and python pattern-matching
encourage a line-oriented view of file-scanning. So it makes complete
and utter sense to recognise it as a crucial python ADT, and to add
important functionality to it. (I recognise that this ADT could in the
future be superseded by a wonderful generalization of sequences, but
that won't happen immediately.)

The interesting questions are "what should the data type do?~ and "where
should it live?"

A file handle, and the name used to create it, are (I would argue)
sufficiently closely related that they belong in the same object. If
it helps you, think of the file name as a printable comment on the
origin of the file handle. That certainly belongs in the same object.

The line number can be computed more easily, and more reusably, in
the sequential line object than any where else. It is the line object's
equivalent of a file pointer. A clue: programmers all over the world
keep rewriting the same code for tracking the line number of an input
file. They shouldn't have to.

As for pushback/readahead, I always thought it was pretty
non-controversial, until now. Parsers need to backtrack. The
question is, do they do it on their input, or do they do it with
their own data structures? And here I disagree with the thesis
that parsers are supposed to be hard. Let me produce a concrete
example. With an apropriate API and with lookahead over the input
stream, it is possible to write recursive-descent parsers in which
you can use other people's parsers to parse some of your non-
terminals. The nice thing about a pushback or readahead model is
that a function doesn't have to consume input unless it understands
it. I fail to see how writing a program where the state of the
input stream is easily characterised (only consume what you recognise)
and where the backtrack state is easily understood (it's your input
language) is "bad style~. Has there been a revolution in programming
that has passed me by?

Tracy Tims
tracy@sni.ca