|Title:||Adding % formatting to bytes and bytearray|
|Last-Modified:||2014-03-02 16:41:09 +1000 (Sun, 02 Mar 2014)|
|Author:||Ethan Furman <ethan at stoneleaf.us>|
|Post-History:||2014-01-14, 2014-01-15, 2014-01-17, 2014-02-22|
While interpolation is usually thought of as a string operation, there are cases where interpolation on bytes or bytearrays make sense, and the work needed to make up for this missing functionality detracts from the overall readability of the code.
With Python 3 and the split between str and bytes, one small but important area of programming became slightly more difficult, and much more painful -- wire format protocols .
This area of programming is characterized by a mixture of binary data and ASCII compatible segments of text (aka ASCII-encoded text). Bringing back a restricted %-interpolation for bytes and bytearray will aid both in writing new wire format code, and in porting Python 2 wire format code.
All the numeric formatting codes (such as %x, %o, %e, %f, %g, etc.) will be supported, and will work as they do for str, including the padding, justification and other related modifiers.
>>> b'%4x' % 10 b' a' >>> '%#4x' % 10 ' 0xa' >>> '%04X' % 10 '000A'
%c will insert a single byte, either from an int in range(256), or from a bytes argument of length 1, not from a str.
>>> b'%c' % 48 b'0' >>> b'%c' % b'a' b'a'
%s is restricted in what it will accept:
- input type supports ``Py_buffer`` _? use it to collect the necessary bytes - input type is something else? use its ``__bytes__`` method _ ; if there isn't one, raise a ``TypeError``
>>> b'%s' % b'abc' b'abc' >>> b'%s' % 3.14 Traceback (most recent call last): ... TypeError: 3.14 has no __bytes__ method, use a numeric code instead >>> b'%s' % 'hello world!' Traceback (most recent call last): ... TypeError: 'hello world' has no __bytes__ method, perhaps you need to encode it?
Because the str type does not have a __bytes__ method, attempts to directly use 'a string' as a bytes interpolation value will raise an exception. To use strings they must be encoded or otherwise transformed into a bytes sequence:
%a will call ascii() on the interpolated value's repr(). This is intended as a debugging aid, rather than something that should be used in production. Non-ascii values will be encoded to either \xnn or \unnnn representation.
%r (which calls __repr__ and returns a 'str') is not supported.
It was suggested to let %s accept numbers, but since numbers have their own format codes this idea was discarded.
It has been proposed to automatically use .encode('ascii','strict') for str arguments to %s.
- Rejected as this would lead to intermittent failures. Better to have the operation always fail so the trouble-spot can be correctly fixed.
It has been proposed to have %s return the ascii-encoded repr when the value is a str (b'%s' % 'abc' --> b"'abc'").
- Rejected as this would lead to hard to debug failures far from the problem site. Better to have the operation always fail so the trouble-spot can be easily fixed.
Originally this PEP also proposed adding format-style formatting, but it was decided that format and its related machinery were all strictly text (aka str) based, and it was dropped.
Various new special methods were proposed, such as __ascii__, __format_bytes__, etc.; such methods are not needed at this time, but can be visited again later if real-world use shows deficiencies with this solution.
The objections raised against this PEP were mainly variations on two themes:
- the ``bytes`` and ``bytearray`` types are for pure binary data, with no assumptions about encodings - offering %-interpolation that assumes an ASCII encoding will be an attractive nuisance and lead us back to the problems of the Python 2 ``str``/``unicode`` text model
As was seen during the discussion, bytes and bytearray are also used for mixed binary data and ASCII-compatible segments: file formats such as dbf and pdf, network protocols such as ftp and email, etc.
bytes and bytearray already have several methods which assume an ASCII compatible encoding. upper(), isalpha(), and expandtabs() to name just a few. %-interpolation, with its very restricted mini-language, will not be any more of a nuisance than the already existing methdods.
It has been suggested to use %b for bytes as well as %s.
- Pro: clearly says 'this is bytes'; should be used for new code.
- Con: does not exist in Python 2.x, so we would have two ways of doing the same thing, %s and %b, with no difference between them.
|||neither string.Template, format, nor str.format are under consideration|
|||to use a str object in a bytes interpolation, encode it first|
|||%c is not an exception as neither of its possible arguments are str|
|||http://docs.python.org/3/c-api/buffer.html examples: memoryview, array.array, bytearray, bytes|
|||mainly implicit encode/decode, with intermittent errors when the data was not ASCII compatible|
This document has been placed in the public domain.