Re: Is this a regex bug or just me?

Guido.van.Rossum@cwi.nl
Tue, 07 Feb 1995 14:59:29 +0100

> Basically, you are right. It is a difference; whether it is a bug is
> a matter of opinion. The problem is that the value left in a register
> is not well defined in my code if the parentesized expression is
> backtracked away. I have seen the same "problem" previously with
> regexps like \([a-zA-Z0-9]\)+ where people expect the register to
> contain the whole identifier. The fix there is to write the
> expression as \([a-zA-Z0-9]+\). The same thing happens with \(xyz\)?,
> where the workaround is to write it as \(\(xyz\)?\).

But that fix doesn't directly translate to Skip's problem -- he had
\(...\)? which was correctly empty, but a \(...\) pair *inside* still
had its old value.

> I have been planning to change the semantics of parentheses in
> connection with backtracking to be the same as in GNU regexp for
> some time, but I haven't gotten around to it yet. Maybe now would be
> the time? I am rather pressed with a deadline right now, but should
> have time after about Feb 20th.

Good!

> In general it is not very clear what should be in the register in an
> expression such as \([a-z]\)+. There are several possible semantics:
> - make it the first thing that the expression matched
> - make it the last thing that the expression matched
> - some intermediate thing it matched
> - empty because the last time it did not match
> - concatenation of all the things it ever matched
> - all characters from beginning of first match to end of last match
>
> Remember, the regexp can also be like \(\([a-z]\)[0-9]\)+.
> What should now be in register 2? Suppose you match it against
> "d7f7a9f6g8sdd". Should \2 be "7", "9", "6", "8", "", or "7f7a9f6g8"?
> Should \1 be "d7", "f7", "g8", "d7f7a9f6g8", "", or something else?
>
> I don't have a clear opinion what should be in either \1 or \2.
> Currently, in my regexp package the value of a register is not
> well-defined if it is inside '+', '*', '?', or '|'.

Doesn't the GNU regex package define the semantics for this?
Otherwise I would vote for always making the contents of the registers
*including the contained registers* empty after backtracking, this is
safest. For ? and | this is the only way anyway, and it's clear that
for + and * there are too many problems with other interpretations.
Especially the argument about what \2 should be in \(\([a-z]\)[0-9]\)+
...

--Guido van Rossum, CWI, Amsterdam <mailto:Guido.van.Rossum@cwi.nl>
<http://www.cwi.nl/cwi/people/Guido.van.Rossum.html>