But that fix doesn't directly translate to Skip's problem -- he had
\(...\)? which was correctly empty, but a \(...\) pair *inside* still
had its old value.
> I have been planning to change the semantics of parentheses in
> connection with backtracking to be the same as in GNU regexp for
> some time, but I haven't gotten around to it yet. Maybe now would be
> the time? I am rather pressed with a deadline right now, but should
> have time after about Feb 20th.
Good!
> In general it is not very clear what should be in the register in an
> expression such as \([a-z]\)+. There are several possible semantics:
> - make it the first thing that the expression matched
> - make it the last thing that the expression matched
> - some intermediate thing it matched
> - empty because the last time it did not match
> - concatenation of all the things it ever matched
> - all characters from beginning of first match to end of last match
>
> Remember, the regexp can also be like \(\([a-z]\)[0-9]\)+.
> What should now be in register 2? Suppose you match it against
> "d7f7a9f6g8sdd". Should \2 be "7", "9", "6", "8", "", or "7f7a9f6g8"?
> Should \1 be "d7", "f7", "g8", "d7f7a9f6g8", "", or something else?
>
> I don't have a clear opinion what should be in either \1 or \2.
> Currently, in my regexp package the value of a register is not
> well-defined if it is inside '+', '*', '?', or '|'.
Doesn't the GNU regex package define the semantics for this?
Otherwise I would vote for always making the contents of the registers
*including the contained registers* empty after backtracking, this is
safest. For ? and | this is the only way anyway, and it's clear that
for + and * there are too many problems with other interpretations.
Especially the argument about what \2 should be in \(\([a-z]\)[0-9]\)+
...
--Guido van Rossum, CWI, Amsterdam <mailto:Guido.van.Rossum@cwi.nl>
<http://www.cwi.nl/cwi/people/Guido.van.Rossum.html>