Re: Is this a regex bug or just me?

Tatu Ylonen (ylo@cs.hut.fi)
Tue, 7 Feb 1995 15:45:47 --200

> > \([0-9]+\)/\([0-9]+\)\([ ]*-[ ]*\(\([0-9]+\)/\)?\([0-9]+\)\)?

> It may be a bug in Tatu Ylonen's code. I tried this in Emacs 19 and
> it seems to correctly make group 5 empty. I haven't looked at the
> code but can imagine that the register values (indicating where the
> parentheses match) are filled in while making partial matches and that
> nested register values are not erased when backtracking at a higher
> level. I'm not sufficiently versed in the code to be able to find a
> fix. Tatu?

Basically, you are right. It is a difference; whether it is a bug is
a matter of opinion. The problem is that the value left in a register
is not well defined in my code if the parentesized expression is
backtracked away. I have seen the same "problem" previously with
regexps like \([a-zA-Z0-9]\)+ where people expect the register to
contain the whole identifier. The fix there is to write the
expression as \([a-zA-Z0-9]+\). The same thing happens with \(xyz\)?,
where the workaround is to write it as \(\(xyz\)?\).

I have been planning to change the semantics of parentheses in
connection with backtracking to be the same as in GNU regexp for
some time, but I haven't gotten around to it yet. Maybe now would be
the time? I am rather pressed with a deadline right now, but should
have time after about Feb 20th.

In general it is not very clear what should be in the register in an
expression such as \([a-z]\)+. There are several possible semantics:
- make it the first thing that the expression matched
- make it the last thing that the expression matched
- some intermediate thing it matched
- empty because the last time it did not match
- concatenation of all the things it ever matched
- all characters from beginning of first match to end of last match

Remember, the regexp can also be like \(\([a-z]\)[0-9]\)+.
What should now be in register 2? Suppose you match it against
"d7f7a9f6g8sdd". Should \2 be "7", "9", "6", "8", "", or "7f7a9f6g8"?
Should \1 be "d7", "f7", "g8", "d7f7a9f6g8", "", or something else?

I don't have a clear opinion what should be in either \1 or \2.
Currently, in my regexp package the value of a register is not
well-defined if it is inside '+', '*', '?', or '|'.

I am open to suggestions.

Tatu