Is this a regex bug or just me?

Skip Montanaro (skip@automatrix.com)
Mon, 6 Feb 1995 14:26:47 -0500

I have a Python script that takes a slightly higher level (and a bit more
restrictive) specification for a regular expression, generates the
corresponding regexp, then uses it to match a bunch of lines in a file.
When it finds a match, it generates output in a different format using the
various groups that were matched. For instance, the following simple spec

%{month}/%{day}

generates the regular expression

\([0-9]+\)/\([0-9]+\)

and the dictionary:

{'month': 1, 'day': 2}

The dictionary maps field names to numeric arguments for the Python regex
module's group() function.

I have a somewhat more complicated specifier that works fine:

%{smonth}/%{sday}%{? - %{eday}}

The leading '?' says that particular chunk of the pattern is optional.

It generates

\([0-9]+\)/\([0-9]+\)\([ ]*-[ ]*\([0-9]+\)\)?

and

{'sday': 2, 'smonth': 1, 'eday': 4}

It successfully matches lines like

1/25-26

I have a multi-day string that crosses the end of a month:

4/30-5/1

so I built the following pattern:

%{smonth}/%{sday}%{? - %{?%{emonth}/}%{eday}}

which generated

\([0-9]+\)/\([0-9]+\)\([ ]*-[ ]*\(\([0-9]+\)/\)?\([0-9]+\)\)?

and

{'smonth': 1, 'sday': 2, 'emonth': 5, 'eday': 6}

just as I expected. It works fine for date strings of the form 1/25 or
4/30-5/1, but returns incorrect results for dates of the form 1/25-26. The
return value of group(5) is '26' instead of None. This is especially
perplexing since group(4), which encloses group(5) correctly returns None.

(For those with acute regexp-itis group(4) and group(6) are nested inside
group(3). group(5) is nested inside group(4). Both group(3) and group(4)
are optional. I saw nothing in the Emacs regexp syntax info page that would
suggest optional regexps should not be nested within one another.)

I noticed that the version of Tatu Ylonen's regexpr.c code used in Python
seemed to not be the most recent, so I fetched the version that was posted
to comp.sources.misc (in volume 27) and the one patch for it I found (in
volume 29), merged Guido's changes into them and rebuilt Python (1.1.1) but
saw no improvement.

Can anybody steer me in the right direction? Have I

a. overstepped the bounds of regular expressions (nesting multiple
optional regexps, prehaps)?
b. failed in my understanding of how they work?
c. generated a faulty regular expression?
d. found a bug in regexpr.c?
e. some, all or none of the above? :-)

Thanks,

--
Skip Montanaro		skip@automatrix.com			  (518)372-5583
Automatrix - World-Wide Computing Solutions	     http://www.automatrix.com/