Note: The mercurial server is disabled at the moment while I investigate whether it can run with an acceptably low CPU load – Mike.
A regexp translation scheme
Presented here is a translation from regular expressions to pseudocode for a matching subroutine. The generated code is suitable for use in an implementation of grep: it looks for an instance of the regular expression as a substring at every place within the text, and reports success or failure. For other applications, a different kind of matching is needed that looks for an instance of the regular expression only at the start of the text, possibly finding the longest instance that appears as a prefix of the text. Such applications can be accommodated by modifying the translation scheme shown here.
This translation scheme uses four scalar variables that can be kept in registers:
Char
– the current characterCloc
– the current location in CLISTCmax
– the last element of CLISTNmax
– the last element of NLIST
Cloc
and Cmax
point into an array of addresses that is the current CLIST, and Nmax
points into another array of addresses that is the current NLIST. Other registers will be needed for temporary values, including registers R1
and R2
that are used to hold the arguments of the NNODE
and CNODE
operations.
Global variables clist
and nlist
point to the beginning of these two arrays; they are swapped when the matching machine moves to the next character, with the current NLIST becoming the new CLIST, and the space occupied by the old CLIST being recycled for the new NLIST. The variables nlist
and clist
, and another variable ptr
that points to the next character in the text, are less frequently used and can live in memory.
Both NLIST and CLIST are held as arrays of code addresses, not arrays of instructions as in Thompson's original scheme. This avoids having to generate instructions dynamically during matching, but it forces a bit of overhead from indirection: instead of branching to the current location in CLIST, the FAIL
operation instead fetches an element of CLIST and branches to it indirectly: that is the meaning of the statement
goto *Cloc++;
written in the matching code. The current CLIST always ends (at the address held in register Cmax
) with the the label XCHG
, so that we don't have to test in implementing FAIL
whether the list has been exhausted: if so, control transfers to XCHG
automatically.
The matching routine
The outline of the matching routine is as follows, with a placeholder for the code implementing a specific regular expression.
function match(text): ptr = text; Nmax = nlist; next: // Swap clist and nlist Cmax = Nmax; Cloc = nlist; Nmax = clist; clist = Cloc; nlist = Nmax; *Cmax = xchg; // Fetch next character Char = *ptr++; // ------------------------- [ translate(regexp, found) ] // ------------------------- found: return TRUE; nnode: *Nmax++ = R1; fail: goto *Cloc++; cnode: *Cmax++ = R2; *Cmax = xchg; goto R1; xchg: if (Char != 0) goto next; return FALSE;
Note that after fetching the next character of the text (at label next
), this routine begins again to match the regular expression from the start, in addition to continuing to explore previous partial matches represented by the contents of NLIST from the previous cycle. Also, matching goes on until the input string is exhausted, and doesn't stop when the set of active states becomes empty.
Translating regular expressions
For the specific matching code in the middle of the routine, each regular expression is translated by a recursive routine
translate(regexp e, label ok)
into code that branches to a label ok
on success and to a fixed label FAIL
on failure, and can call the subroutines cnode
or nnode
to generate alternatives to be tried on the current character or on the next character respectively. We write NNODE(lab1)
for the sequence
R1 = lab1; goto nnode
and CNODE(lab1, lab2)
for the sequence
R1 = lab1; R2 = lab2; goto cnode;
The subroutine at nnode
never "returns", but rather saves its argument and then explores the next alternative from CLIST; the subroutine at cnode
saves lab2
as an alternative and returns to lab1
.
Translations of each kind of regexp follow.
- Literal character a
if (Char != a) goto fail; NNODE(ok);
- Sequencing e1 e2
[ translate(e1, lab1) ] lab1: [ translate(e2, ok) ]
- Alternation e1 | e2
CNODE(lab1, lab2); lab1: [ translate(e1, ok) ] lab2: [ translate(e2, ok) ]
- Closure e1 *
- using the equivalence e1 * = eps | e1 e1*
lab1: CNODE(ok, lab2); lab2: [ translate(e1, lab1) ]
- Empty string eps
goto ok;
- Empty language NULL
goto fail;
Refinements
The translation scheme can easily be extended to support other kinds of regular expression, such as the dot .
that matches any single character, classes of characters [a-z]
(conveniently represented by small bitmaps), and beginning and end of line anchors ^
and $
.
Some regular expressions, for example the expression a*a*a*a*a*a*
given in Thompson's paper, can lead to an explosion in the number of states recorded in the NLIST, because the same state can be recorded multiple times. Thompson's solution to this is to scan the NLIST whenever a state is to be added to it, and make the addition only if the state is not already present. A better implementation of this idea uses a data structure for sets that provides operations to add an element, test for membership and (crucially) to clear the set, each in O(1) time. A suitable data structure is a vector of timestamps: to add an element, we set its entry in the vector to the current time; to test membership, we see if the entry shows the current time; and to make the set empty, we need only to increment the current time, and all existing entries then become invalid. It is quite easy to implement this scheme with a third array kept alongside CLIST and NLIST.
Thunder support
Using this translation scheme depends on being able to put the address of any label in a register, and being able to branch to a code address held in a register. Thunder provides specific operations for these two actions: there is an instruction generated by
vm_gen(MOV, r, lab);
where r
is a register (type vmreg
) and lab
is a label (type vmlabel
). The instruction places the address associated with lab
into the register r
. There also an instruction generated by
vm_gen(JUMP, r);
that transfers control to such a label value. These instructions are not generally used for the main purpose of Thunder, namely acting as a target for JIT translators from the code of an abstract machine, but they are available for use in translation schemes like the one described above.