1

I have a list of pairs like (regex, substitute) that take quite long to apply to some input text so I was thinking about optimizing them a bit.

From my knowledge for each regular expression there is a deterministic finite automaton that accepts the same language, and as its possible to merge DFAs merging two regex should be possible...

My question is now how does this interact with the fact that I want to replace something and not just match it.


Basically: given two pairs of a regex and its replacement in a specific order:

1. /\./    -> "Found #DOT#"
2. /#DOT#/ -> "something"

Is it programmatically possible to create a new pair:

   /\./    -> "Found something"     

that applied to some input creates the same result?


Note: are e.g. Python regex even regular in the sense of theoretical informatic? (I mean would I get problems with e.g. look-arounds?)

My reason for asking: there are some regex-chains in my collection that just replace things with other things so that other regex can replace these things but that's only to keep it simple enough for a human to understand, a computer doesn't need to comprehend the regex.

Fabian N.
  • 113
  • 4

1 Answers1

1

I will describe a way that you could optimize the operation "find the first pattern that matches the input". Given a way to do that, you can repeatedly apply that operation: i.e., repeatedly find the first pattern that matches the input, apply the substitution, then repeat.

However, I'm not sure that this will be any better in practice, so this might be useless (unless the number of regexps is quite small).

Union of NFAs

Any (language-theoretic) regular expression can be converted to a nondeterministic finite automaton (NFA) that recognizes the set of strings matched by the regexp (e.g., via Thompson's algorithm).

So, you can build a NFA for each pattern; and there are standard ways to build a NFA to represent the union of two or more NFAs, i.e., a NFA that accepts any input that is accepted by at least one of the NFAs. In particular, you use the product construction.

You can modify the product construction to find the first NFA that accepts an input. Let $N_1,\dots,N_k$ be $k$ NFAs. Each state of the new NFA is of the form $(q_1,\dots,q_k)$ where $q_i$ is a state in $N_i$. A state $(q_1,\dots,q_k)$ is accepting if any of the $q_i$ are an accepting state in $i$. Also transitions are defined following the product construction, except that all transitions out of an accepting state go back to itself. If $(q_1,\dots,q_k)$ is accepting, then find the smallest $i$ such that $q_i$ is accepting in $N_i$; this indicates that the $i$th regexp was the first match.

In principle you can use standard algorithms to run this NFA on the input, or convert the NFA to a DFA via the powerset construction and then run this DFA on the input. In principle this provides a way to find the first matching pattern. In practice you might run into problems: running this NFA on the input might be no faster than running all regexps; and converting to a DFA might cause you to run out of available memory.

D.W.
  • 167,959
  • 22
  • 232
  • 500