8

I have a dictionary-like regular expression, an "or chain" of words,

word1|word2|word3|...

Unfortunately, the chain is too large. I'd like to find the minimal regular expression that is equivalent. How can do I do that?

You should think of this as a regex like /^(word1|word2|word3)$/. I have a dictionary of words, and I want a minimal regex that will match if and only if the input string is a word in that dictionary. By minimal, I want the shortest regex that matches all words in the dictionary (and nothing else). I need a regex, not some other representation.

The words come from SQL SELECT DISTINCT word FROM t ORDER BY length DESC, word so I'm hoping for a solution that is optimized to practical use.

I was able to identify some heuristics that work for some special cases, but not a general algorithm:

  • It's easy to deal with accented variations: mãe|mae becomes ma[ãe]e). This works fine.

  • Also, ABCCC|ABCC|ABC|DF can be automatically reduced to (output) ABC{1,3}|DF or AB(?:C|CC|CCC)|DF

As @Raphael noted, "minimizing regular expressions is hard", so it is important the focus on non-generic solutions.

I found a web page that describes how to convert a regex to NFA or DFA, but that doesn't help me get back a minimal regular expression. I'm willing to use an existing library, such as this PHP one, but how do I use the library for this purpose?

D.W.
  • 167,959
  • 22
  • 232
  • 500
Peter Krauss
  • 151
  • 8

2 Answers2

2

Maybe this will not give the exact shortest regex, but (as you're looking for a practical solution) you can start with the Aho-Corasick Algorithm to obtain a TRIE-like DFA that can be easily written as a regular expression. It is proven that the automaton is minimal for the word set, but the regex you get should be at close to the optimal. The optimizations you mention seem to be possible if you look for consecutive final states on a single branch of the constructed DFA.

1

Here is my approach to this problem. Let's consider the following set of input words:

[
    "microsoft.com",
    "microsoft.net",
    "www.microsoft.com",
    "www.mAcrosoft.com",
]

We build 2 compressed tries, one for prefix analysis:

{
  "microsoft.": {
    "com": { "$": null },
    "net": { "$": null }
  },
  "www.m": {
    "icrosoft.com": { "$": null },
    "Acrosoft.com": { "$": null }
    }
  }
}

another, made of reversed words, for postfix analysis:

{
  "moc.tfosorc": {
    "im": {
      "$": null,
      ".www": { "$": null }
    },
    "Am.www": { "$": null }
  },
  "ten.tfosorcim": { "$": null }
}

By tracing branches, populate, what I called, an 'affix table':

kind affix count count * len(affix) remainders
post crosoft.com 3 33 mi, www.mi, www.mA
post microsoft.com 2 26 -
pre microsoft. 2 20 net
pre www.m 2 10 -
  • Only take affixes that occur more than once
  • Order by count * len(affix) descending
  • Greedily check input words and assign their remainders (e.g. words without affix) from top to bottom

At this stage, the regex pattern (I omit escaping for readability) is :

(mi|www.mi|www.mA)crosoft.com|microsoft.net

Next, repeat recursively for mi|www.mi|www.mA, isolating the www.m prefix:

(mi|www.m(i|A))crosoft.com|microsoft.net

In general, continue until there are no more common affixes.

The final expression with escaping and modifiers:

const re = /^(?:(?:mi|www\.m(?:i|A))crosoft\.com|microsoft\.net)$/i;

console.log(re.test("microsoft.com")); console.log(re.test("microsoft.net")); console.log(re.test("www.microsoft.com")); console.log(re.test("www.mAcrosoft.com"));

console.log(re.test("nonsense"));


Real world example: regex for top 100 domains from radar.cloudflare.com with minimum common affix length of 3:

/^(?:(?:google(?:|apis|video|usercontent|syndication|adservices|-analytics|tagmanager)
|tiktok(?:cdn(?:|-us)|row-cdn|v)|cloudflare(?:|-dns)|amazon(?:aws||-adsystem)
|microsoft(?:|online)|(?:|cdn)instagram|app-analytics-services
|app(?:le|lovin|-measurement|sflyersdk)|(?:rocket-|ttlive|rbx)cdn|windows(?:|update)
|office(?:|365)|cdn-apple|(?:gst|pubm)atic|gvt(?:2|1)|vungle|(?:aapl|yt)img
|(?:spot|doublever)ify|facebook|youtube|icloud|live|netflix|bing|yahoo
|bytefcdn-oversea|taobao|samsung|snapchat|criteo|msftncsi|ui|unity3d|azure
|roblox|msn|ggpht|skype|linkedin|baidu|a2z|rubiconproject|adnxs|capcutapi
|digicert|xiaomi|gmail|taboola|android|qq|whatsapp|sharepoint|miui|mikrotik)\.com
|(?:akamai(?:|edge)|akadns|apple-dns|root-servers|fbcdn|whatsapp|doubleclick|cloudfront|fastly|trafficmanager|steamserver|office|windows)\.net
|(?:ntp|cdn77|wikipedia|3gppnetwork)\.org
|one\.one|dns\.google|sentry\.io)$/i

Full code: https://github.com/AlekseyMartynov/minimal-regex

amartynov
  • 126
  • 1