TRE: A Regex Engine with Approximate Matching

chaoxu · on Nov 6, 2018

An interesting theoretical problem that arises: Given a regular language L, define LD(L,k) to be the set of strings with Levenshtein distance at most k from some string in L.

1. Is LD(L,k) regular? (yes)

2. How hard is it to construct a regular expression(or DFA, NFA) of LD(L,k) given a regular expression(or DFA, NFA) of L? (probably it is something k times the original size in terms of NFA)

I believe a constant k here does not make much sense. since we might match super long strings, then having more errors seems ok. Maybe we can ask for the following:

LDM(L,ε) is the set of strings, such that if x in LDM(L,ε), then there is a y in L, such that distance between x and y is at most ε|y|.

LDM(L,ε) does not seem to be a regular language.

poorcedure · on Nov 7, 2018

If memory serves, it returns the number of deletes/inserts/edits. So what you mention can still be handled.

poorcedure · on Nov 7, 2018

Nevermind, you can set the cost/max of each:

https://laurikari.net/tre/documentation/regaexec/

tyingq · on Nov 6, 2018

Looks pretty handy, but wish it could capture the actual edit distance somewhere. You can set a max, but many use cases call for knowing how close the match was.

vram22 · on Nov 6, 2018

Apropos of regex engines: PCRE [1] is an interesting and somewhat widely used [2] C library for Perl-Compatible Regular Expressions.

I was once on a product team where we used PCRE, and one of my tasks was building it from its C source on many Unix variants (such as Linux, Solaris, HP-UX, AIX, Tru64 UNIX, etc.) and Windows (x 32/64-bit OS versions for some of them). I learned something about the differences between the C compiler toolchains on all those OSes in the process. There was quite a bit of variation between them, in the commands and steps needed to compile and link programs.

[1] https://www.pcre.org/

https://en.wikipedia.org/wiki/Perl_Compatible_Regular_Expres...

https://en.wikipedia.org/wiki/Philip_Hazel (PCRE creator)

[2] From the Wikipedia article about PCRE:

>A number of prominent open-source programs, such as the Apache and Nginx HTTP Servers, and the PHP and R scripting languages, incorporate the PCRE library; proprietary software can do likewise, as the library is BSD licensed. As of Perl 5.10, PCRE is also available as a replacement for Perl's default regular expression engine through the re::engine::PCRE module.

>The library can be built on Unix, Windows, and several other environments. PCRE is distributed with a POSIX C wrapper, a native C++ wrapper,[a] several test programs, and the utility program pcregrep built in tandem with the library.

poorcedure · on Nov 6, 2018

Years ago in a pinch, I used this library to isolate data in OCR results. Looking for phone numbers or tax IDs, it would return values with a B in it and you could correct to an 8 or other similar transforms. Probably much better tools for this now, but at the time it was extremely useful.

gitgud · on Nov 6, 2018

Interesting, would help with [1] ReDOS attacks? It would seem that fuzzy matching could put a upper limit on execution time right?

[1] https://en.m.wikipedia.org/wiki/ReDoS

onlydnaq · on Nov 6, 2018

I believe the correct way to protect against ReDOS is to use one of the many regular expression implementations that do not have exponential worst cases.

Nowadays there are good DFA or Tree based matching algorithms that are O(mn) where m is the length of the regular expression and n is the length of the input.

jey · on Nov 6, 2018

No, seems like it would result in a much bigger state machine that includes states and transitions for all those non-exact matches.

https://en.wikipedia.org/wiki/Finite-state_machine#Acceptors...

laumars · on Nov 6, 2018

This looks pretty interesting. Anyone know if this library has been ported to any other languages aside Perl?

tyingq · on Nov 6, 2018

C, Perl, Python, and Haskell: https://laurikari.net/tre/faq/