Skip to content

add multi-regex matching #156

@BurntSushi

Description

@BurntSushi
Member

Lately I've been thinking a lot about providing a "multi regex" similar to RE2's "regex set" functionality. The problem they solve is, "I have multiple regexes that I want to run over some large search text once, and I want to see every match." The poor man's way of doing this is to combine them in a single regex of alternations, e.g., re1|re2|re3|.... Two problems with that though:

  1. The current search machinery reports non-overlapping matches. That is, it's impossible for one alternation in a regex to share a match with another alternation in the same regex.
  2. To check which expressions matched, one adds capture groups and then inspect them after a match. Requiring capture groups for this functionality is bad because it incurs a performance penalty and simply isn't needed.

We can start relatively simple by providing an API that answers these three questions:

  1. Do any of the given regexes match anywhere? (analogous to is_match)
  2. If so, which of those regexes match? Where do they match? (analogous to find)
  3. Can you show me all matches? (analogous to find_iter)

Adding capture groups to this API seems possible, but is tricky, so I suggest doing that after an initial implementation is done.

Activity

added this to the 1.0 milestone on Jan 31, 2016
self-assigned this
on Feb 15, 2016
removed this from the 1.0 milestone on Feb 20, 2016
BurntSushi

BurntSushi commented on Feb 20, 2016

@BurntSushi
MemberAuthor

I managed to get something half-working in the multi branch. It should be able to correctly detect which regexes in a particular set match in some search text using a single pass, but if you ask for anything more than that (like where they matched, subsequent matches or sub-captures), then the results are not quite correct. For example, consider matching the regexes .* and a. What should the behavior be? In my opinion, it should report multiple matches of a (if they exist) before possibly reporting one match of .*. I contend that any other behavior is counter intuitive. The problem is that the correct behavior implies a significant change to the way the matching engines are currently written. It would require being able to use the matching engines in an iterator like fashion such that its internal state was carried over on subsequent searches.

Even though I wound up with something less powerful than what I wanted, I do think being able to tell which regexes match some text in a single pass is useful. At least one important use case I can think of is a URL router, which could contain possibly many regexes but usually only has one URL to search. It would be very hard to build something like that out-of-crate (and is also fast), which to me suggests it has a place in this library. If a caller needs more detailed info about the match (like captures), then they can re-run only the matched regexes on the search text one at a time.

added a commit that references this issue on Feb 22, 2016
73341d1
added a commit that references this issue on Feb 22, 2016
94d0ad4
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Metadata

Metadata

Assignees

Labels

No labels
No labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

    Development

    No branches or pull requests

      Participants

      @BurntSushi

      Issue actions

        add multi-regex matching · Issue #156 · rust-lang/regex