-
Notifications
You must be signed in to change notification settings - Fork 471
Closed
Description
Lately I've been thinking a lot about providing a "multi regex" similar to RE2's "regex set" functionality. The problem they solve is, "I have multiple regexes that I want to run over some large search text once, and I want to see every match." The poor man's way of doing this is to combine them in a single regex of alternations, e.g., re1|re2|re3|...
. Two problems with that though:
- The current search machinery reports non-overlapping matches. That is, it's impossible for one alternation in a regex to share a match with another alternation in the same regex.
- To check which expressions matched, one adds capture groups and then inspect them after a match. Requiring capture groups for this functionality is bad because it incurs a performance penalty and simply isn't needed.
We can start relatively simple by providing an API that answers these three questions:
- Do any of the given regexes match anywhere? (analogous to
is_match
) - If so, which of those regexes match? Where do they match? (analogous to
find
) - Can you show me all matches? (analogous to
find_iter
)
Adding capture groups to this API seems possible, but is tricky, so I suggest doing that after an initial implementation is done.
Metadata
Metadata
Assignees
Labels
No labels
Type
Projects
Milestone
Relationships
Development
Select code repository
Activity
BurntSushi commentedon Feb 20, 2016
I managed to get something half-working in the
multi
branch. It should be able to correctly detect which regexes in a particular set match in some search text using a single pass, but if you ask for anything more than that (like where they matched, subsequent matches or sub-captures), then the results are not quite correct. For example, consider matching the regexes.*
anda
. What should the behavior be? In my opinion, it should report multiple matches ofa
(if they exist) before possibly reporting one match of.*
. I contend that any other behavior is counter intuitive. The problem is that the correct behavior implies a significant change to the way the matching engines are currently written. It would require being able to use the matching engines in an iterator like fashion such that its internal state was carried over on subsequent searches.Even though I wound up with something less powerful than what I wanted, I do think being able to tell which regexes match some text in a single pass is useful. At least one important use case I can think of is a URL router, which could contain possibly many regexes but usually only has one URL to search. It would be very hard to build something like that out-of-crate (and is also fast), which to me suggests it has a place in this library. If a caller needs more detailed info about the match (like captures), then they can re-run only the matched regexes on the search text one at a time.
Add regex sets.
Add regex sets.