Closed
Description
In Mercurial, we have to use ^(?:<patterns>)
where <patterns>
is the culmination of all patterns (transformed to regex) in the user's .hgignore
to remove the additional .*
on each end. The user could input a regex like )(
, which while invalid on its own would work with the workaround and would create a useless capturing group.
Adding an option to RegexBuilder
and its bytes
cousin seems like a good solution to me.
Activity
BurntSushi commentedon May 8, 2020
OK, so I have some varied thoughts on this.
I think at a high level, adding an option as described here is probably the wrong path to take. In particular, your problem doesn't really have anything to do with
^
specifically, but rather, it's a problem of composition. ripgrep actually does similar things with regex composition in order to implement its-w/--word-regexp
flag, butrg ')(' -w
correctly exits with a syntax error. This is because it first attempts to parse the regex as given before trying to compose them. If you're doing regex composition, then I think this is the only correct way to do it. More specifically, you could depend onregex-syntax
to run just the parser to check for syntax validity. It's a little extra work, but should overall be pretty cheap compared to the entire regex compilation process.With that said, an "anchored search" is indeed kind of a special case. I think my plan at the moment is not to surface this as a compile-time option, but rather a search-time option. That is, perhaps in addition to
Regex::find
there will also beRegex::find_anchored
(or whatever name). But that needs at least some API design work and won't happen until at least #656 is done.[-]Add option in `RegexBuilder` to not add `.*` around pattern[/-][+]add anchored search APIs[/+]Alphare commentedon May 11, 2020
You're right that checking the pattern first with
regex-syntax
should be pretty inconsequential in terms of runtime compared to building the DFA and even more so compared to the rest of the program. I have to thank whoever decided to splitregex
in modular crates to make this so easy. ;)We will be using this "workaround" (if that's really the term) until
Regex::find_anchored
becomes part of the API, thanks.BurntSushi commentedon May 11, 2020
No problem. And yeah, it's kind of a work-around for this specific case, but for general composition, I think it's right.
It is plausible that some kind of API for this should/could be exposed in
regex
proper. Maybe not full syntax parsing, but a, for example,parse_regex(&str) -> Result<(), Error>
that just checked whether the regex was valid or not without compiling I think would be sufficient for composition. Then folks wouldn't need to depend onregex-syntax
explicitly. (Which, while convenient, is still primarily supposed to be an implementation detail ofregex
.)Just to make sure your mental model is right here, the
regex
crate currently never builds a full DFA ahead of time. It builds an NFA first, and depending on which matching engine is selected, will either execute the search directly with the NFA or will build the DFA lazily one state at a time during a search. This is the same execution model as RE2.(In the future, I expect there will be some cases where building the DFA ahead of time is done, but only when doing so would be very cheap and use very little space.)
BurntSushi commentedon Mar 6, 2023
I think once #656 lands, it will be possible to achieve this using
regex-automata
's "meta" regex engine. It will support this sort of flexibility with a richer set of search options.I'm not sure if it will ever make it into
regex
proper through unfortunately, since it would seem to me to require duplicating a lot of the methods.So for now, I think I'm going to request that folks who need this try out the meta regex engine once
regex-automata 0.3
is out. If you run into troubles there, then please file an issue.changelog: 1.9.0
Update Rust crate regex to 1.9.1 (#1957)
changelog: 1.9.0