Closed
Description
Here's a reproduction:
use regex::bytes::Regex;
fn main() {
let hay = "I have 12, he has 2!";
let re = Regex::new(r"\b..\b").unwrap();
for m in re.find_iter(hay.as_bytes()) {
println!("{:?}", String::from_utf8_lossy(m.as_bytes()));
}
}
Actual output:
"I "
"12"
Expected output:
"I "
"12"
", "
"he"
" 2"
Playground link: https://play.rust-lang.org/?version=stable&mode=debug&edition=2018&gist=55914c890dfb6a68fc72b9c6fd986298
The same bug is present even if we use ASCII word boundaries: https://play.rust-lang.org/?version=stable&mode=debug&edition=2018&gist=eef23f309c9f608eb683aac982648301
Here's a smaller reproduction:
use regex::bytes::Regex;
fn main() {
let hay = "az,,b";
let re = Regex::new(r"\b..\b").unwrap();
for m in re.find_iter(hay.as_bytes()) {
println!("{:?}", String::from_utf8_lossy(m.as_bytes()));
}
}
Playground link: https://play.rust-lang.org/?version=stable&mode=debug&edition=2018&gist=c7507e4d095141004909f9deb1c6cdd7
Originally reported against ripgrep: BurntSushi/ripgrep#1275
Activity
pcpthm commentedon Jun 10, 2019
I figured out how this bug occurs.
When a match is found on a forward DFA, reverse DFA matching is done on
&text[start..]
regex/src/exec.rs
Lines 704 to 716 in 9921922
However, a word boundary check cannot be done only with the substring. For example, when
text = "a,"
andstart = 1
then&text[start..] = ","
but its index 0 (EOF for reverse matching) should be treated as a word boundary.sergeevabc commentedon Feb 27, 2020
@pcpthm, so how should we proceed?
pcpthm commentedon Feb 27, 2020
@sergeevabc I was not completely sure how to fix it and this is why I just commented on this issue and didn't write a fix.
An idea is to modify the
Byte
structregex/src/dfa.rs
Line 1725 in a0f541b
eof
Byte
s where one returnstrue
to.is_ascii_word()
and other one returnsfalse
. Then, hereregex/src/dfa.rs
Line 834 in a0f541b
eof
depending on whether "text[-1]
" is a word-byte or not (of course a slice cannot index at negative so we have to pass it as an additional argument). We have to account for the two kinds of EOF bytes for DFA table layouts e.g.regex/src/dfa.rs
Lines 1522 to 1525 in a0f541b
BurntSushi commentedon Mar 29, 2020
This will be fixed as part of #656.
malaire commentedon Jan 4, 2021
I found a problem with
\b
- is this same issue as discussed here?This returns
None
even though there are two possible matches at locations 3 and 4.Playground: https://play.rust-lang.org/?version=stable&mode=debug&edition=2018&gist=3892e7b72507f13b32a1110e94aee066
changelog: 1.9.0
Update Rust crate regex to 1.9.1 (#1957)
changelog: 1.9.0