Closed
Description
Rust version:
(15:14:21) jamie@woody $ rustc --version
rustc 1.32.0-nightly (00e03ee57 2018-11-22)
(15:15:24) jamie@woody $ cargo --version
cargo 1.32.0-nightly (b3d0b2e54 2018-11-15)
This might be a duplicate of some other already-open bugs, but it's hard to guess at root causes from symptoms.
Consider the regex /(aa$)?/
and the input "aaz", using a partial match.
The docs say that $
should match $ the end of text (or end-of-line with multi-line mode)
.
The final ?
means that the regex engine can partial-match any string --- either strings that end in "aa" (capturing "aa") or any string (capturing nothing).
Since the input "aaz" does not end in "aa":
- I expect this regex to match and capture nothing.
- The regex actually matches and captures "aa".
Here's the kernel of my test program:
match Regex::new(&query.pattern) {
Ok(re) => {
queryResult.validPattern = true;
for i in 0..query.inputs.len() {
let input = query.inputs.get(i).unwrap();
eprintln!("Input: {}", input);
let mut matched = false;
let mut matchedString = "".to_string();
let mut captureGroups: Vec<String> = Vec::new();
// Partial-match semantics
match re.captures(&input) {
Some(caps) => {
matched = true;
matchedString = caps.get(0).unwrap().as_str().to_string();
captureGroups = Vec::new();
for i in 1..caps.len() {
match caps.get(i) {
Some(m) => {
captureGroups.push(m.as_str().to_string());
},
None => {
captureGroups.push("".to_string()); // Interpret unused capture group as ""
}
}
}
},
None => {
matched = false;
}
}
let mr: MatchResult = MatchResult{
input: input.to_string(),
matched: matched,
matchContents: MatchContents{
matchedString: matchedString,
captureGroups: captureGroups,
},
};
queryResult.results.push(mr);
}
},
Err(error) => {
// Could not build.
queryResult.validPattern = false;
}
};
This is the behavior on the regex and input described above:
{"pattern": "(aa$)?", "inputs": ["aaz"]}
The pattern is: (aa$)?
Input: aaz
{
"pattern": "(aa$)?",
"inputs": [
"aaz"
],
"validPattern": true,
"results": [
{
"input": "aaz",
"matched": true,
"matchContents": {
"matchedString": "aa",
"captureGroups": [
"aa"
]
}
}
]
}
In this case, Rust is unique among the 8 languages I tried. Perl, PHP, Java, Ruby, Go, JavaScript (Node-V8), and Python all match with an empty/null capture.
Activity
BurntSushi commentedon Jan 31, 2019
I believe your analysis on what's expected is correct. Here's a much smaller reproduction: https://play.rust-lang.org/?version=stable&mode=debug&edition=2018&gist=30dfc1d0a4d9c158c2dfe55fed32331b
The first and third results are correct, but the second one is not. I do believe there is a duplicate bug for this, but I'd want to confirm the root cause first. It could be a while before this gets fixed.
davisjam commentedon Jan 31, 2019
To your test cases I would add:
which does not match. So the presence of the trailing
?
appears to be important.hikotq commentedon Feb 13, 2019
Hi.
I have investigated about this issue.
The problem seems to come from the following code:
regex/src/exec.rs
Lines 873 to 887 in 60d087a
which in exec.rs
As an example, Execute the following sample code:
https://play.rust-lang.org/?version=stable&mode=debug&edition=2018&gist=1fef3de82985c493655ba57ffa6a7ba0
In case of example, capture of the submatch is performed for the range of the matching result after performing matching in the DFA.
In this case submatches are captured by NFA after matching by DFA. The match result by DFA matches the null character so submatch capture range is (s, e) = (0, 0).
regex/src/exec.rs
Lines 563 to 575 in 60d087a
captures_nfa_with_match
seems to be a function to get the start position and end position of the matching result and to get a submatch within that range, but there are cases that the text two characters added to mached text specified as the submatch capture range.In the code shown in the example, submatch capture range is (0, 0 + 2) = (0, 2) range, so "aa" is the submatch capture target.
Matching against $ is done when capture the submatches, but this process is to judge whether it is the end of the character string passed to
captures_nfa
in the current code. And the position at the time of finishing reading "aa" (at.pos() == 2
) and the length of aa (self.len() == 2
) are compared and matched. This makes/aa$/
match for "aa". But the condition of a match by EndText should be whenat.pos() == 3
in case of example.regex/src/input.rs
Line 188 in 60d087a
Also, #334 has been reported as an issue related to the relevant part. This issue is already closed, but it can be confirmed that even if the regular expression is slightly changed like this:
https://play.rust-lang.org/?version=stable&mode=debug&edition=2018&gist=fc92a5a10c24a89c300c974ae9296373
the problem is reproduced even now.
This seems to be the same cause as the issue of this issue.
To solve this problem, I feel that it is necessary to use the length of the original text when processing EndLine and EndText
Fix a bug in capture with match
Fix a bug in capture with match
Fix a bug in capture with match
Fix a bug in capture with match
Fix a bug in capture with match
exec: fix a bug in capture with match
BurntSushi commentedon Mar 30, 2019
@Pipopa Thanks so much for your investigation into this issue and subsequent fix. I've merged it in #567. :-)
hikotq commentedon Mar 30, 2019
@BurntSushi Thank you for reviewing and merging!
input: fix infinite loop regression
exec: fix a bug in capture with match