Skip to content

Incorrect match behavior on $ #557

Closed
@davisjam

Description

@davisjam
Contributor

Rust version:

(15:14:21) jamie@woody $ rustc --version
rustc 1.32.0-nightly (00e03ee57 2018-11-22)

(15:15:24) jamie@woody $ cargo --version
cargo 1.32.0-nightly (b3d0b2e54 2018-11-15)

This might be a duplicate of some other already-open bugs, but it's hard to guess at root causes from symptoms.

Consider the regex /(aa$)?/ and the input "aaz", using a partial match.

The docs say that $ should match $ the end of text (or end-of-line with multi-line mode).

The final ? means that the regex engine can partial-match any string --- either strings that end in "aa" (capturing "aa") or any string (capturing nothing).

Since the input "aaz" does not end in "aa":

  • I expect this regex to match and capture nothing.
  • The regex actually matches and captures "aa".

Here's the kernel of my test program:

  match Regex::new(&query.pattern) {
    Ok(re) => {
      queryResult.validPattern = true;

      for i in 0..query.inputs.len() {
        let input = query.inputs.get(i).unwrap();
        eprintln!("Input: {}", input);

        let mut matched = false;
        let mut matchedString = "".to_string();
        let mut captureGroups: Vec<String> = Vec::new();

        // Partial-match semantics
        match re.captures(&input) {
          Some(caps) => {
            matched = true;

            matchedString = caps.get(0).unwrap().as_str().to_string();
            captureGroups = Vec::new();
            for i in 1..caps.len() {
              match caps.get(i) {
                Some(m) => {
                  captureGroups.push(m.as_str().to_string());
                },
                None => {
                  captureGroups.push("".to_string()); // Interpret unused capture group as ""
                }
              }
            }
          },
          None => {
            matched = false;
          }
        }

        let mr: MatchResult = MatchResult{
          input: input.to_string(),
          matched: matched,
          matchContents: MatchContents{
            matchedString: matchedString,
            captureGroups: captureGroups,
          },
        };

        queryResult.results.push(mr);
      }
    },
    Err(error) => {
      // Could not build.
      queryResult.validPattern = false;
    }
  };

This is the behavior on the regex and input described above:

{"pattern": "(aa$)?", "inputs": ["aaz"]}

The pattern is: (aa$)?
Input: aaz
{
  "pattern": "(aa$)?",
  "inputs": [
    "aaz"
  ],
  "validPattern": true,
  "results": [
    {
      "input": "aaz",
      "matched": true,
      "matchContents": {
        "matchedString": "aa",
        "captureGroups": [
          "aa"
        ]
      }
    }
  ]
}

In this case, Rust is unique among the 8 languages I tried. Perl, PHP, Java, Ruby, Go, JavaScript (Node-V8), and Python all match with an empty/null capture.

Activity

BurntSushi

BurntSushi commented on Jan 31, 2019

@BurntSushi
Member

I believe your analysis on what's expected is correct. Here's a much smaller reproduction: https://play.rust-lang.org/?version=stable&mode=debug&edition=2018&gist=30dfc1d0a4d9c158c2dfe55fed32331b

The first and third results are correct, but the second one is not. I do believe there is a duplicate bug for this, but I'd want to confirm the root cause first. It could be a while before this gets fixed.

davisjam

davisjam commented on Jan 31, 2019

@davisjam
ContributorAuthor

To your test cases I would add:

   let re = Regex::new(r"(aa$)").unwrap();
   println!("{:?}", re.captures("aaz"));

which does not match. So the presence of the trailing ? appears to be important.

hikotq

hikotq commented on Feb 13, 2019

@hikotq
Contributor

Hi.
I have investigated about this issue.
The problem seems to come from the following code:

regex/src/exec.rs

Lines 873 to 887 in 60d087a

fn captures_nfa_with_match(
&self,
slots: &mut [Slot],
text: &[u8],
match_start: usize,
match_end: usize,
) -> Option<(usize, usize)> {
// We can't use match_end directly, because we may need to examine one
// "character" after the end of a match for lookahead operators. We
// need to move two characters beyond the end, since some look-around
// operations may falsely assume a premature end of text otherwise.
let e = cmp::min(
next_utf8(text, next_utf8(text, match_end)), text.len());
self.captures_nfa(slots, &text[..e], match_start)
}

which in exec.rs

As an example, Execute the following sample code:
https://play.rust-lang.org/?version=stable&mode=debug&edition=2018&gist=1fef3de82985c493655ba57ffa6a7ba0

In case of example, capture of the submatch is performed for the range of the matching result after performing matching in the DFA.
In this case submatches are captured by NFA after matching by DFA. The match result by DFA matches the null character so submatch capture range is (s, e) = (0, 0).

regex/src/exec.rs

Lines 563 to 575 in 60d087a

MatchType::Dfa => {
if self.ro.nfa.is_anchored_start {
self.captures_nfa(slots, text, start)
} else {
match self.find_dfa_forward(text, start) {
dfa::Result::Match((s, e)) => {
self.captures_nfa_with_match(slots, text, s, e)
}
dfa::Result::NoMatch(_) => None,
dfa::Result::Quit => self.captures_nfa(slots, text, start),
}
}
}

captures_nfa_with_match seems to be a function to get the start position and end position of the matching result and to get a submatch within that range, but there are cases that the text two characters added to mached text specified as the submatch capture range.
In the code shown in the example, submatch capture range is (0, 0 + 2) = (0, 2) range, so "aa" is the submatch capture target.

Matching against $ is done when capture the submatches, but this process is to judge whether it is the end of the character string passed to captures_nfa in the current code. And the position at the time of finishing reading "aa" (at.pos() == 2) and the length of aa (self.len() == 2) are compared and matched. This makes /aa$/ match for "aa". But the condition of a match by EndText should be when at.pos() == 3 in case of example.

EndText => at.pos() == self.len(),

Also, #334 has been reported as an issue related to the relevant part. This issue is already closed, but it can be confirmed that even if the regular expression is slightly changed like this:
https://play.rust-lang.org/?version=stable&mode=debug&edition=2018&gist=fc92a5a10c24a89c300c974ae9296373
the problem is reproduced even now.

This seems to be the same cause as the issue of this issue.

To solve this problem, I feel that it is necessary to use the length of the original text when processing EndLine and EndText

added a commit that references this issue on Feb 20, 2019
25eece1
added 4 commits that reference this issue on Mar 10, 2019
011b2be
b800cc9
4e932e6
7fa0b1c
added a commit that references this issue on Mar 30, 2019
72066ec
BurntSushi

BurntSushi commented on Mar 30, 2019

@BurntSushi
Member

@Pipopa Thanks so much for your investigation into this issue and subsequent fix. I've merged it in #567. :-)

hikotq

hikotq commented on Mar 30, 2019

@hikotq
Contributor

@BurntSushi Thank you for reviewing and merging!

added a commit that references this issue on Apr 16, 2019
added a commit that references this issue on Mar 12, 2024
3563d73
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

      Development

      No branches or pull requests

        Participants

        @BurntSushi@davisjam@hikotq

        Issue actions

          Incorrect match behavior on $ · Issue #557 · rust-lang/regex