Skip to content

incorrect case for word boundaries #579

Closed
@BurntSushi

Description

@BurntSushi
Member

Here's a reproduction:

use regex::bytes::Regex;

fn main() {
    let hay = "I have 12, he has 2!";
    let re = Regex::new(r"\b..\b").unwrap();
    for m in re.find_iter(hay.as_bytes()) {
        println!("{:?}", String::from_utf8_lossy(m.as_bytes()));
    }
}

Actual output:

"I "
"12"

Expected output:

"I "
"12"
", "
"he"
" 2"

Playground link: https://play.rust-lang.org/?version=stable&mode=debug&edition=2018&gist=55914c890dfb6a68fc72b9c6fd986298

The same bug is present even if we use ASCII word boundaries: https://play.rust-lang.org/?version=stable&mode=debug&edition=2018&gist=eef23f309c9f608eb683aac982648301

Here's a smaller reproduction:

use regex::bytes::Regex;

fn main() {
    let hay = "az,,b";
    let re = Regex::new(r"\b..\b").unwrap();
    for m in re.find_iter(hay.as_bytes()) {
        println!("{:?}", String::from_utf8_lossy(m.as_bytes()));
    }
}

Playground link: https://play.rust-lang.org/?version=stable&mode=debug&edition=2018&gist=c7507e4d095141004909f9deb1c6cdd7

Originally reported against ripgrep: BurntSushi/ripgrep#1275

Activity

pcpthm

pcpthm commented on Jun 10, 2019

@pcpthm

I figured out how this bug occurs.
When a match is found on a forward DFA, reverse DFA matching is done on &text[start..]

regex/src/exec.rs

Lines 704 to 716 in 9921922

// Now run the DFA in reverse to find the start of the match.
match dfa::Fsm::reverse(
&self.ro.dfa_reverse,
self.cache,
false,
&text[start..],
end - start,
) {
Match(s) => Match((start + s, end)),
NoMatch(i) => NoMatch(i),
Quit => Quit,
}
}

However, a word boundary check cannot be done only with the substring. For example, when text = "a," and start = 1 then &text[start..] = "," but its index 0 (EOF for reverse matching) should be treated as a word boundary.

sergeevabc

sergeevabc commented on Feb 27, 2020

@sergeevabc

@pcpthm, so how should we proceed?

pcpthm

pcpthm commented on Feb 27, 2020

@pcpthm

@sergeevabc I was not completely sure how to fix it and this is why I just commented on this issue and didn't write a fix.
An idea is to modify the Byte struct

regex/src/dfa.rs

Line 1725 in a0f541b

impl Byte {
so that there are two special eof Bytes where one returns true to .is_ascii_word() and other one returns false. Then, here
prev_si = match self.next_state(qcur, qnext, prev_si, Byte::eof()) {
, we have to use the appropriate type of eof depending on whether "text[-1]" is a word-byte or not (of course a slice cannot index at negative so we have to pass it as an additional argument). We have to account for the two kinds of EOF bytes for DFA table layouts e.g.

regex/src/dfa.rs

Lines 1522 to 1525 in a0f541b

fn num_byte_classes(&self) -> usize {
// We add 1 to account for the special EOF byte.
(self.prog.byte_classes[255] as usize + 1) + 1
}
and it might lead to slight performance degradation in some case. I'm not sure it can be improved.

BurntSushi

BurntSushi commented on Mar 29, 2020

@BurntSushi
MemberAuthor

This will be fixed as part of #656.

malaire

malaire commented on Jan 4, 2021

@malaire

I found a problem with \b - is this same issue as discussed here?

This returns None even though there are two possible matches at locations 3 and 4.

fn main() {
    let re = regex::Regex::new(r"\b.").unwrap();
    println!("{:?}", re.find_at("foo bar", 3));
}

Playground: https://play.rust-lang.org/?version=stable&mode=debug&edition=2018&gist=3892e7b72507f13b32a1110e94aee066

added a commit that references this issue on Jul 5, 2023
8513751
added a commit that references this issue on Mar 12, 2024
aa64e6d
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

      Development

      No branches or pull requests

        Participants

        @sergeevabc@BurntSushi@malaire@pcpthm

        Issue actions

          incorrect case for word boundaries · Issue #579 · rust-lang/regex