Skip to content

RegexSet misbehave with unicode #353

Closed
@constituent

Description

@constituent

Tested with regex 0.2.1

println!("{:?}", RegexSet::new(&["a", "b",]).unwrap().is_match("b"));
println!("{:?}", RegexSet::new(&["b", "a",]).unwrap().is_match("b"));
println!("{:?}", RegexSet::new(&["a", "β",]).unwrap().is_match("β"));
println!("{:?}", RegexSet::new(&["β", "a",]).unwrap().is_match("β"));

gives

true
true
false
true

The third should also be true. The only difference of b or β leads to different results.

Activity

BurntSushi

BurntSushi commented on Apr 4, 2017

@BurntSushi
Member

Interestingly, these work fine:

println!("{:?}", Regex::new("a|β").unwrap().is_match("β"));
println!("{:?}", Regex::new("β|a").unwrap().is_match("β"));
BurntSushi

BurntSushi commented on May 20, 2017

@BurntSushi
Member

Found the problem. There appears to be a bug in the compiler that's producing incorrect bytecode specifically for RegexSet:

0000 Split(1, 3) (start)
0001 Bytes(a, a)
0002 Match(0)
0003 Bytes(\xb2, \xb2) (goto: 5)
0004 Bytes(\xce, \xce) (goto: 3)
0005 Match(1)

The correct program should be:

0000 Split(1, 4) (start)
0001 Bytes(a, a)
0002 Match(0)
0003 Bytes(\xb2, \xb2) (goto: 5)
0004 Bytes(\xce, \xce) (goto: 3)
0005 Match(1)

My guess is that the extra Match instructions is somehow throwing things off, because a|β produces the correct program.

added a commit that references this issue on May 20, 2017
cd8f6eb
added a commit that references this issue on May 20, 2017

Auto merge of #369 - rust-lang:ag-fix-353, r=BurntSushi

8a1b2bb
constituent

constituent commented on May 21, 2017

@constituent
Author

I've tested with my original issue and it also works fine now. Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

      Development

      Participants

      @BurntSushi@constituent

      Issue actions

        RegexSet misbehave with unicode · Issue #353 · rust-lang/regex