Description
Crate version: 1.11.0
Example code: https://play.rust-lang.org/?version=stable&mode=debug&edition=2021&gist=c4b4cfe18c2e6413444e53315de33b27 (used for snippets below and extra checks)
The behavior of the crate when trying to use the ASCII character class syntax [[:foo:]]
with invalid character classes is somewhat confusing. A friend was trying to use [[:XID_Start:]]
to check whether _
(underscore/low line) was included in the XID_Start character class (it's not), and was confused when it returned true.
let expr = regex::Regex::new(r"[[:XID_Start:]]").unwrap();
dbg!(expr.is_match("_")); // true
The correct syntax, \p{XID_Start}
, does work correctly:
let correct = regex::Regex::new(r"\p{XID_Start}").unwrap();
dbg!(correct.is_match("a")); // true
dbg!(correct.is_match("1")); // false
dbg!(correct.is_match("_")); // false
It seems that when the class is invalid for an ASCII character class (regex
§ ASCII character classes), it falls back to marking any character present within the brackets as true:
dbg!(expr.is_match(":")); // true
dbg!(expr.is_match("X")); // true
dbg!(expr.is_match("x")); // false
dbg!(expr.is_match("a")); // true
dbg!(expr.is_match("b")); // false
dbg!(expr.is_match("[")); // false
dbg!(expr.is_match("]")); // false
I'm not entirely sure what regex
is actually interpreting this sequence as, but, assuming this is intentional behavior, I think that it might be something that is worth documenting in the aforementioned section on ASCII character classes in the docs, as the behavior is not immediately intuitive.
Activity
BurntSushi commentedon Oct 27, 2024
Yes the behavior is unfortunate but intentional for compatibility with how other regex engines work. In retrospect, I would have rathered being a bit more strict here to produce errors for unrecognized classes.
I agree that adding a note to the docs about this would be a good idea.