Skip to content

Regex::must(...) to replace Regex::new(...).unwrap()? #450

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
BurntSushi opened this issue Feb 22, 2018 · 19 comments
Closed

Regex::must(...) to replace Regex::new(...).unwrap()? #450

BurntSushi opened this issue Feb 22, 2018 · 19 comments

Comments

@BurntSushi
Copy link
Member

BurntSushi commented Feb 22, 2018

This issue is intended to brainstorm some ideas on how to report regex parse errors to Rust programmers.

One of my long running projects has been a rewrite of the regex-syntax crate. There are several goals I'm achieving via this rewrite, and one of them in particular is better error messages for regex parse errors.

Here's an example:

errors

Here's another example:

errors2

The above error messages correspond to the fmt::Display implementation on the error returned by the parser. What I'd like to have happen is for these error messages to appear to programmers when they create malformed regexes in their source code. Specifically, today, the code pattern for building a regex that is known at compile time is:

let re = Regex::new(...).unwrap();

The issue here is that if the regex contains a parse error, then the unwrap implementation will show the Debug message for the error instead of the Display implementation. That's definitely the right behavior in most cases, but in this case, I'd really like to show a nicer error message. My question is: how can I achieve that?

I can think of two ways:

  1. Make the Debug implementation defer to the Display implementation. This causes unwrap to show the right error message, but now we don't have a "normal" Debug implementation, which feels kind of wrong to me.
  2. Create a new constructor on regex called must (name subject to bikeshedding) that is basically the same as Regex::new(...).unwrap(), except it will emit the Display impl for the error instead of the Debug impl. The downside here is that users of the regex crate need to use must, and this particular style of handling panics isn't really idiomatic in the Rust ecosystem, where we instead prefer an explicit unwrap. The key difference here is that regexes in the source code with parse errors are inherently programmer errors, and programmers should get nice regex error messages, because the Debug impl isn't going to be particularly helpful.

Are there other ways of approaching this problem? What do other people think? My inclination right now is hovering around (2), particularly since Regex::new(...).unwrap() is not only extraordinarily common, but is also correct in the vast majority of cases (with it being incorrect only when the regex isn't known until runtime). That is, normalizing that code pattern with an explicit constructor feels like a good thing to me on its own, but it's not clear how the balances with established idioms.

cc @rust-lang/libs @ethanpailes @robinst @killercup @steveklabnik @bluss

@sfackler
Copy link
Member

It seems like this may be a case where new panics and there's an alternative function (maybe just FromStr?) if you actually need to check for a parse error. I don't think I've ever written anything other than Regex::new().unwrap().

@BurntSushi
Copy link
Member Author

BurntSushi commented Feb 22, 2018

@sfackler That's an interesting thought. I guess the problem there is that it's a pretty significant (albeit trivial to fix) breaking change. My plan is to introduce the regex-syntax rewrite without doing a semver bump on regex.

@sfackler
Copy link
Member

Oh yeah, I didn't see that this was talking about regex-syntax breakage rather than regex.

@robinst
Copy link
Contributor

robinst commented Feb 23, 2018

Having a panicking constructor is a good idea I think. Some thoughts on that:

  1. RegexBuilder would need to get the same method. The current method name there is build. So maybe it would be nice if the names for the panicking new and build shared a common prefix/suffix? Some ideas to show what I mean: new_ok, build_ok. new_expect, build_expect. new_or_fail, build_or_fail etc
  2. I'm not a fan of the name must. Is there an existing naming convention in the standard library for something that can fail?
  3. Would most cases where you want a panicking new use a pattern that is actually a &'static str (not just a &str). Would it be a good idea to require the panicking constructor to require the argument to be a &'static str? And in case the pattern is dynamically created, you'd probably want to handle the error in a special way and not panic. I'm not sure how this would be applicable to RegexBuilder though.

@clarfonthey
Copy link

Longer-term I think the best solution is having some form of compile-time checking as the default and making a runtime-specific version more obvious. I wouldn't mind if initially, the compile-time version just checked the syntax and did a new().unwrap().

@clarfonthey
Copy link

To clarify: I think that having a regex! macro which guaranteed that the regex would be valid at compile time, where panicking ever is a bug, should be the primary method of making a regex. It doesn't matter how this is formed (and it should be similar to the current method, not the former macro), but shifting everyone to this method is ideal, as it can display the error in a human-friendly way. The runtime-level version would probably be better named parse to make it clear that it's dynamically generating the regex. Usually, you won't unwrap that, but if you do, it means your code is wrong and you'd be debugging via the usual means.

@BurntSushi
Copy link
Member Author

BurntSushi commented Feb 23, 2018

Compile time checking is definitively out of scope for this issue.

The &'static str idea is nice. It would be backwards compatible if we wanted to loosen that too.

I like must because it is short. We could add it as a method to RegexBuilder too. I would be sad if we ended up with long unwieldy names like "something_or_fail`, especially for such a common operation.

@BurntSushi
Copy link
Member Author

The standard library tends to go in the opposite direction I think, where fallible methods are prefixed with try_. But I don't think there are any instances of constructors.

@killercup
Copy link
Member

killercup commented Feb 23, 2018

Great ideas, I especially like the &'static str special :) I also wanted to point out that there is an invalid_regex lint in clippy.

A new thought: Show the fancy parse error when RUST_BACKTRACE=1 (or a special other env var or feature is set) instead or when a special method is used.

@ethanpailes
Copy link
Contributor

I'll just add my voice to the chorus in favor of option number 2.

My bikeshedding 2 cents on the name is that length matters the most because I expect that people will end up writing this a lot, but must isn't quite as obvious as it could be. I kind of like the new_ok/build_ok option. I don't think it is much more obvious than must but @robinst is right that there should be an option for using this with RegexBuilder and this way the two approaches are related.

@BurntSushi
Copy link
Member Author

BurntSushi commented Mar 7, 2018

OK, so it looks like I can actually sidestep this issue for the most part. One thing I forgot about was that the error type in the regex crate doesn't actually use the error type from regex-syntax to avoid creating a public dependency. Instead, it just converts the syntax error to a string. Since the debug representation at this point is just going to show that string anyway, we might as well format it a little more nicely. Example:

thread 'main' panicked at 'called `Result::unwrap()` on an `Err` value: Syntax(
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
regex parse error:
    hello\p{Greeek}wat
         ^^^^^^^^^^
error: Unicode property not found
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
)', libcore/result.rs:945:5
note: Run with `RUST_BACKTRACE=1` for a backtrace.

Apologies for the distraction folks!

(I still do kind of like the aesthetics of Regex::must, but I suppose we can stew on that.)

BurntSushi added a commit to BurntSushi/regex that referenced this issue Mar 7, 2018
This commit adds an explicit Debug impl for regex's main Error type.
The purpose of this impl is to format parse errors in normal panic
messages more nicely. This is slightly idiosyncratic, but the default
Debug impl prints the full string anyway, we might as well format it
nicely.

See also: rust-lang#450
BurntSushi added a commit to BurntSushi/regex that referenced this issue Mar 7, 2018
This commit adds an explicit Debug impl for regex's main Error type.
The purpose of this impl is to format parse errors in normal panic
messages more nicely. This is slightly idiosyncratic, but the default
Debug impl prints the full string anyway, we might as well format it
nicely.

See also: rust-lang#450
BurntSushi added a commit that referenced this issue Mar 8, 2018
This commit adds an explicit Debug impl for regex's main Error type.
The purpose of this impl is to format parse errors in normal panic
messages more nicely. This is slightly idiosyncratic, but the default
Debug impl prints the full string anyway, we might as well format it
nicely.

See also: #450
@BurntSushi
Copy link
Member Author

@NickHackman wrote:

This issue is a continuation of #450, as it seems like this would be a nice feature to add. To slightly diverge from #450, instead of replacing Regex::new just an alternative constructor that would return a Regex rather than a Result. Making this not a breaking change, but just a cleaner syntax for a common pattern.

This seems beneficial because in a large number of cases regular expressions are constants that are just unwrapped anyway, it should just panic instead for these cases.

Example

Before:

let REGEX = Regex::new(r"regular-expression").unwrap();

After:

let REGEX = Regex::must_compile(r"regular-expression");

I'm not particularly taken by the Regex::must proposed in #450, Regex::must_compile is more explicit, but long. we can discuss this further.


Original link: #688 (comment)

@BurntSushi
Copy link
Member Author

BurntSushi commented Jun 3, 2020

This issue is a continuation of #450, as it seems like this would be a nice feature to add. To slightly diverge from #450, instead of replacing Regex::new just an alternative constructor that would return a Regex rather than a Result. Making this not a breaking change, but just a cleaner syntax for a common pattern.

That's exactly what this issue proposes though. It proposes adding a new Regex::must constructor. It didn't propose removing anything or adding any breaking changes.

The main reason why I closed this issue was because I found another way to solve the problem at hand, which was producing nicer error messages. With that said, it does seem like there was positive reception to the idea.

I'm not particularly taken by the Regex::must proposed in #450, Regex::must_compile is more explicit, but long. we can discuss this further.

I think a name like must_compile partially defeats the point. It still have a clearer semantics, but it's just as long as writing Regex::new(r"...").unwrap().

@RReverser
Copy link
Contributor

I think, if going the must_compile path, it's better to provide a proc-macro (again) rather than a new panicking method, so that any errors could be detected at compile-time.

It would expand to the same Regex::new(...).unwrap(), but would solve both the problem of cleaner syntax, as well as ensuring that regexes in the source code are indeed valid, even before they're constructed at runtime.

@BurntSushi
Copy link
Member Author

As I said above, compile time checks are definitively out of scope for this issue. It isn't going to happen any time soon. (Years, if at all.)

@RReverser
Copy link
Contributor

@BurntSushi Why? There are already 3rd-party crates that implement it, so my suggestion is merely either documenting them or incorporating similar approach. It doesn't have much complexity, since all it does is calls Regex::new(...).unwrap() inside proc macro as well, resulting in a compilation error.

@BurntSushi
Copy link
Member Author

Can you show me such an example? I don't see how calling Regex::new(...).unwrap() at compile time could possibly work. At minimum, you'd have to do that at compile time and call Regex::new().unwrap() at runtime. On top of that, such a thing would require an additional crate. Finally, checking whether a regex is syntactically valid or not is already supported by Clippy.

@RReverser
Copy link
Contributor

At minimum, you'd have to do that at compile time and call Regex::new().unwrap() at runtime.

Yeah that's whey they usually do. I believe Clippy you mentioned is a good example of that actually. I'll try to find a standalone one I've seen somewhere, but yeah, that's true that it will require an additional crate (that can be reexported though).

@BurntSushi
Copy link
Member Author

Okay, well at least I now understand what you meant. But yeah, I think that's more trouble than it's worth, particularly with the existence of the Clippy lint. Adding a Regex::must is quite a bit simpler. I'd personally rather not start down the proc-macro path until I'm ready to dive into real compile time regexes. (Although, I'm still hoping to get bailed out by const fn.)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

7 participants