Skip to content

Conversation

@SnoopJ
Copy link
Contributor

@SnoopJ SnoopJ commented Mar 22, 2023

Description

This changeset handles hexadecimal escape sequences in ISUPPORT parameter values, implementing #2195

Checklist

  • I have read CONTRIBUTING.md
  • I can and do license this contribution under the EFLv2
  • No issues are reported by make qa (runs make quality and make test)
  • I have tested the functionality of the things this change touches

@SnoopJ
Copy link
Contributor Author

SnoopJ commented Mar 22, 2023

The specification says also that:

If the server has not advertised a CHARSET parameter, it MUST not use
such sequences with a value outside those permitted by the above ABNF
grammar, with the exception of "\x20";

which I believe is not currently checked by Sopel. Not sure if we want to worry about that, but if so, it should probably be part of this changeset.

@dgw dgw added this to the 8.0.0 milestone Mar 22, 2023
@dgw
Copy link
Member

dgw commented Mar 22, 2023

Refreshing to have such a small patch to review after you know what 😅 (and thank you @SnoopJ for helping with that one).

The specification says also that:

If the server has not advertised a CHARSET parameter, it MUST not use
such sequences with a value outside those permitted by the above ABNF
grammar, with the exception of "\x20";

which I believe is not currently checked by Sopel. Not sure if we want to worry about that, but if so, it should probably be part of this changeset.

I wouldn't worry about servers violating the MUST; we can "be liberal in what you accept" and decode the escapes anyway, assuming UTF-8 in the absence of a specified CHARSET.

You could add a test case where something that looks like an escape but isn't valid (what you called "pathological counter-examples" on IRC) gets handled. More importantly, the following behavior should be validated in tests as well:

Characters in multibyte encodings such as UTF-8 should be
sent as a series of \x sequences.

If we're gonna do this, it had better not mojibake the encoded values 😉

@SnoopJ
Copy link
Contributor Author

SnoopJ commented Mar 22, 2023

Characters in multibyte encodings such as UTF-8 should be
sent as a series of \x sequences.

This 'should' confuses me a bit. I think what it's saying is that if the server advertises a CHARSET, the meaning of a sequence of \xHH escapes is different than what this changeset is currently doing.

To give a concrete example, let's say we receive the bytes NETWORK=b'\\xc5\\x81ibera' from the server, and that CHARSET=UTF-8. The way I understand this, the correct decoding of the parameter value involves first replacing \\xc5\\x81 with those octets, then decoding the parameter with UTF-8. This changeset will decode as ("NETWORK", "Å\x81ibera") instead of the correct decoding of ("NETWORK", "Łibera"), if I'm not mistaken.

Since this part of Sopel has already decoded received bytes to a string, doing this as described is tricky. I'll mull it over and maybe do a little sopelunking to see how to accomodate this part of the specification.

@SnoopJ
Copy link
Contributor Author

SnoopJ commented Mar 22, 2023

After discussion with @Exirel, it's come to light that the Modern IRC specification and the IETF specification have slightly different opinions about escape sequences, with Modern IRC being more strict, limiting escapes to only \x20, \x5C, \x3D.

I've updated this PR to follow the Modern IRC perspective, which is not only simpler to implement, but avoids tangling with the complication of the antiquated/removed CHARSET field, and the ambiguity of bytes sent in a multi-byte encoding.

@SnoopJ SnoopJ requested a review from a team March 22, 2023 23:36
@Exirel
Copy link
Contributor

Exirel commented Mar 22, 2023

Note from IRC: we have a different understanding of the spec, and we agree that it is a bit confusing. Following these two things led me to believe that we don't need to un-escape everything and only a subset of characters:

As a result, we discussed it with @SnoopJ and we came to the conclusion that we are going to un-escape only the few bytes that need to.

Edit: @SnoopJ beat me to it!

@SnoopJ SnoopJ force-pushed the feature/gh2195_support-hexadecimal-escape branch from 754471d to fa45f08 Compare March 23, 2023 00:34
Copy link
Member

@dgw dgw left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Everyone else already did the good nitpicks, so this was all that was left for me to find. 🤣

RPL_ISUPPORT may include escape sequences for octets confusible with the
message grammar.  This changeset adds support for this protocol feature
by adding logic to unescape these sequences.

Co-authored-by: dgw <[email protected]>
Co-authored-by: Florian Strzelecki <[email protected]>
Co-authored-by: Rusty Bower <[email protected]>
@SnoopJ SnoopJ force-pushed the feature/gh2195_support-hexadecimal-escape branch from 10854c2 to 77132fc Compare March 23, 2023 03:36
@SnoopJ
Copy link
Contributor Author

SnoopJ commented Mar 23, 2023

Squashed and rebased with a summary in the commit message. Another one8.0 bites the dust?

Copy link
Member

@dgw dgw left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:shipit: :shipit: :shipit: 🚀

@Exirel Exirel linked an issue Mar 23, 2023 that may be closed by this pull request
@dgw dgw merged commit f100238 into sopel-irc:master Mar 25, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

isupport: hexadecimal escape sequences

5 participants