fix invalid byte sequence in UTF-8 exception when unencoding URLs containing non UTF-8 characters#459
Conversation
|
Did you see #224? Sounds (a bit) like the same issue? |
Oops no I didn't see this one, it looks a bit like the same issue indeed, although it's about the hostname part in #224. My PR here does not fix #224 because there's other potential problems with encoding in the code handling the hostname part: My PR helps for the path and query parts which are more likely to contain non UTF-8 chars in my experience. |
dentarg
left a comment
There was a problem hiding this comment.
Looks good to me, I will merge if you can please the 🐶 and rebase :)
5ab1213 to
aef7db0
Compare
…ontaining non UTF-8 characters
aef7db0 to
e3b04eb
Compare
|
Ok I think I have please the 🐶 (although I'm not happy with the result), rebased the branch on main and squashed the multiple fixes into a single commit ready to merge. |
|
I'm not sure what's going on with GitHub Actions, but I can't merge this as status hasn't been reported for a number of jobs (same problem in #469 but there it makes a bit sense). Do you mind opening this as a new PR? |
|
Maybe wait with that... I saw this now |
|
Indeed, according to actions/runner-images#5583 the brownout ends in 4 hours so retrying the run after that might pass. The image will have to be updated though because end of August macOS 10.15 will be completely gone ^^ Edit 10h later: I just tried but don't have the rights to re-start the build so I'll let you do it. |
|
Thanks for being thorough and checking 150k extra URL parses. 😁 |
|
Thanks @sporkmonger & @dentarg ! Happy to do it again if you need some validation for other changes. |
Since `PublicSuffix` v4.0.3, it is possible to parse the URL `http://+%D5d.some_site.net`. It is also possible to normalize the URL since `Addressable` v2.8.1, which includes this fix: sporkmonger/addressable#459. Hence, this is now a valid URL, which means that it should be moved to the `valid_urls` array in the specs. `PublicSuffix` 4.x is supported by `Addressable` since v2.7.0 (see https://github.com/sporkmonger/addressable/blob/main/CHANGELOG.md#addressable-270).
Hi 👋
First of all I've been using
addressablefor some time in my product to deal with complex URL transformations and it's been super helpfull. Thanks 🙇Recently I started getting one
invalid byte sequence in UTF-8exception ingsubwhen parsing and normalizing some weird URL containing non UTF-8 compatible characters (ISO-8859-1). So I looked at the code and found that it is supposed to change the encoding back to ASCII-8BIT during this phase to avoid any encoding issue (good idea):https://github.com/sporkmonger/addressable/blob/addressable-2.8.0/lib/addressable/uri.rb#L576-L580
BUT a couple lines later in the
unencodemethod it actually forces back to UTF-8 right BEFORE the gsub:https://github.com/sporkmonger/addressable/blob/addressable-2.8.0/lib/addressable/uri.rb#L472-L480
(this is comming from this change: e4f2bd6 following this issue: #154)
So this change back to UTF-8 before the gsub is breaking again this workflow, the spec I added in this PR gives this failure using the original code:
My fix simply removes some of the
force_encodingand changes slightly the one inside the gsub (to avoid breaking the other issue fixed before). The test suite passes entirely and I have also checked this version on the 150k+ URLs present in my product (parse + normalize) without any error. I am already using this version in production for about a week.Let me know if you have any doubt or questions.
Suggested line for the changelog: