You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Currently, the SQLite driver uses utf8mb4 as the database charset, and it saves the table and column charsets and collation to the information schema, but it doesn't verify or consider them in any way.
The general idea is to check what charset and collation was specified and verify its compatibility with Unicode (SQLite). If we find a charset or collation that would lead to incorrect application behavior, we should throw an error.
For collations, we should also try to match them to SQLite-supported collations and apply them correctly. The current driver simply adds COLLATE NOCASE to all textual columns, because it's the default MySQL behavior, but we need to reflect what was defined in the table/column definition.
Throwing an exception if we see a collation referring to a different character set or comparison rules
Adding an option like enforce_utf8_charset that defaults to false for all the other cases. When we see a mismatched encoding and the option is true, we'd log a warning and use a sensible default encoding when we see a query with a "salvageable" encoding or collation definition. When the option is false, we'd throw a fatal error explaining what happened and that there's an option you can use. It could also be called strict_mode or so.
What's "salvageable" is quite arbitrary and it's easier to say what isn't.
For example, an incompatible set of collation rule would lead to a very different application behavior and I'd just throw an error right away. MySQL Collation doc page explains different parts of the collation suffix:
When a table declares latin1, utf16, or anything that isn't utf8, we'd likely break the app by quietly using utf-8.
On the flip side, I think we're good to rewrite utf8, utf8mb3, and other similar variations as utf8mb4 (when the option is set). Unicode characters sets page lists deprecated character sets and recommends using utf8mb4 instead. That would still change the application behavior, but I can't imagine a plugin that relies on collating up to 3 bytes from every UTF-8 characters and not the fourth byte. Similarly, utf8mb4_general_ci is deprecated in favor of utf8mb4_unicode_ci and I think we could treat them as the same charset.
The Unicode character sets page also discusses other variations, such as general_, language-specific character sets, etc. It's important that we're aware of this general problem space and, when in doubt, default to throwing an error instead of continuing silently.
Currently, the SQLite driver uses
utf8mb4
as the database charset, and it saves the table and column charsets and collation to the information schema, but it doesn't verify or consider them in any way.The general idea is to check what charset and collation was specified and verify its compatibility with Unicode (SQLite). If we find a charset or collation that would lead to incorrect application behavior, we should throw an error.
For collations, we should also try to match them to SQLite-supported collations and apply them correctly. The current driver simply adds
COLLATE NOCASE
to all textual columns, because it's the default MySQL behavior, but we need to reflect what was defined in the table/column definition.More particularly, @adamziel suggested:
enforce_utf8_charset
that defaults to false for all the other cases. When we see a mismatched encoding and the option istrue
, we'd log a warning and use a sensible default encoding when we see a query with a "salvageable" encoding or collation definition. When the option isfalse
, we'd throw a fatal error explaining what happened and that there's an option you can use. It could also be calledstrict_mode
or so.What's "salvageable" is quite arbitrary and it's easier to say what isn't.
For example, an incompatible set of collation rule would lead to a very different application behavior and I'd just throw an error right away. MySQL Collation doc page explains different parts of the collation suffix:
When a table declares
latin1
,utf16
, or anything that isn'tutf8
, we'd likely break the app by quietly using utf-8.On the flip side, I think we're good to rewrite
utf8
,utf8mb3
, and other similar variations asutf8mb4
(when the option is set). Unicode characters sets page lists deprecated character sets and recommends usingutf8mb4
instead. That would still change the application behavior, but I can't imagine a plugin that relies on collating up to 3 bytes from every UTF-8 characters and not the fourth byte. Similarly,utf8mb4_general_ci
is deprecated in favor ofutf8mb4_unicode_ci
and I think we could treat them as the same charset.The Unicode character sets page also discusses other variations, such as
general_
, language-specific character sets, etc. It's important that we're aware of this general problem space and, when in doubt, default to throwing an error instead of continuing silently.See also Automattic/sqlite-database-integration#21 (comment).
The text was updated successfully, but these errors were encountered: