Charset and collation support #192

JanJakes · 2025-06-02T13:19:55Z

Currently, the SQLite driver uses utf8mb4 as the database charset, and it saves the table and column charsets and collation to the information schema, but it doesn't verify or consider them in any way.

The general idea is to check what charset and collation was specified and verify its compatibility with Unicode (SQLite). If we find a charset or collation that would lead to incorrect application behavior, we should throw an error.

For collations, we should also try to match them to SQLite-supported collations and apply them correctly. The current driver simply adds COLLATE NOCASE to all textual columns, because it's the default MySQL behavior, but we need to reflect what was defined in the table/column definition.

More particularly, @adamziel suggested:

Throwing an exception if we see a collation referring to a different character set or comparison rules
Adding an option like enforce_utf8_charset that defaults to false for all the other cases. When we see a mismatched encoding and the option is true, we'd log a warning and use a sensible default encoding when we see a query with a "salvageable" encoding or collation definition. When the option is false, we'd throw a fatal error explaining what happened and that there's an option you can use. It could also be called strict_mode or so.

What's "salvageable" is quite arbitrary and it's easier to say what isn't.

For example, an incompatible set of collation rule would lead to a very different application behavior and I'd just throw an error right away. MySQL Collation doc page explains different parts of the collation suffix:

Suffix | Meaning
-- | --
_ai | Accent-insensitive
_as | Accent-sensitive
_ci | Case-insensitive
_cs | Case-sensitive
_ks | Kana-sensitive
_bin | Binary

When a table declares latin1, utf16, or anything that isn't utf8, we'd likely break the app by quietly using utf-8.

On the flip side, I think we're good to rewrite utf8, utf8mb3, and other similar variations as utf8mb4 (when the option is set). Unicode characters sets page lists deprecated character sets and recommends using utf8mb4 instead. That would still change the application behavior, but I can't imagine a plugin that relies on collating up to 3 bytes from every UTF-8 characters and not the fourth byte. Similarly, utf8mb4_general_ci is deprecated in favor of utf8mb4_unicode_ci and I think we could treat them as the same charset.

The Unicode character sets page also discusses other variations, such as general_, language-specific character sets, etc. It's important that we're aware of this general problem space and, when in doubt, default to throwing an error instead of continuing silently.

See also Automattic/sqlite-database-integration#21 (comment).

The text was updated successfully, but these errors were encountered:

JanJakes marked this as a duplicate of Automattic/sqlite-database-integration#25 Jun 2, 2025

JanJakes mentioned this issue Jun 2, 2025

Charset and collation support Automattic/sqlite-database-integration#25

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Charset and collation support #192

Charset and collation support #192

JanJakes commented Jun 2, 2025

Charset and collation support #192

Charset and collation support #192

Comments

JanJakes commented Jun 2, 2025