Skip to content

Charset and collation support #192

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
JanJakes opened this issue Jun 2, 2025 · 0 comments
Open

Charset and collation support #192

JanJakes opened this issue Jun 2, 2025 · 0 comments

Comments

@JanJakes
Copy link
Collaborator

JanJakes commented Jun 2, 2025

Currently, the SQLite driver uses utf8mb4 as the database charset, and it saves the table and column charsets and collation to the information schema, but it doesn't verify or consider them in any way.

The general idea is to check what charset and collation was specified and verify its compatibility with Unicode (SQLite). If we find a charset or collation that would lead to incorrect application behavior, we should throw an error.

For collations, we should also try to match them to SQLite-supported collations and apply them correctly. The current driver simply adds COLLATE NOCASE to all textual columns, because it's the default MySQL behavior, but we need to reflect what was defined in the table/column definition.


More particularly, @adamziel suggested:

  • Throwing an exception if we see a collation referring to a different character set or comparison rules
  • Adding an option like enforce_utf8_charset that defaults to false for all the other cases. When we see a mismatched encoding and the option is true, we'd log a warning and use a sensible default encoding when we see a query with a "salvageable" encoding or collation definition. When the option is false, we'd throw a fatal error explaining what happened and that there's an option you can use. It could also be called strict_mode or so.

What's "salvageable" is quite arbitrary and it's easier to say what isn't.

For example, an incompatible set of collation rule would lead to a very different application behavior and I'd just throw an error right away. MySQL Collation doc page explains different parts of the collation suffix:

Suffix | Meaning
-- | --
_ai | Accent-insensitive
_as | Accent-sensitive
_ci | Case-insensitive
_cs | Case-sensitive
_ks | Kana-sensitive
_bin | Binary

When a table declares latin1, utf16, or anything that isn't utf8, we'd likely break the app by quietly using utf-8.

On the flip side, I think we're good to rewrite utf8, utf8mb3, and other similar variations as utf8mb4 (when the option is set). Unicode characters sets page lists deprecated character sets and recommends using utf8mb4 instead. That would still change the application behavior, but I can't imagine a plugin that relies on collating up to 3 bytes from every UTF-8 characters and not the fourth byte. Similarly, utf8mb4_general_ci is deprecated in favor of utf8mb4_unicode_ci and I think we could treat them as the same charset.

The Unicode character sets page also discusses other variations, such as general_, language-specific character sets, etc. It's important that we're aware of this general problem space and, when in doubt, default to throwing an error instead of continuing silently.

See also Automattic/sqlite-database-integration#21 (comment).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant