Skip to content

UTF-8, UTF-16 and UTF-32 support #344

Open
@awvwgk

Description

@awvwgk

String support in stdlib is currently limited to ASCII, @wclodius2 brought up the issue of supporting UTF-8, UTF-16 and UTF-32 as well:

FWIW for a "string type" to supplant the intrinsic character I would make the internal representation an integer array so that it is straight forward to extend it to represent UCS/Unicode. The integer type could be either INT8 if a UTF-8 representation is desired, INT16 for a UTF-16 representation, or INT32 for UTF-32. I would expect the UTF-32 representation would be the most straight-forward to implement and best for East Asian ideographs, UTF-8 would be the most efficient for most European and Semetic languages, UTF-16 the most efficient for most of the rest of the world.

Originally posted by @wclodius2 in #334 (comment)

Implementing to_title will require more than ASCII. Allowing more than just ASCII will require access to the Unicode character database, https://unicode.org/ucd/. This database will also be required for to_upper, to_lower, and reverse if more than ASCII is involved. This database consists of several tens of megabytes of files, http://www.unicode.org/Public/UCD/latest/, and including it in the Standard Library will be controversial, but requiring users to download and install it on their own will also be controversial. FWIW I have a couple of modules to process the more important files in the database.

Originally posted by @wclodius2 in #335 (comment)

Metadata

Metadata

Assignees

No one assigned

    Labels

    topic: utilitiescontainers, strings, files, OS/environment integration, unit testing, assertions, logging, ...

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions