Description
String support in stdlib is currently limited to ASCII, @wclodius2 brought up the issue of supporting UTF-8, UTF-16 and UTF-32 as well:
FWIW for a "string type" to supplant the intrinsic
character
I would make the internal representation an integer array so that it is straight forward to extend it to represent UCS/Unicode. The integer type could be eitherINT8
if a UTF-8 representation is desired,INT16
for a UTF-16 representation, or INT32 for UTF-32. I would expect the UTF-32 representation would be the most straight-forward to implement and best for East Asian ideographs, UTF-8 would be the most efficient for most European and Semetic languages, UTF-16 the most efficient for most of the rest of the world.
Originally posted by @wclodius2 in #334 (comment)
Implementing
to_title
will require more than ASCII. Allowing more than just ASCII will require access to the Unicode character database, https://unicode.org/ucd/. This database will also be required forto_upper
,to_lower
, andreverse
if more than ASCII is involved. This database consists of several tens of megabytes of files, http://www.unicode.org/Public/UCD/latest/, and including it in the Standard Library will be controversial, but requiring users to download and install it on their own will also be controversial. FWIW I have a couple of modules to process the more important files in the database.
Originally posted by @wclodius2 in #335 (comment)