Skip to content

Unicode grapheme support #7043

Closed
Closed
@huonw

Description

@huonw
Member

Currently Rust doesn't support unicode properly, e.g. there is no way to iterate over a string by grapheme (there is .iter() for codepoints, and .bytes_iter() for bytes).

Possibly useful: http://useless-factor.blogspot.de/2007/08/unicode-implementers-guide-part-4.html

Activity

mleise

mleise commented on Jun 10, 2013

@mleise

To force people new to Unicode to understand what they iterate over, the names could be chosen to not make one look more common than the other. Grapheme clusters are slow to decode, but what you typically want when you need to limit text to the first n characters. Even when storing text in a fixed size database field one should be aware that one possibly just strip accents of the last letter when working with code points. Maybe a scheme like this would ensure the right decision is made in user code:

bytes_iter();
cp_iter();
graph_iter();
Kimundi

Kimundi commented on Jun 10, 2013

@Kimundi
Member

Yeah, I'm of the same opinion.
If there exist iterators for all three, none of them should be the shorter default, people need to think about which one they need.

The docs of every of those three functions should also contain a short example along those lines:

/// Returns a Iterator over the graphemes of a string.
///
/// Which string iterator do I need?
/// - "aỹe".iter_graph() => iterates "a", "ỹ", "e"
/// - "aỹe".iter_cp()    => iterates 'a', 'y', '\u0303', 'e'
/// - "aỹe".iter_bytes() => iterates 0x61, 0x79, 0xcc, 0x83, 0x65
fn iter_graph() ...
msullivan

msullivan commented on Jul 29, 2013

@msullivan
Contributor

Nominating for backwards compatibility milestone, I suppose? String handling is a pretty fundamental part of the libraries.

bluss

bluss commented on Aug 2, 2013

@bluss
Member

Normalization forms will matter too. Kimundi's 4-codepoint string "aỹe" is 3 codepoints in NFC normalization.

We have a 1-to-1 encoding of utf-8 now, so at least the bytewise equality is the same as the charwise equality. Should string equality hold across unicode normalizations too?

Kimundi

Kimundi commented on Aug 2, 2013

@Kimundi
Member

Rust strings are defined a utf8, but NFC normalization is a additional property on top of that.

I think we should provide functions for explicitly normalizing a str, maybe add Iterator adapters that normalize lazy, but I don't think it is something that should happen automatically (It would generally mean more allocations).

However, if we ever get user definable unsized types, nothing would speak against having a nfc_str, where the invariant 'nfc normalized utf8' holds.

bluss

bluss commented on Aug 18, 2013

@bluss
Member
  • Also a replacement for .word_iter() that takes unicode properties into account.

http://www.unicode.org/reports/tr29

catamorphism

catamorphism commented on Sep 5, 2013

@catamorphism
Contributor

Not backwards-incompatible; accepted for feature-complete

emberian

emberian commented on Feb 17, 2014

@emberian
Member

Visiting for triage. This is still as important as ever. To my knowledge, no progress has been made.

pzol

pzol commented on Feb 26, 2014

@pzol
Contributor

In order to be consistent with curreny naming, my suggestions would be:

bytes()
chars() // codepoints
graphemes()

I'd like to work on the graphemes() iterator.

Kimundi

Kimundi commented on Feb 26, 2014

@Kimundi
Member

@pzol: Yeah, something like that would work. For the name, seeing how "grapheme cluster" is the correct name, an alternative to graphemes could also be clusters.

self-assigned this
on Feb 26, 2014
pnkfelix

pnkfelix commented on Mar 20, 2014

@pnkfelix
Member

We can add support for graphemes backwards compatibly. Therefore not a backwards-compatibility issue. Not tagging as a 1.0 blocker.

Assigning P-low, not 1.0.

Meyermagic

Meyermagic commented on Mar 24, 2014

@Meyermagic
Contributor

I'm working on a patch for grapheme cluster iteration here: https://github.com/Meyermagic/rust/compare/graphemecluster

Still need to clean it up, optimize, write tests, etc. There are probably some code style issues, too.

kwantam

kwantam commented on Jul 11, 2014

@kwantam
Contributor

Folks: I added a Graphemes iterator to the UnicodeStrSlice trait: #15619

Comments appreciated.

added a commit that references this issue on Jul 15, 2014

auto merge of #15619 : kwantam/rust/master, r=huonw

2692ae1
added a commit that references this issue on Apr 8, 2021

Auto merge of rust-lang#7043 - camsteffen:dead-utils, r=flip1995

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Metadata

Metadata

Assignees

No one assigned

    Labels

    A-UnicodeArea: UnicodeC-enhancementCategory: An issue proposing an enhancement or a PR with one.P-lowLow priority

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

      Development

      Participants

      @pzol@pnkfelix@msullivan@catamorphism@Meyermagic

      Issue actions

        Unicode grapheme support · Issue #7043 · rust-lang/rust