Closed
Description
Currently Rust doesn't support unicode properly, e.g. there is no way to iterate over a string by grapheme (there is .iter()
for codepoints, and .bytes_iter()
for bytes).
Possibly useful: http://useless-factor.blogspot.de/2007/08/unicode-implementers-guide-part-4.html
Activity
mleise commentedon Jun 10, 2013
To force people new to Unicode to understand what they iterate over, the names could be chosen to not make one look more common than the other. Grapheme clusters are slow to decode, but what you typically want when you need to limit text to the first n characters. Even when storing text in a fixed size database field one should be aware that one possibly just strip accents of the last letter when working with code points. Maybe a scheme like this would ensure the right decision is made in user code:
Kimundi commentedon Jun 10, 2013
Yeah, I'm of the same opinion.
If there exist iterators for all three, none of them should be the shorter default, people need to think about which one they need.
The docs of every of those three functions should also contain a short example along those lines:
msullivan commentedon Jul 29, 2013
Nominating for backwards compatibility milestone, I suppose? String handling is a pretty fundamental part of the libraries.
bluss commentedon Aug 2, 2013
Normalization forms will matter too. Kimundi's 4-codepoint string
"aỹe"
is 3 codepoints in NFC normalization.We have a 1-to-1 encoding of utf-8 now, so at least the bytewise equality is the same as the charwise equality. Should string equality hold across unicode normalizations too?
Kimundi commentedon Aug 2, 2013
Rust strings are defined a utf8, but NFC normalization is a additional property on top of that.
I think we should provide functions for explicitly normalizing a str, maybe add Iterator adapters that normalize lazy, but I don't think it is something that should happen automatically (It would generally mean more allocations).
However, if we ever get user definable unsized types, nothing would speak against having a
nfc_str
, where the invariant 'nfc normalized utf8' holds.bluss commentedon Aug 18, 2013
.word_iter()
that takes unicode properties into account.http://www.unicode.org/reports/tr29
catamorphism commentedon Sep 5, 2013
Not backwards-incompatible; accepted for feature-complete
_iter
suffixes on specialized Iterator constructors #9440emberian commentedon Feb 17, 2014
Visiting for triage. This is still as important as ever. To my knowledge, no progress has been made.
pzol commentedon Feb 26, 2014
In order to be consistent with curreny naming, my suggestions would be:
I'd like to work on the graphemes() iterator.
Kimundi commentedon Feb 26, 2014
@pzol: Yeah, something like that would work. For the name, seeing how "grapheme cluster" is the correct name, an alternative to
graphemes
could also beclusters
.pnkfelix commentedon Mar 20, 2014
We can add support for graphemes backwards compatibly. Therefore not a backwards-compatibility issue. Not tagging as a 1.0 blocker.
Assigning P-low, not 1.0.
Meyermagic commentedon Mar 24, 2014
I'm working on a patch for grapheme cluster iteration here: https://github.com/Meyermagic/rust/compare/graphemecluster
Still need to clean it up, optimize, write tests, etc. There are probably some code style issues, too.
Replace enum LintId with an extensible alternative
kwantam commentedon Jul 11, 2014
Folks: I added a Graphemes iterator to the UnicodeStrSlice trait: #15619
Comments appreciated.
auto merge of #15619 : kwantam/rust/master, r=huonw
Auto merge of rust-lang#7043 - camsteffen:dead-utils, r=flip1995