Can someone who is familiar with the Unicode standard explain what the exact mechanism is behind how to tell if a unicode code point \UAAAABBBB is a supporting graheme or a standalone grapheme?

For example.

  • \U+0045\U+0301 together is rendered as É with string length (1) as counted by # of graphemes.
  • \U+0301 alone is rendered as ́ with string length (1) as counted by # of graphemes.

How does a program know when to ignore the accent \U+0301 in string length (and other functions) and process it alongside \U+0045 as a single graphical unit - and when not to?

Is there some kind of encoding that goes on? Is every code point hard coded with a property of being standalone or supporting - and all supportings are simply merged with the most recent prior standalone? Or is there something more dynamic going on?

What is the exact underlying mechanism behind this behaviour?

1 Answers 11

up vote 1 down vote accepted

The character property Grapheme_Cluster_Break is responsible for this. Every character belongs to one specific category, and the various interactions between those categories determine the grapheme boundaries in any given string. In general, characters with the property values Extend, Spacing_Mark, and ZWJ combine with their preceding character, but the full set of rules is more complicated than that. You can find the complete specifications in section 3 of UAX #29.

A machine-readable version of all property value assignments is available in this data file, and you can also use this tool to get a list of all characters within a certain category, for example by entering [:Grapheme_Cluster_Break=Extend:].

Edit: Here are a few examples:

  • U+0301 COMBINING ACUTE ACCENT has the value Extend. According to rule GB9, characters with this value will always form a combined grapheme with any preceding character (e.g. the letter x): ‘x’ + ‘ ́’ = ‘x́’, so x́ will be counted as one single unit. Pretty much all characters that are described as combining marks possess this property, and you can add as many combining marks to the cluster as you like since they all glue together with the one before them: x̧̞̥̖̉̄͑̕͘.
  • The Hangul script is written with syllable blocks consisting of two or three individual letters each called jamo. U+1100 HANGUL CHOSEONG KIYEOK has the value L (which stands for ‘leading jamo’) and U+1161 HANGUL JUNGSEONG A has the value V (which stands for ‘vowel jamo’). Rule GB6 states that a leading jamo followed by a vowel jamo should form a unit, so the sequence U+1100 U+1161 will be one single grapheme cluster: ‘ᄀ’ + ‘ᅡ’ = ‘가’.
  • U+270A RAISED FIST has the value E_Base (emoji modifier base) and U+1F3FD EMOJI MODIFIER FITZPATRICK TYPE-4 has the value E_Modifier (emoji modifier). Rule GB10 states an emoji modifier base followed by an emoji modifier should be treated as one graphemic unit: ‘✊’ + ‘🏽’ = ‘✊🏽’.

And so on and so forth. Those are just some of the rules that exist, and I also chose relatively straightforward examples to get the point across. As I said, the full list of rules can be read in UAX #29.

Could you maybe create a few examples, so it is easier to visualize what you just said? – AlanSTACK
I have added some examples to my answer. – RandomGuy32

Not the answer you're looking for? Browse other questions tagged or ask your own question.