Grapheme Clustering
The Emoji Problem
Quick question: How many characters is this emoji?
๐จโ๐ฉโ๐งโ๐ฆ
Most programming languages say:
# Python
emoji = "๐จโ๐ฉโ๐งโ๐ฆ"
len(emoji) # 7 โ
// JavaScript
let emoji = "๐จโ๐ฉโ๐งโ๐ฆ";
emoji.length; # 11 โ
// Java
String emoji = "๐จโ๐ฉโ๐งโ๐ฆ";
emoji.length(); // 11 โ
But itโs ONE emoji! The โfamilyโ emoji is a single visual unit.
SFX gets it right:
Story:
Family is "๐จโ๐ฉโ๐งโ๐ฆ"
Length is Family.Length # 1 โ
What Are Grapheme Clusters?
A grapheme cluster is what humans perceive as a single character:
a- Simple character (1 grapheme)รฉ- Can be one character ORe+ combining accent (still 1 grapheme)๐จโ๐ฉโ๐งโ๐ฆ- Multiple Unicode code points (but 1 grapheme)๐บ๐ธ- Two code points (but 1 flag = 1 grapheme)
Technical Details (Optional)
Unicode has different representations:
- Code points - Individual Unicode values (U+0041, U+1F600, etc.)
- Code units - UTF-8 bytes, UTF-16 words
- Grapheme clusters - What humans see as characters โ
Most languages count code points or code units. SFX counts grapheme clusters.
Why This Matters
1. String Length
Story:
# Simple ASCII
Name is "Alice"
Print Name.Length # 5 โ
# Emoji
Emoji is "๐"
Print Emoji.Length # 1 โ
# Complex emoji
Family is "๐จโ๐ฉโ๐งโ๐ฆ"
Print Family.Length # 1 โ
# Flag emoji
Flag is "๐บ๐ธ"
Print Flag.Length # 1 โ
# Combining diacritics
Accented is "รฉ" # Can be one code point or e + ฬ
Print Accented.Length # 1 โ (regardless of representation)
Compare to other languages:
# Python counts code points
"๐".len() # 1 (lucky - simple emoji)
"๐จโ๐ฉโ๐งโ๐ฆ".len() # 7 โ (complex emoji)
"๐บ๐ธ".len() # 2 โ (flag is 2 code points)
2. String Slicing
Story:
Text is "Hello ๐ World ๐"
# Get first 7 graphemes
Slice is Text.Slice(1, 7)
Print Slice # "Hello ๐" โ (emoji counts as 1)
Other languages:
# Python
text = "Hello ๐ World ๐"
slice = text[0:7] # Might cut emoji in half! ๐ฅ
3. Text Truncation
Concept: TextTruncator
To Truncate with Text and MaxLength:
If Text.Length <= MaxLength:
Return Text
Else:
Truncated is Text.Slice(1, MaxLength - 3)
Return Truncated + "..."
Story:
Create TextTruncator Called T
Short is "Hello"
Print T.Truncate with Short and 10 # "Hello"
Long is "Hello ๐ World ๐ Everyone ๐"
Print T.Truncate with Long and 15 # "Hello ๐ World..." โ
# Emoji not split!
4. Character Validation
Story:
# Validate username length
Username is "Alice๐ฎ"
MinLength is 3
MaxLength is 20
If Username.Length >= MinLength and Username.Length <= MaxLength:
Print "Valid username" # This prints โ
# "Alice๐ฎ" is 6 graphemes (A-l-i-c-e-๐ฎ)
5. Display Width
Concept: TextRenderer
To PadRight with Text and Width:
# Pad with spaces to reach width
CurrentLength is Text.Length
If CurrentLength >= Width:
Return Text
Else:
Padding is Width - CurrentLength
Spaces is ""
Repeat Padding times:
Spaces is Spaces + " "
Return Text + Spaces
Story:
Create TextRenderer Called Renderer
# All aligned, even with emoji
Print Renderer.PadRight with "Name" and 15 + "| Status"
Print Renderer.PadRight with "Alice" and 15 + "| Active"
Print Renderer.PadRight with "Bob ๐ฎ" and 15 + "| Offline"
# Output:
# Name | Status
# Alice | Active
# Bob ๐ฎ | Offline
Common Emoji Patterns
Simple Emoji
Story:
# Single code point emoji
Smile is "๐" # 1 grapheme โ
Heart is "โค๏ธ" # 1 grapheme โ
Star is "โญ" # 1 grapheme โ
Total is Smile.Length + Heart.Length + Star.Length
Print Total # 3 โ
Skin Tone Modifiers
Story:
# Emoji + skin tone = 1 grapheme
WaveLight is "๐๐ป" # 1 grapheme โ
WaveDark is "๐๐ฟ" # 1 grapheme โ
Print WaveLight.Length # 1 โ
Combined Emoji (ZWJ Sequences)
Story:
# Zero-Width Joiner combines emoji
Family is "๐จโ๐ฉโ๐งโ๐ฆ" # 1 grapheme โ
Couple is "๐จโโค๏ธโ๐จ" # 1 grapheme โ
FemaleFirefighter is "๐ฉโ๐" # 1 grapheme โ
Total is Family.Length + Couple.Length + FemaleFirefighter.Length
Print Total # 3 โ
Flag Emoji
Story:
# Two regional indicator symbols = 1 flag
US is "๐บ๐ธ" # 1 grapheme โ
UK is "๐ฌ๐ง" # 1 grapheme โ
Japan is "๐ฏ๐ต" # 1 grapheme โ
Flags is US + UK + Japan
Print Flags.Length # 3 โ
Real-World Examples
Tweet Length Counter
Concept: TweetValidator
MaxLength is 280
To IsValid with Text:
Return Text.Length <= This.MaxLength
To GetRemainingChars with Text:
Return This.MaxLength - Text.Length
Story:
Create TweetValidator Called Validator
Tweet is "Hello world! ๐๐๐"
If Validator.IsValid with Tweet:
Remaining is Validator.GetRemainingChars with Tweet
Print "Valid! " + Remaining + " characters remaining"
# Counts emoji correctly!
Username Validation
Concept: UsernameValidator
MinLength is 3
MaxLength is 20
To Validate with Username:
Length is Username.Length
If Length < This.MinLength:
Return "Username too short (min " + This.MinLength + " characters)"
Else If Length > This.MaxLength:
Return "Username too long (max " + This.MaxLength + " characters)"
Else:
Return "Valid"
Story:
Create UsernameValidator Called Validator
# All valid - emoji count as 1 character each
Print Validator.Validate with "Alice" # Valid
Print Validator.Validate with "Bob๐ฎ" # Valid
Print Validator.Validate with "ๆธธๆ็ฉๅฎถ" # Valid (Chinese characters)
Print Validator.Validate with "ู
ุณุชุฎุฏู
" # Valid (Arabic)
Text Editor
Concept: TextEditor
Content
CursorPosition
To MoveCursorRight:
If This.CursorPosition < This.Content.Length:
Set This.CursorPosition to This.CursorPosition + 1
To MoveCursorLeft:
If This.CursorPosition > 0:
Set This.CursorPosition to This.CursorPosition - 1
To DeleteCharacter:
# Delete character at cursor
Before is This.Content.Slice(1, This.CursorPosition - 1)
After is This.Content.Slice(This.CursorPosition + 1, This.Content.Length)
Set This.Content to Before + After
# Deletes entire grapheme cluster, not just one byte!
Story:
Create TextEditor Called Editor
Set Editor.Content to "Hello ๐จโ๐ฉโ๐งโ๐ฆ World"
Set Editor.CursorPosition to 7
# Delete the family emoji
Editor.DeleteCharacter
Print Editor.Content # "Hello World" โ
# Entire emoji deleted, not corrupted!
Performance
Grapheme clustering is slightly slower than byte counting:
Story:
# For ASCII-only strings, fast
Text is "Hello World"
Length is Text.Length # Fast
# For Unicode strings, slightly slower
Text is "Hello ๐จโ๐ฉโ๐งโ๐ฆ World ๐"
Length is Text.Length # Slightly slower (but still fast!)
But correctness > speed. And with JIT compilation, the difference is minimal.
Best Practices
1. Use .Length for Character Count
# Good - grapheme count
Text is "Hello ๐"
Count is Text.Length # 7 (H-e-l-l-o-space-๐)
2. Use .ByteSize for Storage Size
# If you need byte count for storage/network
Text is "Hello ๐"
Bytes is Text.ByteSize # More than 7 (UTF-8 bytes)
3. Slice by Graphemes
# SFX slices by graphemes
Text is "Hello ๐ World"
First7 is Text.Slice(1, 7) # "Hello ๐" (emoji not split)
4. Validate Input by Grapheme Count
# Validate display length, not byte length
Username is "User๐ฎ"
If Username.Length > 20:
Print "Username too long"
Comparison with Other Languages
| Language | โ๐จโ๐ฉโ๐งโ๐ฆโ.Length | Method |
|---|---|---|
| SFX | 1 โ | Grapheme clusters |
| Python | 7 | Code points |
| JavaScript | 11 | UTF-16 code units |
| Java | 11 | UTF-16 code units |
| Go | 25 | UTF-8 bytes |
| Rust | 25 | UTF-8 bytes (default) |
| Swift | 1 โ | Grapheme clusters |
SFX and Swift get it right by default!
Summary
SFX uses grapheme clustering for strings:
โ Human-centric - Counts what humans see as characters โ Emoji-friendly - ๐จโ๐ฉโ๐งโ๐ฆ is 1 character, not 7 โ International - Works with all languages (Chinese, Arabic, emoji, etc.) โ Safe slicing - Wonโt split multi-byte characters โ Correct validation - Username/tweet length validation works correctly
When dealing with text in the 21st century, grapheme clustering is essential.
Next: Basic Syntax - Learn SFX syntax fundamentals