Keyboard shortcuts

Press โ† or โ†’ to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

Grapheme Clustering

The Emoji Problem

Quick question: How many characters is this emoji?

๐Ÿ‘จโ€๐Ÿ‘ฉโ€๐Ÿ‘งโ€๐Ÿ‘ฆ

Most programming languages say:

# Python
emoji = "๐Ÿ‘จโ€๐Ÿ‘ฉโ€๐Ÿ‘งโ€๐Ÿ‘ฆ"
len(emoji)  # 7 โŒ
// JavaScript
let emoji = "๐Ÿ‘จโ€๐Ÿ‘ฉโ€๐Ÿ‘งโ€๐Ÿ‘ฆ";
emoji.length;  # 11 โŒ
// Java
String emoji = "๐Ÿ‘จโ€๐Ÿ‘ฉโ€๐Ÿ‘งโ€๐Ÿ‘ฆ";
emoji.length();  // 11 โŒ

But itโ€™s ONE emoji! The โ€œfamilyโ€ emoji is a single visual unit.

SFX gets it right:

Story:
    Family is "๐Ÿ‘จโ€๐Ÿ‘ฉโ€๐Ÿ‘งโ€๐Ÿ‘ฆ"
    Length is Family.Length  # 1 โœ“

What Are Grapheme Clusters?

A grapheme cluster is what humans perceive as a single character:

  • a - Simple character (1 grapheme)
  • รฉ - Can be one character OR e + combining accent (still 1 grapheme)
  • ๐Ÿ‘จโ€๐Ÿ‘ฉโ€๐Ÿ‘งโ€๐Ÿ‘ฆ - Multiple Unicode code points (but 1 grapheme)
  • ๐Ÿ‡บ๐Ÿ‡ธ - Two code points (but 1 flag = 1 grapheme)

Technical Details (Optional)

Unicode has different representations:

  1. Code points - Individual Unicode values (U+0041, U+1F600, etc.)
  2. Code units - UTF-8 bytes, UTF-16 words
  3. Grapheme clusters - What humans see as characters โœ“

Most languages count code points or code units. SFX counts grapheme clusters.

Why This Matters

1. String Length

Story:
    # Simple ASCII
    Name is "Alice"
    Print Name.Length  # 5 โœ“

    # Emoji
    Emoji is "๐ŸŽ‰"
    Print Emoji.Length  # 1 โœ“

    # Complex emoji
    Family is "๐Ÿ‘จโ€๐Ÿ‘ฉโ€๐Ÿ‘งโ€๐Ÿ‘ฆ"
    Print Family.Length  # 1 โœ“

    # Flag emoji
    Flag is "๐Ÿ‡บ๐Ÿ‡ธ"
    Print Flag.Length  # 1 โœ“

    # Combining diacritics
    Accented is "รฉ"  # Can be one code point or e + ฬ
    Print Accented.Length  # 1 โœ“ (regardless of representation)

Compare to other languages:

# Python counts code points
"๐ŸŽ‰".len()           # 1 (lucky - simple emoji)
"๐Ÿ‘จโ€๐Ÿ‘ฉโ€๐Ÿ‘งโ€๐Ÿ‘ฆ".len()  # 7 โŒ (complex emoji)
"๐Ÿ‡บ๐Ÿ‡ธ".len()          # 2 โŒ (flag is 2 code points)

2. String Slicing

Story:
    Text is "Hello ๐Ÿ‘‹ World ๐ŸŒ"

    # Get first 7 graphemes
    Slice is Text.Slice(1, 7)
    Print Slice  # "Hello ๐Ÿ‘‹" โœ“ (emoji counts as 1)

Other languages:

# Python
text = "Hello ๐Ÿ‘‹ World ๐ŸŒ"
slice = text[0:7]  # Might cut emoji in half! ๐Ÿ’ฅ

3. Text Truncation

Concept: TextTruncator
    To Truncate with Text and MaxLength:
        If Text.Length <= MaxLength:
            Return Text
        Else:
            Truncated is Text.Slice(1, MaxLength - 3)
            Return Truncated + "..."

Story:
    Create TextTruncator Called T

    Short is "Hello"
    Print T.Truncate with Short and 10  # "Hello"

    Long is "Hello ๐Ÿ‘‹ World ๐ŸŒ Everyone ๐ŸŽ‰"
    Print T.Truncate with Long and 15  # "Hello ๐Ÿ‘‹ World..." โœ“
    # Emoji not split!

4. Character Validation

Story:
    # Validate username length
    Username is "Alice๐ŸŽฎ"
    MinLength is 3
    MaxLength is 20

    If Username.Length >= MinLength and Username.Length <= MaxLength:
        Print "Valid username"  # This prints โœ“
        # "Alice๐ŸŽฎ" is 6 graphemes (A-l-i-c-e-๐ŸŽฎ)

5. Display Width

Concept: TextRenderer
    To PadRight with Text and Width:
        # Pad with spaces to reach width
        CurrentLength is Text.Length
        If CurrentLength >= Width:
            Return Text
        Else:
            Padding is Width - CurrentLength
            Spaces is ""
            Repeat Padding times:
                Spaces is Spaces + " "
            Return Text + Spaces

Story:
    Create TextRenderer Called Renderer

    # All aligned, even with emoji
    Print Renderer.PadRight with "Name" and 15 + "| Status"
    Print Renderer.PadRight with "Alice" and 15 + "| Active"
    Print Renderer.PadRight with "Bob ๐ŸŽฎ" and 15 + "| Offline"
    # Output:
    # Name           | Status
    # Alice          | Active
    # Bob ๐ŸŽฎ         | Offline

Common Emoji Patterns

Simple Emoji

Story:
    # Single code point emoji
    Smile is "๐Ÿ˜€"     # 1 grapheme โœ“
    Heart is "โค๏ธ"     # 1 grapheme โœ“
    Star is "โญ"      # 1 grapheme โœ“

    Total is Smile.Length + Heart.Length + Star.Length
    Print Total  # 3 โœ“

Skin Tone Modifiers

Story:
    # Emoji + skin tone = 1 grapheme
    WaveLight is "๐Ÿ‘‹๐Ÿป"   # 1 grapheme โœ“
    WaveDark is "๐Ÿ‘‹๐Ÿฟ"    # 1 grapheme โœ“

    Print WaveLight.Length  # 1 โœ“

Combined Emoji (ZWJ Sequences)

Story:
    # Zero-Width Joiner combines emoji
    Family is "๐Ÿ‘จโ€๐Ÿ‘ฉโ€๐Ÿ‘งโ€๐Ÿ‘ฆ"        # 1 grapheme โœ“
    Couple is "๐Ÿ‘จโ€โค๏ธโ€๐Ÿ‘จ"         # 1 grapheme โœ“
    FemaleFirefighter is "๐Ÿ‘ฉโ€๐Ÿš’"  # 1 grapheme โœ“

    Total is Family.Length + Couple.Length + FemaleFirefighter.Length
    Print Total  # 3 โœ“

Flag Emoji

Story:
    # Two regional indicator symbols = 1 flag
    US is "๐Ÿ‡บ๐Ÿ‡ธ"      # 1 grapheme โœ“
    UK is "๐Ÿ‡ฌ๐Ÿ‡ง"      # 1 grapheme โœ“
    Japan is "๐Ÿ‡ฏ๐Ÿ‡ต"   # 1 grapheme โœ“

    Flags is US + UK + Japan
    Print Flags.Length  # 3 โœ“

Real-World Examples

Tweet Length Counter

Concept: TweetValidator
    MaxLength is 280

    To IsValid with Text:
        Return Text.Length <= This.MaxLength

    To GetRemainingChars with Text:
        Return This.MaxLength - Text.Length

Story:
    Create TweetValidator Called Validator

    Tweet is "Hello world! ๐Ÿ‘‹๐ŸŒ๐ŸŽ‰"

    If Validator.IsValid with Tweet:
        Remaining is Validator.GetRemainingChars with Tweet
        Print "Valid! " + Remaining + " characters remaining"
        # Counts emoji correctly!

Username Validation

Concept: UsernameValidator
    MinLength is 3
    MaxLength is 20

    To Validate with Username:
        Length is Username.Length

        If Length < This.MinLength:
            Return "Username too short (min " + This.MinLength + " characters)"
        Else If Length > This.MaxLength:
            Return "Username too long (max " + This.MaxLength + " characters)"
        Else:
            Return "Valid"

Story:
    Create UsernameValidator Called Validator

    # All valid - emoji count as 1 character each
    Print Validator.Validate with "Alice"      # Valid
    Print Validator.Validate with "Bob๐ŸŽฎ"      # Valid
    Print Validator.Validate with "ๆธธๆˆ็Žฉๅฎถ"    # Valid (Chinese characters)
    Print Validator.Validate with "ู…ุณุชุฎุฏู…"    # Valid (Arabic)

Text Editor

Concept: TextEditor
    Content
    CursorPosition

    To MoveCursorRight:
        If This.CursorPosition < This.Content.Length:
            Set This.CursorPosition to This.CursorPosition + 1

    To MoveCursorLeft:
        If This.CursorPosition > 0:
            Set This.CursorPosition to This.CursorPosition - 1

    To DeleteCharacter:
        # Delete character at cursor
        Before is This.Content.Slice(1, This.CursorPosition - 1)
        After is This.Content.Slice(This.CursorPosition + 1, This.Content.Length)
        Set This.Content to Before + After
        # Deletes entire grapheme cluster, not just one byte!

Story:
    Create TextEditor Called Editor
    Set Editor.Content to "Hello ๐Ÿ‘จโ€๐Ÿ‘ฉโ€๐Ÿ‘งโ€๐Ÿ‘ฆ World"
    Set Editor.CursorPosition to 7

    # Delete the family emoji
    Editor.DeleteCharacter
    Print Editor.Content  # "Hello World" โœ“
    # Entire emoji deleted, not corrupted!

Performance

Grapheme clustering is slightly slower than byte counting:

Story:
    # For ASCII-only strings, fast
    Text is "Hello World"
    Length is Text.Length  # Fast

    # For Unicode strings, slightly slower
    Text is "Hello ๐Ÿ‘จโ€๐Ÿ‘ฉโ€๐Ÿ‘งโ€๐Ÿ‘ฆ World ๐ŸŒ"
    Length is Text.Length  # Slightly slower (but still fast!)

But correctness > speed. And with JIT compilation, the difference is minimal.

Best Practices

1. Use .Length for Character Count

# Good - grapheme count
Text is "Hello ๐Ÿ‘‹"
Count is Text.Length  # 7 (H-e-l-l-o-space-๐Ÿ‘‹)

2. Use .ByteSize for Storage Size

# If you need byte count for storage/network
Text is "Hello ๐Ÿ‘‹"
Bytes is Text.ByteSize  # More than 7 (UTF-8 bytes)

3. Slice by Graphemes

# SFX slices by graphemes
Text is "Hello ๐Ÿ‘‹ World"
First7 is Text.Slice(1, 7)  # "Hello ๐Ÿ‘‹" (emoji not split)

4. Validate Input by Grapheme Count

# Validate display length, not byte length
Username is "User๐ŸŽฎ"
If Username.Length > 20:
    Print "Username too long"

Comparison with Other Languages

Languageโ€œ๐Ÿ‘จโ€๐Ÿ‘ฉโ€๐Ÿ‘งโ€๐Ÿ‘ฆโ€.LengthMethod
SFX1 โœ“Grapheme clusters
Python7Code points
JavaScript11UTF-16 code units
Java11UTF-16 code units
Go25UTF-8 bytes
Rust25UTF-8 bytes (default)
Swift1 โœ“Grapheme clusters

SFX and Swift get it right by default!

Summary

SFX uses grapheme clustering for strings:

โœ“ Human-centric - Counts what humans see as characters โœ“ Emoji-friendly - ๐Ÿ‘จโ€๐Ÿ‘ฉโ€๐Ÿ‘งโ€๐Ÿ‘ฆ is 1 character, not 7 โœ“ International - Works with all languages (Chinese, Arabic, emoji, etc.) โœ“ Safe slicing - Wonโ€™t split multi-byte characters โœ“ Correct validation - Username/tweet length validation works correctly

When dealing with text in the 21st century, grapheme clustering is essential.


Next: Basic Syntax - Learn SFX syntax fundamentals