Chapter 34: Swift Strings: Unicode & Scalars
1. Why does Unicode matter in Swift?
In the old days (C, C++, early Java), a “character” was usually just one byte (ASCII) or two bytes (UTF-16). That worked fine for English, but broke badly for most languages in the world.
Swift was designed from day one to handle real-world text correctly:
- Hindi (देवनागरी)
- Telugu (తెలుగు)
- Tamil (தமிழ்)
- Arabic (العربية)
- Chinese (中文)
- Japanese (日本語)
- Korean (한국어)
- Emoji 😊 🚀 🇮🇳 🏳️🌈
- Flags with skin tones 👨🏻💻 👩🏾🔬
- Combining characters (é = e + ◌́)
Swift wants every human-visible character to feel natural — not broken into pieces.
2. Three important concepts in Swift strings
| Concept | Swift type | What it represents | How many code points? | How many bytes (UTF-8)? | Example | .count value |
|---|---|---|---|---|---|---|
| Grapheme cluster | Character | One thing a human sees as “one character” | 1 or more | 1–many | “😊”, “é”, “न”, “🇮🇳” | — |
| Unicode scalar | UnicodeScalar | One Unicode code point (a number) | exactly 1 | 1–4 bytes | “😊” → U+1F60A | — |
| String | String | Sequence of grapheme clusters | 0 or more | variable | “नमस्ते 😊” | 7 |
Key rule (very important):
When you loop over a String or use .count, Swift counts grapheme clusters (what humans see), not Unicode scalars or bytes.
3. Examples – let’s see the difference
|
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 |
let simple = "ABC" print(simple.count) // 3 let emoji = "😊" print(emoji.count) // 1 ← one visible character let accented = "é" // can be written two ways: let e1 = "é" // U+00E9 (precomposed) let e2 = "e\u{0301}" // e + combining acute accent print(e1.count) // 1 print(e2.count) // 1 ← Swift sees them as the same character print(e1 == e2) // true |
|
0 1 2 3 4 5 6 7 |
let devanagari = "नमस्ते" print(devanagari.count) // 6 ← 6 visible characters (not 7 or 8) |
|
0 1 2 3 4 5 6 7 |
let flag = "🇮🇳" // India flag = regional indicator I + N print(flag.count) // 1 |
|
0 1 2 3 4 5 6 7 |
let skinTone = "👨🏾" // man + medium-dark skin tone modifier print(skinTone.count) // 1 |
4. Looping over characters – what you actually get
|
0 1 2 3 4 5 6 7 8 9 10 |
let text = "नमस्ते 😊🇮🇳" for char in text { print("→ \(char) (count: \(String(char).count))") } |
Output:
|
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 |
→ न (count: 1) → म (count: 1) → स (count: 1) → ् (count: 1) → त (count: 1) → े (count: 1) → (space) (count: 1) → 😊 (count: 1) → 🇮🇳 (count: 1) |
Each item you get in the loop is a Character — one human-visible unit.
5. Unicode Scalars – when you need the raw code points
Sometimes you want to look at the individual Unicode code points (numbers).
|
0 1 2 3 4 5 6 7 8 9 10 |
let text = "नमस्ते 😊" for scalar in text.unicodeScalars { print("U+\(String(scalar.value, radix: 16, uppercase: true)) → \(scalar)") } |
Output example:
|
0 1 2 3 4 5 6 7 8 9 10 11 12 |
U+928 → न U+92E → म U+938 → स U+94D → ् U+924 → त U+947 → े U+1F60A → 😊 |
When do you actually need unicodeScalars?
- Low-level text processing
- Working with certain APIs (some fonts, regex engines)
- Debugging weird combining behavior
- Interoperability with C/Objective-C
99% of the time — you don’t need it.
6. Real-Life Examples You Will Actually Write
Example 1 – Safe first/last character
|
0 1 2 3 4 5 6 7 8 9 10 11 12 13 |
func firstLetterCapitalized(_ text: String) -> String { guard let first = text.first else { return text } return String(first).uppercased() + text.dropFirst() } print(firstLetterCapitalized("नमस्ते")) // नमस्ते → नमस्ते (already correct) print(firstLetterCapitalized("hello")) // Hello print(firstLetterCapitalized("😊👍")) // 😊👍 |
Example 2 – Emoji detection (simple version)
|
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 |
extension Character { var isEmoji: Bool { guard let scalar = unicodeScalars.first else { return false } return scalar.properties.isEmoji } } let text = "Hello 😊 world! 🇮🇳" for char in text { if char.isEmoji { print("Emoji found: \(char)") } } |
Example 3 – Username sanitization (very common)
|
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 |
func cleanUsername(_ input: String) -> String { let allowed = input.filter { char in char.isLetter || char.isNumber || char == "_" || char == "." || char.isEmoji == false // optional: block emoji in usernames } return allowed.trimmingCharacters(in: .whitespaces) } print(cleanUsername("aarav😊_007")) // aarav_007 |
7. Quick Summary – Key Points to Remember
| Concept | What Swift counts | .count on “नमस्ते 😊” | When to care about it |
|---|---|---|---|
| Visible character | Character (grapheme cluster) | 7 | Almost always — UI, validation, length |
| Unicode scalar (code point) | UnicodeScalar | 8 | Low-level processing, fonts, regex |
| Byte (UTF-8) | Not directly exposed | ~20–25 bytes | Network, file size, rarely needed |
8. Small Practice – Try these
- Print each visible character of “नमस्ते 😊” with its count
- Check if a string contains any emoji
- Take a string “café” (with combining accent) and show it’s equal to “café” (precomposed)
Paste your attempts if you want feedback!
What would you like to explore next?
- More advanced Unicode topics (normalization, canonical equivalence)
- How to safely work with substrings & indices
- Emoji & flags (skin tones, ZWJ sequences, country flags)
- Strings in SwiftUI (Text, AttributedString, markdown)
- Or move to another string topic (formatting, regex, splitting…)
Just tell me — we’ll continue in the same detailed, patient, teacher-like style 😊
