Unicode in Friendly Terms: ASCII, UTF-8, and Character Encodings Explained |

Translate: 🇫🇷 French 🇸🇦 Arabic 🇨🇳 Chinese 🇪🇸 Spanish

Let’s dive into the world of Unicode. This article will explore what it is, how it functions, and what it signifies when a programming language is described as “Unicode-aware.” A basic familiarity with programming concepts will be helpful.

Back to Basics: How Computers Store Data

At its core, all data on a computer is stored as bits—a series of zeros and ones—whether in memory (RAM) or on a disk. This is straightforward for a number like 26, which can be converted to its base-2 (binary) equivalent for storage.

But what about the letter ‘D’? Or a Chinese character like ‘你’? Or even the thumbs-up emoji ‘👍’?

The solution is a universally agreed-upon mapping between characters and numerical values.

ASCII: The Simple, But Limited, Ancestor

The most popular and simplest of these mappings is ASCII (American Standard Code for Information Interchange). ASCII maps a set of basic Western characters, numbers, and symbols to numeric values between 0 and 127, allowing it to represent 128 unique characters.

To encode a string like “hello” into ASCII:

Look up the numeric value for each character.
Convert each value to its binary representation.
Concatenate the binary strings.

Each character becomes eight bits, or one byte, of data. This process is called encoding. Decoding is simply the reverse.

ASCII has a wonderfully simple property: the number of characters in a string is equal to the number of bytes it occupies. In languages like C and C++, the string length function often returns the number of bytes, which for ASCII, conveniently matches the character count.

But this simplicity is also its biggest limitation. What about the tens of thousands of characters in Chinese, or other writing systems like Arabic, Cyrillic, and Devanagari?

Enter Unicode: A Universal Language for Text

After many proposals and iterations, the Unicode standard was created. It is a monumental achievement, encompassing over 100,000 unique characters from more than a hundred languages. To manage this vast collection, along with complex features like accents, emoji modifiers, and other linguistic nuances, Unicode’s structure is more intricate than ASCII’s.

To discuss it properly, we need to refine our terminology.

Deconstructing Unicode: Graphemes, Code Points, and Encodings

1. Grapheme First, we’ll replace the ambiguous word “character” with grapheme. A grapheme is a single, fundamental unit of a human writing system. Think of it as what you would see on a single Scrabble tile, like ‘D’ or ‘你’.

2. Code Point In Unicode, a grapheme is represented by one or more code points. A code point is a unique numerical value assigned to a character or modifier.

The graphemes ‘D’ and ‘你’ are each represented by a single code point.
More complex graphemes, like ‘é’, can be represented in two ways:
- By a single, pre-composed code point: LATIN SMALL LETTER E WITH ACUTE.
- By combining two code points: the base letter e followed by a COMBINING ACUTE ACCENT modifier.

3. Encoding Once we have a list of code points for a string, we must convert them into a binary representation. This is encoding. Unlike ASCII’s single, one-byte-per-character strategy, Unicode offers several encoding schemes, each with its own trade-offs.

UTF-32: The Simple but Wasteful Scheme

UTF-32 takes each code point’s value and encodes it using four bytes (32 bits).

Pro: Every code point has the same size. This makes indexing predictable—the first code point is at byte 0, the second at byte 4, and so on.
Con: It’s incredibly wasteful. The string “hello world” encoded in UTF-32 takes up four times more space than in ASCII. Both common, low-value code points and rare, high-value ones consume the same four bytes.

UTF-8: The Smart and Dominant Scheme

To address this wastefulness, UTF-8 was developed. It’s a variable-width encoding scheme that maps each code point to between one and four bytes.

Code points with lower numerical values (which are more common) use just one byte.
Higher-value code points take two, three, or four bytes.

The genius of UTF-8 lies in its backward compatibility with ASCII. Simple Western graphemes have the same code point values in Unicode as they do in ASCII, and UTF-8 encodes these low-value code points into a single byte—the exact same byte as their ASCII representation. This means legacy ASCII-based systems can often read and process simple UTF-8 text without even knowing it’s not ASCII.

Downside: Because code points have unequal byte sizes, indexing directly into the byte stream is difficult. This has a minor performance impact, but it’s a trade-off most of the world has accepted. Today, UTF-8 is the most widely adopted encoding for Unicode.

A Note on Fairness: You might notice that English and other Western languages, with their low code point values, get the most efficient storage in UTF-8. Is this unfair? Yes, it is. This favoritism is a historical artifact from the US and UK’s early dominance in computing. Interestingly, even the source code for a webpage written in Arabic might be predominantly English due to all the HTML markup (<html>, <p>, etc.). It’s a trade-off: a system that doesn’t favor any language, like UTF-32, wastes a significant amount of storage.

The Developer’s Dilemma: Unicode-Aware vs. Unaware

The big takeaway is that Unicode data is far more complex than ASCII data.

In ASCII: 1 grapheme = 1 byte.
In Unicode: 1 grapheme ≠ 1 code point ≠ 1 byte.

Attempting to read UTF-8 data as if it were ASCII (or any other mismatched encoding) results in garbled, unreadable output, sometimes called mojibake.

This brings us to a critical concept for developers: Unicode-aware libraries. In many languages, older string manipulation functions operate only on bytes.

Unicode-Unaware Functions

Consider this conceptual example in a language that treats strings as simple byte arrays (like Python 2’s native str):

# This is a conceptual example
my_string = '👍' # A single grapheme

# A Unicode-unaware function sees the underlying bytes
len(my_string) 
# Returns 4, because the thumbs-up emoji is 4 bytes in UTF-8

# Trying to get the "first character" gives you a meaningless byte
my_string[0] 
# Returns the first of four bytes, not the emoji

These functions are Unicode-unaware. They operate on bytes and have no understanding of the graphemes or code points they represent.

Unicode-Aware Functions

A Unicode-aware string type treats the string as a sequence of code points.

# Conceptual example with a Unicode-aware string
my_unicode_string = u'👍'

# A Unicode-aware function sees code points
len(my_unicode_string)
# Returns 1, the number of code points

This is a huge improvement. But what about graphemes made of multiple code points?

Grapheme-Aware Functions

Let’s take the thumbs-up emoji with a skin tone modifier. This is one grapheme that a user sees, but it’s composed of two code points.

# Conceptual example
complex_emoji = u'👍' + u'[skin_tone_modifier]' # Two code points, one grapheme

# A Unicode-aware function still sees code points
len(complex_emoji)
# Returns 2, the number of code points

# A Unicode-unaware function sees bytes
len(complex_emoji.encode('utf-8'))
# Returns 8 (4 bytes for each code point)

To correctly identify this as a single user-perceived character, you need a grapheme-aware library.

The Dangers of Naive String Manipulation

This isn’t just academic. Naively manipulating string bytes can corrupt data. Imagine an ellipsis function that truncates text.

A Unicode-unaware function might slice the string in the middle of a multi-byte code point, creating invalid data.
A Unicode-aware function is better but could still slice between a base character and its combining modifier (like separating the ‘e’ from its accent).
A grapheme-aware function is the only way to guarantee you are truncating based on what the user actually perceives as characters.

Rules of Thumb for Developers

Use Unicode-unaware functions only if you are 100% certain your data is pure ASCII.
Use Unicode-aware functions for most programmatic string manipulation, as they correctly handle code points.
Use grapheme-aware functions/libraries when dealing with user-perceived characters, especially for UI tasks like counting, slicing, or truncating.

As a final stop, consider the official Unicode information for the thumbs-up emoji. It has a name (THUMBS UP SIGN), a code point value (U+1F44D or 128,077 in decimal), and defined encodings in UTF-8, UTF-16, and UTF-32. Every character you see on screen has a similar story.

Hopefully, the world of Unicode now makes a lot more sense.