Post

Character Encodings Reference

A reference of character encodings.

Overview

Character Encoding Reference.


Key Concepts

  • Character set - maps characters to numbers. e.g. A = 65. Unicode is a character set.
  • Encoding - rules for storing those numbers as bytes. e.g. 65 → 0x41. UTF-8 is an encoding.
  • Code point - a single character entry in a character set. Written as U+XXXX in Unicode.
  • Code unit - the atomic unit of storage for an encoding. UTF-8 = 8-bit units. UTF-16 = 16-bit units.
  • BOM (Byte Order Mark) - U+FEFF placed at the start of a file to signal byte order. Required in UTF-16. Unnecessary and often harmful in UTF-8.
  • Surrogate pair - two UTF-16 code units used together to encode a code point outside the BMP (U+10000+).

ASCII

128 characters, 0–127, 7 bits, stored as 1 byte.

RangeContents
0–31Control characters (\n, \t, \r, null, etc.)
32–126Printable: A–Z, a–z, 0–9, punctuation, space
127Delete
CharDecHexBinary
A650x410100 0001
a970x610110 0001
0480x300011 0000
space320x200010 0000

Extended ASCII

8th bit used to add 128 more characters (slots 128-255). Not standardised - each group defined their own:

VariantCoverage
ISO 8859-1 (Latin-1)Western European: é, ñ, ü
Windows-1252 (CP1252)Same as Latin-1 for 160–255, different at 128–159
Shift-JISJapanese
GB2312Simplified Chinese

Bytes 0–127 are consistent across all variants. Bytes 128–255 are encoding-specific. This is the root of most legacy encoding corruption.


Unicode

One universal character set for every writing system.

  • ~150,000 code points, range U+0000U+10FFFF
  • Divided into 17 planes of 65,536 code points each
  • Plane 0 = Basic Multilingual Plane (BMP): U+0000U+FFFF, covers most everyday characters
  • Planes 1–16: emoji, historic scripts, rarely used symbols
Code pointCharNotes
U+0041A= ASCII 65
U+00E9éLatin small e with acute
U+4E2DChinese “middle”
U+1F600😀Plane 1, needs 4 bytes in UTF-8
U+FEFFBOMByte Order Mark

Unicode defines the code points. UTF-8, UTF-16, and UTF-32 define how they’re stored as bytes.


UTF-32

Fixed-width: every code point = 4 bytes, always.

Code pointBytes
U+004100 00 00 41
U+1F60000 01 F6 00
  • nth character always at byte offset n × 4 - simple to index
  • 4× larger than ASCII for English text
  • Rarely used outside of internal runtime representations

UTF-16

Variable-width: 2 bytes for BMP, 4 bytes (surrogate pair) for everything else.

RangeEncoding
U+0000U+FFFF (BMP)2 bytes directly
U+10000U+10FFFFSurrogate pair: high (0xD8000xDBFF) + low (0xDC000xDFFF)

Used internally by: Java (char = 1 UTF-16 code unit), JavaScript/V8, .NET/C#, Windows APIs.

BOM in UTF-16

BOM bytesMeaning
FE FFUTF-16 BE (big-endian)
FF FEUTF-16 LE (little-endian)

UTF-8 BOM = EF BB BF. Not needed, but some Windows tools write it anyway. Can silently break parsers that don’t expect it.


UTF-8

Variable-width: 1–4 bytes. Dominant encoding on the web (>98% of pages).

Code point rangeBytesByte pattern
U+0000U+007F10xxxxxxx
U+0080U+07FF2110xxxxx 10xxxxxx
U+0800U+FFFF31110xxxx 10xxxxxx 10xxxxxx
U+10000U+10FFFF411110xxx 10xxxxxx 10xxxxxx 10xxxxxx
  • ASCII-compatible - code points 0–127 are identical to ASCII bytes. Any ASCII file is valid UTF-8.
  • Self-synchronising - byte prefix tells you the role: 0 = 1-byte char, 110 = 2-byte start, 1110 = 3-byte start, 11110 = 4-byte start, 10 = continuation byte.
  • No BOM needed - no endianness issue.

Encoding Example: é (U+00E9)

U+00E9 = binary 000 1110 1001. Falls in U+0080U+07FF → 2-byte pattern 110xxxxx 10xxxxxx.

Fill in bits: 1100 0011 1010 10010xC3 0xA9

Why Latin-1 → UTF-8 Produces Héllo

é in Latin-1 = single byte 0xE9. Read as UTF-8, 0xE9 = 1110 1001 - start of a 3-byte sequence. The bytes that follow don’t match the 10xxxxxx continuation pattern, so the decoder outputs replacement characters. 0xC3 0xA9 (the correct UTF-8 for é) decoded as Latin-1 gives à + ©, producing Héllo.


Common Encoding Bugs

MySQL utf8 vs utf8mb4

MySQL’s utf8 only handles up to 3-byte sequences. Emoji need 4 bytes. Use utf8mb4 always.

1
2
-- Set on connection
SET NAMES utf8mb4;
1
2
# JDBC connection string
jdbc:mysql://host/db?useUnicode=true&characterEncoding=utf8mb4

Python file/CSV reads

Python 3 defaults to UTF-8. Excel exports are often latin1 or cp1252.

1
2
3
open('file.csv', encoding='utf-8')        # explicit UTF-8
open('file.csv', encoding='utf-8-sig')    # UTF-8, strips BOM if present
open('file.csv', encoding='latin1')       # Windows/Excel exports

Java String.length() with emoji

Returns UTF-16 code units, not characters. "😀".length() = 2.

1
2
3
String s = "😀";
s.length();                        // 2  (UTF-16 code units)
s.codePointCount(0, s.length());   // 1  (actual characters)

URL / percent encoding

Non-ASCII characters must be percent-encoded as their UTF-8 bytes. é (0xC3 0xA9) → %C3%A9. Double-encoding gives %25C3%25A9. Decoding with the wrong charset gives corrupted query params.

HTTP Content-Type

1
Content-Type: text/html; charset=utf-8

Without charset, browsers guess. Declare it explicitly.


Encoding Comparison

EncodingBytes/charASCII compatibleUsed by
ASCII1 (7-bit)Is ASCIILegacy systems
Latin-1 / ISO 8859-110–127 onlyWestern European legacy
Windows-125210–127 onlyWindows, Excel
UTF-324 (fixed)NoSome internal runtimes
UTF-162 or 4NoJava, .NET, JS internals, Windows
UTF-81–4YesWeb, Linux, modern default

Number Base Conversion

Relevant when reading byte values and code points.

DecimalHexBinary
00x000000 0000
100x0A0000 1010
150x0F0000 1111
160x100001 0000
650x410100 0001
970x610110 0001
1270x7F0111 1111
1280x801000 0000
1950xC31100 0011
1690xA91010 1001
2330xE91110 1001
2550xFF1111 1111

Conversion rules

Decimal → Hex: divide by 16 repeatedly, remainders (in reverse) are the hex digits. Digits above 9: A=10, B=11, C=12, D=13, E=14, F=15.

Hex → Binary: each hex digit maps directly to 4 bits.

HexBinary
00000
10001
20010
30011
40100
50101
60110
70111
81000
91001
A1010
B1011
C1100
D1101
E1110
F1111

Example: 0xC3C = 1100, 3 = 00111100 0011 = 195 in decimal.

Binary → Decimal: sum of bit × 2^position (right to left, starting at 0). 1100 0011 = 128 + 64 + 2 + 1 = 195.

Comments powered by Disqus.