Localization is about more than cosmetic differences: forget whether the year comes before or after the month in a date, basic beliefs about how the world works vary significantly
When people started building electronic communication systems, it was easy to continue assigning each distinct character a number. Since early systems needed to be simple each character was assigned a fixed-length binary number
… but almost everyone outside the United States needs more characters and uses 8-bits to store extended characters beyond basic ASCII. Worse, it's frequently possible to exchange text incorrectly until someone notices the first document using one of the different characters!
Since there are individual languages which need more than 256 characters, there's no possibility of a standard 8-bit encoding emerging
Starting in the 1980s, engineers from various companies started working on an ambitious project: a universal 16-bit character set which could represent every character used in human writing. At some point it expanded beyond 16 bits but the goal hasn't changed
The smallest component of written language that has semantic value; refers to the abstract meaning…
Key concept: this is not the same as a byte or number!
A mark applied or attached to a symbol to create a new symbol that represents a modified or new value. (2) A mark applied to a symbol irrespective of whether it changes the value of that symbol. In the latter case, the diacritic usually represents an independent value (for example, an accent, tone, or some other linguistic information).
é | ||
LATIN SMALL LETTER E WITH ACUTE (U+00E9) | ||
e | ´ | é |
LATIN SMALL LETTER E (U+0065) | COMBINING ACUTE ACCENT (U+0301) |
Unicode provides rules to determine when characters are exactly the same (e.g. U+00F1 = U+006E + U+0303 = ñ) and also when they are functionally the same (e.g. "ff" == "ff" for searching but not display).
This can also apply to numbers – these are all numerically equivalent but have significantly different semantic meaning:
5
٥
۵
߅
५
৫
੫
૫
୫
௫
౫
೫
൫
๕
໕
༥
၅
႕
៥
᠕
᥋
᧕
᪅
᪕
᭕
᮵
᱅
᱕
꘥
꣕
꤅
꧕
꩕
꯵
5
𐒥
𑁫
𝟓
𝟝
𝟧
𝟱
𝟻
Determining how strings compare for the purposes of sorting
By Language | |
---|---|
Swedish: | z < ö |
German: | ö < z |
German: | ß = ss |
By Context | |
French: | cote < côte < coté < côté |
By Usage | |
German Dictionary: | of < öf |
German Telephone: | öf < of |
Sources: Unicode Technical Standard #10 and Wikipedia: Alphabetical order
The Unicode standard describes abstract characters but we need a way to convert them into bytes for storage and exchange. Early experiments which simply doubled 8-bit ASCII to 16-bits revealed significant problems:
UTF-8 was developed to avoid these problems. It's a very clever variable-length encoding scheme under which all existing 7-bit ASCII is valid, all common non-Asian characters require only 2 bytes, common CJK still needs only 3-bytes, using 4 bytes only for rare and historical characters. Because it's read one byte at a time, there's no need for a BOM.
All major operating systems and programming languages support Unicode and UTF-8, although compatibility is still a consideration for the more recent features added in version 6 such as emoji (🌏) or regional indicators ( 🇺 🇸 = 🇺🇸, 🇫 🇷 = 🇫🇷, etc.)
We previously discussed how accented characters can be formed by combining a base character with the desired diacritic. The concept of multiple characters producing a visually distinct glyph is relatively unusual in English, where only a few ligatures are at all commonly used - perhaps the best known being the “ae” in encylopædia - but other languages depend on this behaviour.
A complex text layout system allows the visual display to be significantly altered based on the context. If you need to support complex languages this will affect your font choices and design options!
/
#