Localization is about more than cosmetic differences: forget whether the year comes before or after the month in a date, basic beliefs about how the world works vary significantly
- Places have only one official name
- Place names follow the character rules of the language
- Place names can be written with the usual character set of a country
- People have exactly one canonical full name.
- People have exactly one full name which they go by.
- People’s names do not change.
- People’s names change, but only at a certain enumerated set of events.
- People have last names, family names, or anything else which is shared by folks recognized as their relatives.
- There are two and only two genders
- Okay, then there are two and only two biological genders.
- Gender is determined solely by biology.
When people started building electronic communication systems, it was easy to continue assigning each distinct character a number. Since early systems needed to be simple each character was assigned a fixed-length binary number
… but almost everyone outside the United States needs more characters and uses 8-bits to store extended characters beyond basic ASCII. Worse, it's frequently possible to exchange text incorrectly until someone notices the first document using one of the different characters!
Since there are individual languages which need more than 256 characters, there's no possibility of a standard 8-bit encoding emerging
Maximilian Dörrbecker via Wikimedia Commons (CC-BY-SA-3.0)
Starting in the 1980s, engineers from various companies started working on an ambitious project: a universal 16-bit character set which could represent every character used in human writing. At some point it expanded beyond 16 bits but the goal hasn't changed
The Unicode Standard defines codes for characters used in all the major languages written today. Scripts include the European alphabetic scripts, Middle Eastern right-to-left scripts, and many scripts of Asia.
The Unicode Standard further includes punctuation marks, diacritics, mathematical symbols, technical symbols, arrows, dingbats, emoji, etc. … In all, the Unicode Standard, Version 6.0 provides codes for 109,449 characters from the world's alphabets, ideograph sets, and symbol collections.
The smallest component of written language that has semantic value; refers to the abstract meaning…
Key concept: this is not the same as a byte or number!
A mark applied or attached to a symbol to create a new symbol that represents a modified or new value. (2) A mark applied to a symbol irrespective of whether it changes the value of that symbol. In the latter case, the diacritic usually represents an independent value (for example, an accent, tone, or some other linguistic information).
é | ||
LATIN SMALL LETTER E WITH ACUTE (U+00E9) | ||
e | ´ | é |
LATIN SMALL LETTER E (U+0065) | COMBINING ACUTE ACCENT (U+0301) |
Unicode provides rules to determine when characters are exactly the same (e.g. U+00F1 = U+006E + U+0303 = ñ) and also when they are functionally the same (e.g. "ff" == "ff" for searching but not display).
This can also apply to numbers – these are all numerically equivalent but have significantly different semantic meaning:
5
٥
۵
߅
५
৫
੫
૫
୫
௫
౫
೫
൫
๕
໕
༥
၅
႕
៥
᠕
᥋
᧕
᪅
᪕
᭕
᮵
᱅
᱕
꘥
꣕
꤅
꧕
꩕
꯵
5
𐒥
𑁫
𝟓
𝟝
𝟧
𝟱
𝟻
Determining how strings compare for the purposes of sorting
By Language | |
---|---|
Swedish: | z < ö |
German: | ö < z |
German: | ß = ss |
By Context | |
French: | cote < côte < coté < côté |
By Usage | |
German Dictionary: | of < öf |
German Telephone: | öf < of |
Sources: Unicode Technical Standard #10 and Wikipedia: Alphabetical order
SPQRobin via Wikimedia Commons (CC-BY-SA-3.0)
The Unicode standard describes abstract characters but we need a way to convert them into bytes for storage and exchange. Early experiments which simply doubled 8-bit ASCII to 16-bits revealed significant problems:
UTF-8 was developed to avoid these problems. It's a very clever variable-length encoding scheme under which all existing 7-bit ASCII is valid, all common non-Asian characters require only 2 bytes, common CJK still needs only 3-bytes, using 4 bytes only for rare and historical characters. Because it's read one byte at a time, there's no need for a BOM.
All major operating systems and programming languages support Unicode and UTF-8, although compatibility is still a consideration for the more recent features added in version 6 such as emoji (🌏) or regional indicators ( 🇺 🇸 = 🇺🇸, 🇫 🇷 = 🇫🇷, etc.)
We previously discussed how accented characters can be formed by combining a base character with the desired diacritic. The concept of multiple characters producing a visually distinct glyph is relatively unusual in English, where only a few ligatures are at all commonly used - perhaps the best known being the “ae” in encylopædia - but other languages depend on this behaviour.
A complex text layout system allows the visual display to be significantly altered based on the context. If you need to support complex languages this will affect your font choices and design options!
ا ل ع ر ب ي ة
العربية
/
#