Background

Terminology
Cultural Differences
Writing

Terminology

Locale: A collection of preferences defining how a system should behave for a target group. For example, users in the United States, Great Britain and Australia mostly share a language but choose different ways to spell words, display dates and measure.
Localization (l10n): A collection of preferences defining how the user interface should behave for a locale. This implies a number of surprisingly complex topics ranging from how basic text processing and number or formatting to questions about prefered colors and icons and even legal requirements.
Internationalization (i18n): Making it easy to localize software: in general this involves identifying locale dependency points and adding an abstraction mechanism to manage locale-specific changes

Cultural Differences

Localization is about more than cosmetic differences: forget whether the year comes before or after the month in a date, basic beliefs about how the world works vary significantly

Geography's pretty universal, right?
Falsehoods Programmers Believe About Geography
- Places have only one official name
- Place names follow the character rules of the language
- Place names can be written with the usual character set of a country
Well, what about someone's name?
Falsehoods Programmers Believe About Names
1. People have exactly one canonical full name.
2. People have exactly one full name which they go by.
3. People’s names do not change.
4. People’s names change, but only at a certain enumerated set of events.
5. People have last names, family names, or anything else which is shared by folks recognized as their relatives.
What about something as simple as a person's gender? That's just biology, right?
Falsehoods Programmers Believe About Gender
- There are two and only two genders
- Okay, then there are two and only two biological genders.
- Gender is determined solely by biology.

Dealing with Cultural Differences

The easiest way to spend less time dealing with complex data is not to ask for it: do you really need to know your users' gender? This is also a good way to avoid your signup process feeling nosy
If you do need data, ask how much structure you need: a simple “What name should we display on your profile?” field is easy to build and much easier than trying to migrate a simplistic system after it's full of user data
If you have to model something complicated, pay the cost upfront: use a library or service, test major assumptions and potential outliers regularly and think about how you'll deal with problems

Writing

An Abbreviated History of Electronic Text

When people started building electronic communication systems, it was easy to continue assigning each distinct character a number. Since early systems needed to be simple each character was assigned a fixed-length binary number

Baudot code (1870) to ITA2 (1930): 5 bits - just enough for the English alphabet
TeleTypeSetter and BCD (1928): 6 bits allowed punctuation. Unfortunately, different manufacturers used different schemes, making it difficult to exchange data or even mix computers and printers from different manufacturers!
ASCII (1963): 7 bits allow both upper and lower case! Standardization should also help avoid painful conversion issues between manufacturers…
… but almost everyone outside the United States needs more characters and uses 8-bits to store extended characters beyond basic ASCII. Worse, it's frequently possible to exchange text incorrectly until someone notices the first document using one of the different characters!

Since there are individual languages which need more than 256 characters, there's no possibility of a standard 8-bit encoding emerging

The Range of Human Writing

Writing Systems of the World
Maximilian Dörrbecker via Wikimedia Commons (CC-BY-SA-3.0)

Unicode

Starting in the 1980s, engineers from various companies started working on an ambitious project: a universal 16-bit character set which could represent every character used in human writing. At some point it expanded beyond 16 bits but the goal hasn't changed

The Unicode® Standard: A Technical Introduction

The Unicode Standard defines codes for characters used in all the major languages written today. Scripts include the European alphabetic scripts, Middle Eastern right-to-left scripts, and many scripts of Asia.

The Unicode Standard further includes punctuation marks, diacritics, mathematical symbols, technical symbols, arrows, dingbats, emoji, etc. … In all, the Unicode Standard, Version 6.0 provides codes for 109,449 characters from the world's alphabets, ideograph sets, and symbol collections.

Terminology

Character

The smallest component of written language that has semantic value; refers to the abstract meaning…

Key concept: this is not the same as a byte or number!

Diacritic

A mark applied or attached to a symbol to create a new symbol that represents a modified or new value. (2) A mark applied to a symbol irrespective of whether it changes the value of that symbol. In the latter case, the diacritic usually represents an independent value (for example, an accent, tone, or some other linguistic information).

Grapheme Cluster

Combining Character

Unicode supports a key concept that multiple characters can be combined to produce a single displayed character (generally anything a user would consider the same is a “grapheme cluster”).

`é`
LATIN SMALL LETTER E WITH ACUTE (U+00E9)
`e`	`´`	`é`
LATIN SMALL LETTER E (U+0065)	COMBINING ACUTE ACCENT (U+0301)

Terminology

Equivalence

Unicode provides rules to determine when characters are exactly the same (e.g. U+00F1 = U+006E + U+0303 = ñ) and also when they are functionally the same (e.g. "ﬀ" == "ff" for searching but not display).

This can also apply to numbers – these are all numerically equivalent but have significantly different semantic meaning:
5 ٥ ۵ ߅ ५ ৫ ੫ ૫ ୫ ௫ ౫ ೫ ൫ ๕ ໕ ༥ ၅ ႕ ៥ ᠕ ᥋ ᧕ ᪅ ᪕ ᭕ ᮵ ᱅ ᱕ ꘥ ꣕ ꤅ ꧕ ꩕ ꯵ ５ 𐒥 𑁫 𝟓 𝟝 𝟧 𝟱 𝟻

Case Mapping

Various alphabets (Latin, Georgian, Armenian, Cyrillic, etc.) have the concept of “case” and Unicode has rules for converting text from one case to another, including the various language-specific complications this can entail: for example, the German ß (SMALL LETTER SHARP S) converts to uppercase as “SS” (see Unicode Standard Annex #21: CASE MAPPINGS).

Case Folding

Case folding is a similar process, generally used for caseless comparisons (e.g. search). In Unicode this is an expanded form of the lowercase mapping which is consistent but the output is not suitable for display to users

Terminology

Collation

Determining how strings compare for the purposes of sorting

Sample character collation rules
By Language
Swedish:	z < ö
German:	ö < z
German:	ß = ss
By Context
French:	cote < côte < coté < côté
By Usage
German Dictionary:	of < öf
German Telephone:	öf < of

Sources: Unicode Technical Standard #10 and Wikipedia: Alphabetical order

Directionality

Writing Directions of the World
SPQRobin via Wikimedia Commons (CC-BY-SA-3.0)

Unicode Encodings

The Unicode standard describes abstract characters but we need a way to convert them into bytes for storage and exchange. Early experiments which simply doubled 8-bit ASCII to 16-bits revealed significant problems:

Not every processor stores the bytes comprising a 16-bit integer the same way — big-endian (“UNIX”) or little-endian (“NUXI”) — necessitating a special byte-order mark (BOM) at the start of the string simply to decode it
All existing text would need to be converted!
All text becomes more expensive to store and process
It wasn't enough: Chinese alone would require at least 3 bytes!

UTF-8 was developed to avoid these problems. It's a very clever variable-length encoding scheme under which all existing 7-bit ASCII is valid, all common non-Asian characters require only 2 bytes, common CJK still needs only 3-bytes, using 4 bytes only for rare and historical characters. Because it's read one byte at a time, there's no need for a BOM.

State of Unicode

In 2008, Google announced that for web content UTF-8 had surpassed ASCII in popularity. In 2010, it was approaching 50%, and as of February 2012 it has passed 60%
All major operating systems and programming languages support Unicode and UTF-8, although compatibility is still a consideration for the more recent features added in version 6 such as emoji (🌏) or regional indicators ( 🇺 🇸 = 🇺🇸, 🇫 🇷 = 🇫🇷, etc.)

Complex scripts

We previously discussed how accented characters can be formed by combining a base character with the desired diacritic. The concept of multiple characters producing a visually distinct glyph is relatively unusual in English, where only a few ligatures are at all commonly used - perhaps the best known being the “ae” in encylopædia - but other languages depend on this behaviour.

A complex text layout system allows the visual display to be significantly altered based on the context. If you need to support complex languages this will affect your font choices and design options!

The name of the Arabic language as individual characters and written normally

ا ل ع ر ب ي ة
العربية