Background

Terminology

Locale
A collection of preferences defining how a system should behave for a target group. For example, users in the United States, Great Britain and Australia mostly share a language but choose different ways to spell words, display dates and measure.
Localization (l10n)
A collection of preferences defining how the user interface should behave for a locale. This implies a number of surprisingly complex topics ranging from how basic text processing and number or formatting to questions about prefered colors and icons and even legal requirements.
Internationalization (i18n)
Making it easy to localize software: in general this involves identifying locale dependency points and adding an abstraction mechanism to manage locale-specific changes

Cultural Differences

Localization is about more than cosmetic differences: forget whether the year comes before or after the month in a date, basic beliefs about how the world works vary significantly

  1. Geography's pretty universal, right?
    Falsehoods Programmers Believe About Geography
    • Places have only one official name
    • Place names follow the character rules of the language
    • Place names can be written with the usual character set of a country
  2. Well, what about someone's name?
    Falsehoods Programmers Believe About Names
    1. People have exactly one canonical full name.
    2. People have exactly one full name which they go by.
    3. People’s names do not change.
    4. People’s names change, but only at a certain enumerated set of events.
    5. People have last names, family names, or anything else which is shared by folks recognized as their relatives.
  3. What about something as simple as a person's gender? That's just biology, right?
    Falsehoods Programmers Believe About Gender
    • There are two and only two genders
    • Okay, then there are two and only two biological genders.
    • Gender is determined solely by biology.

Dealing with Cultural Differences

Writing

An Abbreviated History of Electronic Text

When people started building electronic communication systems, it was easy to continue assigning each distinct character a number. Since early systems needed to be simple each character was assigned a fixed-length binary number

The Range of Human Writing

Writing Systems of the World

Maximilian Dörrbecker via Wikimedia Commons (CC-BY-SA-3.0)

Unicode

Starting in the 1980s, engineers from various companies started working on an ambitious project: a universal 16-bit character set which could represent every character used in human writing. At some point it expanded beyond 16 bits but the goal hasn't changed

The Unicode® Standard: A Technical Introduction

The Unicode Standard defines codes for characters used in all the major languages written today. Scripts include the European alphabetic scripts, Middle Eastern right-to-left scripts, and many scripts of Asia.

The Unicode Standard further includes punctuation marks, diacritics, mathematical symbols, technical symbols, arrows, dingbats, emoji, etc. … In all, the Unicode Standard, Version 6.0 provides codes for 109,449 characters from the world's alphabets, ideograph sets, and symbol collections.

Terminology

Character
The smallest component of written language that has semantic value; refers to the abstract meaning…

Key concept: this is not the same as a byte or number!

Diacritic
A mark applied or attached to a symbol to create a new symbol that represents a modified or new value. (2) A mark applied to a symbol irrespective of whether it changes the value of that symbol. In the latter case, the diacritic usually represents an independent value (for example, an accent, tone, or some other linguistic information).
Grapheme Cluster
Combining Character
Unicode supports a key concept that multiple characters can be combined to produce a single displayed character (generally anything a user would consider the same is a “grapheme cluster”).
é
LATIN SMALL LETTER E WITH ACUTE (U+00E9)
e ´ é
LATIN SMALL LETTER E (U+0065) COMBINING ACUTE ACCENT (U+0301)

Terminology

Equivalence

Unicode provides rules to determine when characters are exactly the same (e.g. U+00F1 = U+006E + U+0303 = ñ) and also when they are functionally the same (e.g. "ff" == "ff" for searching but not display).

This can also apply to numbers – these are all numerically equivalent but have significantly different semantic meaning:
5 ٥ ۵ ߅ 𐒥 𑁫 𝟓 𝟝 𝟧 𝟱 𝟻

Case Mapping
Various alphabets (Latin, Georgian, Armenian, Cyrillic, etc.) have the concept of “case” and Unicode has rules for converting text from one case to another, including the various language-specific complications this can entail: for example, the German ß (SMALL LETTER SHARP S) converts to uppercase as “SS” (see Unicode Standard Annex #21: CASE MAPPINGS).
Case Folding
Case folding is a similar process, generally used for caseless comparisons (e.g. search). In Unicode this is an expanded form of the lowercase mapping which is consistent but the output is not suitable for display to users

Terminology

Collation

Determining how strings compare for the purposes of sorting

Sample character collation rules
By Language
Swedish: z < ö
German: ö < z
German: ß = ss
By Context
French: cote < côte < coté < côté
By Usage
German Dictionary: of < öf
German Telephone: öf < of

Sources: Unicode Technical Standard #10 and Wikipedia: Alphabetical order

Directionality

Writing Directions of the World

SPQRobin via Wikimedia Commons (CC-BY-SA-3.0)

Unicode Encodings

The Unicode standard describes abstract characters but we need a way to convert them into bytes for storage and exchange. Early experiments which simply doubled 8-bit ASCII to 16-bits revealed significant problems:

  1. Not every processor stores the bytes comprising a 16-bit integer the same way — big-endian (“UNIX”) or little-endian (“NUXI”) — necessitating a special byte-order mark (BOM) at the start of the string simply to decode it
  2. All existing text would need to be converted!
  3. All text becomes more expensive to store and process
  4. It wasn't enough: Chinese alone would require at least 3 bytes!

UTF-8 was developed to avoid these problems. It's a very clever variable-length encoding scheme under which all existing 7-bit ASCII is valid, all common non-Asian characters require only 2 bytes, common CJK still needs only 3-bytes, using 4 bytes only for rare and historical characters. Because it's read one byte at a time, there's no need for a BOM.

State of Unicode

Complex scripts

We previously discussed how accented characters can be formed by combining a base character with the desired diacritic. The concept of multiple characters producing a visually distinct glyph is relatively unusual in English, where only a few ligatures are at all commonly used - perhaps the best known being the “ae” in encylopædia - but other languages depend on this behaviour.

A complex text layout system allows the visual display to be significantly altered based on the context. If you need to support complex languages this will affect your font choices and design options!

The name of the Arabic language as individual characters and written normally

ا ل ع ر ب ي ة
العربية

/

#