Handling Unicode

Unicode in Python 2.x

There's a persistent misconception that you need Python 3 to support Unicode. This is untrue: the default "" may produce a str (i.e. bytes without a specified encoding) instance but simply prefixing any string with u produces a full unicode object.

Python 2 will even helpfully convert them when you mix the two:

>>> type("foo")
<type 'str'>
>>> type(u"bar")
<type 'unicode'>
>>> "foo" + u"bar"
u'foobar'

Background: http://docs.python.org/howto/unicode.html

Unicode in Python 2.x: the bad parts

To preserve compatibility with existing code, Python 2 assumes byte-strings are ASCII. This means that any input source is a risk of UnicodeDecodeErrors:

>>> s = " ".join(['I', '\xe2\x98\xba', 'Unicode'])
>>> print s # Raw UTF-8 displays correctly in my UTF-8 terminal
I ☺ Unicode
>>> unicode(s)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
UnicodeDecodeError: 'ascii' codec can't decode byte 0xe2 in position 2: ordinal not in range(128)

Everything works if you know the encoding and remember to decode first:

>>> s.decode("utf-8")
u'I \u263a Unicode'

It also “works” if you simply avoid treating bytes as text until you're prepared to deal with them. This was a problem in django-localeurl when requests were received with incorrectly encoded data - the rest of the system handled it but the redirect logic would choke on a UnicodeDecodeError. This was fixed by doing less

Unicode in Python 3.x

Explicit is better than implicit

Learning from what was problematic in Python 2, the default string type is unicode, with a separate bytes type for raw I/O. You can't mix the two without an explicit conversion:

>>> "☹" + b"bytes"
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
TypeError: Can't convert 'bytes' object to str implicitly
>>> "☺" + b" bytes".decode("ascii")
'☺ bytes'
>>> "☺".encode("utf-8") + b" bytes"
b'\xe2\x98\xba bytes'

Python 2 fails later; Python 3 forces you to fix it now

Unicode Best Practices

These rules are the same for Python 2 and 3 so you won't need to change when Django migrates

Unicode Best Practices

I/O Conversion

The most basic form involves str and anything str-like where you want to convert everything as soon as you see it:

s = foo.read().decode("utf-8")
print >>foo, s.encode("utf-8")

For files, you might want to use the codecs module so you can read from the file normally and receive Unicode characters instead of byte-strings:

>>> import codecs
>>> my_file = codecs.open(foo_file, encoding="utf-8")

Unicode Best Practices

Handling invalid data

In general the best policy is to reject invalid text so the user can correct it immediately. However, in a batch context you might need to process what you can. Python allows you to “ignore” or strip anything which fails to decode:

foo.decode("utf-8", "ignore")

It may be preferable to instead replace each invalid character with the Unicode Replacement character � to clearly indicate that data was lost:

foo.decode("utf-8", "replace")

Unicode Best Practices

Normalization

There are multiple ways to display many characters, producing situations where a program treats two values as different even though they appear identical to the user. We can avoid this problem by normalizing everything to a consistent internal representation for comparison:

>>> from unicodedata import normalize
>>> precomposed = u"\N{LATIN SMALL LETTER E WITH ACUTE}"
>>> decomposed = u"\N{LATIN SMALL LETTER E}\N{COMBINING ACUTE ACCENT}"
>>> precomposed == decomposed
False
>>> normalize("NFC", precomposed) == normalize("NFC", decomposed)
True

Other stdlib features

Comparing numbers with Unicode

Python's unicodedata module has other features for working with Unicode text. Several of them allow you to retrieve information about characters, as in this example which generated the different characters in the numeric equivalence example:

from __future__ import print_function, unicode_literals
from unicodedata import decimal
import sys

if sys.version_info[0] == 3:
    unichr = chr

for i in range(0, sys.maxunicode):
    u = unichr(i)
    if decimal(u, None) == 5:
        print(u.encode("utf-8"))

Unicode Best Practices

re vs. regex

If you need to process text using regular expressions, you'll quickly learn that even with Python 3 there are significant drawbacks with the built-in re module. The regex module is the successor being developed as a standalone library for ease of testing and it has much better Unicode support:

>>> re.match(r"(?i)strasse", u"stra\N{LATIN SMALL LETTER SHARP S}e", flags=re.UNICODE)
None
>>> regex.match(ur"(?iV1)strasse", u"stra\N{LATIN SMALL LETTER SHARP S}e")
<_regex.Match object at …>
>>> regex.match(ur"(?iV1)stra\N{LATIN SMALL LETTER SHARP S}e", "STRASSE")
<_regex.Match object at …>

regex's smart behaviour depends on either setting the Unicode flag or using a Unicode string as the pattern!

Unicode in Django

A bytestring does not carry any information with it about its encoding. For that reason, we have to make an assumption, and Django assumes that all bytestrings are in UTF-8.

In most cases when Django is dealing with strings, it will convert them to Unicode strings before doing anything else. So, as a general rule, if you pass in a bytestring, be prepared to receive a Unicode string back in the result.

Handy Utilities

See the official Django Unicode documentation for more information

Unicode in databases

Django attempts to use Unicode for communication with the database. There's only one catch but it's significant: you must ensure that the database is configured to use a Unicode encoding when you create it. Otherwise you'll probably be able to store data but queries may be erratic - remember collation? - and data can be lost or truncated: thanks to UTF-8's variable length, a VARCHAR(20) could require up to 80 bytes to store!

Unicode in databases: MySQL

Per database

CREATE DATABASE my_site CHARACTER SET utf8 COLLATE utf8_unicode_ci;

Server-Wide (recommended)

[mysqld]
character-set-server = utf8
collation-server = utf8_unicode_ci

See http://dev.mysql.com/doc/refman/5.1/en/globalization.html

Unicode in databases: PostgreSQL

Per database

CREATE DATABASE my_app WITH ENCODING 'UTF8'; or createdb -E UTF8 myapp

Server-Wide (recommended)

initdb -E UTF8

See PostgreSQL: Character Set Support

Normalizing your model data

There is no one right approach: you need to validate your inputs at every point but the details vary from project to project.

My fledgling django-i18n-utils package provides an example of one approach: UnicodeNormalizerMixin uses the standard model validation clean_fields() method to normalize all text fields to Unicode NFC:

class Person(UnicodeNormalizerMixin, models.Model):
    name = models.CharField(…)

This works well as long as you scrupulously call clean() before saving a model. Since this is important for other reasons, make it a key point in testing

Unicode Security: Homoglyph Attacks

Unicode has a number of characters which need to be different for some reason but are visually quite similar. This is most concerning in the case of domain names if an IDN homograph attack allows an attacker to register something like http://www.pаypal.com/ but the concern applies any time there's something to gain from spoofing.

Solving this requires some combination of restricting your data or alerting the user:

/

#