There's a persistent misconception that you need Python 3 to
support Unicode. This is untrue: the default ""
may produce a str
(i.e. bytes without a specified
encoding) instance but simply prefixing any string with
u
produces a full unicode
object.
Python 2 will even helpfully convert them when you mix the two:
>>> type("foo")Background: http://docs.python.org/howto/unicode.html
To preserve compatibility with existing code, Python 2 assumes
byte-strings are ASCII. This means that any input source is a
risk of UnicodeDecodeError
s:
Everything works if you know the encoding and remember to decode first:
>>> s.decode("utf-8")
It also “works” if you simply avoid treating bytes as text until
you're prepared to deal with them. This was a problem in
django-localeurl when requests were received with incorrectly
encoded data - the rest of the system handled it but the
redirect logic would choke on a UnicodeDecodeError
.
This was fixed by
doing less
Learning from what was problematic in Python 2, the default
string type is unicode
, with a separate
bytes
type for raw I/O. You can't mix the two
without an explicit conversion:
Python 2 fails later; Python 3 forces you to fix it now
These rules are the same for Python 2 and 3 so you won't need to change when Django migrates
The most basic form involves str and anything str-like where you want to convert everything as soon as you see it:
s = foo.read().decode("utf-8")
print >>foo, s.encode("utf-8")
For files, you might want to use the codecs module so you can read from the file normally and receive Unicode characters instead of byte-strings:
>>> import codecsIn general the best policy is to reject invalid text so the user can correct it immediately. However, in a batch context you might need to process what you can. Python allows you to “ignore” or strip anything which fails to decode:
foo.decode("utf-8", "ignore")
It may be preferable to instead replace each invalid character with the Unicode Replacement character � to clearly indicate that data was lost:
foo.decode("utf-8", "replace")
There are multiple ways to display many characters, producing situations where a program treats two values as different even though they appear identical to the user. We can avoid this problem by normalizing everything to a consistent internal representation for comparison:
>>> from unicodedata import normalizePython's unicodedata module has other features for working with Unicode text. Several of them allow you to retrieve information about characters, as in this example which generated the different characters in the numeric equivalence example:
from __future__ import print_function, unicode_literals
from unicodedata import decimal
import sys
if sys.version_info[0] == 3:
unichr = chr
for i in range(0, sys.maxunicode):
u = unichr(i)
if decimal(u, None) == 5:
print(u.encode("utf-8"))
If you need to process text using regular expressions, you'll quickly learn that even with Python 3 there are significant drawbacks with the built-in re module. The regex module is the successor being developed as a standalone library for ease of testing and it has much better Unicode support:
>>> re.match(r"(?i)strasse", u"stra\N{LATIN SMALL LETTER SHARP S}e", flags=re.UNICODE)
None
>>> regex.match(ur"(?iV1)strasse", u"stra\N{LATIN SMALL LETTER SHARP S}e")
<_regex.Match object at …>
>>> regex.match(ur"(?iV1)stra\N{LATIN SMALL LETTER SHARP S}e", "STRASSE")
<_regex.Match object at …>
regex's smart behaviour depends on either setting the Unicode flag or using a Unicode string as the pattern!
A bytestring does not carry any information with it about its encoding. For that reason, we have to make an assumption, and Django assumes that all bytestrings are in UTF-8.
…
In most cases when Django is dealing with strings, it will convert them to Unicode strings before doing anything else. So, as a general rule, if you pass in a bytestring, be prepared to receive a Unicode string back in the result.
unicode
instance. Set strings_only=True
to leave
numbers, boolean and None
unconverted
smart_unicode
except that
lazy translations will be processed
strings_only=True
to leave numbers, boolean
and None
unconverted
See the official Django Unicode documentation for more information
Django attempts to use Unicode for communication with the
database. There's only one catch but it's significant: you
must ensure that the database is configured to use a
Unicode encoding when you create it. Otherwise you'll probably
be able to store data but queries may be erratic - remember
collation? - and data can be lost or truncated: thanks to
UTF-8's variable length, a VARCHAR(20)
could
require up to 80 bytes to store!
CREATE DATABASE my_site CHARACTER SET utf8 COLLATE utf8_unicode_ci;
[mysqld]
character-set-server = utf8
collation-server = utf8_unicode_ci
See http://dev.mysql.com/doc/refman/5.1/en/globalization.html
CREATE DATABASE my_app WITH ENCODING 'UTF8';
or
createdb -E UTF8 myapp
initdb -E UTF8
There is no one right approach: you need to validate your inputs at every point but the details vary from project to project.
My fledgling
django-i18n-utils
package provides an example of one approach:
UnicodeNormalizerMixin
uses the standard
model validation clean_fields()
method to normalize all text fields to Unicode NFC:
class Person(UnicodeNormalizerMixin, models.Model):
name = models.CharField(…)
This works well as long as you scrupulously call
clean()
before saving a model. Since this is
important for other reasons, make it a key point in testing
Unicode has a number of characters which need to be different for some reason but are visually quite similar. This is most concerning in the case of domain names if an IDN homograph attack allows an attacker to register something like http://www.pаypal.com/ but the concern applies any time there's something to gain from spoofing.
Solving this requires some combination of restricting your data or alerting the user:
/
#