Register new account
Encodings, or: Validating UTF8
Contrary to what some people expect, not every file is valid UTF8. This means that if you mess up and mangle your encodings, the files may get rejected or cause problems with various software, such as the CurseForge packager or WoW. '''Note: The repository hooks validate the encoding of Lua files before accepting them. WoW addons must encode their files in UTF-8. WAR addons must encode their files in either little-endian UTF-16 with a BOM, or plain ASCII.''' ==Notation== Most numbers are in decimal; byte values displayed as <code>AB CD</code> are hexadecimal. ==Encodings in a nutshell== At the most fundamental level, computers encode information in chunks called ''bytes'' (or more precisely, octets; we shall assume that bytes are 8 bits). Most encodings just use one byte per character. This means there are 256 ''code points'' that can be mapped to ''characters'' (or digits, punctuation, Kanji, runes, whatever — we'll call them characters). While there are dozens of encodings still in use, the following are especially important to this discussion: * '''[http://en.wikipedia.org/wiki/ASCII ASCII]''' is the basis of most western encodings today. Due to its 7-bit heritage, it only maps 128 code points: 0–127. * The '''[http://en.wikipedia.org/wiki/ISO_8859 ISO 8859]''' family of encodings are the most frequently used western encodings. Its parts 1–15, known also as "Latin-1" etc., are very widely used. Code points 0–127 agree with ASCII; the rest are used for various characters, such that it is possible, e.g., to type most western European languages in Latin-1. : And here's the first catch: ''the Latin-N encodings do not cover all 256 code points''. Specifically, 128–159 are ''unassigned''. The encodings formally known as ISO-8859-1 etc. (note the extra dash) assign control characters to this range, but see below. * '''[http://en.wikipedia.org/wiki/Windows-1252 Windows-1252]''' in turn is a superset of ISO Latin-1 (ISO 8859-1), which maps printable characters into the range 128–159. This means it is ''not'' ISO-8859-1 (note the dash) compatible! As the name suggests, it is in wide use on western European Windows installations. But 256 possible characters are quite restrictive. For example, it is not possible to write Chinese or Japanese in their native writing systems (by a wide margin). Along came '''Unicode''' and the '''[http://en.wikipedia.org/wiki/Universal_Character_Set UCS]''' (Universal Character Set, formally ISO 10146), which intends to provide characters for ''all'' writing systems in use on the Earth. This also means that a different mapping to bytes has to be used. Common variants include: * '''[http://en.wikipedia.org/wiki/UTF-16 UTF-16]''', which uses two bytes (16 bits) per character and thus limited the original standard to 65536 code points (more are possible with escape codes). * '''[http://en.wikipedia.org/wiki/UTF-32 UTF-32]''', which uses four bytes per character and allowed the second version of the Unicode standard to move beyond the 65k limit. (Unicode now has "room" for 1.1 million characters.) * The UTF-1/7/8 transformations, of which '''[http://en.wikipedia.org/wiki/UTF-8 UTF-8]''' is by far the most commonly used. Unicode notably includes all of Latin-1 on the 256 lowest-numbered code points, which also means that it contains all ASCII characters on the same code points as in ASCII itself. UTF-16/32 unfortunately require special support from software to be handled sanely. For example, the 'A' character would be encoded in (little-endian) UTF-32 as the bytes <code>41 00 00 00</code>, and a lot of unprepared software will choke on the [http://en.wikipedia.org/wiki/Null_character null bytes]. Furthermore, a program cannot ''reliably'' tell an ASCII file from an UTF-16 one, because almost all even-sized ASCII files are valid (though probably nonsensical) UTF-16. '''UTF-8''' circumvents this problem by guaranteeing the following properties: * All ASCII code points (i.e. 0–127) map to the corresponding ASCII bytes (i.e. 0–127). * All other code points map to a ''series of bytes'', all of which have values in the range 128–255. * No encoding of a character is contained in a (longer) encoding of another character. * No encoding contains <code>FE</code> or <code>FF</code>. This means that it is ASCII-compatible in the same sense that Latin-1 is: software that expects input to be "ASCII and maybe some higher bytes" will be able to cope with it. However, ''not all byte sequences are valid UTF-8''. The details are linked to the last three properties: roughly speaking, <code>00</code>–<code>7F</code> correspond to ASCII, <code>C2</code>–<code>F4</code> are legal at the start of a multibyte sequence, and <code>80</code>–<code>BF</code> constitute the rest of the multibyte sequence. (There are some exceptions.) <code>F5</code>–<code>FD</code> are currently invalid, but reserved for 5- and 6-byte sequence leaders if UCS ever introduces more characters. ''Other combinations of these bytes are invalid.'' This concludes the general remarks. You should remember the following points: '''Summary:''' ASCII only maps half the possible byte values. Latin-1, Win-1252 and UTF-8 are all different supersets of ASCII. UTF-8/16 can encode every language in use. Not every sequence of bytes is valid UTF-8. UTF-16 needs special support and is not a superset of ASCII. ==Encoding detection and BOM== Unicode provides a special means to detect the specific encoding (and endianness) that a file was written in: the [http://en.wikipedia.org/wiki/Byte_Order_Mark Byte-Order Mark] (BOM). This is just the special code point (U+FEFF) which is invisible as a character (it is a zero-width non-breaking space), but serves to distinguish the encodings via its byte representation: * <code>EF BB BF</code> for UTF-8, * <code>FF FE</code> for little-endian UTF-16, * <code>FF FE 00 00</code> for little-endian UTF-32, etc. Thus, a file can be marked as UTF-8 simply by putting a BOM at the very beginning. When the file is erroneously interpreted as a different encoding, the BOM will usually appear as garbage bytes. Many programs also use a heuristic to detect UTF-8, exploiting the property that not all byte sequences are valid: simply attempt to decode the file as UTF-8; if that fails, it's probably Latin-1. ==How does this affect my addon?== Because Unicode can express all languages, WoW and WAR support Unicode representations for strings in Lua code. * WoW expects <code>.lua</code> files to be encoded in UTF-8. * WAR expects <code>.lua</code> files to be encoded in ASCII or ''little-endian UTF-16''. Unfortunately there are still many editors around that cannot correctly handle some encodings. Before the advent of UTF, this wasn't so bad; many files would simply look wrong if read with the wrong encoding. But in any half sane editor, the damage was limited. But with UTF-8, it's a different story: suppose you open a file encoded in UTF-8 in an editor which treats the contents as Latin-1. Now you cut&paste some text, perhaps containing umlauts, into it. What happens? Most likely the editor will happily save the Latin-1 bytes into the file. Remember that not all byte sequences are valid UTF-8? The file is now ''most likely corrupted''—the author has yet to see a UTF-8 (non-ASCII) byte sequence that makes sense in Latin-1 (in any language); conversely, nothing that makes sense is valid UTF-8! Similar remarks hold for UTF-16, but there the file looks so damaged when read as a Latin encoding (it will have a null byte at almost every odd byte offset in the file) that few people would attempt to edit it under such conditions. However, if one does insert two bytes at different positions, everything between them will be garbled! To fix encoding corruption, you need to identify the offending bytes (see the next section). Then attempt to guess the encoding they are in, usually Win-1252 is a good starting point, look up what characters they represented, then insert those characters with a UTF aware editor. Of course, opening the file in the right encoding may take some convincing because the editor may detect it as damaged. Other sources of badly encoded files are more obvious; for example, the author helped debug one case where the Lua sources were generated by a PHP script that simply used Latin-1 output. '''Summary:''' Editing a UTF-8 (UTF-16) encoded file in a non-UTF-aware editor will most likely leave it invalid (garbled, resp.). ==Checking for valid UTF-8/16== This is not easily possible with tools that are provided with Windows. You can, however, install the [http://www.python.org Python] programming language, open an interactive Python window and use the following commands: >>> s = open(r"c:\path\to\file").read() >>> u = s.decode("utf8") # or "utf-16-le" This loads the entire file into RAM and attempts to decode it as UTF-8 (or little-endian UTF-16). If the file is not valid, you will get a message along these lines: Traceback (most recent call last): File "<stdin>", line 1, in <module> File "/usr/lib64/python2.5/encodings/utf_8.py", line 16, in decode return codecs.utf_8_decode(input, errors, True) UnicodeDecodeError: 'utf8' codec can't decode bytes in position 13-14: invalid data Unless you have an editor that can jump to a certain byte position, you can slice a bit of the string to get some context: >>> s[10:20] 'kc\x1c\xd6\x00\x82^\xff\xff\xff' In the case of UTF-16 for WAR, we also mandate a BOM. To check for its presence, you can look at the first Unicode character in the decoded string: >>> u u'\ufeff' Any other result is not the BOM, and you need to insert it into the file. ==Closing remarks== Note that it is generally a bad idea to debug encoding problems over the internet. Pastebins are especially useless: the posting browser, the pastebin software and the viewing browser all have a chance to switch encodings, and they usually do. Mails are slightly better, but some MUAs are broken too. Similarly, IRC provides few guarantees, though many clients (with the notable exception of the widely used mIRC) now default to UTF-8. If you must discuss such byte-level issues, the most reliable tool is a hex dump. OS X and Linux users can use the powerful <code>xxd</code> utility. On Windows, you can resort to Python (if you installed it for the last section), and use its own string representation which encodes the problematic characters as in <code>'\xAB'</code>.
The type of markup for this entry.
Click here for details
Curse Wiki (Deprecated)