vankuik.nl: Internationalization

WARNING WORK IN PROGRESS

"Geïrriteerd"

Are you a West-European software developer who:

Is irritated that your accents frequently are mangled?
Can't be bothered to read huge stacks of paper about it?
Wants to know the absolute bare minimum on this issue?
Has heard of US-ASCII, code page, character encoding, Unicode, ISO-8859-1 and UTF-8 before, but has the feeling it doesn't really add up?

Read on for your first aid in this area... WARNING: high information density, no fluffy bits.

Definitions

Character repertoire: Just a bunch of characters that someone decided should be together in a collection. Included could be the ø, or the € or something else. Then again, maybe it didn't include it.
Character encoding: Agreement on which byte (or bytes) mean which character. Could be really simple, like: we use one byte for one character (which means an 8-bit encoding, with 256 possible characters). Or: we use two bytes for each character.
Character set: Can mean various things, but most of the time, means the repertoire plus an encoding.
Code page: Kind of old thing. Comes from the DOS time. Can mean various things, but most of the time, means the repertoire plus an 8-bit encoding where the first 127 characters are plain old ASCII and the rest is something else.
Font: These are the graphical pictures (glyphs) that make up the character. It's entirely possible that the designer of the font didn't include glyphs for certain characters. Hell, some fonts only have glyphs for capital A to Z!

Important standards

The current situation is that we're heading for Unicode, which solves most problems since it can contain any character. First of all, Unicode is a repertoire, which keeps expanding to include all sorts of languages (Klingon!). Secondly, there's a number of encodings defined for Unicode. The most important one is UTF-8. This uses one byte for the first bunch of characters and when necessary, uses multiple bytes. It's compatible with ASCII.

For Western people, the most important standard is the ISO-8859 family and its Windows equivalents. This ISO standard is an 8-bit encoding where the first 127 characters are plain old ASCII and the rest is something language-specific. We have ISO-8859-1 for Western-European countries, ISO-8859-2 for East-European, et cetera. Lately, ISO-8859-9 replaced ISO-8859-1 because of the Euro sign and some French stuff.

Then there's the Windows encodings. There are equivalents: ISO-8859-1 becomes Windows-1252 a.k.a. cp1251 a.k.a. WinLatin1. The windows encodings contain extra characters like smart/curly quotes, the TM, en- and em-dash. They have been put in the range of ISO-8859-1 that was meant for control characters. There are more; for Greece, Windows has win-1253 a.k.a. WinGreek and the list goes on.

An example: I'm Dutch so I want to write words like patiënt, or geïrriteerd or when it's the end of the year and my boss reluctantly shares some of his riches: tantièmes. That's why my Windows PC uses the WinLatin1 repertoire/encoding (kind of a lie, see later) and my Linux PC uses the Unicode repertoire, with encoding UTF-8.

Which encoding did we use?

Suppose we have a piece of text. Maybe we got it from the local network share, or maybe from a USB stick. Or maybe it was the result of an internet user pushing buttons on our website and submitting it to our script.

Whatever the source, this piece of text can be encoded in various ways. I.e. if the creator was Polish, he wanted to tell us he really enjoyed his stay in Łódź. And if we aren't told outside of this piece of text which encoding was used, it goes wrong. Maybe our PC by default assumes the ISO-8859-1 encoding. And the Polish guy uses ISO-8859-2. Then our encoding has another character in place of his Ł and his ź and then it doesn't show what he typed.

Since most encodings use US-ASCII for the first 127 characters, the plain ASCII stuff is OK. But outside of that, we should know the encoding in which the Polish guy typed his document. This is part of the metadata, the data about the data.

How is this solved?

The encoding of a piece of data is really metadata -- data about data. So, where is this metadata stoerd?

For text documents, metadata isn't stored anywhere and it's not really possible to determine it. When you open the file, you must decide the encoding. If my Linux PC uses Unicode and I copy a filename with accents to my USB stick and read it on Windows, I see funny characters.
For the web, there are two possibilities:
- Either the web server is configured to send out a so-called HTTP header (with the encoding) before the actual page is sent.
- It's also possible to add a <meta> tag. This tells us in which encoding the page is.
- There's a default and that's whatever the browser is set to.
With XML, the default is UTF-8, but can be explicitly set in the header

  <?xml version="1.0" encoding="ISO-8859-1"?>

With mail, we have MIME which has headers defining the encoding.
With Word documents, the document format includes an encoding field.
Even the filenames of the files on your local harddisk have an encoding. That means that on a file share, we have to know the encoding and pass it as a parameter when we access a share (but only when it's different from ours).

TODO

Inside Windows C code
Inside Linux C code
Inside Java software
Talk about clipboards copy/paste
Mention PHP its Strings aren't Unicode by default
Do testing with Perl

Even if there is a standard...

Even if there is a standard, it doesn't mean nothing can go wrong. Some examples follow.

In Windows 2000, when you save a textfile in Notepad as Unicode (note that this silently implies UTF-8), the file data is prefixed with a special character so it can be recognized as unicode. No other systems adhere to this "standard".

With e-mail, there are other problems. MIME defines a standard for mentioning the encoding of attachments. But what about the filename of those files? Most mail agents use the same encoding as the filename. But not all.

This can horribly go wrong with webmail of course. If the hungarian user opens his webmail and has his encoding set to something in the ISO-8859-something range, but receives an e-mail with a Finnish attachment, we have a problem. There is no way that this information can be shown on the same page by the browser, because the encodings differ and have overlapping characters. The HTTP and HTML standards do not support multiple encodings.

Suppose you have an online application which requires the user to fill in a bunch of forms. You conform to the HTML standard and have set an appropriate encoding, for instance ISO-8859-1. However, your Polish user has set his PC to his appropriate encoding and happily pounds away in your textareas, including all sorts of characters you're wholly unprepared to accept! The browser will not warn anyone -- it'll go ahead and submit anyway.

Do a default installation of Tomcat. Save an HTML file as Unicode, put some funky Unicode characters in it and add a meta tag. Watch in horror as Tomcat sends an HTTP header saying your document is in ISO-8859-1 encoding, telling your browser to ignore your META tag and display nonsense characters.

Also, sometimes the user has equipment of limited capabilities. For instance, a friend of mine studied at a German university. They hand out telnet accounts which drop you straight into Pine. A student who logs in, views an e-mail from Holland. The Dutch sender thought himself smart and uses a Linux box, where everything's encoded in UTF-8. Except... the German student has a limited capability terminal emulator. Pine checks this and upon displaying the e-mail then says something about the message having another encoding and warning that it might be displayed incorrectly.

Bugs in software

If your university uses Blackboard, and you have both Linux and Windows users, try the following experiment:

Linux user drafts up a summary of a meeting in a text file in some random editor, doesn't think of the fact that his system uses UTF-8 as a default encoding. There are a few accents and symbols in the file.
Linux user uploads said textfile into Blackboard
Linux or Windows user clicks on textfile in Blackboard and watches in horror as accents are mangled.

This problem is unfixable in the sense that there is no straightforward automated solution. The problem is that the plain text format does not have a metadata section defining in which encoding the file is constructed. Blackboard accepts the text format, sees that the file name ends with .txt. When a user clicks on the file, Blackboard outputs two HTTP headers:

  Content-Type: text/plain
  Content-Encoding: ISO-8859-1

Blackboard outputs the wrong Content-Encoding header for this file, but what can it do? It's difficult and unreliable to just try to guess the encoding, so it does what goes right 95% of the time.

About web applications

Let's talk some more about web applications, the main way of building applications currently.

Some Pointers

http://www.cs.tut.fi/~jkorpela/chars.html

http://www.cs.tut.fi/~jkorpela/html/chars.html

http://www.cl.cam.ac.uk/~mgk25/unicode.html

http://bugzilla.mozilla.org/show_bug.cgi?id=18643

http://www.joelonsoftware.com/articles/Unicode.html

http://weblogs.java.net/blog/joconner/archive/2005/03/finding_the_bes.html