Are you a West-European software developer who:
Read on for your first aid in this area... WARNING: high information density, no fluffy bits.
The current situation is that we're heading for Unicode, which solves most problems since it can contain any character. First of all, Unicode is a repertoire, which keeps expanding to include all sorts of languages (Klingon!). Secondly, there's a number of encodings defined for Unicode. The most important one is UTF-8. This uses one byte for the first bunch of characters and when necessary, uses multiple bytes. It's compatible with ASCII.
For Western people, the most important standard is the ISO-8859 family and its Windows equivalents. This ISO standard is an 8-bit encoding where the first 127 characters are plain old ASCII and the rest is something language-specific. We have ISO-8859-1 for Western-European countries, ISO-8859-2 for East-European, et cetera. Lately, ISO-8859-9 replaced ISO-8859-1 because of the Euro sign and some French stuff.
Then there's the Windows encodings. There are equivalents: ISO-8859-1 becomes Windows-1252 a.k.a. cp1251 a.k.a. WinLatin1. The windows encodings contain extra characters like smart/curly quotes, the TM, en- and em-dash. They have been put in the range of ISO-8859-1 that was meant for control characters. There are more; for Greece, Windows has win-1253 a.k.a. WinGreek and the list goes on.
An example: I'm Dutch so I want to write words like patiënt, or geïrriteerd or when it's the end of the year and my boss reluctantly shares some of his riches: tantièmes. That's why my Windows PC uses the WinLatin1 repertoire/encoding (kind of a lie, see later) and my Linux PC uses the Unicode repertoire, with encoding UTF-8.
Suppose we have a piece of text. Maybe we got it from the local network share, or maybe from a USB stick. Or maybe it was the result of an internet user pushing buttons on our website and submitting it to our script.
Whatever the source, this piece of text can be encoded in various ways. I.e. if the creator was Polish, he wanted to tell us he really enjoyed his stay in Łódź. And if we aren't told outside of this piece of text which encoding was used, it goes wrong. Maybe our PC by default assumes the ISO-8859-1 encoding. And the Polish guy uses ISO-8859-2. Then our encoding has another character in place of his Ł and his ź and then it doesn't show what he typed.
Since most encodings use US-ASCII for the first 127 characters, the plain ASCII stuff is OK. But outside of that, we should know the encoding in which the Polish guy typed his document. This is part of the metadata, the data about the data.
The encoding of a piece of data is really metadata -- data about data. So, where is this metadata stoerd?
<?xml version="1.0" encoding="ISO-8859-1"?>
Even if there is a standard, it doesn't mean nothing can go wrong. Some examples follow.
In Windows 2000, when you save a textfile in Notepad as Unicode (note that this silently implies UTF-8), the file data is prefixed with a special character so it can be recognized as unicode. No other systems adhere to this "standard".
With e-mail, there are other problems. MIME defines a standard for mentioning the encoding of attachments. But what about the filename of those files? Most mail agents use the same encoding as the filename. But not all.
This can horribly go wrong with webmail of course. If the hungarian user opens his webmail and has his encoding set to something in the ISO-8859-something range, but receives an e-mail with a Finnish attachment, we have a problem. There is no way that this information can be shown on the same page by the browser, because the encodings differ and have overlapping characters. The HTTP and HTML standards do not support multiple encodings.
Suppose you have an online application which requires the user to fill in a bunch of forms. You conform to the HTML standard and have set an appropriate encoding, for instance ISO-8859-1. However, your Polish user has set his PC to his appropriate encoding and happily pounds away in your textareas, including all sorts of characters you're wholly unprepared to accept! The browser will not warn anyone -- it'll go ahead and submit anyway.
Do a default installation of Tomcat. Save an HTML file as Unicode, put some funky Unicode characters in it and add a meta tag. Watch in horror as Tomcat sends an HTTP header saying your document is in ISO-8859-1 encoding, telling your browser to ignore your META tag and display nonsense characters.
Also, sometimes the user has equipment of limited capabilities. For instance, a friend of mine studied at a German university. They hand out telnet accounts which drop you straight into Pine. A student who logs in, views an e-mail from Holland. The Dutch sender thought himself smart and uses a Linux box, where everything's encoded in UTF-8. Except... the German student has a limited capability terminal emulator. Pine checks this and upon displaying the e-mail then says something about the message having another encoding and warning that it might be displayed incorrectly.
If your university uses Blackboard, and you have both Linux and Windows users, try the following experiment:
This problem is unfixable in the sense that there is no straightforward automated solution. The problem is that the plain text format does not have a metadata section defining in which encoding the file is constructed. Blackboard accepts the text format, sees that the file name ends with .txt. When a user clicks on the file, Blackboard outputs two HTTP headers:
Content-Type: text/plain Content-Encoding: ISO-8859-1
Blackboard outputs the wrong Content-Encoding header for this file, but what can it do? It's difficult and unreliable to just try to guess the encoding, so it does what goes right 95% of the time.
Let's talk some more about web applications, the main way of building applications currently.