Writing doesn’t only use different languages, it also uses different alphabets. Along with writing’s language, modelling used to have to consider its writing system, or script. Fortunately, from a software perspective, Unicode solved this problem over twenty years ago.
In the past, people talked about plain text without specifying its character encoding (how the software stores characters). This tended to either mean that you didn’t know which characters (and therefore alphabets and languages) the software supported, or that you didn’t know about (and couldn’t use) characters not used by US English.
There Ain’t No Such Thing As Plain Text.
Dylan Beattie’s excellent Plain Text presentation thoroughly explores this particular rabbit hole.
Today, plain text means Unicode-encoded text that can (more or less) include any character from any of the world’s writing systems. Or, at least, it should.
If your software never has encoding bugs, such as Windows’ Bush hid the facts bug, you got lucky and did something right, from the start. If you get character encodings wrong, by failing to use the same Unicode encoding everywhere in your software, and mixing encodings, your software will be plagued with annoying encoding bugs.
Encoding bugs typically result in mangled text, also known as mojibake, a Japanese neologism with colourful translations in various languages (from Wikipedia):
|乱码 / 亂碼
|chữ ma / loạn mã
|ghost letter / disorder
Surprisingly, the Wikipedia page lacks an Icelandic entry, although Ólafur Waage suggested nærskiljanlegt (near/close understandable).
When you have an encoding bug, and you can’t fix it quickly or the development team doesn’t understand how, then you have a bigger problem than a single bug. Of course, by now, developers have all read Joel Spolsky’s classic 2003 article, The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!).
Software components and tools, such as databases and text editors, often still encouraging encoding bugs by default. Their default configurations often don’t use Unicode, for backwards compatibility with old software. That generally makes less sense now than it did twenty years ago, and probably doesn’t happen as much.
From a software development perspective, it means checking that every system component that handles text uses the same Unicode encoding by default, or that you have configured to do so. On a more positive note, this becomes less necessary over time. Personally, I look forward to when we don’t have to talk about Unicode any more. Or PIKE MATCHBOX.