Use Unicode
Why the only bug you should see in text is š 2022-10-11 #data
Writing doesnāt only use different languages, it also uses different alphabets. Along with writingās language, modelling used to have to consider its writing system, or script. Fortunately, from a software perspective, Unicode solved this problem over twenty years ago.
Plain text
In the past, people talked about plain text without specifying its character encoding (how the software stores characters). This tended to either mean that you didnāt know which characters (and therefore alphabets and languages) the software supported, or that you didnāt know about (and couldnāt use) characters not used by US English.
There Aināt No Such Thing As Plain Text.
Joel Spolsky
Dylan Beattieās excellent Plain Text presentation thoroughly explores this particular rabbit hole.
Today, plain text means Unicode-encoded text that can (more or less) include any character from any of the worldās writing systems. Or, at least, it should.
Use Unicode
If your software never has encoding bugs, such as Windowsā Bush hid the facts bug, you got lucky and did something right, from the start. If you get character encodings wrong, by failing to use the same Unicode encoding everywhere in your software, and mixing encodings, your software will be plagued with annoying encoding bugs.
Encoding bugs typically result in mangled text, also known as mojibake, a Japanese neologism with colourful translations in various languages (from Wikipedia):
Language | Name | Transliteration | Meaning |
---|---|---|---|
Bulgarian | majmunica | Š¼Š°Š¹Š¼ŃŠ½ŠøŃŠ° | monkeyās [alphabet] |
Chinese | ä¹±ē / äŗē¢¼ | LuĆ n mĒ | chaotic code |
German | Buchstabensalat | Ā | letter salad |
Hungarian | betűszemĆ©t | Ā | letter garbage |
Japanese | ęååć | mojibake | character transformation |
Russian | krakozyabry | ŠŗŃŠ°ŠŗŠ¾Š·ŃĢŠ±ŃŃ | Ā |
Serbian | Äubre | ŃŃŠ±ŃŠµ | trash |
Spanish | deformaciĆ³n | Ā | deformation |
Vietnamese | chį»Æ ma / loįŗ”n mĆ£ | Ā | ghost letter / disorder |
Surprisingly, the Wikipedia page lacks an Icelandic entry, although Ćlafur Waage suggested nƦrskiljanlegt (near/close understandable).
When you have an encoding bug, and you canāt fix it quickly or the development team doesnāt understand how, then you have a bigger problem than a single bug. Of course, by now, developers have all read Joel Spolskyās classic 2003 article, The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!).
Default encodings
Software components and tools, such as databases and text editors, often still encouraging encoding bugs by default. Their default configurations often donāt use Unicode, for backwards compatibility with old software. That generally makes less sense now than it did twenty years ago, and probably doesnāt happen as much.
From a software development perspective, it means checking that every system component that handles text uses the same Unicode encoding by default, or that you have configured to do so. On a more positive note, this becomes less necessary over time. Personally, I look forward to when we donāt have to talk about Unicode any more. Or PIKE MATCHBOX.