Writing by Peter Hilton

Use Unicode

Why the only bug you should see in text is šŸ› 2022-10-11 #data

Ryoma Onita

Writing doesnā€™t only use different languages, it also uses different alphabets. Along with writingā€™s language, modelling used to have to consider its writing system, or script. Fortunately, from a software perspective, Unicode solved this problem over twenty years ago.

Plain text

In the past, people talked about plain text without specifying its character encoding (how the software stores characters). This tended to either mean that you didnā€™t know which characters (and therefore alphabets and languages) the software supported, or that you didnā€™t know about (and couldnā€™t use) characters not used by US English.

There Ainā€™t No Such Thing As Plain Text.
Joel Spolsky

Dylan Beattieā€™s excellent Plain Text presentation thoroughly explores this particular rabbit hole.

Today, plain text means Unicode-encoded text that can (more or less) include any character from any of the worldā€™s writing systems. Or, at least, it should.

Use Unicode

If your software never has encoding bugs, such as Windowsā€™ Bush hid the facts bug, you got lucky and did something right, from the start. If you get character encodings wrong, by failing to use the same Unicode encoding everywhere in your software, and mixing encodings, your software will be plagued with annoying encoding bugs.

Encoding bugs typically result in mangled text, also known as mojibake, a Japanese neologism with colourful translations in various languages (from Wikipedia):

Language Name Transliteration Meaning
Bulgarian majmunica Š¼Š°Š¹Š¼ŃƒŠ½ŠøцŠ° monkeyā€™s [alphabet]
Chinese ä¹±ē  / äŗ‚ē¢¼ LuĆ n mĒŽ chaotic code
German Buchstabensalat Ā  letter salad
Hungarian betűszemĆ©t Ā  letter garbage
Japanese ę–‡å­—åŒ–ć‘ mojibake character transformation
Russian krakozyabry ŠŗрŠ°ŠŗŠ¾Š·ŃĢŠ±Ń€Ń‹ Ā 
Serbian đubre ђуŠ±Ń€Šµ trash
Spanish deformaciĆ³n Ā  deformation
Vietnamese chį»Æ ma / loįŗ”n mĆ£ Ā  ghost letter / disorder

Surprisingly, the Wikipedia page lacks an Icelandic entry, although Ɠlafur Waage suggested nƦrskiljanlegt (near/close understandable).

When you have an encoding bug, and you canā€™t fix it quickly or the development team doesnā€™t understand how, then you have a bigger problem than a single bug. Of course, by now, developers have all read Joel Spolskyā€™s classic 2003 article, The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!).

Default encodings

Software components and tools, such as databases and text editors, often still encouraging encoding bugs by default. Their default configurations often donā€™t use Unicode, for backwards compatibility with old software. That generally makes less sense now than it did twenty years ago, and probably doesnā€™t happen as much.

From a software development perspective, it means checking that every system component that handles text uses the same Unicode encoding by default, or that you have configured to do so. On a more positive note, this becomes less necessary over time. Personally, I look forward to when we donā€™t have to talk about Unicode any more. Or PIKE MATCHBOX.

Share on BlueskyShare on XShare on LinkedIn