Writing by Peter Hilton

Use Unicode

Why the only bug you should see in text is šŸ› 2022-10-11 #data

Ryoma Onita

Writing doesn’t only use different languages, it also uses different alphabets. Along with writing’s language, modelling used to have to consider its writing system, or script. Fortunately, from a software perspective, Unicode solved this problem over twenty years ago.

Plain text

In the past, people talked about plain text without specifying its character encoding (how the software stores characters). This tended to either mean that you didn’t know which characters (and therefore alphabets and languages) the software supported, or that you didn’t know about (and couldn’t use) characters not used by US English.

There Ain’t No Such Thing As Plain Text.
Joel Spolsky

Dylan Beattie’s excellent Plain Text presentation thoroughly explores this particular rabbit hole.

Today, plain text means Unicode-encoded text that can (more or less) include any character from any of the world’s writing systems. Or, at least, it should.

Use Unicode

If your software never has encoding bugs, such as Windows’ Bush hid the facts bug, you got lucky and did something right, from the start. If you get character encodings wrong, by failing to use the same Unicode encoding everywhere in your software, and mixing encodings, your software will be plagued with annoying encoding bugs.

Encoding bugs typically result in mangled text, also known as mojibake, a Japanese neologism with colourful translations in various languages (from Wikipedia):

Language Name Transliteration Meaning
Bulgarian majmunica Š¼Š°Š¹Š¼ŃƒŠ½ŠøŃ†Š° monkey’s [alphabet]
Chinese 乱码 / 亂碼 LuĆ n mĒŽ chaotic code
German Buchstabensalat Ā  letter salad
Hungarian betűszemét   letter garbage
Japanese ę–‡å­—åŒ–ć‘ mojibake character transformation
Russian krakozyabry ŠŗŃ€Š°ŠŗŠ¾Š·ŃĢŠ±Ń€Ń‹ Ā 
Serbian đubre Ń’ŃƒŠ±Ń€Šµ trash
Spanish deformación   deformation
Vietnamese chữ ma / loẔn mã   ghost letter / disorder

Surprisingly, the Wikipedia page lacks an Icelandic entry, although Ɠlafur Waage suggested nƦrskiljanlegt (near/close understandable).

When you have an encoding bug, and you can’t fix it quickly or the development team doesn’t understand how, then you have a bigger problem than a single bug. Of course, by now, developers have all read Joel Spolsky’s classic 2003 article, The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!).

Default encodings

Software components and tools, such as databases and text editors, often still encouraging encoding bugs by default. Their default configurations often don’t use Unicode, for backwards compatibility with old software. That generally makes less sense now than it did twenty years ago, and probably doesn’t happen as much.

From a software development perspective, it means checking that every system component that handles text uses the same Unicode encoding by default, or that you have configured to do so. On a more positive note, this becomes less necessary over time. Personally, I look forward to when we don’t have to talk about Unicode any more. Or PIKE MATCHBOX.

Share on BlueskyShare on LinkedIn