Use Unicode

Why the only bug you should see in text is 🐛 2022-10-11 #data

Writing doesn’t only use different languages, it also uses different alphabets. Along with writing’s language, modelling used to have to consider its writing system, or script. Fortunately, from a software perspective, Unicode solved this problem over twenty years ago.

Plain text

In the past, people talked about plain text without specifying its character encoding (how the software stores characters). This tended to either mean that you didn’t know which characters (and therefore alphabets and languages) the software supported, or that you didn’t know about (and couldn’t use) characters not used by US English.

There Ain’t No Such Thing As Plain Text.
Joel Spolsky

Dylan Beattie’s excellent Plain Text presentation thoroughly explores this particular rabbit hole.

Today, plain text means Unicode-encoded text that can (more or less) include any character from any of the world’s writing systems. Or, at least, it should.

Use Unicode

If your software never has encoding bugs, such as Windows’ Bush hid the facts bug, you got lucky and did something right, from the start. If you get character encodings wrong, by failing to use the same Unicode encoding everywhere in your software, and mixing encodings, your software will be plagued with annoying encoding bugs.

Encoding bugs typically result in mangled text, also known as mojibake, a Japanese neologism with colourful translations in various languages (from Wikipedia):

Language	Name	Transliteration	Meaning
Bulgarian	majmunica	маймуница	monkey’s [alphabet]
Chinese	乱码 / 亂碼	Luàn mǎ	chaotic code
German	Buchstabensalat		letter salad
Hungarian	betűszemét		letter garbage
Japanese	文字化け	mojibake	character transformation
Russian	krakozyabry	кракозя́бры
Serbian	đubre	ђубре	trash
Spanish	deformación		deformation
Vietnamese	chữ ma / loạn mã		ghost letter / disorder

Surprisingly, the Wikipedia page lacks an Icelandic entry, although Ólafur Waage suggested nærskiljanlegt (near/close understandable).

When you have an encoding bug, and you can’t fix it quickly or the development team doesn’t understand how, then you have a bigger problem than a single bug. Of course, by now, developers have all read Joel Spolsky’s classic 2003 article, The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!).

Default encodings

Software components and tools, such as databases and text editors, often still encouraging encoding bugs by default. Their default configurations often don’t use Unicode, for backwards compatibility with old software. That generally makes less sense now than it did twenty years ago, and probably doesn’t happen as much.

From a software development perspective, it means checking that every system component that handles text uses the same Unicode encoding by default, or that you have configured to do so. On a more positive note, this becomes less necessary over time. Personally, I look forward to when we don’t have to talk about Unicode any more. Or PIKE MATCHBOX.

Writing by Peter Hilton

Plain text

Use Unicode

Default encodings