Modelling text as writing

Software design problems with the simplest of four kinds of text 2022-08-30 #data #DDD

Data modelling unfairly restricts us to modelling some values as text (strings). Hillel Wayne explains:

The problem with strings is they represent too many things. They’re the lingua franca of arbitrary data that all systems can read and write, and that means we use them in many distinct ways that can be confused with each other.

Wayne lists four kinds of text: writing, symbols, data and languages. Ideally, you wouldn’t have to care about different kinds of text, but each kind turns out to have its own potential for bad software user experience and bugs.

Writing

English prose doesn’t have the kind of structure that makes it easy for computers to understand it, so we call anything that can parse this AI. Treating written text as an opaque value reduces it to an unstructured sequence of characters, which software handles relatively easily. However, text in natural languages, or actual languages if you like to separate them from programming languages, still causes problems.

One problem with writing arises from not knowing its potential length. While this blog’s article word limit makes lengths predictable, both novels and people’s names vary widely in length, over at least a factor of ten. And while modelling a text value’s length as unlimited doesn’t pose a conceptual problem, it complicates software implementations.

When you fail to model text lengths properly in advance, your software will probably break visibly, and you can fix it. That assumes, of course, that you can change your data model to remove length limits, and that you actually find out, rather than silently losing customers whose content your software rejects. In practice, you can mitigate these risks during data modelling, unlike the next problem, which you may try ignore at first.

Languages

Front page of the 3 February 2009 edition of the Luxemburger Wort newspaper, with articles in both German and French

These days, many software categories require multiple language support. Software applications therefore need to make a text value’s language explicit, for two reasons. First, most people don’t understand most languages, which means that multi-lingual software should support language preferences, to identify which text a reader can understand. Second, even fewer people understand pairs of languages, so software generally avoids mixing content languages on the same page. Note: Luxembourg newspaper Luxemburger Wort, used to provide an exception, with German and French articles on the front page (right).

Software generally standardises language and region identification by using language tags, such as en-GB for British English. These tags use an ISO 639-1 two-letter language code, combined with an ISO 3166-1 alpha-2 country code. Apparently, the regional language lists in ISO 639-2 (487 languages) and ISO 639-3 (8279 languages - source) never caught on.

Of course, content localisation only works if you model content languages in advance, and tag all content with its language. If you don’t, and fail to notice that you now have content in multiple languages, you have a data quality problem: lots of unknown language text. At least you can use an ISO 639-2 code for that: und (undetermined).

Other (technical) problems

Text length and language issues already make text hard enough for software to get right. You had better hope that your software doesn’t also have technical design problems caused by character encoding issues and embedding one kind of text in another.

Writing by Peter Hilton

Writing

Languages

Other (technical) problems