Modelling text as data and code

Getting your formats mixed up 2022-09-13 #data #DDD

Recently on Twitter, Hillel Wayne complained about mixing up four kinds of text: writing, symbols, data, and code. Software developers typically model all of these things as text, but also have different ways of identifying them, and mix them together in different ways.

This article follows continues on from writing and symbols, and explores what goes wrong with data and code.

Data

In this context, data refers to structured computer-readable formats, such as spreadsheets, binary rich text files, or text-based data formats like CSV and JSON. Data’s structure makes it computer readable, by software that knows what to expect.

But sometimes data doesn’t load. This happens when a missing or imprecise specification results in incompatible variations between what some software produces, and what other software understands. We mitigate these problems by standardising the data formats.

Sometimes people design data formats in advance, without knowing whether software will use them successfully, as with SVG, until Adobe SVG Viewer took up the challenge. More often, a standard only appears after years or decades of incompatibility, as with RFC 4180-compliant CSV.

In practice, some standards remain imperfect, if not actually missing, so software also applies a second mitigation - Postel’s law, from RFC 761 Transmission Control Protocol (TCP), section 2.10. Robustness Principle:

TCP implementations should follow a general principle of robustness: be conservative in what you do, be liberal in what you accept from others.

Meanwhile, we usually handle code differently.

Code

Text as code, which Wayne calls languages, also a models a kind of structured data according to a specification, with the critical difference that it does something. Computers execute code. In practice, we don’t think about modelling text as code much, because we usually store code in files, or at least pretend that we do.

In practice, if you do store code in a database, don’t forget to also make its language explicit. In general, when you store or transmit both data and code, use a standard type identifier.

Like natural language codes that identify writing language, you can use standard codes to identify data formats and code languages. On the Internet, you use a MIME type to identify examples like application/json (data), text/javascript (code), and text/html (markup). This also make better type identifiers in databases than invented codes or flags that indicate whether text values contain HTML, say.

Mixing data and code

We usually keep data and code separate. The Lisp programming language provides a notable exception: Lisp programs can treat Lisp code as data, rather than text that it would have to parse. This hasn’t caught on in more modern languages.

At the other end of the popularity scale, we find HTML, the most popular markup language. Markup languages live on the boundary between kinds of text. Depending on how you look at it, markup either embeds data in prose, or embeds prose in code.

In general, in software development, you should avoid mixing the four different kinds of text together by embedding one in another. Embedding writing or symbols in data and code works, but only up to a point, leading to content management systems to keep them separate. Meanwhile, embedding one kind of data or code in other kind of data or code may have seemed like a good idea at the time, but ends badly.

Writing by Peter Hilton

Data

Code

Mixing data and code