Mixing kinds of text

Where embedding one kind of text in another goes bad 2022-09-20 #data #DDD

Data modelling usually separates text from numbers and dates, but ignores differences between four kinds of text: writing, symbols, data and code. You probably don’t usually worry about this, unless you make a habit of overthinking modelling. You should think about them, though, whenever you decide to mix them.

Mostly harmless embedding

Four kinds of embedding cause the least harm.

Embedding writing in other writing - using quotation marks and quotation blocks
Embedding writing in code or data - literal text values
Embedding symbols in writing - often using quotation marks or emphasis
Embedding symbols in code or data - atomic text values

More sophisticated systems separate writing instead of embedding it. Topic-based authoring allows technical writers to embed writing in other writing, to assemble content from smaller blocks. Content management systems and localisation systems make it easier to embed writing in code, in web sites and other software, to separate writing and coding environments.

Avoid embedding data in different kinds of text

Embedding structured data in other kinds of text causes character escaping and content type identification issues. Escaping causes ugly mixed data formats, and many escaped characters.

Having seen both, I find it hard to say whether embedding XML in JSON offends me more than embedding JSON in XML, or the other way around. For example, a system built during the XML age might include this:

<greeting><message lang="en" style="formal">Hello, world!</message></greeting>

Later, in the JSON age, lazy integration with another system might fail to replace the XML model with JSON:

{
   "type": "log",
   "level": "info",
   "body": "<greeting><message lang=\"en\" style=\"formal\">Hello, world!</message></greeting>"
}

The worst example of this I’ve seen appeared in an HTTP API for validating XML that required the XML document embedded in a JSON request body. Needless to say, the API didn’t do HTTP content negotiation correctly.

For the other way around, you can choose between two ugly options. The first option requires escaping quotes, and a few other XML characters:

<greeting>
   <message lang="en" style="formal">Hello, world!</message>
   <notification>
      { &quot;type&quot;: &quot;log&quot;, &quot;level&quot;: &quot;info&quot; }
   <notification>
</greeting>

The second approach embeds an unparsed character data (CDATA) block:

<greeting>
   <message lang="en" style="formal">Hello, world!</message>
   <notification><![CDATA[ { "type": "log", "level": "info" } ]]><notification>
</greeting>

When you do this, your application requires special processing to handle a different content type for part of the data. Making the content type explicit would solve the wrong problem; instead, avoid mixing types when combining data. In most cases, do it all in JSON.

Note that making the content type explicit does make sense when embedding data in writing. For example, the first XML fragment (above), appears in this article’s Markdown source as:

```xml
<greeting><message lang="en" style="formal">Hello, world!</message></greeting>
``` 

Use MIME types to identify embedded data

Unfortunately, Markdown doesn’t have a standard way to indicate the embedded content type, and GitHub-Flavoured Markdown’s Fenced code blocks use xml instead of a text/xml MIME type. This matters, because without a standard way to identify the embedded data, you need a custom software implementation to handle the combination.

Avoid embedding code in data

The same applies to embedding code in data, as when saving code in a database. Store code separately from other types of code and data, and store the MIME type separately. As for embedding code in code, in the same file, let’s pretend that never happens.

Don’t embed text in symbols

Identifiers and other symbols have atomic structure, so you can’t embed anything in them. Good identifiers, that is. Software designers sometimes embed data in identifiers, such as putting the year in a registration number, or a gender in a personal number. Don’t do this.

Writing by Peter Hilton

Mostly harmless embedding

Avoid embedding data in different kinds of text

Use MIME types to identify embedded data

Avoid embedding code in data

Don’t embed text in symbols