Localise names with the CLDR

Using the Unicode Common Locale Data Repository’s standard translations 2022-01-25 #data #DDD

Language localisation, part of the broader internationalisation and localisation topic, includes translating text to a local language for a particular locale. In this context, a locale identifier refers to a language associated with a specific geographic region, such as British English.

The Unicode Common Locale Data Repository (CLDR) contains a wide variety of locale-specific names, data formats, and validation rules, as well as details of various languages and scripts. You can use this data for standard localisations in your software.

CLDR lists

CLDR includes a number of standard lists, translated into each language. For example, each of the following is an entry in a (sometimes long) list.

	English (`en`)	French (`fr`)	Russian (`ru`)	Japanese (`ja`)	Thai (`th`)
Languages	Russian	russe	русский	ロシア語	รัสเซีย
Scripts	Cyrillic	cyrillique	кириллица	キリル文字	ซีริลลิก
Regions	Russia	Russie	Россия	ロシア	รัสเซีย
Months	January	janvier	января	1月	มกราคม
Days	Monday	lundi	понедельник	月曜日	วันจันทร์
Quarters	Q1	1er trimestre	1-й квартал	第1四半期	ไตรมาส 1
Time zones	Moscow Time	heure de Moscou	Москва	モスクワ時間	เวลามอสโก
Currencies	Russian Ruble	rouble russe	российский рубль	ロシアルーブル	รูเบิลรัสเซีย
Units	meters	mètres	метры	メートル	เมตร
Typography	italic	italique	курсив	イタリック	ตัวเอียง

CLDR translations - excluding the more obscure lists - include the names of:

languages
scripts (writing systems)
territories, including countries
calendar names - quarters, months and weekdays, including abbreviations
time zones
currencies
units of measurement
typographic styles

This means that if you can use CLDR as a source for a list of countries, with translations to different languages, where each country is identified by its ISO 3166-1 two-letter country code.

In general, if you display these kinds of lists or selections in software, and you want to localise your software into multiple languages, you can get the translations from CLDR. Each entry has a code that all localisations share for looking up entries, sometimes a standard code as for countries, and sometimes a simple numeric code. You can also use lists of these codes to include or exclude sub-lists.

Filtered lists

To get a list of currency names, you first need to filter the list to exclude what you don’t consider proper currencies. You can include most of the ISO 4127 currencies whose three-letter currency codes follow the pattern for a two-letter country code followed by the currency name’s initial, such as USD (US Dollar). However, you should exclude the X currencies such as XAU (gold), deprecated currencies such as RUR (Russian Ruble 1991-1998), and the unknown currency XXX.

Similarly, you also need to filter the CLDR territories to get a countries list. You can exclude large regions with three-digit codes such as 151 (Eastern Europe), and regions with two-letter codes: EU (European Union), EZ (Eurozone), QO (Outlying Oceania), and UN (United Nations). After that, it gets complicated.

In general, depending on which list you want, you may need to filter its contents. CLDR helps with this by including validity data that divides these lists’ entries into categories:

regular
special
deprecated
reserved
private use
unknown

Published data

The CLDR releases page publishes the data in XML format, whose source resides in the cldr GitHub Project. The source includes one XML per locale, e.g. en.xml. These XML files use the Unicode Locale Data Markup Language (LDML).

The cldr-json GitHub Project generates JSON representations from the XML source. This is also available via npm. Finally, various software libraries make CLDR available directly via their own APIs.

Essential complexity

You can easily get lost in the CLDR’s complexity, which reflects the world’s messiness. However, when you internationalise and localise software, you will find accurate locale data both valuable and satisfying.

Writing by Peter Hilton

CLDR lists

Filtered lists

Published data

Essential complexity