Writing by Peter Hilton

Localise names with the CLDR

Using the Unicode Common Locale Data Repository’s standard translations 2022-01-25 #software #data #design

Banknotes

Jason Leung

Language localisation, part of the broader internationalisation and localisation topic, includes translating text to a local language for a particular locale. In this context, a locale identifier refers to a language associated with a specific geographic region, such as British English.

The Unicode Common Locale Data Repository (CLDR) contains a wide variety of locale-specific names, data formats, and validation rules, as well as details of various languages and scripts. You can use this data for standard localisations in your software.

CLDR lists

CLDR includes a number of standard lists, translated into each language. For example, each of the following is an entry in a (sometimes long) list.

  English (en) French (fr) Russian (ru) Japanese (ja) Thai (th)
Languages Russian russe русский ロシア語 รัสเซีย
Scripts Cyrillic cyrillique кириллица キリル文字 ซีริลลิก
Regions Russia Russie Россия ロシア รัสเซีย
Months January janvier января 1月 มกราคม
Days Monday lundi понедельник 月曜日 วันจันทร์
Quarters Q1 1er trimestre 1-й квартал 第1四半期 ไตรมาส 1
Time zones Moscow Time heure de Moscou Москва モスクワ時間 เวลามอสโก
Currencies Russian Ruble rouble russe российский рубль ロシア ルーブル รูเบิลรัสเซีย
Units meters mètres метры メートル เมตร
Typography italic italique курсив イタリック ตัวเอียง

CLDR translations - excluding the more obscure lists - include the names of:

This means that if you can use CLDR as a source for a list of countries, with translations to different languages, where each country is identified by its ISO 3166-1 two-letter country code.

In general, if you display these kinds of lists or selections in software, and you want to localise your software into multiple languages, you can get the translations from CLDR. Each entry has a code that all localisations share for looking up entries, sometimes a standard code as for countries, and sometimes a simple numeric code. You can also use lists of these codes to include or exclude sub-lists.

Filtered lists

To get a list of currency names, you first need to filter the list to exclude what you don’t consider proper currencies. You can include most of the ISO 4127 currencies whose three-letter currency codes follow the pattern for a two-letter country code followed by the currency name’s initial, such as USD (US Dollar). However, you should exclude the X currencies such as XAU (gold), deprecated currencies such as RUR (Russian Ruble 1991-1998), and the unknown currency XXX.

Similarly, you also need to filter the CLDR territories to get a countries list. You can exclude large regions with three-digit codes such as 151 (Eastern Europe), and regions with two-letter codes: EU (European Union), EZ (Eurozone), QO (Outlying Oceania), and UN (United Nations). After that, it gets complicated.

In general, depending on which list you want, you may need to filter its contents. CLDR helps with this by including validity data that divides these lists’ entries into categories:

Published data

The CLDR releases page publishes the data in XML format, whose source resides in the cldr GitHub Project. The source includes one XML per locale, e.g. en.xml. These XML files use the Unicode Locale Data Markup Language (LDML).

The cldr-json GitHub Project generates JSON representations from the XML source. This is also available via npm. Finally, various software libraries make CLDR available directly via their own APIs.

Essential complexity

You can easily get lost in the CLDR’s complexity, which reflects the world’s messiness. However, when you internationalise and localise software, you will find accurate locale data both valuable and satisfying.

Share on TwitterShare on LinkedIn