Unicode Collation Algorithm v11
The Unicode Collation Algorithm (UCA) is a specification (Unicode Technical Report #10) that defines a customizable method of collating and comparing Unicode data. Collation means how data is sorted as with a SELECT … ORDER BY
clause. Comparison is relevant for searches that use ranges with less than, greater than, or equal to operators.
Customizability is an important factor for various reasons such as the following.
- Unicode supports a vast number of languages. Letters that may be common to several languages may be expected to collate in different orders depending upon the language.
- Characters that appear with letters in certain languages such as accents or umlauts have an impact on the expected collation depending upon the language.
- In some languages, combinations of several consecutive characters should be treated as a single character with regards to its collation sequence.
- There may be certain preferences as to the collation of letters according to case. For example, should the lowercase form of a letter collate before the uppercase form of the same letter or vice versa.
- There may be preferences as to whether punctuation marks such as underscore characters, hyphens, or space characters should be considered in the collating sequence or should they simply be ignored as if they did not exist in the string.
Given all of these variations with the vast number of languages supported by Unicode, there is a necessity for a method to select the specific criteria for determining a collating sequence. This is what the Unicode Collation Algorithm defines.
Note
In addition, another advantage for using ICU collations (the implementation of the Unicode Collation Algorithm) is for performance. Sorting tasks, including B-tree index creation, can complete in less than half the time it takes with a non-ICU collation. The exact performance gain depends on your operating system version, the language of your text data, and other factors.
The following sections provide a brief, simplified explanation of the Unified Collation Algorithm concepts. As the algorithm and its usage are quite complex with numerous variations, refer to the official documents cited in these sections for complete details.
Basic Unicode Collation Algorithm Concepts
The official information for the Unicode Collation Algorithm is specified in Unicode Technical Report #10, which can be found on The Unicode Consortium website at:
http://www.unicode.org/reports/tr10/
The ICU – International Components for Unicode also provides much useful information. An explanation of the collation concepts can be found on their website located at:
http://userguide.icu-project.org/collation/concepts
The basic concept behind the Unicode Collation Algorithm is the use of multilevel comparison. This means that a number of levels are defined, which are listed as level 1 through level 5 in the following bullet points. Each level defines a type of comparison. Strings are first compared using the primary level, also called level 1.
If the order can be determined based on the primary level, then the algorithm is done. If the order cannot be determined based on the primary level, then the secondary level, level 2, is applied. If the order can be determined based on the secondary level, then the algorithm is done, otherwise the tertiary level is applied, and so on. There is typically, a final tie-breaking level to determine the order if it cannot be resolved by the prior levels.
- Level 1 – Primary Level for Base Characters. The order of basic characters such as letters and digits determines the difference such as
A < B
. - Level 2 – Secondary Level for Accents. If there are no primary level differences, then the presence or absence of accents and other such characters determine the order such as
a < á
. - Level 3 – Tertiary Level for Case. If there are no primary level or secondary level differences, then a difference in case determines the order such as
a < A
. - Level 4 – Quaternary Level for Punctuation. If there are no primary, secondary, or tertiary level differences, then the presence or absence of white space characters, control characters, and punctuation determine the order such as
-A < A
. - Level 5 – Identical Level for Tie-Breaking. If there are no primary, secondary, tertiary, or quaternary level differences, then some other difference such as the code point values determines the order.
International Components for Unicode
The Unicode Collation Algorithm is implemented by open source software provided by the International Components for Unicode (ICU). The software is a set of C/C++ and Java libraries.
When Advanced Server is used to create a collation that invokes the ICU components to produce the collation, the result is referred to as an ICU collation.
Locale Collations
When creating a collation for a locale, a predefined ICU short form name for the given locale is typically provided.
An ICU short form is a method of specifying collation attributes, which are the properties of a collation. Collation Attributes
provides additional information on collation attributes.
There are predefined ICU short forms for locales. The ICU short form for a locale incorporates the collation attribute settings typically used for the given locale. This simplifies the collation creation process by eliminating the need to specify the entire list of collation attributes for that locale.
The system catalog pg_catalog.pg_icu_collate_names
contains a list of the names of the ICU short forms for locales. The ICU short form name is listed in column icu_short_form
.