Skip to content

DBAzine.com

Sections
Personal tools
You are here: Home » DB2 » DB2 Distributed (UDB) Articles Archive » Tricky European Collations

Tricky European Collations

by Peter Gulutzan

This resource page accompanies Peter Gulutzan's article "Collations" in dbazine.com. It explains the points that need considering for German, Spanish and Scandinavian collations. There are two goals for this page:

      • Explain all the essential rules so that if you're trying to build your own collation (e.g. for IBM DB2) you'll understand what actually goes on with the dictionaries and/or the telephone books;
      • Explain the popular choices so that if you're trying to choose a built-in collation (e.g. for MS SQL Server or Oracle) you'll be aware of significant factors, such as multiple collations for one language, or inadequate vendor support for accent weights.

For most languages you only need to know a set of general rules that apply whenever there are accented characters, so I'll start with the general rules. Then come separate sections for the three areas that have special rules: German, Spanish, and Scandinavian.

Some of the description that follows might surprise even native speakers, and some of it contradicts or criticizes vendors' DBMS implementations. Nevertheless everything here is based on facts, gleaned from web sites, usenet discussions, language guides, and the author's wide reading of telephone books. In any case, any "issues" that you find are mere grist to the argument, since they prove that you must become aware of the action behind the scenes.

General Rules

There are two general rules:

      • Characters with accents have the same character weights as characters without accents; however, they have higher accent weights. This would be "accent insensitivity with non-accent preference" in Microsoft's language.
      • Uppercase letters have the same character weights and accent weights as lowercase letters; however, they have higher case weights. This would be "case insensitivity with case preference" in Microsoft's language.

For example, the LCMapString weights of the characters 'e' and 'é' are as follows (in Finnish_Swedish with Windows API):

Character's Unicode Name Character Weight Accent Weight Case Weight
e Small Letter E 14,33 2 18
é Small Letter E With Acute 14,33 14 #

(Note: LCMapString generates different weights for Finnish_Swedish characters than it generates for other collations.)

In this document the shorthand syntax "e = é" means that 'e' has the same character weight as 'é' so the search condition 'e'='é' should be True. Search conditions should ignore accent and case weights.

The two General Rules are all you need to know for Dutch, English, French, and Italian.

German

German collations have special rules for the "sharp s" letter 'ß' and the three "umlauted" letters 'Ä'and 'Ö' and 'Ü'. There are three German collations, DIN-1, DIN-2, and Austrian. The letters DIN stand for Deutsches Institut Für Normung, the name of the German body that issued the DIN-5007 standard for German sorting.

In all three collations, 'ß' has the same character weight as 'ss'.

In the DIN-1 collation, the umlauted letters have the same character weights as their unaccented equivalents, that is 'Ä' = 'A', 'Ö' = 'O', and 'Ü' = 'U'. For example these words are in order:
'Mueller' 'Muffler' 'Muller' 'Mü ller'

DIN-1 is for lists of words; it's the sort order for dictionaries. It's also the sort order for phone books in Switzerland.

In the DIN-2 collation, the umlauted letters have the same character weights as their unaccented equivalents combined with the letter 'E', that is 'Ä' = 'AE', 'Ö' = 'OE', and 'Ü' = 'UE'. For example these names are in order: 'Mueller' 'Mü ller' 'Muffler' 'Muller'

DIN-2 is for lists of names. It's the sort order for phone books in Germany, so Microsoft calls it the GERMAN_PHONE_BOOK sort style. With Oracle, the name is "Xgerman". The letter X is supposed to stand for "eXtended" and Oracle's extended linguistic sorts begin with X if there is a complexity, such as the one-to-two mapping of DIN-2 characters.

In the Austrian collation, the umlauted letters have character weights slightly greater than their unaccented equivalents, that is 'Ä' > 'AZ', 'Ö' > 'OZ', and 'Ü' > 'UZ'. For example these names are in order: 'Mueller' 'Muffler' 'Muller' 'Mü ller'

Austrian is rarely used; even the online edition of the Vienna (Austria) phone book is now in DIN-2 order. So the large DBMSs don't have built-in support for Austrian.

Microsoft's Windows collation won't take into account the accent weights for the umlauted characters in DIN-2 or for 'ß' in DIN-1, so a sorted list could have 'Müller' before 'Mueller'. As the German DIN-5007 standards document puts it: that's "falsch." But perhaps this is a temporary phenomenon that will be fixed in later versions of Windows.

Summary: You can't simply select something called "German" - to get it right, you have to know which German-speaking country you're coding for, and whether you have a list of words or a list of names.

Spanish

In the traditional Spanish collation there are three special rules:

      • 'CH' is a separate letter between 'C' and 'D'.
      • 'LL' is a separate letter between 'L' and 'M'.
      • 'Ñ' is a separate letter between 'N' and 'O'.

You probably learned those rules in high school Spanish. Well, start unlearning them. There is a modern collation from the Association of Spanish Language Academies [la Asociación de Academias de la Lengua Española] in which:

      • 'CH' is nothing special.
      • 'LL' is nothing special.
      • 'Ñ' is still a separate letter between 'N' and 'O'.

Both Microsoft and Oracle support traditional Spanish rules for legacy reasons. It isn't nice to make users change existing databases. But over time more Spanish-speaking users will be expecting the modern Spanish collation.

In the online versions of the telephone books for Madrid and for Buenos Aires, 'LL' comes before 'LM'. For example 'Allende' appears before 'Almeida' and 'Abella' appears before 'Abels'. The same is true for The Collins Spanish Dictionary, Sixth Edition. In other words, the switch to modern Spanish is well advanced.

Scandanavian

In the Scandinavian collations there are additional letters after 'Z'. The last letters of the alphabet are:

Danish/Norwegian: U V W X Y Z Æ Ø Å

      • Icelandic: U V W X Y Z Þ Æ Ö
      • Swedish/Finnish: U V X Y Z Å Ä Ö

Danish/Norwegian special rules are:

      • 'Ü' = 'Y'
      • 'Ä' = 'Æ'
      • 'AA' = 'Å'

The 'AA' = 'Å' rule is often broken. Originally it was for Danish names only, so that 'Aachen' (Germany) would appear at the start of a list but 'Aalborg' (Denmark) would appear at the end. Such distinctions are too difficult for a computer, so LCMapString gives 'AA' and 'Å' the same character weight. Oracle always ignores the rule.

Icelandic special rules are:

      • 'ETH' (Ð) is between 'D' and 'E'.

Swedish/Finnish special rules are:

      • 'Ø' = 'Ö'
      • 'Õ' = 'Ö'
      • 'Ü' = 'Y'
      • 'Æ' = 'AE' (at both character and accent level)
      • 'W' = 'V'

The 'W' = 'V' rule is often broken. Here is a snapshot of the situation:

      • Follows Swedish/Finnish 'W' = 'V' Rule:

Microsoft "phone" SQL collation 184 (SQL_SwedishPhone_Pref_Cp1_CI_AS)
Microsoft Windows collation
Oracle
Gothenburg (Sweden) printed telephone book

      • Follows Swedish/Finnish 'W' > 'V' Rule:

Microsoft "standard" SQL collation 185 (SQL_SwedishStd_Pref_Cp1_CI_AS)

MySQL
InterBase
Helsinki (Finland) printed telephone book
Stockholm (Sweden) online telephone book

Microsoft SQL Server has two types of collations. An SQL collation is legacy from Microsoft SQL Server 7 and its use may be discouraged. A Windows collation, on the other hand, is new in Microsoft SQL Server 2000. The Windows collations should generate exactly what the Windows API functions generate. There is some disagreement whether the "phone" collation is intended for "phonetic Swedish/Finnish" or for "Swedish/Finnish telephone books" -- in any case it is neither, so the matter is moot. The "standard" SQL collation corresponds to the Posix locale descriptor.

It is impossible to produce a single collation that handles all the rules. So if you see a collation named "Nordic" (as in Microsoft PDS Basic) or a collation named "Scandinavian" (as in Sybase), you can be sure it is incorrect for some characters.

--

Peter Gulutzan is co-author of SQL-99 Complete Really (CMP Books 1999) and SQL Performance Tuning, which Addison-Wesley will publish in September 2002.


Contributors : Peter Gulutzan
Last modified 2005-06-22 01:06 PM

Free to download, develop, deploy.

Best Practice: Free eBooks, articles and videos on IT service management for the enterprise.

 
 

Powered by Plone