All About Collation

Collation is one of the easiest settings to overlook when configuring SQL Server but it’s one that can come back to bite you if it isn’t properly set. In this post I intend to cover the basics of collations, common issues resulting from collation misconfiguration, and basic resolution of those common issues.

So what is a collation? One way to think of a collation is as a dialect of a language. For example, in the U.S. we use the word color. Other English speaking countries use the word colour instead. Who is right? Well, that would depend entirely on where you are from. If you are in the U.S., color is probably going to be considered correct. If you had never even seen the word colour, would you know it means the same thing as color?

Collations can be thought of in a similar way. There are many variations of a language, each with different rules associated. These rules are set by the collation and govern things like sorting, case sensitivity, and accent sensitivity. In one collation, ‘ABC’ may be the same thing as ‘abc’ while in others that may not be the case. The “language” associated with a collation is going to be a code page.

Single byte (8 bit) characters (char, varchar, text) can make a total of 256 different combinations with their bits being flipped on and off. Since there are more than 256 characters in the entire world, there is no way to represent every possible character in a single byte. This is where code pages come in. Code pages exist at the Windows level and keep track of which characters the 256 combinations represent.

For example, English speaking countries are going to use the Latin code page (1252). The 256 characters on that code page are going to be characters used in the English language. We wouldn’t be able to fit every Greek character in that same code page, so Greek has it’s own code page as well (1253). All collations, their description, and their code page can be found with the query below.

Unicode data types (nchar, nvarchar, ntext) are stored double byte so they can actually store multiple languages without having to reference a code page. Our Unicode data types will still play by the sorting and comparison rules of the collation, so you still don’t get to ignore collation if you are using Unicode data types.

Collation is set at the instance level, the database level, and the column level. When a new database is created it will use the instance collation unless otherwise specified. Your system databases will also match that instance collation. Any character columns created in your user databases will use the database collation when they are created unless a collation is defined in the create statement.

When problems with collation occur it’s typically because something was overlooked in the planning stages. At some point somebody realizes the wrong collation was used and they decide to change it at the database level. Problem solved right? Unfortunately that is not the case as the database collation is simply what SQL Server looks to when new objects are created. Changing this setting will have no impact on existing objects. So in this scenario, the change is made and time goes on and new tables are added to the database using this new collation. What happens at this point is that you have two collations actively used in your database and if there is ever a need to compare columns in different collations you will encounter errors.

The other issues commonly seen are TempDB collation mismatches. We know that TempDB’s default collation is going to match the instance collation so any objects (such as temp tables) created in TempDB will use that collation by default. So what happens if your user database uses a different collation and you have to compare it to data in a temp table created with TempDB’s default collation? You guessed it, you get collation mismatch errors.

The proper fix for these issues really depends on the scenario. With collation mismatches within a user database you can simply alter the columns if there are only a few tables impacted. Unfortunately these issues, in my experience, are rarely caught that early. I have worked with databases that have had multiple collations throughout the database and manually altering them was more work than it was worth. In this scenario I would just migrate all the data to a new database that was created in the correct collation.

Your other option if there are only a few tables is to collate the columns in question within the query. This is done by adding a COLLATE clause to the column. Unless this is a report or some other ad-hoc query, this probably isn’t going to be a very viable option.

With TempDB collation mismatches the fix is somewhat similar. You can take the sledgehammer approach and either change the collation of the database and all existing objects to match the instance (and therefore TempDB) or change the instance collation. Neither is an easy task and can introduce even more issues if done incorrectly. Alternatively, you could modify the code like in the other scenario. This could be accomplished by specifying a collation when creating the temporary object, or modifying the queries to collate the columns on the fly.

Not scared yet? Well the above are best case scenarios assuming that the collations in question share the same code page. When they are using different code pages a potential for data loss exists when changing between collations, and you probably won’t know when it happens. Your best bet is to avoid these issues altogether by making sure the proper collation is used in the first place. As a general rule, your instance and all databases on that instance should use the same collation. This is the best way to be sure that you will never encounter issues within your user databases or in TempDB.

Let’s create a test database and step through a few examples.

Now let’s take a look at the collation at the instance, object, and table levels.

CollationInfo

We can see that our instance collation is SQL_Latin1_General_CP1_CI_AS. As we learned, our database collation will match that along with our columns since we did not specify otherwise. Now let’s change the database collation of our database and create a new table with that collation. If we select from the two tables, we will see that the data looks identical.

CollationLooksSame

Both sets of data look identical, but watch what happens when we try to join them together.

CollationError

And now we have the dreaded “cannot resolve the collation conflict” error. This is why collation mismatches can be a silent killer. You generally don’t know it’s in issue until it jumps out at you. As we discussed earlier, there are two ways to fix this. Alter the collation of the columns themselves, or modify the query. First we will alter the columns with success.

JoinSuccess

The other option is to use the COLLATE clause within our query. I have set our columns back to the wrong collation for this example.

JoinSuccess

Let’s now look at TempDB collation mismatches. As it stands currently, our person1 table is using the SQL_Latin1_General_CP1_CI_AS collation and person2 is using Latin1_General_100_CI_AI. Here we will create and populate a temp table and then try to join to it.

CollationError

Once again, our join fails because our temp table was created using the default collation of TempDB. If we were to include a collation in our create table statement, this would work without issue. Alternatively we could modify our select query to use the COLLATE clause like in the previous example.

JoinSuccess

Unfortunately, most real world examples I have encountered aren’t always this simple to resolve. If thousands of objects are using multiple collations the solution becomes a little less clear. It becomes even more difficult when different code pages are in play as data loss is a possibility. With proper planning, however, these situations can be avoided entirely.

Leave a Reply

Your email address will not be published. Required fields are marked *