Displaying articles in the databases category

Character encoding hell: Converting MySQL latin1 to utf8 for use in Django

This is one of those problems that developers like to pretend doesn't exist. Until we actually have to deal with it of coarse. What I'm talking about is Character encoding hell. Lots of programmers (me included) have many times spent endless hours trying to figure out the mysteries of garbled text. So the reason I'm writing this blog post is to maybe help one or two people out there that are trying to do the same thing as I did today. Converting a MySQL database full of Swedish text (latin1) to a more manageable utf8 format.

We basically have different character encodings* because we have lots of different spoken and written languages in the world, and it was hard for the brilliant computer scientists back in the day to get it all right the first time.

*Note: Please read A tutorial on character code issues for a fuller explanation on character encodings.

The problem

I've got a MySQL database dump containing some 5000 products with descriptions all written in Swedish. As you might know Swedish contains characters like å, ä and ö. The database was also using latin1 as it's character set. So when firing it up in the Django admin I got something looking like this:

Mysql latin1

I spent about 5 hours trying to get this to display correctly. Django rightfully uses utf8 throughout, so converting my database to utf8 should take care of the problem?

alter database DATABASE_NAME charset=utf8;

No, it's not as easy as that, because when we convert the database to utf8, the data within stays the same. So what if we convert the table and the field?

alter table TABLE_NAME charset utf8;
alter table TABLE_NAME modify FIELD_NAME varchar(255) character set utf8;

No no, nothing changes, it's all the same garbled mess. So how can we get MySQL to actually convert the content?

The solution

Convert the field to BLOB. That's right, the first thing we need to do is converting the field type to a binary type. Then we change the character set of the database to utf8, and lastly we convert our field back to text or varchar.

alter table TABLE_NAME modify FIELD_NAME blob;
alter database DATABASE_NAME charset=utf8;
alter table TABLE_NAME modify FIELD_NAME varchar(255) character set utf8;

This forces MySQL to convert the binary format into varchar using character set utf8.

Mysql utf8

This trick should work regardless if you're using Django or not, that's just what I happened to be developing in. And telling Django to use a different character encoding is, from what I've heard, a nightmare.