This is one of those problems that developers like to pretend doesn't exist. Until we actually have to deal with it of coarse. What I'm talking about is Character encoding hell. Lots of programmers (me included) have many times spent endless hours trying to figure out the mysteries of garbled text. So the reason I'm writing this blog post is to maybe help one or two people out there that are trying to do the same thing as I did today. Converting a MySQL database full of Swedish text (latin1) to a more manageable utf8 format.
We basically have different character encodings* because we have lots of different spoken and written languages in the world, and it was hard for the brilliant computer scientists back in the day to get it all right the first time.
*Note: Please read A tutorial on character code issues for a fuller explanation on character encodings.
The problem
I've got a MySQL database dump containing some 5000 products with descriptions all written in Swedish. As you might know Swedish contains characters like å, ä and ö. The database was also using latin1 as it's character set. So when firing it up in the Django admin I got something looking like this:
I spent about 5 hours trying to get this to display correctly. Django rightfully uses utf8 throughout, so converting my database to utf8 should take care of the problem?
alter database DATABASE_NAME charset=utf8;
No, it's not as easy as that, because when we convert the database to utf8, the data within stays the same. So what if we convert the table and the field?
alter table TABLE_NAME charset utf8;
alter table TABLE_NAME modify FIELD_NAME varchar(255) character set utf8;
No no, nothing changes, it's all the same garbled mess. So how can we get MySQL to actually convert the content?
The solution
Convert the field to BLOB. That's right, the first thing we need to do is converting the field type to a binary type. Then we change the character set of the database to utf8, and lastly we convert our field back to text or varchar.
alter table TABLE_NAME modify FIELD_NAME blob;
alter database DATABASE_NAME charset=utf8;
alter table TABLE_NAME modify FIELD_NAME varchar(255) character set utf8;
This forces MySQL to convert the binary format into varchar using character set utf8.
This trick should work regardless if you're using Django or not, that's just what I happened to be developing in. And telling Django to use a different character encoding is, from what I've heard, a nightmare.
By: Nilesh
Nice work man! I achieved similar using iconv's command line tools, but your method is so much more simple.
By: Penny Lane
Dude, thanks so much for your post, you've totally saved my life (well, at least several frustrating shaking-fist-at-the-world hours of it)! Can't thank you enough, seriously.
By: Hanky
Stabilt! :)
By: Ben
Thanks, thanks, thanks ! I was already imagining hours of trouble ahead... But your 3 lines made the whole process so easy! I still can't believe it :)
By: Kim
Cool! Also, I have some questions about POKE. If you have time please drop me an email so that I get yours :)
By: Bastiaan
you can do it quicker:
ALTER TABLE `name` MODIFY COLUMN `title` VARCHAR(255) CHARACTER SET binary;
ALTER TABLE `name` MODIFY COLUMN `title` VARCHAR(255) CHARACTER SET utf8;
By: antonio
wow.. great solution!
By: Mattias
@Bastiaan That is surely a quicker solution. Thanks for sharing.
By: marjan
Hi,
I'm still struggling to find the solution to convert the data in MySQL but with no success. This method doesn't solving my problems, although effect from this update is now characters are shown as question marks ???????? //
Any other hint is more than welcome
marjan
By: patrick
@Bastiaan
Possible to wildcard tables/columns to run it in one go?*
Thanks!
By: Jeroen Tomas
Thanks as lot!