Convert a mysql latin1 Database to UTF-8 with data
Recently I had to convert a database of a system that used UTF-8 encoding but the database was default created with latin1.
All worked fine until greek letters was needed… than all greek characters were showing question marks (????)
main problem was that I already had data in the tables, so just recreating the database wouldnt cut it…
Research everywhere got me to a lot of different information but only one was close enough to help me solve my problem, so I decided to share this here with you.
This is under the assumption that you application connection to the database use UTF8 (set names)…
The article that had me start this was: http://en.gentoo-wiki.com/wiki/Convert_latin1_to_UTF-8_in_MySQL
The simple steps there were (BACKUP YOUR DATABASE BEFORE!!):
1. Dump the database:
mysqldump -p --default-character-set=latin1 -c \ --insert-ignore --skip-set-charset crm > dump.sql
2. Convert the dump file to utf8 using iconv
iconv -f ISO-8859-1 -t UTF-8 dump.sql > dump_utf8.sql
3. Change table definition from latin1 to utf8 using perl
perl -pi -w -e 's/CHARSET=latin1/CHARSET=utf8/g;' dump_utf8.sql
4. make sure you didnt forget anything
grep latin1 dump_utf8.sql
5. Create a new UTF8 Database
mysql --user=foo -p --execute="CREATE DATABASE NEWDB_utf8 CHARACTER SET utf8 COLLATE utf8_general_ci;"
6. Restore converted utf8 dump to the database
mysql --max_allowed_packet=16M -p --default-character-set=utf8 NEWDB_utf8< dump_utf8.sql
7. dont forget to grant privileges to application user
mysql --user=foo -p --execute="GRANT ALL PRIVILEGES on NEWDB_utf8.* to 'some_user'@'localhost' identified by 'some_pass';"
Now you are basically done. change connection configuration to the new DB name and see it working, HOWEVER you might see some characters were not encoded correctly and now showing as gibberish. here comes dirty work…
you need to check which character is screwed in your table and change it, for example:
character (’) is screwed up, you need to check where it appears in the tables and change it’s unicode value(C292):
To find:
select count(*) from some_table WHERE subject REGEXP UNHEX('C296');
To Replace with utf8 equivelent E28093:
UPDATE some_table SET subject= REPLACE(subject, UNHEX('C292'),UNHEX('E28099') ) WHERE subject REGEXP UNHEX('C292');
same routine for (–) C296 to E28093
hope you won’t find too many of thes at least now you have a solution without losing your data.









