Convert MySQL database from latin1 to utf8
You are going to see many blog posts stating that you can just dump a MySQL database via mysqldump – globally replace “latin1” (or some other character set) in the dump file – and then import that into a utf8 database and it’ll just work.
This appears, however, to be WRONG. It does not force MySQL to convert the text, it only fools you into believing that that your latin1 characters have been converted. You have to actually convert the text yourself, the columns will just be unconverted latin1 sitting in a utf8 table.
One way to do this is to convert the column in question to binary and back again – assuming your database/table is set to utf8, this will force MySQL to convert the character set correctly.
Another – better way is to just use iconv to convert during the dump process. This will convert latin1 characters to utf8 properly.
mysqldump --add-drop-table database_to_correct | replace CHARSET=latin1 CHARSET=utf8 | iconv -f latin1 -t utf8 | mysql database_to_correct
Below you may find a longer explanation of everything about this topic:
If you have an existing MySQL database that is already encoded in latin1, here’s how to convert the latin1 to UTF-8:
- Make sure you’ve made all the modifications to the configuration settings in your
my.ini
file, as described above. - Execute the following command:
ALTER SCHEMA `your-db-name` DEFAULT CHARACTER SET UTF-8;
- Via command line, verify that everything is properly set to UTF-8
mysql> show variables like 'char%';
- Create a dump file with latin1 encoding for the table you want to convert:
mysqldump -u USERNAME -pDB_PASSWORD --opt --skip-set-charset --default-character-set=latin1 --skip-extended-insert DATABASENAME --tables TABLENAME > DUMP_FILE_TABLE.sql
e.g:
mysqldump -u root --opt --skip-set-charset --default-character-set=latin1 --skip-extended-insert artists-database --tables tbl_artist > tbl_artist.sql
- Do a global search and replace of the charset in the dumpfile from latin1 to UTF-8:
e.g., using Perl:
perl -i -pe 's/DEFAULT CHARSET=latin1/DEFAULT CHARSET=UTF-8/' DUMP_FILE_TABLE.sql
Note to Windows users: This charset string replacement (from latin1 to UTF-8) can also be done using find-and-replace in WordPad (or some other text editor, such as vim). Be sure to save the file just as it is though (don’t save it as unicode txt file!).
- From this point, we will start messing with the database data, so it would probably be prudent to backup the database if you haven’t already done so. Then, restore the dump into the database:
mysql> source "DUMP_FILE_TABLE.sql";
- Search for any records that may not have converted properly and correct them. Since non-ASCII characters are multi-byte by design, we can find them by comparing the byte length to the character length (i.e., to identify rows that may hold double-encoded UTF-8 characters that need to be fixed).
- See if there are any records with multi-byte characters (if this query returns zero, then there don’t appear to be any records with multi-byte characters in your table and you can proceed to Step 8).
mysql> select count(*) from MY_TABLE where LENGTH(MY_FIELD) != CHAR_LENGTH(MY_FIELD);
- Copy rows with multi-byte characters into a temporary table:
create table temptable ( select * from MY_TABLE where LENGTH(MY_FIELD) != CHAR_LENGTH(MY_FIELD));
- Convert double-encoded UTF-8 characters to proper UTF-8 characters
This is actually a bit tricky. A double encoded string is one that was properly encoded as UTF-8. However, MySQL then did us the erroneous favor of converting it (from what it thought was latin1) to UTF-8 again, when we set the column to UTF-8 encoding. Resolving this therefore requires a two step process through which we “trick” MySQL in order to preclude it from doing us this “favor”.
First, we set the encoding type for the column back to latin1, thereby removing the double encoding:
e.g.:
alter table temptable modify temptable.ArtistName varchar(128) character set latin1;
Note: Be sure to use the correct field type for your table. In the example above, for our table, the correct field type for ‘ArtistName’ was varchar(128), but the field in your table could be text or any other type. Be sure to specify it properly!
The problem is that now, if we set the column encoding back to UTF-8, MySQL will run the latin1 to UTF-8 data encoding for us again and we’ll be back to where we started. To avoid this, we change the column type to blob and THEN we set it to UTF-8. This exploits the fact that MySQL will not attempt to encode a blob. We are thereby able to “fool” the MySQL charset conversion to avoid the double encoding issue.
e.g.:
alter table temptable modify temptable.ArtistName blob; alter table temptable modify temptable.ArtistName varchar(128) character set UTF-8;
(Again, as noted above, be sure to use the proper field type for your table.)
- Remove rows with only single-byte characters from the temporary table:
delete from MY_TABLE where LENGTH(MY_FIELD) = CHAR_LENGTH(MY_FIELD);
- Re-insert fixed rows back into the original table (before doing this, you may want to run some selects on the temptable to verify that it appears to be properly corrected, just as a sanity check).
replace into MY_TABLE (select * from temptable);
- See if there are any records with multi-byte characters (if this query returns zero, then there don’t appear to be any records with multi-byte characters in your table and you can proceed to Step 8).
- Verify the remaining data and, if necessary, repeat the process in step 7 (this could be necessary, for example, if the data was triple encoded). Further errors, if any, may be easiest to resolve manually.