Wednesday, November 20, 2019

MySQL : Converting Table Character Sets From latin1 to utf8

Let's assume we were using latin1 for the database and client character set. Even though latin1 is a single-byte character set, we can still insert multi-byte characters because of double-encoding.
mysql --user=root --password


CREATE DATABASE char_test_db;
USE char_test_db;

DROP TABLE IF EXISTS t1;

CREATE TABLE t1 (
  id          INT(11) NOT NULL AUTO_INCREMENT,
  description VARCHAR(50),
  PRIMARY KEY(id)
) ENGINE=InnoDB AUTO_INCREMENT=0 DEFAULT CHARSET=latin1;

SET NAMES latin1;

INSERT INTO t1 (description) VALUES ('¡Volcán!');

SELECT id, description, HEX(description) FROM t1;
+----+-------------+----------------------+
| id | description | HEX(description)     |
+----+-------------+----------------------+
|  1 | ¡Volcán!    | C2A1566F6C63C3A16E21 |
+----+-------------+----------------------+
1 row in set (0.00 sec)

mysql>
Due to the amount of multi-byte information coming in, we now decide we need to switch to utf8 as the character set for the database and client. That's a simple change.
SET NAMES utf8;

ALTER TABLE t1 CONVERT TO CHARACTER SET utf8;

SELECT id, description, HEX(description) FROM t1;
+----+----------------+------------------------------+
| id | description    | HEX(description)             |
+----+----------------+------------------------------+
|  1 | ¡Volcán!     | C382C2A1566F6C63C383C2A16E21 |
+----+----------------+------------------------------+
1 row in set (0.01 sec)

mysql>
Unfortunately, we've mangled the data. If we switch the client back to latin1, the data looks OK though.
SET NAMES latin1;

SELECT id, description, HEX(description) FROM t1;
+----+-------------+------------------------------+
| id | description | HEX(description)             |
+----+-------------+------------------------------+
|  1 | ¡Volcán!    | C382C2A1566F6C63C383C2A16E21 |
+----+-------------+------------------------------+
1 row in set (0.00 sec)

mysql>

The Solution

Recreate the table in its original state.
DROP TABLE IF EXISTS t1;

CREATE TABLE t1 (
  id          INT(11) NOT NULL AUTO_INCREMENT,
  description VARCHAR(50),
  PRIMARY KEY(id)
) ENGINE=InnoDB AUTO_INCREMENT=0 DEFAULT CHARSET=latin1;

SET NAMES latin1;

INSERT INTO t1 (description) VALUES ('¡Volcán!');

SELECT id, description, HEX(description) FROM t1;
+----+-------------+----------------------+
| id | description | HEX(description)     |
+----+-------------+----------------------+
|  1 | ¡Volcán!    | C2A1566F6C63C3A16E21 |
+----+-------------+----------------------+
1 row in set (0.00 sec)

mysql>
A better way to convert the character set of the table is to first convert the description column to a BLOB. BLOB data has no associated character set, so it is unchanged by the conversion of the table character set. We can then safely convert the character set of the table and convert the description column back to its original data type.
SET NAMES utf8;
ALTER TABLE t1 CHANGE description description BLOB;
ALTER TABLE t1 CONVERT TO CHARACTER SET utf8, CHANGE description description VARCHAR(50);
Now the data looks fine when viewed from a utf8 client.
SET NAMES utf8;

SELECT id, description, HEX(description) FROM t1;
+----+-------------+----------------------+
| id | description | HEX(description)     |
+----+-------------+----------------------+
|  1 | ¡Volcán!    | C2A1566F6C63C3A16E21 |
+----+-------------+----------------------+
1 row in set (0.01 sec)

mysql>
As you might expect, the data will look a little mangled from a latin1 client though!
SET NAMES latin1;

SELECT id, description, HEX(description) FROM t1;
+----+-------------+----------------------+
| id | description | HEX(description)     |
+----+-------------+----------------------+
|  1 | �Volc�n!    | C2A1566F6C63C3A16E21 |
+----+-------------+----------------------+
1 row in set (0.00 sec)

mysql>

No comments:

Post a Comment