... And how to go out from it.
Why this article?
The scope of the article is to light up what happens behind the scene when an application push UTF data on a MySQL server using Latin1 encoding.
I will show how the characters are store, what exactly happen if you mix UTF8 and latin1, what is transformed to what and why.
Finally I will show the safer way to use to convert a table or simply a table field, without exporting the full dataset.
I will also show what is a good way to do it when using master-slave scenario, with minimal impact on the live dataset.
What I am not going to discuss here, and I assume you already have familiarity with is:
- How to assign a character set or a collation to a table or fields;
- How collation works in in sorting;
- How to set the default character set in server and connecting clients.
If any of this point is not clear to you, please read first here (http://dev.mysql.com/doc/refman/5.6/en/charset.html)
For my exercise I will use two tables, two MySQL server instances, and Arabic text.
Please note that I have copy this text from Official UN nation page, as such I hope they do not contain any offensive text.
Step one check the status of our severs:
Master
(root@localhost) [(none)]>\s
--------------
/opt/mysql_templates/mysql-55o/bin/mysql Ver 14.14 Distrib 5.5.27, for linux2.6 (x86_64) using readline 5.1
Connection id: 2
Current database:
Current user: root@localhost
SSL: Not in use
Current pager: stdout
Using outfile: ''
Using delimiter: ;
Server version: 5.5.27-log MySQL Community Server (GPL)
Protocol version: 10
Connection: Localhost via UNIX socket
Server characterset: latin1
Db characterset: latin1
Client characterset: latin1
Conn. characterset: latin1
UNIX socket: /opt/mysql_instances/instances/derekutfm/mysql.sock
Uptime: 2 min 34 sec
Threads: 3 Questions: 16 Slow queries: 0 Opens: 34 Flush tables: 1 Open tables: 26 Queries per second avg: 0.103
Slave
------------
(root@localhost) [(none)]>\s
--------------
/opt/mysql_templates/mysql-55o/bin/mysql Ver 14.14 Distrib 5.5.27, for linux2.6 (x86_64) using readline 5.1
Connection id: 4
Current database:
Current user: root@localhost
SSL: Not in use
Current pager: stdout
Using outfile: ''
Using delimiter: ;
Server version: 5.5.27-log MySQL Community Server (GPL)
Protocol version: 10
Connection: Localhost via UNIX socket
Server characterset: utf8
Db characterset: utf8
Client characterset: latin1
Conn. characterset: latin1
UNIX socket: /opt/mysql_instances/instances/derekutfs/mysql.sock
Uptime: 10 min 43 sec
Threads: 2 Questions: 8 Slow queries: 0 Opens: 34 Flush tables: 1 Open tables: 26 Queries per second avg: 0.012
As you can see, I have define the MASTER to use latin1 as default also for the server, while for the SLAVE I choose to use UTF8.
In this exercise I am going to use only UTF8 for the examples never UTF8mb4.
First create the table.
CREATE TABLE `utf_test` ( `ID` integer AUTO_INCREMENT PRIMARY KEY, `notutf` varchar(250) CHARACTER SET latin1 COLLATE latin1_general_cs DEFAULT NULL, `yesutf` varchar(250) CHARACTER SET utf8 COLLATE utf8_bin DEFAULT NULL, `binutf` varbinary(250) DEFAULT NULL, `different` varchar(250) CHARACTER SET latin1 COLLATE latin1_general_cs DEFAULT NULL ) ENGINE=InnoDB DEFAULT CHARSET=latin1
Now let start insert data again and see what happens:
insert into utf_test values(null,'simple insert not utf in non utf fields, but using Master all Latin1 and Slave with server default UTF8', 'إستقدام أصناف محصولية مقاومة لصدأ القمح إلى كينيا',
' this is a mix of it 3 إستقدام أصناف محصولية مقاومة لصدأ القمح إلى كينيا', 'Am I the same');
Master ------------- (root@localhost) [test]>insert INTO utf_test VALUES(NULL,'simple insert not utf in non utf fields, but using Master all Latin1 and Slave with server default UTF8', 'إستقدام أصناف محصولية مقاومة لصدأ القمح إلى كينيا', ' this is a mix of it 3 إستقدام أصناف محصولية مقاومة لصدأ القمح إلى كينيا', 'Am I the same'); *************************** 1. row *************************** ID: 1 notutf: simple INSERT NOT utf IN non utf FIELDS, but USING Master ALL Latin1 AND Slave WITH server DEFAULT UTF8 yesutf: إستقدام أصناف محصولية مقاومة لصدأ القمح إلى كينيا binutf: this IS a mix of it 3 إستقدام أصناف محصولية مقاومة لصدأ القمح إلى كينيا different: Am I the same 1 row IN SET (0.00 sec) Slave --------------- (root@localhost) [test]>set names latin1; Query OK, 0 rows affected (0.00 sec) (root@localhost) [test]>select * FROM utf_test\G *************************** 1. row *************************** ID: 1 notutf: simple INSERT NOT utf IN non utf FIELDS, but USING Master ALL Latin1 AND Slave WITH server DEFAULT UTF8 yesutf: إستقدام أصناف محصولية مقاومة لصدأ القمح إلى كينيا binutf: this IS a mix of it 3 إستقدام أصناف محصولية مقاومة لصدأ القمح إلى كينيا different: Am I the same 1 row IN SET (0.00 sec)
All good given the value are passed consistently
Now let add some text to the "different" fields and see IF we can still handle it correctly:
(root@localhost) [test]>insert INTO utf_test VALUES(NULL,'simple insert not utf in non utf fields, but using Master all Latin1 and Slave with server default UTF8', 'إستقدام أصناف محصولية مقاومة لصدأ القمح إلى كينيا', ' this is a mix of it 3 إستقدام أصناف محصولية مقاومة لصدأ القمح إلى كينيا', 'Am I the same? لصدأ القمح إلى كينيا'); Master -------------- (root@localhost) [test]>select * FROM utf_test\G ID: 3 notutf: simple INSERT NOT utf IN non utf FIELDS, but USING Master ALL Latin1 AND Slave WITH server DEFAULT UTF8 yesutf: إستقدام أصناف محصولية مقاومة لصدأ القمح إلى كينيا binutf: this IS a mix of it 3 إستقدام أصناف محصولية مقاومة لصدأ القمح إلى كينيا different: Am I the same? لصدأ القمح إلى كينيا 2 rows IN SET (0.00 sec) Slave ---------------------------- ID: 3 notutf: simple INSERT NOT utf IN non utf FIELDS, but USING Master ALL Latin1 AND Slave WITH server DEFAULT UTF8 yesutf: إستقدام أصناف محصولية مقاومة لصدأ القمح إلى كينيا binutf: this IS a mix of it 3 إستقدام أصناف محصولية مقاومة لصدأ القمح إلى كينيا different: Am I the same? لصدأ القمح إلى كينيا
Ok once more all good.
Is this real UTF8 or as we have seen sometimes fake UTF thanks to Latin1 push?
let see inserting another row with UTF field "yesutf" and "different" containing the same apparent data.
insert into utf_test values(null,'simple insert not utf in non utf fields, but using Master all Latin1 and Slave with server default UTF8', 'ح إلى كينيا', 'ح إلى كينيا', 'ح إلى كينيا');
Master ----------- (root@localhost) [test]>insert INTO utf_test VALUES(NULL,'simple insert same text everywhere', 'ح إلى كينيا', 'ح إلى كينيا', 'ح إلى كينيا'); ID: 5 notutf: simple INSERT same text everywhere yesutf: ح إلى كينيا binutf: ح إلى كينيا different: ح إلى كينيا 3 rows IN SET (0.00 sec) Slave ------------------ ID: 5 notutf: simple INSERT same text everywhere yesutf: ح إلى كينيا binutf: ح إلى كينيا different: ح إلى كينيا 3 rows IN SET (0.00 sec)
Checking the hex:
Master ----- (root@localhost) [test]>select hex(notutf) AS noutf_hex,hex(yesutf) AS yesutf_hex,hex(binutf) AS binutf_hex,hex(different) AS different_hex FROM utf_test WHERE id=(SELECT max(id) FROM utf_test)\G *************************** 1. row *************************** noutf_hex: 73696D706C6520696E736572742073616D6520746578742065766572797768657265 yesutf_hex: C398C2AD20C398C2A5C399E2809EC399E280B020C399C692C399C5A0C399E280A0C399C5A0C398C2A7 binutf_hex: D8AD20D8A5D984D98920D983D98AD986D98AD8A7 different_hex: D8AD20D8A5D984D98920D983D98AD986D98AD8A7 Slave ------------- (root@localhost) [test]>select hex(notutf) AS noutf_hex,hex(yesutf) AS yesutf_hex,hex(binutf) AS binutf_hex,hex(different) AS different_hex FROM utf_test WHERE id=(SELECT max(id) FROM utf_test)\G *************************** 1. row *************************** noutf_hex: 73696D706C6520696E736572742073616D6520746578742065766572797768657265 yesutf_hex: C398C2AD20C398C2A5C399E2809EC399E280B020C399C692C399C5A0C399E280A0C399C5A0C398C2A7 binutf_hex: D8AD20D8A5D984D98920D983D98AD986D98AD8A7 different_hex: D8AD20D8A5D984D98920D983D98AD986D98AD8A7 1 row IN SET (0.00 sec)
As expected the hex code is different between the fields define as UTF8 and the one as Latin1,and I see them correctly because... ?
Look what happens if I change the Names to UTF8...:
(root@localhost) [test]>set names utf8; (root@localhost) [test]>select * FROM utf_test WHERE id=(SELECT max(id) FROM utf_test)\G *************************** 1. row *************************** ID: 5 notutf: simple INSERT same text everywhere yesutf: Ø Ø¥Ù„Ù‰ كينيا binutf: ح إلى كينيا different: Ø Ø¥Ù„Ù‰ كينيا 1 row IN SET (0.00 sec)
Now only the binutf (varbinary) is shown correctly.
So just to recap what happen if I use latin1....
Using latin1
I see all fine
(root@localhost) [test]>set names latin1; (root@localhost) [test]>select * FROM utf_test WHERE id=(SELECT max(id) FROM utf_test)\G *************************** 1. row *************************** ID: 5 notutf: simple INSERT same text everywhere yesutf: ح إلى كينيا binutf: ح إلى كينيا different: ح إلى كينيا 1 row IN SET (0.00 sec)
The hex values are not matching as:
(root@localhost) [test]>select hex(notutf) AS noutf_hex,hex(yesutf) AS yesutf_hex,hex(binutf) AS binutf_hex,hex(different) AS different_hex FROM utf_test WHERE id=(SELECT max(id) FROM utf_test)\G *************************** 1. row *************************** noutf_hex: 73696D706C6520696E736572742073616D6520746578742065766572797768657265 yesutf_hex: C398C2AD20C398C2A5C399E2809EC399E280B020C399C692C399C5A0C399E280A0C399C5A0C398C2A7 binutf_hex: D8AD20D8A5D984D98920D983D98AD986D98AD8A7 different_hex: D8AD20D8A5D984D98920D983D98AD986D98AD8A7 1 row IN SET (0.00 sec)
If I do a reverse conversion, also using UNHEX
(root@localhost) [test]>select id,hex(notutf) AS noutf_hex,unhex(hex(yesutf)) AS yesutf_hex,unhex(hex(binutf)) AS binutf_hex,unhex(hex(different)) AS different_hex FROM utf_test WHERE id=(SELECT max(id) FROM utf_test)\G *************************** 1. row *************************** id: 5 noutf_hex: 73696D706C6520696E736572742073616D6520746578742065766572797768657265 yesutf_hex: Ø Ø¥Ù„Ù‰ كينيا binutf_hex: ح إلى كينيا different_hex: ح إلى كينيا 1 row IN SET (0.00 sec)
Again I see something different, is it confusing?
No.
It is quite clear that varbinary and varchar store the string using the same hexadecimal values, while the yesutf is different because it was expecting a UTF string.
Let us analyze JUST the first character coming from the arabic string: ح
This letter correspond to:
Unicode code point U+062D
character: ح
UTF8 characters: Ø
UTF8 hex: d8 ad
Meaning: ARABIC LETTER HAH
What happen is that I am inserting the correct codes but the UTF8 field read it as LATIN1 char per byte, as such it transform the value to Ø.
The value of that character in the UTF-8 encoding table and Unicode characters is:
Unicode code point U+00D8
character: Ø
UTF8 characters: Ã
UTF8 hex: c8 98
Meaning: LATIN CAPITAL LETTER O WITH STROKE
Comparing the values above with the HEX value we had before, you can see that they match perfectly
yesutf_hex: C398
binutf_hex: D8AD
different_hex: D8AD
So far I hope everything is clear.
The fields using LATIN1 and UT8 are correctly shown when using NAMES latin1 because the latin1 interpretation of the UTF8 hex values in the case of the fields with latin1 encoding.
While in the case of fields with UTF8 encoding, the server perform a double conversion of the codes, conversion that IS NOT executed when doing UNHEX, and in that case what it is shown is the REAL value of the stored codes.
Again the only SAFE one during these operation is the varbinary which store the code not bounded to any encoding.
So for instance if you have all your database using LATIN1 and you want to convert to UTF8 you cannot just do :
(root@localhost) [test]>alter table utf_corrupt modify different varchar(250) CHARACTER SET utf8 COLLATE utf8_bin DEFAULT NULL;
Because your text in the tables will be converted and become unusable if using UTF8 clients, see below:
Encoding set as LATIN1:
Client characterset: latin1
Conn. characterset: latin1
I am going to do the conversion:
(root@localhost) [test]>alter TABLE utf_corrupt MODIFY different varchar(250) CHARACTER SET utf8 COLLATE utf8_bin DEFAULT NULL; ID: 5 notutf: simple INSERT same text everywhere yesutf: ح إلى كينيا binutf: ح إلى كينيا different: ح إلى كينيا 3 rows IN SET (0.00 sec) SET NAMES UTF8; Client characterset: utf8 Conn. characterset: utf8 (root@localhost) [test]>select * FROM utf_corrupt\G ID: 5 notutf: simple INSERT same text everywhere yesutf: Ø Ø¥Ù„Ù‰ كينيا binutf: ح إلى كينيا different: Ø Ø¥Ù„Ù‰ كينيا
The text is a mess again, this because the text is transformed:
(root@localhost) [test]>select id,hex(notutf) AS noutf_hex,hex(yesutf) AS yesutf_hex,hex(binutf) AS binutf_hex,hex(different) AS different_hex FROM utf_corrupt WHERE id=(SELECT max(id) FROM utf_corrupt)\G *************************** 1. row *************************** id: 5 noutf_hex: 73696D706C6520696E736572742073616D6520746578742065766572797768657265 yesutf_hex: C398C2AD20C398C2A5C399E2809EC399E280B020C399C692C399C5A0C399E280A0C399C5A0C398C2A7 binutf_hex: D8AD20D8A5D984D98920D983D98AD986D98AD8A7 different_hex: C398C2AD20C398C2A5C399E2809EC399E280B020C399C692C399C5A0C399E280A0C399C5A0C398C2A7 <--------- 1 row IN SET (0.00 sec)
Starting from the beginning so the table is again:
(root@localhost) [test]>show CREATE TABLE utf_corrupt\G *************************** 1. row *************************** TABLE: utf_corrupt CREATE TABLE: CREATE TABLE `utf_corrupt` ( `ID` int(11) NOT NULL AUTO_INCREMENT, `notutf` varchar(250) CHARACTER SET latin1 COLLATE latin1_general_cs DEFAULT NULL, `yesutf` varchar(250) CHARACTER SET utf8 COLLATE utf8_bin DEFAULT NULL, `binutf` varbinary(250) DEFAULT NULL, `different` varbinary(250) DEFAULT NULL, PRIMARY KEY (`ID`) ) ENGINE=InnoDB AUTO_INCREMENT=6 DEFAULT CHARSET=latin1 1 row IN SET (0.00 sec)
data is:
*************************** 3. row *************************** ID: 5 notutf: simple INSERT same text everywhere yesutf: ح إلى كينيا binutf: ح إلى كينيا different: ح إلى كينيا 3 rows IN SET (0.00 sec)
(root@localhost) [test]>select id,hex(notutf) AS noutf_hex,hex(yesutf) AS yesutf_hex,hex(binutf) AS binutf_hex,hex(different) AS different_hex FROM utf_corrupt WHERE id=(SELECT max(id) FROM utf_corrupt)\G *************************** 1. row *************************** id: 5 noutf_hex: 73696D706C6520696E736572742073616D6520746578742065766572797768657265 yesutf_hex: C398C2AD20C398C2A5C399E2809EC399E280B020C399C692C399C5A0C399E280A0C399C5A0C398C2A7 binutf_hex: D8AD20D8A5D984D98920D983D98AD986D98AD8A7 different_hex: D8AD20D8A5D984D98920D983D98AD986D98AD8A7 1 row IN SET (0.00 sec)
And encoding: Client characterset: latin1 Conn. characterset: latin1
To do properly you have to do a 3 steps conversion, mainly varchar -> varbinary -> varchar.As follow:
(root@localhost) [test]>alter TABLE utf_corrupt MODIFY different varbinary(250) DEFAULT NULL; Query OK, 3 rows affected (0.01 sec) Records: 3 Duplicates: 0 Warnings: 0 (root@localhost) [test]>select id,hex(notutf) AS noutf_hex,hex(yesutf) AS yesutf_hex,hex(binutf) AS binutf_hex,hex(different) AS different_hex FROM utf_corrupt WHERE id=(SELECT max(id) FROM utf_corrupt)\G *************************** 1. row *************************** id: 5 noutf_hex: 73696D706C6520696E736572742073616D6520746578742065766572797768657265 yesutf_hex: C398C2AD20C398C2A5C399E2809EC399E280B020C399C692C399C5A0C399E280A0C399C5A0C398C2A7 binutf_hex: D8AD20D8A5D984D98920D983D98AD986D98AD8A7 different_hex: D8AD20D8A5D984D98920D983D98AD986D98AD8A7 <-------------- 1 row IN SET (0.00 sec)
Now all is fine and the string was not transformed.Now change the encoding with NAMES:
Set NAMES utf8;
Convert back to varchar and UTF ...
(root@localhost) [test]>alter TABLE utf_corrupt MODIFY different varchar(250) CHARACTER SET utf8 COLLATE utf8_bin DEFAULT NULL; Query OK, 3 rows affected (0.02 sec) Records: 3 Duplicates: 0 Warnings: 0
And now yes you have the text stored correctly.
ID: 5 notutf: simple INSERT same text everywhere yesutf: Ø Ø¥Ù„Ù‰ كينيا binutf: ح إلى كينيا different: ح إلى كينيا 3 rows IN SET (0.00 sec) (root@localhost) [test]>select id,hex(notutf) AS noutf_hex,hex(yesutf) AS yesutf_hex,hex(binutf) AS binutf_hex,hex(different) AS different_hex FROM utf_corrupt WHERE id=(SELECT max(id) FROM utf_corrupt)\G *************************** 1. row *************************** id: 5 noutf_hex: 73696D706C6520696E736572742073616D6520746578742065766572797768657265 yesutf_hex: C398C2AD20C398C2A5C399E2809EC399E280B020C399C692C399C5A0C399E280A0C399C5A0C398C2A7 binutf_hex: D8AD20D8A5D984D98920D983D98AD986D98AD8A7 different_hex: D8AD20D8A5D984D98920D983D98AD986D98AD8A7 1 row IN SET (0.00 sec)
All good! My data is correctly moved from a fields base on varchar LATIN1 to the new varchar UTF8.The most important aspect is to ensure to use the right encoding and to do not allow the server to convert you text to wrong hex values.
In case of significant migration if you have a master slave setup,it is possible to play with them, to minimize the impact of the migration, following these steps:
- all latin1
- convert slave to varbinary
- insert data from master to slave keeping all as latin1
- swap the server
- convert ex-master to varbinary
- when done change all application connection to utf8 master_host
- convert tables back to varchar utf8 on slave
- swap master slave again
- convert slave table to UTF8
It seems cumbersome, but if you cannot afford to export the data change the table encoding and reload the data, and you cannot stop the write for long time, this is the safer way.The negative aspect of it is ... that varbinary use a different approach for sorting, it is based on the value of the binaries stored and you cannot use collation to modify it.
Let's go through the process:
Step 1) Check that all is latin1
Master ------------ (root@127.0.0.1:5510) [test]>\s -------------- /opt/mysql_templates/mysql-55o/bin/mysql Ver 14.14 Distrib 5.5.27, FOR linux2.6 (x86_64) USING readline 5.1 Connection id: 11 Current DATABASE: test Current user: root@localhost SSL: NOT IN USE Current pager: stdout USING OUTFILE: '' USING delimiter: ; Server version: 5.5.27-log MySQL Community Server (GPL) Protocol version: 10 Connection: Localhost via UNIX socket Server characterset: latin1 Db characterset: latin1 Client characterset: latin1 Conn. characterset: latin1 UNIX socket: /opt/mysql_instances/instances/derekutfm/mysql.sock Uptime: 2 days 3 hours 3 min 49 sec Threads: 3 Questions: 198 Slow queries: 0 Opens: 89 FLUSH TABLES: 1 Open TABLES: 27 Queries per second avg: 0.001 -------------- Slave --------------------- (root@127.0.0.1:5511) [test]>\s -------------- /opt/mysql_templates/mysql-55o/bin/mysql Ver 14.14 Distrib 5.5.27, FOR linux2.6 (x86_64) USING readline 5.1 Connection id: 9 Current DATABASE: test Current user: root@localhost SSL: NOT IN USE Current pager: stdout USING OUTFILE: '' USING delimiter: ; Server version: 5.5.27-log MySQL Community Server (GPL) Protocol version: 10 Connection: Localhost via UNIX socket Server characterset: utf8 Db characterset: utf8 Client characterset: latin1 Conn. characterset: latin1 UNIX socket: /opt/mysql_instances/instances/derekutfs/mysql.sock Uptime: 2 days 3 hours 3 min 45 sec Threads: 2 Questions: 177 Slow queries: 0 Opens: 87 FLUSH TABLES: 1 Open TABLES: 27 Queries per second avg: 0.000
Insert some data on the master ... just to have a couples of rows.
(root@127.0.0.1:5510) [test]>insert INTO utf_test VALUES(NULL,'STEP1: insert data 1', 'القمح إلى كينيا', 'القمح إلى كينيا', 'القمح إلى كينيا'); Query OK, 1 row affected (0.00 sec) (root@127.0.0.1:5510) [test]>insert INTO utf_test VALUES(NULL,'STEP1: insert data 2', 'القمح إلى كينيا', 'القمح إلى كينيا', 'القمح إلى كينيا'); Query OK, 1 row affected (0.00 sec)
Check the values:
Master ---------------------- (root@127.0.0.1:5510) [test]>select * FROM utf_test\G *************************** 1. row *************************** ID: 1 notutf: STEP1: INSERT DATA 1 yesutf: القمح إلى كينيا binutf: القمح إلى كينيا different: القمح إلى كينيا *************************** 2. row *************************** ID: 3 notutf: STEP1: INSERT DATA 2 yesutf: القمح إلى كينيا binutf: القمح إلى كينيا different: القمح إلى كينيا 2 rows IN SET (0.00 sec) (root@127.0.0.1:5510) [test]>select id,hex(notutf) AS noutf_hex,hex(yesutf) AS yesutf_hex,hex(binutf) AS binutf_hex,hex(different) AS different_hex FROM utf_test WHERE id=(SELECT max(id) FROM utf_test)\G *************************** 1. row *************************** id: 3 noutf_hex: 53544550313A20696E7365727420646174612032 yesutf_hex: C398C2A7C399E2809EC399E2809AC399E280A6C398C2AD20C398C2A5C399E2809EC399E280B020C399C692C399C5A0C399E280A0C399C5A0C398C2A7 binutf_hex: D8A7D984D982D985D8AD20D8A5D984D98920D983D98AD986D98AD8A7 different_hex: D8A7D984D982D985D8AD20D8A5D984D98920D983D98AD986D98AD8A7 1 row IN SET (0.00 sec) Slave ------------------------- (root@127.0.0.1:5511) [test]>select * FROM utf_test\G *************************** 1. row *************************** ID: 1 notutf: STEP1: INSERT DATA 1 yesutf: القمح إلى كينيا binutf: القمح إلى كينيا different: القمح إلى كينيا *************************** 2. row *************************** ID: 3 notutf: STEP1: INSERT DATA 2 yesutf: القمح إلى كينيا binutf: القمح إلى كينيا different: القمح إلى كينيا 2 rows IN SET (0.00 sec) (root@127.0.0.1:5511) [test]>select id,hex(notutf) AS noutf_hex,hex(yesutf) AS yesutf_hex,hex(binutf) AS binutf_hex,hex(different) AS different_hex FROM utf_test WHERE id=(SELECT max(id) FROM utf_test)\G *************************** 1. row *************************** id: 3 noutf_hex: 53544550313A20696E7365727420646174612032 yesutf_hex: C398C2A7C399E2809EC399E2809AC399E280A6C398C2AD20C398C2A5C399E2809EC399E280B020C399C692C399C5A0C399E280A0C399C5A0C398C2A7 binutf_hex: D8A7D984D982D985D8AD20D8A5D984D98920D983D98AD986D98AD8A7 different_hex: D8A7D984D982D985D8AD20D8A5D984D98920D983D98AD986D98AD8A7 1 row IN SET (0.00 sec)
Step 2) Convert to varbinary the slave, keep same encoding (NAMES latin1)
(root@127.0.0.1:5511) [test]>alter TABLE utf_test MODIFY different varbinary(250); Query OK, 2 rows affected (0.01 sec) Records: 2 Duplicates: 0 Warnings: 0 (root@127.0.0.1:5511) [test]>select id,hex(notutf) AS noutf_hex,hex(yesutf) AS yesutf_hex,hex(binutf) AS binutf_hex,hex(different) AS different_hex FROM utf_test WHERE id=(SELECT max(id) FROM utf_test)\G *************************** 1. row *************************** id: 3 noutf_hex: 53544550313A20696E7365727420646174612032 yesutf_hex: C398C2A7C399E2809EC399E2809AC399E280A6C398C2AD20C398C2A5C399E2809EC399E280B020C399C692C399C5A0C399E280A0C399C5A0C398C2A7 binutf_hex: D8A7D984D982D985D8AD20D8A5D984D98920D983D98AD986D98AD8A7 different_hex: D8A7D984D982D985D8AD20D8A5D984D98920D983D98AD986D98AD8A7 <------------------ 1 row IN SET (0.00 sec)
Code remain consistent.
Step 3) Production continue to work as usual, no impact.
Insert some other records on the master:
(root@127.0.0.1:5510) [test]>insert INTO utf_test VALUES(NULL,'STEP1: insert data 2', 'القمح إلى كينيا', 'القمح إلى كينيا', 'القمح إلى كينيا'); Query OK, 1 row affected (0.00 sec) (root@127.0.0.1:5510) [test]>insert INTO utf_test VALUES(NULL,'STEP1: insert data 2', 'القمح إلى كينيا', 'القمح إلى كينيا', 'القمح إلى كينيا'); Query OK, 1 row affected (0.00 sec) (root@127.0.0.1:5510) [test]>select * FROM utf_test\G *************************** 1. row *************************** ID: 1 notutf: STEP1: INSERT DATA 1 yesutf: القمح إلى كينيا binutf: القمح إلى كينيا different: القمح إلى كينيا *************************** 2. row *************************** ID: 3 notutf: STEP1: INSERT DATA 2 yesutf: القمح إلى كينيا binutf: القمح إلى كينيا different: القمح إلى كينيا *************************** 3. row *************************** ID: 5 notutf: STEP1: INSERT DATA 2 yesutf: القمح إلى كينيا binutf: القمح إلى كينيا different: القمح إلى كينيا *************************** 4. row *************************** ID: 7 notutf: STEP1: INSERT DATA 2 yesutf: القمح إلى كينيا binutf: القمح إلى كينيا different: القمح إلى كينيا 4 rows IN SET (0.00 sec) (root@127.0.0.1:5510) [test]>select id,hex(notutf) AS noutf_hex,hex(yesutf) AS yesutf_hex,hex(binutf) AS binutf_hex,hex(different) AS different_hex FROM utf_test\G *************************** 1. row *************************** id: 1 noutf_hex: 53544550313A20696E7365727420646174612031 yesutf_hex: C398C2A7C399E2809EC399E2809AC399E280A6C398C2AD20C398C2A5C399E2809EC399E280B020C399C692C399C5A0C399E280A0C399C5A0C398C2A7 binutf_hex: D8A7D984D982D985D8AD20D8A5D984D98920D983D98AD986D98AD8A7 different_hex: D8A7D984D982D985D8AD20D8A5D984D98920D983D98AD986D98AD8A7 *************************** 2. row *************************** id: 3 noutf_hex: 53544550313A20696E7365727420646174612032 yesutf_hex: C398C2A7C399E2809EC399E2809AC399E280A6C398C2AD20C398C2A5C399E2809EC399E280B020C399C692C399C5A0C399E280A0C399C5A0C398C2A7 binutf_hex: D8A7D984D982D985D8AD20D8A5D984D98920D983D98AD986D98AD8A7 different_hex: D8A7D984D982D985D8AD20D8A5D984D98920D983D98AD986D98AD8A7 *************************** 3. row *************************** id: 5 noutf_hex: 53544550313A20696E7365727420646174612032 yesutf_hex: C398C2A7C399E2809EC399E2809AC399E280A6C398C2AD20C398C2A5C399E2809EC399E280B020C399C692C399C5A0C399E280A0C399C5A0C398C2A7 binutf_hex: D8A7D984D982D985D8AD20D8A5D984D98920D983D98AD986D98AD8A7 different_hex: D8A7D984D982D985D8AD20D8A5D984D98920D983D98AD986D98AD8A7 *************************** 4. row *************************** id: 7 noutf_hex: 53544550313A20696E7365727420646174612032 yesutf_hex: C398C2A7C399E2809EC399E2809AC399E280A6C398C2AD20C398C2A5C399E2809EC399E280B020C399C692C399C5A0C399E280A0C399C5A0C398C2A7 binutf_hex: D8A7D984D982D985D8AD20D8A5D984D98920D983D98AD986D98AD8A7 different_hex: D8A7D984D982D985D8AD20D8A5D984D98920D983D98AD986D98AD8A7 4 rows IN SET (0.00 sec) ON The Slave ---------------------- (root@127.0.0.1:5511) [test]>select * FROM utf_test\G *************************** 1. row *************************** ID: 1 notutf: STEP1: INSERT DATA 1 yesutf: القمح إلى كينيا binutf: القمح إلى كينيا different: القمح إلى كينيا *************************** 2. row *************************** ID: 3 notutf: STEP1: INSERT DATA 2 yesutf: القمح إلى كينيا binutf: القمح إلى كينيا different: القمح إلى كينيا *************************** 3. row *************************** ID: 5 notutf: STEP1: INSERT DATA 2 yesutf: القمح إلى كينيا binutf: القمح إلى كينيا different: القمح إلى كينيا *************************** 4. row *************************** ID: 7 notutf: STEP1: INSERT DATA 2 yesutf: القمح إلى كينيا binutf: القمح إلى كينيا different: القمح إلى كينيا 4 rows IN SET (0.00 sec) (root@127.0.0.1:5511) [test]>select id,hex(notutf) AS noutf_hex,hex(yesutf) AS yesutf_hex,hex(binutf) AS binutf_hex,hex(different) AS different_hex FROM utf_test\G *************************** 1. row *************************** id: 1 noutf_hex: 53544550313A20696E7365727420646174612031 yesutf_hex: C398C2A7C399E2809EC399E2809AC399E280A6C398C2AD20C398C2A5C399E2809EC399E280B020C399C692C399C5A0C399E280A0C399C5A0C398C2A7 binutf_hex: D8A7D984D982D985D8AD20D8A5D984D98920D983D98AD986D98AD8A7 different_hex: D8A7D984D982D985D8AD20D8A5D984D98920D983D98AD986D98AD8A7 *************************** 2. row *************************** id: 3 noutf_hex: 53544550313A20696E7365727420646174612032 yesutf_hex: C398C2A7C399E2809EC399E2809AC399E280A6C398C2AD20C398C2A5C399E2809EC399E280B020C399C692C399C5A0C399E280A0C399C5A0C398C2A7 binutf_hex: D8A7D984D982D985D8AD20D8A5D984D98920D983D98AD986D98AD8A7 different_hex: D8A7D984D982D985D8AD20D8A5D984D98920D983D98AD986D98AD8A7 *************************** 3. row *************************** id: 5 noutf_hex: 53544550313A20696E7365727420646174612032 yesutf_hex: C398C2A7C399E2809EC399E2809AC399E280A6C398C2AD20C398C2A5C399E2809EC399E280B020C399C692C399C5A0C399E280A0C399C5A0C398C2A7 binutf_hex: D8A7D984D982D985D8AD20D8A5D984D98920D983D98AD986D98AD8A7 different_hex: D8A7D984D982D985D8AD20D8A5D984D98920D983D98AD986D98AD8A7 *************************** 4. row *************************** id: 7 noutf_hex: 53544550313A20696E7365727420646174612032 yesutf_hex: C398C2A7C399E2809EC399E2809AC399E280A6C398C2AD20C398C2A5C399E2809EC399E280B020C399C692C399C5A0C399E280A0C399C5A0C398C2A7 binutf_hex: D8A7D984D982D985D8AD20D8A5D984D98920D983D98AD986D98AD8A7 different_hex: D8A7D984D982D985D8AD20D8A5D984D98920D983D98AD986D98AD8A7 4 rows IN SET (0.00 sec)
Step 4) swap the server
5511 master and 5510 slave (but stop the replication from 5511->5510 and stop also the previous replication from 5510 to 511), production will continue to work in write, order by will be affected for the moment.
On new Master 5511:
*************************** 6. row *************************** ID: 11 <---------------------------------------------- notutf: STEP4: inserting DATA FROM EX-SLAVE after swap replication IS down yesutf: القمح إلى كينيا binutf: القمح إلى كينيا different: القمح إلى كينيا 6 rows IN SET (0.00 sec) ON new Slave 5510: *************************** 4. row *************************** ID: 7 <-------------------------------- notutf: STEP1: INSERT DATA 2 yesutf: القمح إلى كينيا binutf: القمح إلى كينيا different: القمح إلى كينيا 4 rows IN SET (0.00 sec)
Step 5) modify to varbinary the tables on the new Slave:
(root@127.0.0.1:5510) [test]>alter TABLE utf_test MODIFY different varbinary(250); Query OK, 4 rows affected (0.02 sec) Records: 4 Duplicates: 0 Warnings: 0
And now restart replication on 5510, so it can replicate from the master 5511.
(root@127.0.0.1:5510) [test]>select * FROM utf_test WHERE id=(SELECT max(id) FROM utf_test)\G *************************** 1. row *************************** ID: 11 notutf: STEP4: inserting DATA FROM EX-SLAVE after swap replication IS down yesutf: القمح إلى كينيا binutf: القمح إلى كينيا different: القمح إلى كينيا 1 row IN SET (0.00 sec)(root@127.0.0.1:5510) [test]>select id,hex(notutf) AS noutf_hex,hex(yesutf) AS yesutf_hex,hex(binutf) AS binutf_hex,hex(different) AS different_hex FROM utf_test WHERE id=(SELECT max(id) FROM utf_test)\G *************************** 1. row *************************** id: 11 noutf_hex: 53544550343A20696E73657274696E6720646174612066726F6D2045582D534C4156452061667465722073776170207265706C69636174696F6E20697320646F776E20 yesutf_hex: C398C2A7C399E2809EC399E2809AC399E280A6C398C2AD20C398C2A5C399E2809EC399E280B020C399C692C399C5A0C399E280A0C399C5A0C398C2A7 binutf_hex: D8A7D984D982D985D8AD20D8A5D984D98920D983D98AD986D98AD8A7 different_hex: D8A7D984D982D985D8AD20D8A5D984D98920D983D98AD986D98AD8A7 1 row IN SET (0.00 sec)
All good.
Step 6) change all Application connections to UTF8,
in our example I will change the NAMES on new Master 5511 and on slave 5510
SET NAMES UTF8;
Server characterset: utf8
Db characterset: utf8
Client characterset: utf8
Conn. characterset: utf8
Master -------------------------- (root@127.0.0.1:5511) [test]>set names utf8; Query OK, 0 rows affected (0.00 sec) (root@127.0.0.1:5511) [test]>select * FROM utf_test\G *************************** 1. row *************************** ID: 1 notutf: STEP1: INSERT DATA 1 yesutf: Ø§Ù„Ù‚Ù…Ø Ø¥Ù„Ù‰ كينيا binutf: القمح إلى كينيا different: القمح إلى كينيا ... *************************** 6. row *************************** ID: 11 notutf: STEP4: inserting DATA FROM EX-SLAVE after swap replication IS down yesutf: Ø§Ù„Ù‚Ù…Ø Ø¥Ù„Ù‰ كينيا binutf: القمح إلى كينيا different: القمح إلى كينيا 6 rows IN SET (0.00 sec)
Note how the values in the yesutf was NOT correctly converted given the initial double translation.
Slave --------------- Server characterset: utf8 Db characterset: utf8 Client characterset: utf8 Conn. characterset: utf8 (root@127.0.0.1:5510) [test]>select * FROM utf_test WHERE id=(SELECT max(id) FROM utf_test)\G *************************** 1. row *************************** ID: 11 notutf: STEP4: inserting DATA FROM EX-SLAVE after swap replication IS down yesutf: Ø§Ù„Ù‚Ù…Ø Ø¥Ù„Ù‰ كينيا binutf: القمح إلى كينيا different: القمح إلى كينيا 1 row IN SET (0.00 sec)
Step 7) Convert table on SLAVE back to varchar but using utf8 now
(root@127.0.0.1:5510) [test]>alter TABLE utf_test MODIFY different varchar(250) CHARACTER SET utf8 COLLATE utf8_bin DEFAULT NULL; Query OK, 7 rows affected (0.01 sec) Records: 7 Duplicates: 0 Warnings: 0 (root@127.0.0.1:5510) [test]>select * FROM utf_test WHERE id=(SELECT max(id) FROM utf_test)\G *************************** 1. row *************************** ID: 15 notutf: STEP7: inserting DATA FROM EX-SLAVE after swap replication IS down yesutf: القمح إلى كينيا binutf: القمح إلى كينيا different: القمح إلى كينيا 1 row IN SET (0.00 sec) (root@127.0.0.1:5510) [test]>select id,hex(notutf) AS noutf_hex,hex(yesutf) AS yesutf_hex,hex(binutf) AS binutf_hex,hex(different) AS different_hex FROM utf_test WHERE id=(SELECT max(id) FROM utf_test)\G *************************** 1. row *************************** id: 15 noutf_hex: 53544550343A20696E73657274696E6720646174612066726F6D2045582D534C4156452061667465722073776170207265706C69636174696F6E20697320646F776E20 yesutf_hex: D8A7D984D982D985D8AD20D8A5D984D98920D983D98AD986D98AD8A7 binutf_hex: D8A7D984D982D985D8AD20D8A5D984D98920D983D98AD986D98AD8A7 different_hex: D8A7D984D982D985D8AD20D8A5D984D98920D983D98AD986D98AD8A7 1 row IN SET (0.00 sec)
Step 8) given all good put now the 5510 back as master,
given NOW the 5510 has the fields set as varchar and encoding/collation using UTF also the order by will work fine again.
At this point data will continue to be pass from master to slave.
ON Master --------------------- (root@127.0.0.1:5510) [test]>insert INTO utf_test VALUES(NULL,'STEP8: inserting data from master 5510 after swap server again ', 'القمح إلى كينيا', 'القمح إلى كينيا', 'القمح إلى كينيا'); Query OK, 1 row affected (0.00 sec) (root@127.0.0.1:5510) [test]>select * FROM utf_test WHERE id=(SELECT max(id) FROM utf_test)\G *************************** 1. row *************************** ID: 17 notutf: STEP8: inserting DATA FROM master 5510 after swap server again yesutf: القمح إلى كينيا binutf: القمح إلى كينيا different: القمح إلى كينيا 1 row IN SET (0.00 sec) (root@127.0.0.1:5510) [test]>select id,hex(notutf) AS noutf_hex,hex(yesutf) AS yesutf_hex,hex(binutf) AS binutf_hex,hex(different) AS different_hex FROM utf_test WHERE id=(SELECT max(id) FROM utf_test)\G *************************** 1. row *************************** id: 17 noutf_hex: 53544550383A20696E73657274696E6720646174612066726F6D206D6173746572203535313020616674657220737761702073657276657220616761696E20 yesutf_hex: D8A7D984D982D985D8AD20D8A5D984D98920D983D98AD986D98AD8A7 binutf_hex: D8A7D984D982D985D8AD20D8A5D984D98920D983D98AD986D98AD8A7 different_hex: D8A7D984D982D985D8AD20D8A5D984D98920D983D98AD986D98AD8A7 1 row IN SET (0.00 sec) ON Slave ------------------------- *************************** 8. row *************************** ID: 17 notutf: STEP8: inserting DATA FROM master 5510 after swap server again yesutf: القمح إلى كينيا binutf: القمح إلى كينيا different: القمح إلى كينيا 8 rows IN SET (0.00 sec) *************************** 8. row *************************** id: 17 noutf_hex: 53544550383A20696E73657274696E6720646174612066726F6D206D6173746572203535313020616674657220737761702073657276657220616761696E20 yesutf_hex: D8A7D984D982D985D8AD20D8A5D984D98920D983D98AD986D98AD8A7 binutf_hex: D8A7D984D982D985D8AD20D8A5D984D98920D983D98AD986D98AD8A7 different_hex: D8A7D984D982D985D8AD20D8A5D984D98920D983D98AD986D98AD8A7
Step 9) Finally convert all tables also on slave to varchar UTF8:
Master --------------- (root@127.0.0.1:5510) [test]>insert INTO utf_test VALUES(NULL,'STEP9: final Slave conversions', 'القمح إلى كينيا', 'القمح إلى كينيا', 'القمح إلى كينيا'); Slave --------------- (root@127.0.0.1:5511) [test]>alter TABLE utf_test MODIFY different varchar(250) CHARACTER SET utf8 COLLATE utf8_bin DEFAULT NULL; *************************** 9. row *************************** ID: 19 notutf: STEP9: final Slave conversions yesutf: القمح إلى كينيا binutf: القمح إلى كينيا different: القمح إلى كينيا 9 rows IN SET (0.00 sec) (root@127.0.0.1:5511) [test]>select id,hex(notutf) AS noutf_hex,hex(yesutf) AS yesutf_hex,hex(binutf) AS binutf_hex,hex(different) AS different_hex FROM utf_test\G *************************** 1. row *************************** id: 1 noutf_hex: 53544550313A20696E7365727420646174612031 yesutf_hex: C398C2A7C399E2809EC399E2809AC399E280A6C398C2AD20C398C2A5C399E2809EC399E280B020C399C692C399C5A0C399E280A0C399C5A0C398C2A7 binutf_hex: D8A7D984D982D985D8AD20D8A5D984D98920D983D98AD986D98AD8A7 different_hex: D8A7D984D982D985D8AD20D8A5D984D98920D983D98AD986D98AD8A7 ... id: 19 noutf_hex: 53544550393A2066696E616C20536C61766520636F6E76657273696F6E73 yesutf_hex: D8A7D984D982D985D8AD20D8A5D984D98920D983D98AD986D98AD8A7 binutf_hex: D8A7D984D982D985D8AD20D8A5D984D98920D983D98AD986D98AD8A7 different_hex: D8A7D984D982D985D8AD20D8A5D984D98920D983D98AD986D98AD8A7 9 rows IN SET (0.00 sec)
SELECT USER(), CHARSET(USER()), COLLATION(USER())\G(root@127.0.0.1:5511) [test]>SELECT USER(), CHARSET(USER()), COLLATION(USER())\G *************************** 1. row *************************** USER(): root@127.0.0.1 CHARSET(USER()): utf8 COLLATION(USER()): utf8_general_ci 1 row IN SET (0.00 sec) AND (root@127.0.0.1:5511) [test]>show VARIABLES LIKE '%server%'\G *************************** 1. row *************************** Variable_name: character_set_server Value: utf8 *************************** 2. row *************************** Variable_name: collation_server Value: utf8_unicode_ci *************************** 3. row *************************** Variable_name: server_id Value: 5511 3 rows IN SET (0.00 sec)
Conclusion
Reference
Thanks
Finally a huge thanks to Derek Downey, because he raise the issue (again) to me in relation to a customer migration.
Thanks for his dedication, professionality and will to learn and have fun together.