ProxySQL and MHA integration
MHA (Master High Availability Manager and tools for MySQL), is almost fully integrated with the ProxySQL process.
What it means is that you can count on the MHA standard feature to manage the failover, and on ProxySQL to manage the traffic and shift from one server.
This is one of the main difference between using MHA and VIP vs MHA and ProxySQL.
There is no need to move IPs or re-define DNS.
The following is an example of configuration file for MHA in case you use it with ProxySQL:
[server default] user=mha password=mha ssh_user=root repl_password=replica manager_log=/tmp/mha.log manager_workdir=/tmp remote_workdir=/tmp master_binlog_dir=/opt/mysql_instances/mha1/logs client_bindir=/opt/mysql_templates/mysql-57/bin client_libdir=/opt/mysql_templates/mysql-57/lib master_ip_failover_script=/opt/tools/mha/mha4mysql-manager/samples/scripts/master_ip_failover master_ip_online_change_script=/opt/tools/mha/mha4mysql-manager/samples/scripts/master_ip_online_change log_level=debug [server1] hostname=mha1r ip=192.168.1.104 candidate_master=1 [server2] hostname=mha2r ip=192.168.1.107 candidate_master=1 [server3] hostname=mha3r ip=192.168.1.111 candidate_master=1 [server4] hostname=mha4r ip=192.168.1.109 no_master=1
The only thing you need to be sure is to comment out the "FIX ME " lines in the sample/scripts.
In the MHA installation direcotry look for :
mha4mysql-manager/samples/scripts/master_ip_failover
mha4mysql-manager/samples/scripts/master_ip_online_change
After that just install MHA as you are used to.
In ProxySQL be sure to have mha users and the servers set.
One very important thing when using ProxySQL with standard replication is that we need to set additional privileges to the ProxySQL monitor user.
It must have "Replication Client" set or it will fail to check the SLAVE LAG.
Also the Servers MUST have a defined value for the attribute max_replication_lag, or the check will be ignore.
As a reminder:
INSERT INTO mysql_servers (hostname,hostgroup_id,port,weight,max_replication_lag) VALUES ('192.168.1.104',600,3306,1000,0); INSERT INTO mysql_servers (hostname,hostgroup_id,port,weight,max_replication_lag) VALUES ('192.168.1.104',601,3306,1000,10); INSERT INTO mysql_servers (hostname,hostgroup_id,port,weight,max_replication_lag) VALUES ('192.168.1.107',601,3306,1000,10); INSERT INTO mysql_servers (hostname,hostgroup_id,port,weight,max_replication_lag) VALUES ('192.168.1.111',601,3306,1000,10); INSERT INTO mysql_servers (hostname,hostgroup_id,port,weight,max_replication_lag) VALUES ('192.168.1.109',601,3306,1000,10); INSERT INTO mysql_replication_hostgroups VALUES (600,601); LOAD MYSQL SERVERS TO RUNTIME; SAVE MYSQL SERVERS TO DISK; INSERT INTO mysql_query_rules (username,destination_hostgroup,active) VALUES('mha_W',600,1); INSERT INTO mysql_query_rules (username,destination_hostgroup,active) VALUES('mha_R',601,1); INSERT INTO mysql_query_rules (username,destination_hostgroup,active,retries,match_digest) VALUES('mha_RW',600,1,3,'^SELECT.*FOR UPDATE'); INSERT INTO mysql_query_rules (username,destination_hostgroup,active,retries,match_digest) VALUES('mha_RW',601,1,3,'^SELECT'); LOAD MYSQL QUERY RULES TO RUNTIME;SAVE MYSQL QUERY RULES TO DISK; INSERT INTO mysql_users (username,password,active,default_hostgroup,default_schema,transaction_persistent) VALUES ('mha_W','test',1,600,'test_mha',1); INSERT INTO mysql_users (username,password,active,default_hostgroup,default_schema,transaction_persistent) VALUES ('mha_R','test',1,601,'test_mha',1); INSERT INTO mysql_users (username,password,active,default_hostgroup,default_schema,transaction_persistent) VALUES ('mha_RW','test',1,600,'test_mha',1); LOAD MYSQL USERS TO RUNTIME;SAVE MYSQL USERS TO DISK
Ok all is ready let us rock'n'roll.
Controlled fail-over.First of all the masterha_manager should not be running or you will get error.
Then let us start some traffic.
Write sysbench --test=/usr/share/doc/sysbench/tests/db/oltp.lua --mysql-host=192.168.1.50 --mysql-port=3311 --mysql-user=mha_RW --mysql-password=test --mysql-db=mha_test --db-driver=mysql --oltp-tables-count=50 --oltp-tablesize=5000 --max-requests=0 --max-time=900 --oltp-point-selects=5 --oltp-read-only=off --oltp-dist-type=uniform --oltp-reconnect-mode=transaction --oltp-skip-trx=off --num-threads=10 --report-interval=10 --mysql-ignore-errors=all run Read only sysbench --test=/usr/share/doc/sysbench/tests/db/oltp.lua --mysql-host=192.168.1.50 --mysql-port=3311 --mysql-user=mha_RW --mysql-password=test --mysql-db=mha_test --db-driver=mysql --oltp-tables-count=50 --oltp-tablesize=5000 --max-requests=0 --max-time=900 --oltp-point-selects=5 --oltp-read-only=on --num-threads=10 --oltp-reconnect-mode=query --oltp-skip-trx=on --report-interval=10 --mysql-ignore-errors=all run
Let it run for a bit then check:
mysql> SELECT * FROM stats_mysql_connection_pool WHERE hostgroup BETWEEN 600 AND 601 ORDER BY hostgroup,srv_host DESC; +-----------+---------------+----------+--------+----------+----------+--------+---------+---------+-----------------+-----------------+------------+ | hostgroup | srv_host | srv_port | STATUS | ConnUsed | ConnFree | ConnOK | ConnERR | Queries | Bytes_data_sent | Bytes_data_recv | Latency_ms | +-----------+---------------+----------+--------+----------+----------+--------+---------+---------+-----------------+-----------------+------------+ | 600 | 192.168.1.104 | 3306 | ONLINE | 10 | 0 | 20 | 0 | 551256 | 44307633 | 0 | 285 | <--- current Master | 601 | 192.168.1.111 | 3306 | ONLINE | 5 | 3 | 11 | 0 | 1053685 | 52798199 | 4245883580 | 1133 | | 601 | 192.168.1.109 | 3306 | ONLINE | 3 | 5 | 10 | 0 | 1006880 | 50473746 | 4052079567 | 369 | | 601 | 192.168.1.107 | 3306 | ONLINE | 3 | 5 | 13 | 0 | 1040524 | 52102581 | 4178965796 | 604 | | 601 | 192.168.1.104 | 3306 | ONLINE | 7 | 1 | 16 | 0 | 987548 | 49458526 | 3954722258 | 285 | +-----------+---------------+----------+--------+----------+----------+--------+---------+---------+-----------------+-----------------+------------+
And now let us try to perform the failover.
To do we will instruct MHA to do a switch, and to set the OLD master as new slave:
masterha_master_switch --master_state=alive --conf=/etc/mha.cnf --orig_master_is_new_slave --interactive=0 --running_updates_limit=0
Now let us check what happened:
[ 160s] threads: 10, tps: 354.50, reads: 3191.10, writes: 1418.50, response time: 48.96ms (95%), errors: 0.00, reconnects: 0.00 [ 170s] threads: 10, tps: 322.50, reads: 2901.98, writes: 1289.89, response time: 55.45ms (95%), errors: 0.00, reconnects: 0.00 [ 180s] threads: 10, tps: 304.60, reads: 2743.12, writes: 1219.91, response time: 58.09ms (95%), errors: 0.10, reconnects: 0.00 <--- moment of the switch [ 190s] threads: 10, tps: 330.40, reads: 2973.40, writes: 1321.00, response time: 50.52ms (95%), errors: 0.00, reconnects: 0.00 [ 200s] threads: 10, tps: 304.20, reads: 2745.60, writes: 1217.60, response time: 58.40ms (95%), errors: 0.00, reconnects: 1.00 [ 210s] threads: 10, tps: 353.80, reads: 3183.80, writes: 1414.40, response time: 48.15ms (95%), errors: 0.00, reconnects: 0.00
Check ProxySQL :
mysql> SELECT * FROM stats_mysql_connection_pool WHERE hostgroup BETWEEN 600 AND 601 ORDER BY hostgroup,srv_host DESC; +-----------+---------------+----------+--------+----------+----------+--------+---------+---------+-----------------+-----------------+------------+ | hostgroup | srv_host | srv_port | STATUS | ConnUsed | ConnFree | ConnOK | ConnERR | Queries | Bytes_data_sent | Bytes_data_recv | Latency_ms | +-----------+---------------+----------+--------+----------+----------+--------+---------+---------+-----------------+-----------------+------------+ | 600 | 192.168.1.107 | 3306 | ONLINE | 10 | 0 | 10 | 0 | 123457 | 9922280 | 0 | 658 | <--- new master | 601 | 192.168.1.111 | 3306 | ONLINE | 2 | 6 | 14 | 0 | 1848302 | 91513537 | 7590137770 | 1044 | | 601 | 192.168.1.109 | 3306 | ONLINE | 5 | 3 | 12 | 0 | 1688789 | 83717258 | 6927354689 | 220 | | 601 | 192.168.1.107 | 3306 | ONLINE | 3 | 5 | 13 | 0 | 1834415 | 90789405 | 7524861792 | 658 | | 601 | 192.168.1.104 | 3306 | ONLINE | 6 | 2 | 24 | 0 | 1667252 | 82509124 | 6789724589 | 265 | +-----------+---------------+----------+--------+----------+----------+--------+---------+---------+-----------------+-----------------+------------+
In this case the servers were not behind the master and switch happened quite fast.
We can see that the WRITE operations that normally are an issue given the need to move around an VIP or change name resolution, have a very limited hiccup.
Read operations were not affected, at all.
Nice eh?
Do you know how long it takes to do a switch in this conditions?
real 0m2.710s yes 2.7 seconds.
This is another evidence that most of the time in case of switch with MHA is cause by the need to redirect traffic from A to B using the network.
Crash fail-over
What happened if instead have a nice switch we have to cover a real failover?
First of all let us start masterha_manager:
nohup masterha_manager --conf=/etc/mha.cnf --wait_on_monitor_error=60 --wait_on_failover_error=60 >> /tmp/mha.log 2>&1
Then let us start some load again.
Finally go to the MySQL node that us master xxx.xxx.xxx.107
ps aux|grep mysql mysql 18755 0.0 0.0 113248 1608 pts/0 S Aug28 0:00 /bin/sh /opt/mysql_templates/mysql-57/bin/mysqld_safe --defaults-file=/opt/mysql_instances/mha1/my.cnf mysql 21975 3.2 30.4 4398248 941748 pts/0 Sl Aug28 93:21 /opt/mysql_templates/mysql-57/bin/mysqld --defaults-file=/opt/mysql_instances/mha1/my.cnf --basedir=/opt/mysql_templates/mysql-57/ --datadir=/opt/mysql_instances/mha1/data --plugin-dir=/opt/mysql_templates/mysql-57//lib/plugin --log-error=/opt/mysql_instances/mha1/mysql-3306.err --open-files-limit=65536 --pid-file=/opt/mysql_instances/mha1/mysql.pid --socket=/opt/mysql_instances/mha1/mysql.sock --port=3306
And kill the MySQL process.
kill -9 21975 18755
As before let us check what happened application side:
[ 80s] threads: 4, tps: 213.20, reads: 1919.10, writes: 853.20, response time: 28.74ms (95%), errors: 0.00, reconnects: 0.00 [ 90s] threads: 4, tps: 211.30, reads: 1901.80, writes: 844.70, response time: 28.63ms (95%), errors: 0.00, reconnects: 0.00 [ 100s] threads: 4, tps: 211.90, reads: 1906.40, writes: 847.90, response time: 28.60ms (95%), errors: 0.00, reconnects: 0.00 [ 110s] threads: 4, tps: 211.10, reads: 1903.10, writes: 845.30, response time: 29.27ms (95%), errors: 0.30, reconnects: 0.00 <-- issue starts [ 120s] threads: 4, tps: 198.30, reads: 1785.10, writes: 792.40, response time: 28.43ms (95%), errors: 0.00, reconnects: 0.00 [ 130s] threads: 4, tps: 0.00, reads: 0.60, writes: 0.00, response time: 0.00ms (95%), errors: 0.00, reconnects: 0.40 <-- total stop in write [ 140s] threads: 4, tps: 173.80, reads: 1567.80, writes: 696.30, response time: 34.89ms (95%), errors: 0.40, reconnects: 0.00 <-- writes restart [ 150s] threads: 4, tps: 195.20, reads: 1755.10, writes: 780.50, response time: 33.98ms (95%), errors: 0.00, reconnects: 0.00 [ 160s] threads: 4, tps: 196.90, reads: 1771.30, writes: 786.80, response time: 33.49ms (95%), errors: 0.00, reconnects: 0.00 [ 170s] threads: 4, tps: 193.70, reads: 1745.40, writes: 775.40, response time: 34.39ms (95%), errors: 0.00, reconnects: 0.00 [ 180s] threads: 4, tps: 191.60, reads: 1723.70, writes: 766.20, response time: 35.82ms (95%), errors: 0.00, reconnects: 0.00
So it takes ~10 seconds to perform failover.
To better understand let see what happened on MHA-land:
Tue Aug 30 09:33:33 2016 - [warning] Global configuration file /etc/masterha_default.cnf not found. Skipping. Tue Aug 30 09:33:33 2016 - [info] Reading application default configuration from /etc/mha.cnf.. ... Read conf and start Tue Aug 30 09:33:47 2016 - [debug] Trying to get advisory lock.. Tue Aug 30 09:33:47 2016 - [info] Ping(SELECT) succeeded, waiting until MySQL doesn't respond.. ... Wait for errors Tue Aug 30 09:34:47 2016 - [warning] Got error on MySQL select ping: 2006 (MySQL server has gone away) <--- Error time Tue Aug 30 09:34:56 2016 - [warning] Connection failed 4 time(s).. <--- Finally MHA decide to do something Tue Aug 30 09:34:56 2016 - [warning] Master is not reachable from health checker! Tue Aug 30 09:34:56 2016 - [warning] Master mha2r(192.168.1.107:3306) is not reachable! Tue Aug 30 09:34:56 2016 - [warning] SSH is reachable. Tue Aug 30 09:34:58 2016 - [info] Master failover to mha1r(192.168.1.104:3306) completed successfully. <--- end of the failover
MHA sees the server failing at xx:47, but because the retry and checks validation it actually fully acknowledge the downtime at xx:56.
As such ~8 seconds after.
To perform the whole failover it then takes only ~2 seconds (again).
No movable IP or dns involve and as such the operations were fast.
This is obviously true in case the servers have the binary-log there, different story may be if MHA has also to manage and push data from binarylog to MySQL.
As you can see ProxySQL may help a lot in reducing the timing also for this scenario, totally skipping the network related operations, that as we know, are the ones causing more trouble in these cases.