程式師世界 >> 編程語言 >> JAVA編程 >> JAVA綜合教程 >> MHA GTID based failover代碼解析

MHA GTID based failover代碼解析

編輯：JAVA綜合教程

MHA GTID based failover代碼解析

作為以下文章的補充，說明MHA GTID based failover的處理流程。
http://blog.chinaunix.net/uid-20726500-id-5700631.html

MHA判斷是GTID based failover需要滿足下面3個條件(參考函數get_gtid_status)
所有節點gtid_mode=1
所有節點Executed_Gtid_Set不為空
至少一個節點Auto_Position=1

GTID basedMHA故障切換

MHA::MasterFailover::main()
->do_master_failover
Phase 1: Configuration Check Phase
-> check_settings：
check_node_version：查看MHA的版本信息
connect_all_and_read_server_status：確認各個node的MySQL實例是否可以連接
get_dead_servers/get_alive_servers/get_alive_slaves：double check各個node的死活狀態
start_sql_threads_if：查看Slave_SQL_Running是否為Yes，若不是則啟動SQL thread
Phase 2: Dead Master Shutdown Phase：對於我們來說，唯一的作用就是stop IO thread
-> force_shutdown($dead_master)：
stop_io_thread：所有slave的IO thread stop掉(將stop掉master)
force_shutdown_internal(實際上就是執行配置文件中的master_ip_failover_script/shutdown_script，若無則不執行)：
master_ip_failover_script：如果設置了VIP，則首先切換VIP
shutdown_script：如果設置了shutdown腳本，則執行
Phase 3: Master Recovery Phase
-> Phase 3.1: Getting Latest Slaves Phase(取得latest slave)
read_slave_status：取得各個slave的binlog file/position
check_slave_status：調用"SHOW SLAVE STATUS"來取得slave的如下信息：
Slave_IO_State, Master_Host,
Master_Port, Master_User,
Slave_IO_Running, Slave_SQL_Running,
Master_Log_File, Read_Master_Log_Pos,
Relay_Master_Log_File, Last_Errno,
Last_Error, Exec_Master_Log_Pos,
Relay_Log_File, Relay_Log_Pos,
Seconds_Behind_Master, Retrieved_Gtid_Set,
Executed_Gtid_Set, Auto_Position
Replicate_Do_DB, Replicate_Ignore_DB, Replicate_Do_Table,
Replicate_Ignore_Table, Replicate_Wild_Do_Table,
Replicate_Wild_Ignore_Table
identify_latest_slaves：
通過比較各個slave中的Master_Log_File/Read_Master_Log_Pos，來找到latest的slave
identify_oldest_slaves：
通過比較各個slave中的Master_Log_File/Read_Master_Log_Pos，來找到oldest的slave
-> Phase 3.2: Determining New Master Phase
get_most_advanced_latest_slave:找到(Relay_Master_Log_File,Exec_Master_Log_Pos)最靠前的Slave
select_new_master：選出新的master節點
If preferred node is specified, one of active preferred nodes will be new master.
If the latest server behinds too much (i.e. stopping sql thread for online backups),
we should not use it as a new master, we should fetch relay log there. Even though preferred
master is configured, it does not become a master if it's far behind.
get_candidate_masters:
就是配置文件中配置了candidate_master>0的節點
get_bad_candidate_masters：
# The following servers can not be master:
# - dead servers
# - Set no_master in conf files (i.e. DR servers)
# - log_bin is disabled
# - Major version is not the oldest
# - too much replication delay(slave與master的binlog position差距大於100000000)
Searching from candidate_master slaves which have received the latest relay log events
if NOT FOUND：
Searching from all candidate_master slaves
if NOT FOUND:
Searching from all slaves which have received the latest relay log events
if NOT FOUND:
Searching from all slaves

-> Phase 3.3: Phase 3.3: New Master Recovery Phase
recover_master_gtid_internal:
wait_until_relay_log_applied
stop_slave
如果new master不是擁有最新relay的Slave
$latest_slave->wait_until_relay_log_applied:等待直到最新relay的Slave上Exec_Master_Log_Pos等於Read_Master_Log_Pos
change_master_and_start_slave( $target, $latest_slave)
wait_until_in_sync( $target, $latest_slave )
save_from_binlog_server:
遍歷所有binary server，執行save_binary_logs --command=save獲取後面的binlog
apply_binlog_to_master:
應用從binary server上獲取的binlog(如果有的話)
如果設置了master_ip_failover_script，調用$master_ip_failover_script --command=start進行啟用vip
如果未設置skip_disable_read_only，設置read_only=0

Phase 4: Slaves Recovery Phase
recover_slaves_gtid_internal
-> Phase 4.1: Starting Slaves in parallel
對所有Slave執行change_master_and_start_slave
如果設置了wait_until_gtid_in_sync，通過"SELECT WAIT_UNTIL_SQL_THREAD_AFTER_GTIDS(?,0)"等待Slave數據同步

Phase 5: New master cleanup phase
reset_slave_on_new_master
清理New Master其實就是重置slave info，即取消原來的Slave信息。至此整個Master故障切換過程完成