Today I faced with serious issue. During update execution my server has been unexpectedly rebooted. I found root cause. It was bad disk queue timing for Qdisk cluster component. I forgot to disable cluster services during update. This is recommended action. I always remember about it but at this moment I hurry up and forgot about it. This is shame but it happens ;(
Server did not start properly and I had to use rescue CD to fix kernel. I wasted a lot of time. It was enough to start system from previous kernel - grub menu and fix yum transactions.
Let's to fix cluster node. We do not have to much time. I know all services are up and running on standby node but we have no redundancy.
After system will bring up with previous kernel you can heck how many packages are missing or duplicted.
Before you start please create list of duplicated packages. It may be important for further moves. You can't start without it!
First step: Cancel all not completed yum transaction:
# yum-complete-transaction --cleanup-only
Second step: Cleanup system from duplicate packages:
# package-cleanup --cleandupes
Be aware at the end of checking section you can find package list which will be removed. Please save the list as separate file. It will be useful if new packages not have installed yet.
After reboot I observed that some packages are missing :( It really hurts but we must to handle it.
# mkdir /root/tmp/
# cat packages-list-erased-`hostname`-`date +%F`
Erasing : ipa-client-3.0.0-37.el6.x86_64 1/252
Erasing : sssd-1.9.2-129.el6.x86_64 2/252
Erasing : libvirt-client-0.10.2-29.el6.x86_64 3/252
.
..
...
Erasing : glibc-2.12-1.132.el6.x86_64 250/252
Erasing : tzdata-2013g-1.el6.noarch 251/252
Erasing : libgcc-4.4.7-4.el6.x86_64 252/252
Before you start please create list of duplicated packages. It may be important for further moves. You can't start without it!
First step: Cancel all not completed yum transaction:
# yum-complete-transaction --cleanup-only
Second step: Cleanup system from duplicate packages:
# package-cleanup --cleandupes
Be aware at the end of checking section you can find package list which will be removed. Please save the list as separate file. It will be useful if new packages not have installed yet.
After reboot I observed that some packages are missing :( It really hurts but we must to handle it.
# mkdir /root/tmp/
# cat packages-list-erased-`hostname`-`date +%F`
Erasing : ipa-client-3.0.0-37.el6.x86_64 1/252
Erasing : sssd-1.9.2-129.el6.x86_64 2/252
Erasing : libvirt-client-0.10.2-29.el6.x86_64 3/252
.
..
...
Erasing : glibc-2.12-1.132.el6.x86_64 250/252
Erasing : tzdata-2013g-1.el6.noarch 251/252
Erasing : libgcc-4.4.7-4.el6.x86_64 252/252
# cat packages-list-erased-`hostname`-`date +%F` | awk '{ print $3 }' | sed 's/^[0-9]://' | \ awk -F- '{ print $1 "-" $2}' | sed 's/-[0-9].*$//' > /root/tmp/packages-list-missed- \ `hostname`-`date +%F`
# for package in `cat /root/tmp//root/tmp/packages-list-missed-`hostname`-`date \ +%F``;do yum -y install $package;done
Of course you must reboot server to take the effect and apply all changes. After reboot I have been surprised very positive. My server was up and running with all services. I checked cluster services and were also fine, only one small remark: Please check if rgmanager is up and running after reboot. I found that did not started during booting phase.
# chkconfig --add rgmanager
# chkconfig --level=3 rgmanager on
# chkconfig --level=5 rgmanager on
# chkconfig --level=5 rgmanager on
# service rgmanager start
# clustat
Cluster Status for RH6cluster01 @ Sat Apr 25 20:59:20 2015
Member Status: Quorate
Member Name ID Status
------ ---- ---- ------
clrh6n01 1 Online, rgmanager
clrh6n02 2 Online, Local, rgmanager
/dev/VolGroupQdisk/lv_qdisk 0 Online, Quorum Disk
Service Name Owner (Last) State
------- ---- ----- ------ -----
service:appRG clrh6n02 started
service:dbRG clrh6n01 started
# clustat
Cluster Status for RH6cluster01 @ Sat Apr 25 20:59:20 2015
Member Status: Quorate
Member Name ID Status
------ ---- ---- ------
clrh6n01 1 Online, rgmanager
clrh6n02 2 Online, Local, rgmanager
/dev/VolGroupQdisk/lv_qdisk 0 Online, Quorum Disk
Service Name Owner (Last) State
------- ---- ----- ------ -----
service:appRG clrh6n02 started
service:dbRG clrh6n01 started
Lessons and learn: do not perform update without yum lvm fs snapshoot plugin. This functionality is desirable during system update and the maintenance tasks. You can save a lot of time/money in case of update failure. You should always perform full backup system and have a shot in the locker proper back out procedure.