Document Type | Technical Information
Category | Administration
Applicable Product Version | 7FS02PS
Document Number | TADTI019
Overview
This guide explains how to protect the CM process and maintain system stability when a failure occurs while the CM Fencing feature is enabled. It also provides instructions on setting the CM_FENCE parameter and presents test scenarios.
NoteTest OS / CM Version: Rocky Linux 8.10 / TBCM 7.1.1 (Build 277758)Set CM_FENCE parameter to Y after DB installation and startup completion
Method
Perform this after installing CM and DB.
Modify $CM_SID.tip
su - root [root@tac1 ~]# vi $CM_HOME/config/$CM_SID.tip CM_FENCE=Y
When starting, the message "fencing function enabled" will be displayed.
[root@tac1 ~]# tbcm -b CM Guard daemon started up. CM-fence enabled. import resources from '/cm/cmresource'...
Verify the parameter application with cmrctl show param.
[root@tac1 ~]# cmrctl show param Parameter Resource Info ================================================ CM_FENCE : Y ================================================
FENCING Feature Test Scenarios
Note
The following two cases are example scenarios performed for testing purposes.
Case 1. cm file access unavailable
Detach the disk containing the CM FILE from the VM. (Remove disk in the console VM settings)
cm log of the VM node with detached disk
2025/03/26 10:35:45.987 [2] cm_fd.c :0318(05) [cls:0] cmhead read error(5, Input/output error) r_size=-1 2025/03/26 10:35:45.988 [2] cm_fd.c :0318(07) [cls:2] cmhead read error(5, Input/output error) r_size=-1 2025/03/26 10:35:45.993 [1] cm_actio:1194(04) [cls] [ERROR] Cannot access to enough CM files (1/3). SHUTDOWN... 2025/03/26 10:35:45.994 [1] cm_actio:1201(04) [cls] FENCE notify to CM_GUARD (cls=cls) 2025/03/26 10:35:46.194 [1] cm_actio:7526(04) [cls] [WARNING] VIP 'tac2_vip' status 5, global status 12, intr stat 0. Forcibly clearing... 2025/03/26 10:35:46.194 [1] cm_actio:6011(04) [cls] [VIP] (tac1_vip) prework start! vip:192.168.56.11, port:8629 (svc: tac) 2025/03/26 10:35:46.194 [1] cm_actio:5415(04) [cls] Executing command for the instance(tas1) with default environment variable profile 2025/03/26 10:35:46.195 [1] cm_actio:5443(04) [cls] EXECUTE CMD: dbctl_for_cm.sh down abnormal 2025/03/26 10:35:46.194 [1] cm_actio:5415(04) [cls] Executing command for the instance(tac1) with default environment variable profile 2025/03/26 10:35:46.195 [1] cm_actio:5443(04) [cls] EXECUTE CMD: dbctl_for_cm.sh down abnormal 2025/03/26 10:35:46.198 [1] cm_actio:3525(04) [cls] [INST] VIP_LOST msg 2025/03/26 10:35:46.198 [1] cm_actio:3943(04) [cls] [VIP] (tac1_vip) prework done! vip:192.168.56.11, port:8629 (svc: tac) 2025/03/26 10:35:46.241 [2] cm_netwo:0386(00) connection closed. fd:19 2025/03/26 10:35:46.244 [1] cm_util.:0395(04) [cls] start exec ifconfig ens33:1 down 2025/03/26 10:35:46.247 [1] cm_util.:0415(04) [cls] exec ifconfig ens33:1 down success. exit status 255 2025/03/26 10:35:46.248 [1] cm_actio:5039(00) cmd execution rc: -1 (100 < rc < 106) 2025/03/26 10:35:46.248 [1] cm_actio:5289(00) Resource tac1_vip down SUCCESS 2025/03/26 10:35:46.263 [2] cm_netwo:0386(00) connection closed. fd:18 2025/03/26 10:35:46.430 [1] cm_actio:5462(04) [cls] EXECUTE RESULT: 0 2025/03/26 10:35:46.430 [1] cm_actio:5039(00) cmd execution rc: 0 (100 < rc < 106) 2025/03/26 10:35:46.430 [1] cm_actio:5137(00) Resource 'tas1' down SUCCESS (mode: ABNORMAL) 2025/03/26 10:35:46.463 [1] cm_act_s:1235(04) [cls] [SERVICE] New incar no. 3 for service tas 2025/03/26 10:35:46.528 [1] cm_actio:5462(04) [cls] EXECUTE RESULT: 0 2025/03/26 10:35:46.528 [1] cm_actio:5039(00) cmd execution rc: 0 (100 < rc < 106) 2025/03/26 10:35:46.528 [1] cm_actio:5137(00) Resource 'tac1' down SUCCESS (mode: ABNORMAL) 2025/03/26 10:35:46.670 [1] cm_act_s:1281(04) [cls] [SERVICE] incar no. 3 for service tas acked by all scheduled instances 2025/03/26 10:35:46.670 [1] cm_act_s:1235(04) [cls] [SERVICE] New incar no. 3 for service tac 2025/03/26 10:35:46.672 [1] cm_actio:7526(04) [cls] [WARNING] VIP 'tac2_vip' status 6, global status 12, intr stat 0. Forcibly clearing... 2025/03/26 10:35:46.672 [1] cm_actio:7551(04) [cls] all cluster resource down 2025/03/26 10:35:46.675 [2] cm_netwo:0386(00) connection closed. fd:16 2025/03/26 10:35:46.675 [2] cm_netwo:0497(00) delayed close done. fd:16 2025/03/26 10:35:46.988 [1] cm_file.:1022(05) [cls:0] Exit FILEIO thread for +0 2025/03/26 10:35:46.995 [1] cm_file.:1022(07) [cls:2] Exit FILEIO thread for +2 2025/03/26 10:35:47.004 [1] cm_file.:1007(06) [cls:1] Write file down notify! 2025/03/26 10:35:47.005 [2] cm_fd.c :0318(06) [cls:1] cmhead read error(0, Success) r_size=0 2025/03/26 10:35:47.005 [1] cm_file.:0778(06) [cls:1] [ERROR] File hb write failed! size_write = -1, hb_size = 1024 2025/03/26 10:35:47.005 [1] cm_file.:1022(06) [cls:1] Exit FILEIO thread for +1 2025/03/26 10:35:47.687 [1] cm_actio:9346(04) [cls] [ACTION] finish loop
cm guard log
2025/03/26 10:35:45.994 [1] cm_guard:0589(00) [CM_GUARD] FENCE notify from CM (cls=cls) 2025/03/26 10:35:45.997 [1] cm_guard:0459(00) [CM_GUARD] resource release on behalf of CM 2025/03/26 10:35:45.999 [1] cm_util.:0393(00) start exec ifconfig ens33:1 down 2025/03/26 10:35:46.002 [1] cm_util.:0412(00) exec ifconfig ens33:1 down success. exit status 0 2025/03/26 10:35:46.003 [2] cm_vip.c:0661(00) VIP 192.168.56.11 removed from ens33:1 2025/03/26 10:35:48.004 [1] cm_util.:0432(00) [CM_GUARD] kill process with pid 2254 2025/03/26 10:35:48.010 [1] cm_util.:0432(00) [CM_GUARD] kill process with pid 2261 2025/03/26 10:35:48.012 [1] cm_util.:0432(00) [CM_GUARD] kill process with pid 2262 2025/03/26 10:35:48.014 [1] cm_util.:0432(00) [CM_GUARD] kill process with pid 2263 2025/03/26 10:35:48.016 [1] cm_util.:0432(00) [CM_GUARD] kill process with pid 2264 2025/03/26 10:35:48.017 [1] cm_util.:0432(00) [CM_GUARD] kill process with pid 2265 2025/03/26 10:35:48.018 [1] cm_util.:0432(00) [CM_GUARD] kill process with pid 2266 2025/03/26 10:35:48.019 [1] cm_util.:0432(00) [CM_GUARD] kill process with pid 2267 2025/03/26 10:35:48.020 [1] cm_util.:0432(00) [CM_GUARD] kill process with pid 2268 2025/03/26 10:35:48.021 [1] cm_util.:0432(00) [CM_GUARD] kill process with pid 2423 2025/03/26 10:35:48.022 [1] cm_util.:0432(00) [CM_GUARD] kill process with pid 2437 2025/03/26 10:35:48.023 [1] cm_util.:0432(00) [CM_GUARD] kill process with pid 2438 2025/03/26 10:35:48.025 [1] cm_util.:0432(00) [CM_GUARD] kill process with pid 2439 2025/03/26 10:35:48.026 [1] cm_util.:0432(00) [CM_GUARD] kill process with pid 2440 2025/03/26 10:35:48.027 [1] cm_util.:0432(00) [CM_GUARD] kill process with pid 2441 2025/03/26 10:35:48.028 [1] cm_util.:0432(00) [CM_GUARD] kill process with pid 2442 2025/03/26 10:35:48.029 [1] cm_util.:0432(00) [CM_GUARD] kill process with pid 2443 2025/03/26 10:35:48.029 [1] cm_util.:0432(00) [CM_GUARD] kill process with pid 2444 2025/03/26 10:35:48.030 [1] cm_util.:0432(00) [CM_GUARD] kill process with pid 2445 2025/03/26 10:35:48.031 [1] cm_util.:0432(00) [CM_GUARD] kill process with pid 2446 2025/03/26 10:35:48.032 [1] cm_util.:0432(00) [CM_GUARD] kill process with pid 2447 2025/03/26 10:35:48.033 [1] cm_util.:0432(00) [CM_GUARD] kill process with pid 2448 2025/03/26 10:35:48.034 [1] cm_util.:0432(00) [CM_GUARD] kill process with pid 2449 2025/03/26 10:35:48.035 [1] cm_util.:0432(00) [CM_GUARD] kill process with pid 2450 2025/03/26 10:35:48.147 [1] cm_guard:0490(00) [CM_GUARD] kill CM process with pid '1842' 2025/03/26 10:35:48.148 [1] cm_guard:0504(00) [CM_GUARD] is going to reboot system.
Afterwards, the server reboots.
Case 2. cm process hang (File H/B expired)
Attach to the cm process using gdb.
$ ps -ef | grep -v grep | grep -v guard|grep $CM_SID | awk '{print $2}'
$ gdb
(gdb) attach [CM PID]
cm guard log of the node with expired h/b
2025/03/27 10:52:24.119 [1] cm_guard:0868(00) [EXPIRE] Heartbeat from CM missing for 25% expire count 2025/03/27 10:52:57.472 [1] cm_guard:0868(00) [EXPIRE] Heartbeat from CM missing for 50% expire count 2025/03/27 10:53:28.802 [1] cm_guard:0868(00) [EXPIRE] Heartbeat from CM missing for 75% expire count 2025/03/27 10:53:49.022 [1] cm_guard:0868(00) [EXPIRE] Heartbeat from CM missing for 90% expire count 2025/03/27 10:54:01.162 [1] cm_guard:0863(00) [EXPIRE] Heartbeat from CM was expired! (last HB: 3914.211721, current time: 4044.616027) 2025/03/27 10:54:01.166 [1] cm_guard:0935(00) [CM_GUARD] Heartbeat from CM expired... 2025/03/27 10:54:03.175 [1] cm_guard:0459(00) [CM_GUARD] resource release on behalf of CM 2025/03/27 10:54:03.179 [1] cm_util.:0393(00) start exec ifconfig ens33:1 down 2025/03/27 10:54:03.186 [1] cm_util.:0412(00) exec ifconfig ens33:1 down success. exit status 0 2025/03/27 10:54:03.189 [2] cm_vip.c:0661(00) VIP 192.168.56.11 removed from ens33:1 2025/03/27 10:54:05.192 [1] cm_util.:0432(00) [CM_GUARD] kill process with pid 8944 2025/03/27 10:54:05.196 [1] cm_util.:0432(00) [CM_GUARD] kill process with pid 8957 2025/03/27 10:54:05.199 [1] cm_util.:0432(00) [CM_GUARD] kill process with pid 8958 2025/03/27 10:54:05.202 [1] cm_util.:0432(00) [CM_GUARD] kill process with pid 8959 2025/03/27 10:54:05.204 [1] cm_util.:0432(00) [CM_GUARD] kill process with pid 8960 2025/03/27 10:54:05.206 [1] cm_util.:0432(00) [CM_GUARD] kill process with pid 8961 2025/03/27 10:54:05.208 [1] cm_util.:0432(00) [CM_GUARD] kill process with pid 8962 2025/03/27 10:54:05.209 [1] cm_util.:0432(00) [CM_GUARD] kill process with pid 8963 2025/03/27 10:54:05.211 [1] cm_util.:0432(00) [CM_GUARD] kill process with pid 8964 2025/03/27 10:54:05.212 [1] cm_util.:0432(00) [CM_GUARD] kill process with pid 9312 2025/03/27 10:54:05.213 [1] cm_util.:0432(00) [CM_GUARD] kill process with pid 9348 2025/03/27 10:54:05.221 [1] cm_util.:0432(00) [CM_GUARD] kill process with pid 9349 2025/03/27 10:54:05.233 [1] cm_util.:0432(00) [CM_GUARD] kill process with pid 9350 2025/03/27 10:54:05.286 [1] cm_util.:0432(00) [CM_GUARD] kill process with pid 9352 2025/03/27 10:54:05.295 [1] cm_util.:0432(00) [CM_GUARD] kill process with pid 9353 2025/03/27 10:54:05.301 [1] cm_util.:0432(00) [CM_GUARD] kill process with pid 9354 2025/03/27 10:54:05.308 [1] cm_util.:0432(00) [CM_GUARD] kill process with pid 9355 2025/03/27 10:54:05.240 [1] cm_util.:0432(00) [CM_GUARD] kill process with pid 9351 2025/03/27 10:54:05.310 [1] cm_util.:0432(00) [CM_GUARD] kill process with pid 9356 2025/03/27 10:54:05.312 [1] cm_util.:0432(00) [CM_GUARD] kill process with pid 9357 2025/03/27 10:54:05.314 [1] cm_util.:0432(00) [CM_GUARD] kill process with pid 9358 2025/03/27 10:54:05.315 [1] cm_util.:0432(00) [CM_GUARD] kill process with pid 9359 2025/03/27 10:54:05.318 [1] cm_util.:0432(00) [CM_GUARD] kill process with pid 9360 2025/03/27 10:54:05.320 [1] cm_util.:0432(00) [CM_GUARD] kill process with pid 9361 2025/03/27 10:54:05.599 [1] cm_guard:0490(00) [CM_GUARD] kill CM process with pid '1684'
Afterwards, the server reboots.