Document Type | Technical Information
Category | Administration
Applicable Product Versions | Tibero6, Tibero7
Document Number | TADTI200
Overview
This document describes the phenomenon of TAC Split Brain caused by users not following the startup procedures when using the Veritas Cluster File System (hereafter: VCFS) in a TAC environment.
Method
VxCFS
- VxCFS is Veritas's cluster file system solution.
- When using VxCFS, raw devices can be clustered and used as a shared file system.
- Similar to Oracle ASM, Tibero TAS, etc.; in the case of TAS, it cannot be used as a file system.
TAC/VxCFS
- When configuring TAC through the VxCFS shared file system, TAC startup must proceed only after confirming that VxCFS has started normally.
- The mount_vxfs of VxCFS is automatically mounted by the Veritas solution; it is described here for explanation purposes.
Startup Sequence in VxCFS Usage Environment
1) Start the operating system
2) Connect the operating system device
3) Perform Veritas startup
4) Complete Veritas startup
5) Start Tibero Cluster Manager
6) Confirm Tibero Cluster Manager
7) Start Tibero Database Instance
8) Confirm Tibero Database Instance
TAC/VxCFS Split Brain
- If started normally, Split Brain does not occur due to VxCFS impact.
- It can occur when rebooting each nodeโs OS one by one while all TAC nodes are running.
TAC Node 1 OS/Veritas/TAC Shutdown
Shutdown Sequence in VxCFS Usage Environment
1) Stop Tibero Database Instance
2) Confirm Tibero Database Instance stop
3) Stop Tibero Cluster Manager
4) Confirm Tibero Cluster Manager stop
5) Stop Veritas
6) Confirm Veritas stop
7) Stop the operating system
TAC Node 1 OS/Veritas/TAC Startup
- This can occur when the operating system is started and the OS-related devices are properly connected.
Split Brain Occurrence Sequence
- The Split Brain occurs when the startup sequence from steps 1 to 8 is not properly followed.
- If "5)" and "6)" are executed before "3)", abnormal behavior occurs in the Tibero Cluster Manager.
1) Start the operating system
- Starts normally
2) Connect the operating system device
- Confirmed normally
5) Start Tibero Cluster Manager
- Needs to access CMFILE of VxCFS, but since Veritas is not started, accesses the local CMFILE path
- Cannot read CMFILE of node 2, so node 1 starts as Master node
6) Confirm Tibero Cluster Manager
- CMFILE is created normally and the cluster is UP
3) Perform Veritas startup
- Veritas starts and the path where CMFILE was previously used becomes active
- Although the original CMFILE path is activated, the local CMFILE path from step "5)" remains UP
4) Complete Veritas startup
- Veritas considers it started normally
7) Start Tibero Database Instance
- When tbboot is performed on node 1, nodes 1 and 2 are looking at different CMFILEs
- In this situation, normal signals cannot be shared between TAC nodes
8) Confirm Tibero Database Instance
- Both nodes start in normal mode, but if I/O occurs on one side, an internal error occurs and the instance goes down
- TSNs of REDO on each node differ, causing mismatch of TSN info in CF and DF (can occur on node 1 or 2)
- Nodes 1 and 2 cannot share CMFILE and consider the other node down
- I/O on node 1 updates REDO logs and TSNs of CF and DF
- Node 1 sees node 2 as down and does not share information
- Node 2 encounters internal error because TSNs of CF and DF differ from its REDO
Internal Error [Code:RV_0ZQNII] with condition '((tsnval_t)((((uint64_t)(uint16_t)(tsn_wrap(low_cache->tsn))) << 32) | \ ((uint64_t)(uint32_t)(tsn_base(low_cache->tsn))))) >= ((tsnval_t)((((uint64_t)(uint16_t)(tsn_wrap(org->low_cache.tsn))) << 32) | \ ((uint64_t)(uint32_t)(tsn_base(org->low_cache.tsn)))))' (2 args) (cf_ckpt.c:105:cf_put_ckpt_progress) (pid=20413, sessid=284, tid=284)
Countermeasures
This phenomenon occurs due to non-compliance with shutdown/startup procedures.
During startup procedures, it is essential to confirm that each solution has started normally.
If TAC Split Brain occurs due to non-compliance with startup procedures causing damage to database consistency, the following procedures can be performed:
- Verify consistency of FULL SCAN COUNT for all tables
- If damaged tables are found, perform recovery using database backup