Document Type | Troubleshooting
Category | Monitoring/Inspection
Applicable Product Versions | 5SP1FS01, 5SP1FS02, 5SP1FS03, 5SP1FS04, 5SP1FS06, 6FS01, 6FS02, 6FS03, 6FS04, 6FS05, 6FS06, 6FS07, 6FS07PS, 7FS01, 7FS02, 7FS02PS
Document Number | TMOTS011
Issue
The server, which was operating normally, became inaccessible or, even if accessible, experienced a sudden drop in performance. Additionally, the database running within the server also became significantly slower than usual.
Cause
In the Linux kernel, if abnormal memory access by a process is detected, a memory dump (Core Dump) is performed, causing a sudden increase in resource usage that can lead to system performance degradation.
The Linux kernel monitors abnormal access in real-time for memory protection. When a problem occurs, it logs Segmentation Fault or General Protection messages in /var/log/messages and terminates the process with SIGSEGV (segmentation violation) while performing a memory dump (dumping core).
Processes generally use the following two types of memory areas:
- VSS (Virtual Set Size): The entire memory area mapped by the process
- RSS (Resident Set Size): The area actually in use
When forcibly terminating a process that attempted abnormal memory access, the Linux kernel does not know which memory section caused the issue, so it generates a dump for the entire VSS area.
During the memory dump, CPU, Memory, and DISK I/O usage occur depending on the size of the VSS. The larger the VSS, the more resource usage spikes relatively.
Checking CPU Usage
- Use the sar command with the -q option to check core usage.
- The ldavg-1, ldavg-5, ldavg-15 values indicate the number of cores working.
(ldavg refers to Load Average values for 1 minute, 5 minutes, and 15 minutes)
- On a 24-core server, it can be seen that the number of requested cores compared to allocated cores at the problem time is high, which can be considered abnormal.
$ sar -q 15:20:01 runq-sz plist-sz ldavg-1 ldavg-5 ldavg-15 blocked 15:20:01 25 7690 20.82 17.63 15.00 0 15:30:01 23 7684 19.23 18.55 16.61 0 15:40:01 13 7694 14.25 14.82 15.41 0 16:26:37 8 7782 148.43 165.40 171.44 0 <- Problem time 16:30:01 10 7789 127.00 146.79 162.90 0 <- Problem time 16:40:01 6 7785 92.69 104.66 133.27 1 <- Problem time 16:50:01 16 7816 62.64 75.12 105.34 0 <- Problem time 17:00:01 15 7815 11.63 27.52 67.23 1
Checking Memory (Swap Out/In)
- Use the sar command with the -W option to check swap usage.
- When memory usage increases sharply, the swap area is used.
- Sending memory from main memory to swap area is called swap out, and returning memory from swap area to main memory is called swap in.
- Swap out (pswpout) and swap in (pswpin) show very high values at the problem time.
$ sar -W 15:30:01 pswpin/s pswpout/s 15:30:01 0.00 0.00 15:40:01 0.00 0.00 16:26:37 276.53 1627.42<- Problem time 16:30:01 75.30 2330.37<- Problem time 16:40:01 764.02 2450.31<- Problem time 16:50:01 599.91 1903.95<- Problem time 17:00:01 573.69 1108.47<- Problem time 17:10:01 98.03 0.00 17:20:01 102.21 0.00
Checking Linux Kernel Detection (/var/log/message)
- The mxg_tib (PID: 70809) process receives the SIGSEGV signal and terminates abnormally.
Apr 30 15:33:08 BPOTDB01 abrt-hook-ccpp: Process 70809 (mxg_tib) of user 1002 killed by SIGSEGV - dumping core Apr 30 15:33:09 BPOTDB01 abrt-server: Executable '/home/maxgauge/semas241/bin/mxg_tib' doesn't belong to any package and ProcessUnpackaged is set to 'no' Apr 30 15:33:09 BPOTDB01 abrt-server: 'post-create' on '/var/spool/abrt/ccpp-2025-04-30-15:33:08-70809' exited with 1 Apr 30 15:33:09 BPOTDB01 abrt-server: Deleting problem directory '/var/spool/abrt/ccpp-2025-04-30-15:33:08-70809'
- Afterward, logs indicate that the database process performing key server tasks has no response for more than 120 seconds.
Apr 30 15:52:40 BPOTDB01 kernel: INFO: task tbsvr:22059 blocked for more than 120 seconds. Apr 30 15:52:40 BPOTDB01 kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. Apr 30 15:52:40 BPOTDB01 kernel: tbsvr D ffff8d91c015acc0 0 22059 21530 0x00000080 Apr 30 15:52:40 BPOTDB01 kernel: Call Trace: Apr 30 15:52:40 BPOTDB01 kernel: [<ffffffff83f87169>] schedule+0x29/0x70 Apr 30 15:52:40 BPOTDB01 kernel: [<ffffffff83f88b55>] rwsem_down_read_failed+0x105/0x1c0 Apr 30 15:52:40 BPOTDB01 kernel: [<ffffffff83b97528>] call_rwsem_down_read_failed+0x18/0x30 Apr 30 15:52:40 BPOTDB01 kernel: [<ffffffff83f86450>] down_read+0x20/0x40 Apr 30 15:52:40 BPOTDB01 kernel: [<ffffffff83f8e8fd>] __do_page_fault+0x4bd/0x500 Apr 30 15:52:40 BPOTDB01 kernel: [<ffffffff83f8e975>] do_page_fault+0x35/0x90 Apr 30 15:52:40 BPOTDB01 kernel: [<ffffffff83f8a778>] page_fault+0x28/0x30 Apr 30 15:52:40 BPOTDB01 kernel: INFO: task tbsvr:25337 blocked for more than 120 seconds. Apr 30 15:52:40 BPOTDB01 kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. Apr 30 15:52:40 BPOTDB01 kernel: tbsvr D ffff8d91c005acc0 0 25337 21530 0x00000080 Apr 30 15:52:40 BPOTDB01 kernel: Call Trace: Apr 30 15:52:40 BPOTDB01 kernel: [<ffffffffc065ce4e>] ? bond_start_xmit+0x1be/0x420 [bonding] ... omitted ...
- It can be confirmed that the max_tib process detected abnormal memory and was forcibly terminated by the Linux kernel with a memory dump performed.
- Process information monitored at the problem time.
- The max_tib process running at the time had very high VSS, causing the memory dump to take a long time and resource usage to increase sharply.
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 70809 maxgauge 20 0 130.7g 650664 632136 S 18.2 0.2 9:20.34 mxg_tib -c semas241 -r -D
Solutions
The Linux kernel performs memory dumps on abnormal memory access, but in most cases, a full dump is not necessary.
Dumping the entire VSS causes excessive resource usage (CPU, memory, disk I/O), leading to server performance degradation and disk thrashing.
Therefore, it is recommended to enable core dump size limitation settings.
Processes Constituting the Database
In the Tibero database, when abnormal behavior is detected internally, its own monitoring process recognizes this and automatically records BackTrace logs.
- Log path: $TB_HOME/instance/$TB_SID
- Log file: tbsvr.callstack.%PID
NoteHow to Limit Core Dumps of Processes with Abnormal Memory Access in the Linux Kernel
Modify the /etc/security/limits.conf configuration file.
The changes will only take effect after the process is restarted.Checking Memory Dump Size
Identify the PID of the process you want to check.
Check the memory dump size applied through the process PID.
The "Max core file size" item corresponds to the memory dump size. If it is set to unlimited, it means there is no limit.$ ps -ef |grep tbsvr tibero 1111133 1 0 May14 pts/3 00:00:26 tbsvr -t NORMAL -SVR_SID psource1 tibero 1111134 1111133 0 May14 pts/3 00:00:00 /tibero/tibero_engine/bin/tblistener -n 11 -t NORMAL -SVR_SID psource1 tibero 1111135 1111133 0 May14 pts/3 00:00:00 tbsvr_MGWP -t NORMAL -SVR_SID psource1 tibero 1111136 1111133 0 May14 pts/3 00:00:00 tbsvr_FGWP000 -t NORMAL -SVR_SID psource1 tibero 1111137 1111133 0 May14 pts/3 00:00:00 tbsvr_FGWP001 -t NORMAL -SVR_SID psource1 tibero 1111138 1111133 0 May14 pts/3 00:00:00 tbsvr_FGWP002 -t NORMAL -SVR_SID psource1 ... omitted ... $ cat /proc/1111133/limits Limit Soft Limit Hard Limit Units Max cpu time unlimited unlimited seconds Max file size unlimited unlimited bytes Max data size unlimited unlimited bytes Max stack size 8388608 unlimited bytes Max core file size unlimited unlimited bytes Max resident set unlimited unlimited bytes Max processes unlimited unlimited processes Max open files 1048576 1048576 files Max locked memory 65536 65536 bytes Max address space unlimited unlimited bytes Max file locks unlimited unlimited locks Max pending signals 159739 159739 signals Max msgqueue size 819200 819200 bytes Max nice priority 0 0 Max realtime priority 0 0 Max realtime timeout unlimited unlimited us
Applying Memory Dump Size Limit
It can be applied per user on the server.
Since the database is configured under the tibero user on the server, apply the memory dump size limit to the tibero account.
After modifying the configuration file, changes do not apply immediately. They will be applied after restarting the process.# cat /etc/security/limits.conf tibero soft core 0 tibero hard core 0 !! Restart the database !! $ ps -ef|grep tbsvr tibero 1583929 1 14 06:49 pts/1 00:00:00 tbsvr -t NORMAL -SVR_SID psource1 tibero 1583936 1583929 0 06:50 pts/1 00:00:00 tbsvr_MGWP -t NORMAL -SVR_SID psource1 tibero 1583937 1583929 0 06:50 pts/1 00:00:00 tbsvr_FGWP000 -t NORMAL -SVR_SID psource1 tibero 1583938 1583929 0 06:50 pts/1 00:00:00 tbsvr_FGWP001 -t NORMAL -SVR_SID psource1 tibero 1583939 1583929 0 06:50 pts/1 00:00:00 tbsvr_FGWP002 -t NORMAL -SVR_SID psource1 tibero 1583940 1583929 0 06:50 pts/1 00:00:00 tbsvr_FGWP003 -t NORMAL -SVR_SID psource1 tibero 1583941 1583929 0 06:50 pts/1 00:00:00 tbsvr_FGWP004 -t NORMAL -SVR_SID psource1 tibero 1583942 1583929 0 06:50 pts/1 00:00:00 tbsvr_FGWP005 -t NORMAL -SVR_SID psource1 tibero 1583943 1583929 0 06:50 pts/1 00:00:00 tbsvr_FGWP006 -t NORMAL -SVR_SID psource1 $ cat /proc/1111133/limits Limit Soft Limit Hard Limit Units Max cpu time unlimited unlimited seconds Max file size unlimited unlimited bytes Max data size unlimited unlimited bytes Max stack size 8388608 unlimited bytes Max core file size 0 0 bytes Max resident set unlimited unlimited bytes Max processes unlimited unlimited processes Max open files 1048576 1048576 files Max locked memory 65536 65536 bytes Max address space unlimited unlimited bytes Max file locks unlimited unlimited locks Max pending signals 159739 159739 signals Max msgqueue size 819200 819200 bytes Max nice priority 0 0 Max realtime priority 0 0 Max realtime timeout unlimited unlimited us