Document Type | Technical Information
Administration | Administration
Applicable Product Version | 7FS02PS
Document Number | TADTI176
Overview
This chapter explains sampling, GRANULARITY, statistics replication, and automatic statistics collection methods.
Methods
1. Sampling
Tibero uses a method called Row sampling for tables when performing sampling for statistics collection. In the case of Row sampling, even if the sampling percentage is low, most blocks are scanned due to its nature, so it takes a long time when dealing with large volumes.
A sample size is set to calculate statistics, and the ratio of rows to be sampled is determined according to the ESTIMATE_PERCENT ratio. Collecting statistics on all rows of a table can gather accurate statistical information, but if the sample size is large, the time and resources required for statistics collection increase accordingly. In other words, even though statistics collection involves sampling some data, if the volume is large, the sampled data itself becomes a significant amount.
Tibero determines an appropriate sample size for each table, but since the automatically determined sample size is small, it is efficient to set an appropriate sample size in situations where data distribution is uneven.
1.1 ESTIMATE_PERCENT
In operational systems, DBAs often set their own ESTIMATE_PERCENT values based on the number of rows to collect statistics. While increasing the sampling percentage results in more precise statistical information, performance degradation proportional to or greater than the sampling ratio during statistics collection is unavoidable. Therefore, it is necessary to find an appropriate sampling percentage balancing collection time and accuracy.
If set to 0, Tibero decides the value automatically. For detailed criteria on this decision, refer to โ1.3.2. Settings Items.โ
Applying the criteria for determining the ESTIMATE_PERCENT sampling ratio, the sampling ratios for tables with 500 million rows and 200,000 rows are as follows.
-- For a large table with 500 million rows, apply the calculation 100,000 / row count x 100 (%). In this case, estimate_percent is 100,000 / 500,000,000 * 100 = 0.02%. -- For a table with 200,000 rows (less than 1,000,000, 10%) The previous example, the DAILY_LOG table, consists of 20 partitions with 10,000 rows per partition, so the total row count is 200,000. Therefore, estimate_percent is 10%.
Note
Refer to the ESTIMATE_PERCENT in โSettings Itemsโ for the ESTIMATE_PERCENT criteria.
2. Dynamic Sampling
Dynamic sampling aims to improve performance by determining accurate selectivity and cardinality estimates for tables without statistical information. Tibero decides whether to perform dynamic sampling at parse time.
If dynamic sampling is decided, random sampling scans of the table blocks for the query are performed to determine selectivity for key single tables. Even if statistical information exists, there may be situations requiring dynamic sampling, but since it causes overhead during parsing and may be less accurate than using statistics, it is recommended to generate statistical information unless in special circumstances.
Dynamic sampling should be avoided in OLTP environments where bind variables are used and sampling overhead needs to be minimized, but it can be considered in DW environments where parse time is cheaper than execution time.
Dynamic sampling is not applied to normal queries. The optimizer chooses sampling in the following cases:
- When the cost-based optimizer (CBO) cannot make accurate predictions due to limited statistics and considers dynamic sampling can produce a better execution plan
- When sampling time is judged to be only a part of the total query execution time
- When the query execution time is considered long
Sampling is performed at the block level, starting with 16 blocks and up to a maximum of 300 blocks depending on the total number of table blocks.
3. Granularity
For large partitioned tables, query conditions often use partition key columns, and other columns usually have similar distribution across partitions.
If there is not expected to be a significant difference in physical information (row count, block count, B-level, leaf block count, etc.) per partition, collecting statistics using the GLOBAL granularity option can be a good method.
SQL> CALL DBMS_STATS.GATHER_TABLE_STATS (USER, 'DAILY_LOG', granularity=>'GLOBAL');
The GLOBAL option collects statistics at the entire table or index level rather than at the partition level.
Even if partition-level statistics are not collected, the optimizer performs dynamic sampling during execution to obtain some physical information, so significant errors do not occur. Even when collected with GLOBAL, column statistics such as histograms are gathered, which helps resolve the choice between index scans and table scans based on cardinality estimates to some extent.
However, since the table itself is large, even collecting only at the table or index level can take a long time due to sampling.
Partition options update min, max, distinct count, and histograms for partition key columns, improving the accuracy of collected statistics.
Tibero uses Row sampling for tables when sampling for statistics collection. Although it is faster than ALL granularity, it still requires scanning most blocks, so frequent collection can be burdensome.
If partition key column statistics are most important and most query conditions use equality on partition key columns, the tip parameter _OPT_BOUND_SELEC_ADJUST_DEGREE can help.
The following is an example of setting the _OPT_BOUND_SELEC_ADJUST_DEGREE (Default: 100) value.
SQL> ALTER SYSTEM SET _OPT_BOUND_SELEC_ADJUST_DEGREE=100;
The _OPT_BOUND_SELEC_ADJUST_DEGREE parameter adjusts selectivity to some extent even if the equality condition value is somewhat outside the histogram range.
For example, if the histogram column value range is 0 to 100, setting the parameter to 100 allows values between 100 to 200 and -100 to 0 to be recognized with some selectivity similar to values within the histogram range (selectivity decreases as the value moves further from the range).
- If set to 50: values between 100 to 150 and -50 to 0 are recognized
- If set to 200: values between 100 to 300 and -200 to 0 are recognized
Adjusting _OPT_BOUND_SELEC_ADJUST_DEGREE can reduce statistics collection frequency, but for partitioned tables that grow daily or hourly and can reach gigabyte or terabyte scales, collection itself can be a significant burden. Therefore, this method is recommended only for environments with many partitions but not large data volumes.
Note
For detailed explanations and examples of Granularity settings, refer to โPartition Table Statistics.โ
4. Statistics Replication
This function copies statistical information between partitions, offering the advantage that no separate statistics collection is needed even when partitions are added.
In operational systems where partitions are regularly added and large amounts of data are loaded, it is possible to collect statistics only on new partitions without directly collecting statistics for all data.
Histograms for the partition key column (or the first column in composite keys) are also adjusted, so cardinality for query conditions on newly added column values does not deviate significantly.
If the DAILY_LOG table partition is collected up to partition P20250120 as described in the example โTables with partitions added periodically,โ and partition P20250121 is added, statistics can be copied as follows. (stats_copy.sql)
SQL> CALL DBMS_STATS.COPY_TABLE_STATS (USER, 'DAILY_LOG', 'P20230120', 'P20230121'); SQL> ALTER SYSTEM FLUSH PPC;
After replicating statistics, verify as follows.
SQL> set autotrace traceonly exp planstat
SQL> SELECT * FROM DAILY_LOG WHERE LOG_DAY = '20250121';
Execution Plan
----------------------------------------------------------------------------------------------
1 PARTITION RANGE (SINGLE PART) (Cost:4240, %CPU:0, Rows:9944) (PS:20, PE:20)
2
TABLE ACCESS (FULL): DAILY_LOG (Cost:4240, %CPU:0, Rows:9944)
Predicate Information
----------------------------------------------------------------------------------------------
2 - filter: ("DAILY_LOG"."LOG_DAY" = '20250121') (0.050)
Execution Stat
----------------------------------------------------------------------------------------------
1 PARTITION RANGE (SINGLE PART) (Time:.02 ms, Rows:10000, CR:0, CU:0, Starts:1)
2
TABLE ACCESS (FULL): DAILY_LOG (Time:108.34 ms, Rows:10000, CR:10145, CU:203,
Starts:1)
SQL> set autotrace offSince statistics values stored in the DD table are copied without sampling the DAILY_LOG table, there is a significant performance advantage.
Because the physical statistics for partition P20250121 are not reflected, if there are large differences in physical information (row count, block count, etc.) between partitions, the optimizer may generate inaccurate plans. However, because queries on daily or hourly partitioned tables mainly use partition key columns in conditions and the distribution of other columns does not vary much between partitions, filling in physical statistics for the added partition and statistics for the key columns can secure stable plans.
Statistics collection requires scanning tables, so it takes a long time with large volumes; thus, periodic statistics collection on operational servers should be done carefully.
Due to the nature of large-scale environments, if cardinality estimates are significantly off, incorrect plans can cause excessive rowid scans or index joins, leading to excessive index scans. Therefore, appropriate statistics updates are essential, but frequent collection is difficult due to volume.
Using statistics replication in such cases allows updating essential statistics without heavy load on operational servers, enabling stable query plan generation and execution.
5. Automatic Statistics Collection
Whether to use automatic statistics collection should be decided based on the operational system environment. Generally, for medium to large business environment systems, it is recommended not to use automatic statistics collection. Database administrators managing such systems should prepare to collect accurate and stable statistics manually by establishing object-specific strategies.
The following example checks statistics collection registered as a job.
SQL> col next_date format a10;
SQL> col interval format a20;
SQL> col this_date format a10;
SQL> col last_date format a10;
SQL> col what format a50;
SQL> select next_date, interval,broken,failures, this_date,last_date,what
from dba_jobs
where what like '%dbms_stats%';
NEXT_DATE INTERVAL BROKEN FAILURES THIS_DATE LAST_DATE WHAT
---------- ---------------- ------ -------- ---------- ---------- ----------------------
2023/06/20 TRUNC(SYSDATE)+1 N 1 2023/06/19 /*+EVERY_MIDNIGHT_STAT_GATHERING*/
begin
dbms_stats.gather_database_stats
(gather_sys=>true, options
=>'GATHER AUTO',
method_opt=>'FOR ALL INDEXED COLU
MNS SIZE REPEAT'); end;Note
For detailed explanations of options, refer to "DBMS_JOB" in the "tbPSM Reference Guide"
The optimizer estimates cardinality based on object statistics, which greatly influences execution plans. Various configuration options are provided for efficient object statistics collection, but systems using default parameter values and automatic statistics collection cannot guarantee optimal performance.
When operating large-scale databases, it is recommended to use parameter options for each procedure in the DBMS_STATS package to reduce collection time and resource consumption by detailed statistics settings.
Manual statistics collection has the following pros and cons when collecting statistics for partitioned tables in large-scale systems:
-
Advantages
Reduces collection time and resource usage by setting statistics collection only for newly added partitions
Improves potential stability of execution plans when accessing partitioned tables with increasing data volume
-
Disadvantages
Direct statistics option settings do not guarantee accurate and complete maintenance of global statistics
Requires detailed knowledge of the system and data characteristics managed by the DBA
Detailed settings for composite partitions can cause input and management errors without sufficient knowledge of data characteristics
Settings for new partitions may not fit other systems