Document Type | Technical Information
Field | Administration
Applicable Product Version | 7FS02PS
Document Number | TADTI176
Overview
The Tibero Optimizer is the core engine inside the DBMS that finds the optimal (lowest cost) processing path to execute SQL as quickly and efficiently as possible, and it establishes the execution plan based on the plan with the lowest resource usage cost among executable plans. This method is called the Cost Based Optimizer (CBO) approach.
A lower cost means a more efficient execution plan can be established, and to accurately estimate the cost of an execution plan, information about all objects used by the SQL statement and performance data of the system's I/O, CPU, and memory are required. All object information, including tables and indexes, as well as system information, are called Optimizer statistics. To maintain appropriate performance, it is necessary to decide on an efficient time and method for collecting statistics. Understanding the characteristics of optimizer statistics and methods for collecting them is the most important factor in optimizing SQL execution.
Method
1. Understanding Statistics
1.1 Efficient Statistics Collection Policy
The primary method used to optimize execution plans is the collection of statistics information. However, the most important thing is the accuracy of the collected statistics information.
Even if the statistics information is recent, if it is inaccurate, incorrect execution plans may be established, which can negatively impact database performance. Therefore, database administrators (DBAs) must establish and manage efficient statistics collection policies to ensure the statistics information is accurate and stable.
The following are items that DBAs must manage to maintain the accuracy and stability of statistics information.
- Check for the existence of old statistics collection dates
If the statistics collection date is old, performance issues may occur. This problem often occurs especially with partitions when values are inserted into newly added partitions but the statistics information is not updated.
- Accuracy of collected statistics information
Even recent statistics information, if inaccurate, can negatively affect database performance.
- Management of statistics collection policy
Statistics information must be managed accurately and stably based on an efficient statistics collection policy.
1.2 Considerations for Statistics Collection
Statistics information should be collected periodically while simultaneously improving accuracy and stability. The following factors should be considered for an efficient statistics collection policy.
- Select a time with no load for faster collection.
- Minimize the sample size collected to reduce impact. However, if accuracy and stability cannot be ensured, a full scan is recommended.
- When performing a full scan, accuracy must be high to approximate the statistics values.
- If data changes are minimal, consider stability to prevent statistics values from changing due to frequent collections.
1.3 Causes of Performance Degradation
To implement an efficient statistics collection policy, DBAs must understand the considerations for statistics collection to collect accurate and stable statistics. However, they must recognize that statistics collection can be a cause of performance degradation.
In most operational systems, after statistics collection, a sudden surge in hard parses occurs, causing library cache pin contention and resulting in system performance degradation.
System performance degradation often occurs in the following situations:
- Statistics information is changed due to statistics collection.
- When statistics change, related SQL cursors are invalidated.
- When SQL cursors are invalidated, a hard parse occurs the next time the SQL is accessed.
- During hard parse, an exclusive mode PP cache pin is acquired for the PP cache object.
- Other sessions trying to access the PP cache object under hard parse wait for the PP cache pin event.
Since performance issues may arise due to cursor invalidation caused by statistics updates, control is applied using the no_invalidate option. Performance degradation may be caused by statistics collection, and there may also be other reasons such as incorrect statistics collection options.
1.4 Statistics Collection Strategy
While statistics collection is a means to improve performance, it can also cause performance degradation. Therefore, various considerations when collecting statistics must be based on policies suited to the system operated for each business.
In other words, DBAs must establish a statistics collection strategy, and Tibero proposes the following statistics collection strategies.
1) Establishing a Statistics Collection Policy is Essential
- For objects where statistics collection is unnecessary, lock options are used to fix statistics information to prevent changes.
- For some critical programs that should not be affected by statistics information, hints are applied to fix execution plans.
- Always back up the most stable recent statistics information in case serious application performance issues occur due to statistics changes.
2) Column Histogram Collection
- Histograms incur significant costs to collect and manage. Therefore, histograms should only be collected for necessary columns, primarily those frequently used in WHERE clauses with skewed data distributions.
- When calculating the final selectivity after accessing a table, the selectivity of WHERE clause columns without indexes is also used as an input, and based on this selectivity, join order and join method with other sets are determined.
3) Balance Between Accuracy and Performance of Statistics
To collect accurate statistics, a large amount of data must be collected and analyzed, which can negatively impact database performance during collection. Conversely, collecting less data for performance considerations may result in inaccurate statistics.
4) Statistics Collection Frequency
Statistics collection frequency should be chosen based on data characteristics. For tables with frequently updated data or where partitions are added frequently at short intervals, set a short collection frequency or select appropriate collection options. Conversely, if data updates are infrequent, set a longer collection frequency.
5) Statistics Collection Level
Statistics collection levels are divided into all objects and partition-level. Global statistics for all objects target the entire table, so collecting global statistics in large-scale databases takes a lot of time and load. To shorten this, partition-level statistics collection ('PARTITION' granularity) can be considered. However, while partition-level collection shortens collection time, it may create execution plans with inaccurate Number of Distinct Values (NDV), potentially causing performance issues, so the collection level must be decided carefully.
When aggregating global statistics using partition-level statistics ('APPROX_GLOBAL AND PARTITION' granularity), approximate global-level statistics are derived from partition-level data, securing collection performance and statistics accuracy. However, it should be noted that this is still less accurate than statistics collected at the global level.
6) Sampling Rate
Increasing the sampling rate improves the accuracy of statistics but takes more time to collect. Conversely, lowering the sampling rate reduces accuracy somewhat but allows faster collection.
7) Column Data Distribution
If the data distribution of index column values is not uniform, it is important to generate histogram information for the column. If no histogram info exists, the optimizer decides whether to generate histograms and collects statistics for all columns. Usually, the optimizer assigns appropriate histogram buckets, but unnecessary histograms may be generated.
8) System Statistics
Periodic statistics collection and column histograms alone are insufficient for optimization, so system statistics features are used.
9) Automatic Statistics Collection
Automatic statistics collection can be efficient for most DB objects that are updated at regular intervals. However, it is inefficient for tables or objects that are deleted or truncated during operations and recreated, or when more than 10% of the total size is loaded in bulk.
- Alternative Statistics Collection Strategies
- Statistics Setting
- Dynamic Sampling
- Incremental Global Statistics
1.5 Cost Based Optimizer (CBO)
Tibero adopts a cost-based optimizer that quantifies the expected time to execute a plan and optimizes based on the plan with the lowest cost.
When an SQL statement is executed in Tibero, the query optimizer considers various factors to select the most efficient execution plan. The execution plan greatly affects the SQL execution time, making it a very important factor.
In CBO, even the same query can have different execution plans depending on data characteristics, and it involves complex cost calculation principles, so precise knowledge is required.
Cost measurement includes I/O cost, which evaluates the number of I/O requests, and CPU cost, which converts CPU operations into time. To perform this evaluation, the optimizer estimates the cost at each execution step based on various statistics previously collected for tables, indexes, etc., and selects the execution plan with the lowest total cost. Therefore, maintaining statistics information that accurately reflects data status is essential to produce the optimal execution plan.
- CBO Execution Steps
- The optimizer creates plans for the SQL statement based on possible access paths and hints.
- The optimizer estimates the cost of the plan based on data dictionary statistics about the storage characteristics and data distribution of tables, indexes, and partitions accessed by the SQL. The cost estimates the proportion of resources needed to execute the SQL, and the optimizer calculates the cost of access paths and join orders based on expected system resources including I/O, CPU, and memory.
- The optimizer compares the costs of plans and selects the plan with the lowest cost.
Three Functions of Cost Measurement
When measuring the cost of a plan, the following methods are used to pass the SQL with the minimum cost to the plan generator for execution plan creation.
| Category | Description |
|---|---|
| Selectivity | Represents a portion of rows and depends on query conditions. The ratio of rows satisfying a specific condition applied to a row set. Closely related to query predicates and the distribution of column values. |
| Cardinality | Estimated number of rows (= record count) |
| Cost | The amount of resources required to execute the SQL statement |
Optimization tasks use disk I/O, CPU usage, memory usage, etc., and calculations are based on selectivity and cardinality.
Note
For more details on the functions of cost measurement, refer to "Selectivity and Cardinality".