Document Type | Technical Information
Category | Administration
Applicable Product Version | 7FS02PS
Document Number | TADTI176
Overview
This chapter explains histograms.
Method
1. Histogram
The query optimizer starts from the principle that data is evenly distributed. If the data is uniform, the query optimizer only needs to know the minimum value, maximum value, and distinct count object statistics in the column information to accurately estimate the number of rows filtered by the condition based on that column in the Where clause.
However, if the data distribution is not uniform and no additional information is provided, the query optimizer cannot accurately calculate an acceptable estimate. When the data distribution is uneven, the additional information required by the query optimizer is called a histogram. Two types of histograms are used: frequency histogram and height-balanced histogram.
The data types for which histograms can be collected are as follows:
- NUMBER
- CHAR
- VARCHAR
- DATE
- TIME
- TIMESTAMP
A histogram is a set of bars representing the distribution by range. In Tibero, information such as maximum value, minimum value, and distinct count (number of unique values) is stored as data distribution information by range in units called Buckets.
The distribution of specific values and the range of values play an important role later in predicting the selectivity and cardinality of the condition on the corresponding column. How the histogram is collected can change the predicted selectivity and cardinality values. Cardinality is one of the important factors in cost calculation, so column statistics information should also be collected strategically.
- Frequency Histogram
- A histogram that stores the frequency number for each value.
- Only up to 256 Buckets are allowed, and columns with distinct counts exceeding 256 cannot use this type.
- Since a single Bucket is assigned per column value, if a column has 40 distinct counts and 50 Buckets are requested, only 40 Buckets are created.
- This applies when the distinct count is low (less than or equal to the number of Buckets). One Bucket holds one value, accurately storing the frequency, so cardinality satisfying the condition can be easily and accurately calculated.
- END_POINT is the value assigned to the Bucket expressed as a NUMBER, and END_POINT_ACTUAL is the value expressed as a string for easier viewing.
- BUCKET is the cumulative count from the minimum value to the current value when sorted by END_POINT_ACTUAL.
- Height-Balanced Histogram
- Created when requesting fewer Buckets than the number of distinct values in the column (when distinct count exceeds the Bucket number specified in METHOD_OPT).
- If there are more than 256 distinct counts, a Height-Balanced histogram is created.
- The number of Buckets specified in METHOD_OPT are created.
- END_POINT is the largest value covered by the Bucket expressed as a NUMBER, and END_POINT_ACTUAL is expressed as a string.
- BUCKET is the Bucket number.
Whether the histogram is stored as frequency or height-balanced is determined by the number of Buckets in method_opt and the distinct count. Generally, when the distinct count is greater than the number of Buckets, a Height-Balanced histogram is used.
Each Bucket expresses the distribution of values by range and distinct count information. If the distinct count is small (less than or equal to the number of Buckets), such as gender or grade, a Frequency histogram is used for more precise representation, storing how many values exist per unique value.
When querying the Frequency histogram, the following information can be confirmed. (SYS._DD_HISTOGRAM)
| HIST_HEAD_ID | BUCKET (Count) | END_POINT (Actual Value) | DISTINCT_CNT |
|---|---|---|---|
| 123 | 20 | 111 | 1 |
| 123 | 3 | 222 | 1 |
| 123 | 100 | 333 | 1 |
In the table, for the corresponding column (based on sampled data), there are three types of values: 111, 222, and 333. Among the sampled 20+3+100=123 values, 111 appears 20 times, 222 appears 3 times, and 333 appears 100 times.
If the column name is C and the condition C=111 exists, the selectivity is 20/123, which is multiplied by the number of rows stored in the table statistics to produce the cardinality estimate.
In the case of Height-Balanced histograms, multiple unique values can exist within a Bucket, and the distribution of unique values inside one Bucket is unknown. Therefore, Frequency histograms are more accurate for "=" conditions.
-- Query to view SYS._DD_HISTOGRAM
SQL> SELECT HIST_HEAD_ID, BUCKET, END_POINT, DISTINCT_CNT
FROM SYS._DD_HISTOGRAM
WHERE HIST_HEAD_ID=(SELECT HIST_HEAD_ID
FROM SYS._DD_HIST_HEAD
WHERE (OBJ_ID, COL_NO)=(SELECT C.OBJ_ID, C.COL_NO
FROM SYS._DD_COL C, SYS._DD_OBJ O,
SYS._DD_USER U
WHERE O.OBJ_ID=C.OBJ_ID
AND O.OWNER_ID=U.USER_ID
AND U.NAME='TEST_USER'
AND O.NAME='TEST_TABLE'
AND C.NAME='TEST_COLUMN'))
/The query to collect table statistics basically includes aggregation functions calculating the maximum value, minimum value, distinct count per column, so when the number of Buckets is 1, the table-level statistics collection query alone fills the information. However, if the number of Buckets is 2 or more, separate statistics collection queries are executed per column to obtain the maximum value, distinct count, etc., for each Bucket.
Since histograms are used to check the data distribution of columns, if the values of columns are not evenly distributed and are skewed to one side, histograms should be created for those columns.
Although it is true that the more Buckets a histogram has, the more precise the distribution it can represent, collecting histograms for a large number of columns in a large-scale environment will take a considerable amount of time. Therefore, it is recommended to use histograms only for columns with uneven data distribution.
For detailed information on how to select histogram targets through plan analysis, refer to “Histogram Target Selection Method.”