Document Type | Technical Information

Category | Administration

Applicable Product Version | 7FS02PS

Document Number | TADTI183

Overview

This chapter explains the cost estimation method.

Method

1. Selectivity and Cardinality

Cost estimation is determined through Selectivity, Cardinality, Cost, Data Access methods, Index and Join methods (Index, Nested loop, Sort Merge, Hash), etc.

1.1 Selectivity

Selectivity is the ratio of rows satisfying a specific condition when applied to a Row set, which is a collection of query results using Base Table, View, Join, or Group by syntax. In other words, it is the expected ratio of rows for a condition.

If a histogram exists, Selectivity is calculated using the histogram.
If there is no histogram or if bind variables are used in the condition clause, the optimizer assumes a uniform data distribution and calculates Selectivity accordingly.

If the number of histogram buckets is 1, it is also assumed that no histogram exists, but basic information is actually stored when querying the DD Table.

Conditions like A='B' act as filters that select specific numbers of rows from the Row set, so the Selectivity for such conditions shows how many rows in the Row set are filtered by the restriction.

Selectivity values range from 0.0 to 1.0. A value of 0.0 means no rows are selected from the Row set, and 1.0 means all rows are selected.

If statistical information is absent or the column type cannot be represented by a histogram, an internal default Selectivity value is used, which varies depending on the condition clause format. For example, the default value for an equality operator (A='B') is smaller than that for a range operator (A<'B') because equality operators are assumed to return fewer rows than range operators.

If statistics exist, they are used to calculate the Selectivity value. For example, in an equality operator (A='B'), Selectivity is inversely proportional to the number of distinct values in the column. If a histogram exists for the column, it is used instead of distinct values.

Below is an explanation of the Selectivity calculation formulas.

Selectivity calculation formula for equality operator (=) without histogram

Selectivity = 1/distinct value (number of unique values)

Selectivity calculation formula for range operators (>,<,between) without histogram

Selectivity = Range of requested values in condition / Total value range

Below is an example of how to check num_rows, num_distinct, low_value, high_value and calculate Selectivity for a specific column matching the condition in a user query.

SQL> SELECT num_distinct, density, low_value, high_value, last_analyzed,column_name
 FROM USER_TAB_COL_STATISTICS
 WHERE table_name='CMS_CATEGORY';

-- Calculate selectivity for each column based on query conditions.
-- When the values from the query are as below, examples of condition clauses for each column are as follows.
 num_rows    num_distinct  low_value  high_value
------------ ------------- ---------- ---------------
1,000,000    100           1          1,000

* no < 500
 Selectivity = (Comparison value - low_value) / (high_value - low_value + 1)
 = (500 - 1) / (1,000 - 1 + 1) = 0.499

* no <= 500
 Selectivity = (Comparison value - low_value) / (high_value - low_value + 1) + (1 / num_distinct)
 = (500 - 1) / (1,000 - 1 + 1) + 1 / 100 = 0.509

* no > 500
 Selectivity = 1 - (Selectivity of no <= 500)
 = 1 - 0.509 = 0.491

* no >= 500
 Selectivity = 1 - (Selectivity of no < 500)
 = 1 - 0.499 = 0.501

* no between 100 and 500
 Selectivity = (Selectivity of no >= 100) x (Selectivity of no <= 500)
 = (1 - (Selectivity of no < 100)) x (Selectivity of no <= 500)
 = (1 - (100 - 1) / (1,000 - 1 + 1)) x ((500 - 1) / (1,000 - 1 + 1) + 1 / 100) = 0.459

1.2 Cardinality

Cardinality is the predicted number of rows expected to be output after the Access phase, indicating how many rows are output by the given Where condition. A higher cardinality means less overlapping data and higher distribution, which implies lower Selectivity. All Cardinalities used in execution plans are estimates.

An increase in the number of rows increases I/O cost, and if many operations are performed per row, it negatively affects CPU cost as well.

Cardinality calculation formula

Cardinality = Base Cardinality * Selectivity = num_rows * Selectivity

Base cardinality

The total number of rows in a specific table, which can be obtained by collecting statistics. If statistics are missing or invalid, it is calculated by dividing the total number of bytes (obtained by multiplying the number of table blocks by the block size) by the estimated average row length.

1.3 Selectivity and Cardinality Calculation

Below is an example of calculating Selectivity and Cardinality.

Create table (create_table_for_histogram.sql)

SQL> @create_table_for_histogram.sql

The reason for using the PCTFREE 90 option when creating the table is to sufficiently increase the number of table blocks. (Used when row size is small, so even with much data, the blocks do not increase much)

Create indexes (create_index_for_histogram.sql)

SQL> CREATE INDEX ARMY_IDX_NAME ON ARMY (NAME);
SQL> CREATE INDEX ARMY_IDX_JOB ON ARMY (JOB);
SQL> CREATE INDEX ARMY_IDX_SAL ON ARMY (SAL);
SQL> CREATE INDEX ARMY_IDX_DEPT ON ARMY (DEPT);

The ARMY table consists of ID, NAME, JOB, SAL, and DEPT columns.

Check values uniquely assigned in queries, values distributed diversely, and unique values.
ID is a value uniquely assigned per row.

-- ID is a value uniquely assigned per row
SQL> SELECT ID, COUNT(*) FROM ARMY GROUP BY ID ORDER BY ID;

ID         COUNT(*)
---------- ----------
1          
1
2          1
3          1
4          1
5          1
6          1
7          1
8          1
9          1
10         1
...

-- NAME and SAL are evenly and diversely distributed.
SQL> SELECT NAME, COUNT(*) FROM ARMY GROUP BY NAME ORDER BY NAME;

NAME       
COUNT(*)
---------- ----------
AACIJ      1
AAEYC      1
AAIAH      1
AAIYM      1
AAJIE      1
AAJTS      1
AALEZ      1
AANNQ      1
AAOKH      1
AAOQX      1
...

SQL> SELECT SAL, COUNT(*) FROM ARMY GROUP BY SAL ORDER BY SAL;
 SAL        
COUNT(*)
---------- ----------
100               78
200              102       
300              110       
400               96        
500              101     
600               90  
700               91 
800              106
900              113 
1000             102
...

-- JOB and DEPT each consist of only 5 unique values, and in JOB, INFANTRY rows occupy more than half.
SQL>  SELECT JOB, COUNT(*) FROM ARMY GROUP BY JOB ORDER BY COUNT(*) DESC;

 JOB                 
COUNT(*)
-------------------- ----------
INFANTRY                  8000
ARCHER                    1000
CAVALRY                    500
WIZARD                     300
GENERAL                    200

SQL> SELECT DEPT, COUNT(*) FROM ARMY GROUP BY DEPT ORDER BY DEPT;

 DEPT                 COUNT(*)
-------------------- ----------
DETAIL                     
1250
ENDLESS                    2500
INFIELD                    2500
PROMISE                    2500
RETHINKING                 1250

Use 100% sampling and collect statistics with default values to ensure accuracy of statistical figures. (set_default_histogram.sql)

The default at this time is FOR ALL COLUMNS SIZE 1.

SQL> CALL DBMS_STATS.GATHER_TABLE_STATS(USER,'ARMY',estimate_percent=100);

Check how histograms are stored for each column. (view_histogram.sql)

You can query the ALL_TAB_COL_STATISTICS view, but since it does not show information for each histogram bucket, check directly through the DD Table.

@view_histogram.sql


NAME        BUCKET_CNT   NULL_CNT DISTINCT_CNT LOW_VAL                       

HIGH_VAL                                    DENSITY  AVG_COL_SIZE
---------- ------------ --------- ------------ ------------------------------ 
------------------------------------------ ---------- ------------
ID            1          0        10000 1.0000000000000000000000000000000000000E+00 
1.0000000000000000000000000000000000000E+04     .0001 2.9801
NAME          1          0         9996 AAAST ZZXVC                                       
                                      .00010004      5
JOB           1          0            5 ARCHER WIZARD                                     
                                             .2   7.67
SAL           1          0           99 1.0000000000000000000000000000000000000E+02 
9.9000000000000000000000000000000000000E+03 .01010101      2
DEPT          1          0            5 DETAIL RETHINKING .2 7.2


SQL> SELECT C.COLUMN_NAME NAME, H.HIST_HEAD_ID, H.BUCKET, H.END_POINT, H.END_POINT_ACTUAL, H.DISTINCT_CNT
FROM _DD_HIST_HEAD D, DBA_TAB_COLUMNS C, _DD_HISTOGRAM H
WHERE C.TABLE_NAME = 'ARMY'
AND C.COLUMN_ID = D.COL_NO
AND D.HIST_HEAD_ID= H.HIST_HEAD_ID;

NAME                 HIST_HEAD_ID     BUCKET  END_POINT END_POINT_ACTUAL      
                   DISTINCT_CNT
------------------- -------------- ---------- ---------- -----------------------------------
-------------------------------
ID                   26 1          10000      1.0000000000000000000000000000000000000E+04 
              10000
 NAME                 27 1          2.5432E+16 ZZXVC                                       
               9996
 JOB                  28 1          2.4569E+16 WIZARD                                     
                   5
 SAL                  29 1          9900       9.9000000000000000000000000000000000000E+03 
                 99
 DEPT                 30 1          2.3157E+16 RETHINKING               
                 5

From the above results, since the Bucket count was not specified when calling GATHER_TABLE_STATS, the default value 1 was applied, so each column has only 1 bucket.

The first query returns the histogram header (_DD_HIST_HEAD) results, showing information such as the overall minimum value, maximum value, and distinct count within the column.

The second query returns the histogram bucket details (_DD_HISTOGRAM), containing the maximum value and distinct count for each histogram bucket. (Minimum values can be inferred from the header and the maximum value of the preceding bucket)

Analyzing the results, the ID column values are unique, so the distinct count equals the total number of rows, 10,000. JOB and DEPT have a few fixed unique values, and the statistics reflect that.

Use set autotrace traceonly exp plan stat to compare predicted and actual results.

Run the query to check the plan. (view_plan_detail_for_histogram.sql)

SQL> set lines 150
SQL> set autot traceonly exp plans
SQL> SELECT * FROM ARMY WHERE ID <= 100;
SQL> set autot off

SQL ID: 1434
Plan Hash Value: 2324995388
Execution Plan
--------------------------------------------------------------------------------
1 TABLE ACCESS (ROWID): ARMY (Cost:7, %CPU:0, Rows:100)
2   
INDEX (RANGE SCAN): ARMY_PK (Cost:2, %CPU:0, Rows:100)

Predicate Information
--------------------------------------------------------------------------------
2 - access: ("ARMY"."ID" <= 100) (0.010)

Execution Stat
--------------------------------------------------------------------------------
1 TABLE ACCESS (ROWID): ARMY (Time:.26 ms, Rows:100, CR:5, CU:0, Starts:1)
2   
INDEX (RANGE SCAN): ARMY_PK (Time:.06 ms, Rows:100, CR:2, CU:0, Starts:1)

The Rows shown in the execution plan are estimates, and the Rows in Execution Stat are the actual number of rows output at the plan node. Since ID values are evenly distributed, the estimate was very accurate in this case.

Actual plan with statistical information errors

Below is a query result with a condition on the DEPT column. (view_plan_detail_for_histogram.sql)

SQL> set lines 150
SQL> set autot traceonly exp plans
SQL> SELECT * FROM ARMY WHERE DEPT = 'PROMISE';

SQL ID: 1446
Plan Hash Value: 1631011846
Execution Plan
----------------------------------------------------------------------------------
1  TABLE ACCESS (FULL): ARMY (Cost:211, %CPU:0, Rows:2000)

Predicate Information
----------------------------------------------------------------------------------
1 - filter: ("ARMY"."DEPT" = 'PROMISE') (0.200)

Execution Stat
----------------------------------------------------------------------------------
1 TABLE ACCESS (FULL): ARMY (Time:5.54 ms, Rows:2500, CR:481, CU:27, Starts:1)

When there is no histogram, the optimizer assumes values within a bucket are uniformly distributed, so it predicts the same cardinality for PROMISE and RETHINKING. Since there are 5 unique values in 10,000 rows, it estimates 10,000/5 = 2,000 rows.

However, in reality, PROMISE has 2,500 rows and RETHINKING has 1,250 rows, causing estimation errors. In this example, the optimizer's plan based on statistics is not a significant problem, but the next example can lead to performance degradation.

SQL> SELECT * FROM ARMY WHERE JOB = 'INFANTRY';
SQL ID: 1452 

Plan Hash Value: 749881390 
Execution Plan
-------------------------------------------------------------------------------- 
1 TABLE ACCESS (ROWID): ARMY (Cost:197, %CPU:0, Rows:2000) 
2   INDEX (RANGE SCAN): ARMY_IDX_JOB (Cost:6, %CPU:0, Rows:2000)

Predicate Information
-------------------------------------------------------------------------------- 
2 - access: ("ARMY"."JOB" = 'INFANTRY') (0.200) 

Execution Stat
--------------------------------------------------------------------------------
1 TABLE ACCESS (ROWID): ARMY (Time:23.1 ms, Rows:8000, CR:472, CU:0, Starts:1) 
2   INDEX (RANGE SCAN): ARMY_IDX_JOB (Time:.68 ms, Rows:8000, CR:26, CU:0, Starts:1)

According to the histogram for the JOB column, there are 5 unique values and a total of 10,000 rows, so the optimizer's estimate of 2,000 rows for the INFANTRY condition was correct.

However, in reality, there are 8,000 rows, but the optimizer predicted 2,000 and used INDEX RANGE SCAN + ROWID SCAN, resulting in 8,000 ROWID SCANs. This is an example where incorrect statistics negatively affected the execution plan, which could cause critical performance degradation in a truly large-scale environment.

Related to

Search

Welcome to Tibero GTS!

Tibero Large-Scale System Statistics Collection Guide - 8. Cost Estimation Method

Overview

Method

1. Selectivity and Cardinality

1.1 Selectivity

1.2 Cardinality

1.3 Selectivity and Cardinality Calculation

업무 외 시간 안내

Search

Welcome to Tibero GTS!

Overview

Method

1. Selectivity and Cardinality

1.1 Selectivity

1.2 Cardinality

1.3 Selectivity and Cardinality Calculation