Document Type | Technical Information
Category | Administration
Applicable Product Version | 7FS02PS
Document Number | TADTI183
Overview
This chapter explains the cost estimation method.
Method
1. Selectivity and Cardinality
Cost estimation is determined through Selectivity, Cardinality, Cost, Data Access methods, Index and Join methods (Index, Nested loop, Sort Merge, Hash), etc.
1.1 Selectivity
Selectivity is the ratio of rows satisfying a specific condition when applied to a Row set, which is a collection of query results using Base Table, View, Join, or Group by syntax. In other words, it is the expected ratio of rows for a condition.
- If a histogram exists, Selectivity is calculated using the histogram.
- If there is no histogram or if bind variables are used in the condition clause, the optimizer assumes a uniform data distribution and calculates Selectivity accordingly.
If the number of histogram buckets is 1, it is also assumed that no histogram exists, but basic information is actually stored when querying the DD Table.
Conditions like A='B' act as filters that select specific numbers of rows from the Row set, so the Selectivity for such conditions shows how many rows in the Row set are filtered by the restriction.
Selectivity values range from 0.0 to 1.0. A value of 0.0 means no rows are selected from the Row set, and 1.0 means all rows are selected.
If statistical information is absent or the column type cannot be represented by a histogram, an internal default Selectivity value is used, which varies depending on the condition clause format. For example, the default value for an equality operator (A='B') is smaller than that for a range operator (A<'B') because equality operators are assumed to return fewer rows than range operators.
If statistics exist, they are used to calculate the Selectivity value. For example, in an equality operator (A='B'), Selectivity is inversely proportional to the number of distinct values in the column. If a histogram exists for the column, it is used instead of distinct values.
Below is an explanation of the Selectivity calculation formulas.
- Selectivity calculation formula for equality operator (=) without histogram
Selectivity = 1/distinct value (number of unique values)
- Selectivity calculation formula for range operators (>,<,between) without histogram
Selectivity = Range of requested values in condition / Total value range
Below is an example of how to check num_rows, num_distinct, low_value, high_value and calculate Selectivity for a specific column matching the condition in a user query.
SQL> SELECT num_distinct, density, low_value, high_value, last_analyzed,column_name FROM USER_TAB_COL_STATISTICS WHERE table_name='CMS_CATEGORY'; -- Calculate selectivity for each column based on query conditions. -- When the values from the query are as below, examples of condition clauses for each column are as follows. num_rows num_distinct low_value high_value ------------ ------------- ---------- --------------- 1,000,000 100 1 1,000 * no < 500 Selectivity = (Comparison value - low_value) / (high_value - low_value + 1) = (500 - 1) / (1,000 - 1 + 1) = 0.499 * no <= 500 Selectivity = (Comparison value - low_value) / (high_value - low_value + 1) + (1 / num_distinct) = (500 - 1) / (1,000 - 1 + 1) + 1 / 100 = 0.509 * no > 500 Selectivity = 1 - (Selectivity of no <= 500) = 1 - 0.509 = 0.491 * no >= 500 Selectivity = 1 - (Selectivity of no < 500) = 1 - 0.499 = 0.501 * no between 100 and 500 Selectivity = (Selectivity of no >= 100) x (Selectivity of no <= 500) = (1 - (Selectivity of no < 100)) x (Selectivity of no <= 500) = (1 - (100 - 1) / (1,000 - 1 + 1)) x ((500 - 1) / (1,000 - 1 + 1) + 1 / 100) = 0.459
1.2 Cardinality
Cardinality is the predicted number of rows expected to be output after the Access phase, indicating how many rows are output by the given Where condition. A higher cardinality means less overlapping data and higher distribution, which implies lower Selectivity. All Cardinalities used in execution plans are estimates.
An increase in the number of rows increases I/O cost, and if many operations are performed per row, it negatively affects CPU cost as well.
- Cardinality calculation formula
Cardinality = Base Cardinality * Selectivity = num_rows * Selectivity
- Base cardinality
The total number of rows in a specific table, which can be obtained by collecting statistics. If statistics are missing or invalid, it is calculated by dividing the total number of bytes (obtained by multiplying the number of table blocks by the block size) by the estimated average row length.
1.3 Selectivity and Cardinality Calculation
Below is an example of calculating Selectivity and Cardinality.
- Create table (create_table_for_histogram.sql)
SQL> @create_table_for_histogram.sql
The reason for using the PCTFREE 90 option when creating the table is to sufficiently increase the number of table blocks. (Used when row size is small, so even with much data, the blocks do not increase much)
- Create indexes (create_index_for_histogram.sql)
SQL> CREATE INDEX ARMY_IDX_NAME ON ARMY (NAME); SQL> CREATE INDEX ARMY_IDX_JOB ON ARMY (JOB); SQL> CREATE INDEX ARMY_IDX_SAL ON ARMY (SAL); SQL> CREATE INDEX ARMY_IDX_DEPT ON ARMY (DEPT);
The ARMY table consists of ID, NAME, JOB, SAL, and DEPT columns.
- Check values uniquely assigned in queries, values distributed diversely, and unique values.
- ID is a value uniquely assigned per row.
-- ID is a value uniquely assigned per row SQL> SELECT ID, COUNT(*) FROM ARMY GROUP BY ID ORDER BY ID; ID COUNT(*) ---------- ---------- 1 1 2 1 3 1 4 1 5 1 6 1 7 1 8 1 9 1 10 1 ... -- NAME and SAL are evenly and diversely distributed. SQL> SELECT NAME, COUNT(*) FROM ARMY GROUP BY NAME ORDER BY NAME; NAME COUNT(*) ---------- ---------- AACIJ 1 AAEYC 1 AAIAH 1 AAIYM 1 AAJIE 1 AAJTS 1 AALEZ 1 AANNQ 1 AAOKH 1 AAOQX 1 ... SQL> SELECT SAL, COUNT(*) FROM ARMY GROUP BY SAL ORDER BY SAL; SAL COUNT(*) ---------- ---------- 100 78 200 102 300 110 400 96 500 101 600 90 700 91 800 106 900 113 1000 102 ... -- JOB and DEPT each consist of only 5 unique values, and in JOB, INFANTRY rows occupy more than half. SQL> SELECT JOB, COUNT(*) FROM ARMY GROUP BY JOB ORDER BY COUNT(*) DESC; JOB COUNT(*) -------------------- ---------- INFANTRY 8000 ARCHER 1000 CAVALRY 500 WIZARD 300 GENERAL 200 SQL> SELECT DEPT, COUNT(*) FROM ARMY GROUP BY DEPT ORDER BY DEPT; DEPT COUNT(*) -------------------- ---------- DETAIL 1250 ENDLESS 2500 INFIELD 2500 PROMISE 2500 RETHINKING 1250
- Use 100% sampling and collect statistics with default values to ensure accuracy of statistical figures. (set_default_histogram.sql)
The default at this time is FOR ALL COLUMNS SIZE 1.
SQL> CALL DBMS_STATS.GATHER_TABLE_STATS(USER,'ARMY',estimate_percent=100);
- Check how histograms are stored for each column. (view_histogram.sql)
You can query the ALL_TAB_COL_STATISTICS view, but since it does not show information for each histogram bucket, check directly through the DD Table.
@view_histogram.sql
NAME BUCKET_CNT NULL_CNT DISTINCT_CNT LOW_VAL
HIGH_VAL DENSITY AVG_COL_SIZE
---------- ------------ --------- ------------ ------------------------------
------------------------------------------ ---------- ------------
ID 1 0 10000 1.0000000000000000000000000000000000000E+00
1.0000000000000000000000000000000000000E+04 .0001 2.9801
NAME 1 0 9996 AAAST ZZXVC
.00010004 5
JOB 1 0 5 ARCHER WIZARD
.2 7.67
SAL 1 0 99 1.0000000000000000000000000000000000000E+02
9.9000000000000000000000000000000000000E+03 .01010101 2
DEPT 1 0 5 DETAIL RETHINKING .2 7.2
SQL> SELECT C.COLUMN_NAME NAME, H.HIST_HEAD_ID, H.BUCKET, H.END_POINT, H.END_POINT_ACTUAL, H.DISTINCT_CNT
FROM _DD_HIST_HEAD D, DBA_TAB_COLUMNS C, _DD_HISTOGRAM H
WHERE C.TABLE_NAME = 'ARMY'
AND C.COLUMN_ID = D.COL_NO
AND D.HIST_HEAD_ID= H.HIST_HEAD_ID;
NAME HIST_HEAD_ID BUCKET END_POINT END_POINT_ACTUAL
DISTINCT_CNT
------------------- -------------- ---------- ---------- -----------------------------------
-------------------------------
ID 26 1 10000 1.0000000000000000000000000000000000000E+04
10000
NAME 27 1 2.5432E+16 ZZXVC
9996
JOB 28 1 2.4569E+16 WIZARD
5
SAL 29 1 9900 9.9000000000000000000000000000000000000E+03
99
DEPT 30 1 2.3157E+16 RETHINKING
5
From the above results, since the Bucket count was not specified when calling GATHER_TABLE_STATS, the default value 1 was applied, so each column has only 1 bucket.
The first query returns the histogram header (_DD_HIST_HEAD) results, showing information such as the overall minimum value, maximum value, and distinct count within the column.
The second query returns the histogram bucket details (_DD_HISTOGRAM), containing the maximum value and distinct count for each histogram bucket. (Minimum values can be inferred from the header and the maximum value of the preceding bucket)
Analyzing the results, the ID column values are unique, so the distinct count equals the total number of rows, 10,000. JOB and DEPT have a few fixed unique values, and the statistics reflect that.
- Use set autotrace traceonly exp plan stat to compare predicted and actual results.
Run the query to check the plan. (view_plan_detail_for_histogram.sql)
SQL> set lines 150
SQL> set autot traceonly exp plans
SQL> SELECT * FROM ARMY WHERE ID <= 100;
SQL> set autot off
SQL ID: 1434
Plan Hash Value: 2324995388
Execution Plan
--------------------------------------------------------------------------------
1 TABLE ACCESS (ROWID): ARMY (Cost:7, %CPU:0, Rows:100)
2
INDEX (RANGE SCAN): ARMY_PK (Cost:2, %CPU:0, Rows:100)
Predicate Information
--------------------------------------------------------------------------------
2 - access: ("ARMY"."ID" <= 100) (0.010)
Execution Stat
--------------------------------------------------------------------------------
1 TABLE ACCESS (ROWID): ARMY (Time:.26 ms, Rows:100, CR:5, CU:0, Starts:1)
2
INDEX (RANGE SCAN): ARMY_PK (Time:.06 ms, Rows:100, CR:2, CU:0, Starts:1)The Rows shown in the execution plan are estimates, and the Rows in Execution Stat are the actual number of rows output at the plan node. Since ID values are evenly distributed, the estimate was very accurate in this case.
- Actual plan with statistical information errors
Below is a query result with a condition on the DEPT column. (view_plan_detail_for_histogram.sql)
SQL> set lines 150
SQL> set autot traceonly exp plans
SQL> SELECT * FROM ARMY WHERE DEPT = 'PROMISE';
SQL ID: 1446
Plan Hash Value: 1631011846
Execution Plan
----------------------------------------------------------------------------------
1 TABLE ACCESS (FULL): ARMY (Cost:211, %CPU:0, Rows:2000)
Predicate Information
----------------------------------------------------------------------------------
1 - filter: ("ARMY"."DEPT" = 'PROMISE') (0.200)
Execution Stat
----------------------------------------------------------------------------------
1 TABLE ACCESS (FULL): ARMY (Time:5.54 ms, Rows:2500, CR:481, CU:27, Starts:1)When there is no histogram, the optimizer assumes values within a bucket are uniformly distributed, so it predicts the same cardinality for PROMISE and RETHINKING. Since there are 5 unique values in 10,000 rows, it estimates 10,000/5 = 2,000 rows.
However, in reality, PROMISE has 2,500 rows and RETHINKING has 1,250 rows, causing estimation errors. In this example, the optimizer's plan based on statistics is not a significant problem, but the next example can lead to performance degradation.
SQL> SELECT * FROM ARMY WHERE JOB = 'INFANTRY';
SQL ID: 1452
Plan Hash Value: 749881390
Execution Plan
--------------------------------------------------------------------------------
1 TABLE ACCESS (ROWID): ARMY (Cost:197, %CPU:0, Rows:2000)
2 INDEX (RANGE SCAN): ARMY_IDX_JOB (Cost:6, %CPU:0, Rows:2000)
Predicate Information
--------------------------------------------------------------------------------
2 - access: ("ARMY"."JOB" = 'INFANTRY') (0.200)
Execution Stat
--------------------------------------------------------------------------------
1 TABLE ACCESS (ROWID): ARMY (Time:23.1 ms, Rows:8000, CR:472, CU:0, Starts:1)
2 INDEX (RANGE SCAN): ARMY_IDX_JOB (Time:.68 ms, Rows:8000, CR:26, CU:0, Starts:1)According to the histogram for the JOB column, there are 5 unique values and a total of 10,000 rows, so the optimizer's estimate of 2,000 rows for the INFANTRY condition was correct.
However, in reality, there are 8,000 rows, but the optimizer predicted 2,000 and used INDEX RANGE SCAN + ROWID SCAN, resulting in 8,000 ROWID SCANs. This is an example where incorrect statistics negatively affected the execution plan, which could cause critical performance degradation in a truly large-scale environment.