diff options
Diffstat (limited to 'Docs/optimizer_costs.txt')
-rw-r--r-- | Docs/optimizer_costs.txt | 1309 |
1 files changed, 1309 insertions, 0 deletions
diff --git a/Docs/optimizer_costs.txt b/Docs/optimizer_costs.txt new file mode 100644 index 00000000..8518728a --- /dev/null +++ b/Docs/optimizer_costs.txt @@ -0,0 +1,1309 @@ +This file is intended to explain some of the optimizer cost variables +in MariaDB 11.0 + +Background +========== + +Most timings has come from running: + +./check_costs.pl --rows=1000000 --socket=/tmp/mysql-dbug.sock --comment="--aria-pagecache-buffer-size=10G --innodb-buffer_pool_size=10G --key_buffer-size=1G --max-heap-table-size=10G" + +The MariaDB server is started with the options: +--aria-pagecache-buffer-size=10G --innodb-buffer_pool_size=10G --key_buffer-size=1G --max-heap-table-size=10G" + +- All costs are changed to be milliseconds for engine operations and + other calculations, like the WHERE clause. This is a big change from + before the patch that added this file where the basic cost was a + disk seek and one index read and we assumed they had the same cost. +- I am using Aria as the 'base' cost. This is because it caches all data, + which most other engines also would do. +- MyISAM cannot be used as 'base' as it does not cache row data (which gives + a high overhead when doing row lookups). +- Heap is in memory and a bit too special (no caching). +- InnoDB is a clustered engine where secondary indexes has to use + the clustered index to find a row (not a common case among storage engines). + +The old assumption in the optimizer has 'always' been that +1 cost = 1 seek = 1 index = 1 row lookup = 0.10ms. +However 1 seek != 1 index or row look and this has not been reflected in +most other cost. +This document is the base of changing things so that 1 cost = 1ms. + + +Setup +===== + +All timings are calculated based on result from this computer: +CPU: Intel(R) Xeon(R) W-2295 CPU @ 3.00GHz +Memory: 256G +Disk: Samsum SSD 860 (not really relevant in this case) +Rows in tests: 1M Each test is run 3 times +(one test to cache the data and 2 runs of which we take the average). + +The assumption is that other computers will have somewhat proportional +timings. The timings are done with all data in memory (except MyISAM rows). +This is reflected in the costs for the test by setting +optimizer_disk_read_ratio=0. + +Note that even on a single Linux computer without any notable tasks +the run time vary a bit from run to run (up to 4%), so the numbers in +this document cannot be repeated exactly but should be good enough for +the optimizer. + +Timings for disk accesses on other system can be changed by setting +optimizer_disk_read_cost (usec / 4092 bytes) to match the read speed. + +Default values for check_costs.pl: +optimizer_disk_read_ratio= 0 Everything is cached +SCAN_LOOKUP_COST=1 Cost modifier for scan (for end user) +set @@optimizer_switch='index_condition_pushdown=off'"; + + +ROW_COPY_COST and KEY_COPY_COST +=============================== + +Regarding ROW_COPY_COST: +When calulating cost of fetching a row, we have two alternativ cost +parts (in addition to other costs): +scanning: rows * (ROW_NEXT_FIND_COST + ROW_COPY_COST) +rnd_pos: rows * (ROW_LOOKUP_COST + ROW_COPY_COST) + +In theory we could remove ROW_COPY_COST and just move the cost +to the two other variables. However, in the future there may reason +to be able to modif row_copy_cost per table depending on number and type +of fields (A table of 1000 fields should have a higher row copy cost than +a table with 1 field). Because of this, I prefer to keep ROW_COPY_COST +around for now. + +Regarding KEY_COPY_COST: +When calulating cost of fetching a key we have as part of the cost: +keyread_time: rows * KEY_COPY_COST + ranges * KEY_LOOKUP_COST + + (rows-ranges) * KEY_NEXT_FIND_COST +key_scan_time: rows * (KEY_NEXT_FIND_COST + KEY_COPY_COST) + +We could remove KEY_COPY_COST by adding it to KEY_LOOKUP_COST and +KEY_NEXT_FIND_COST but I prefer to keep it with the same argument as +for ROW_COPY_COST. + +The reation between KEY_COPY_COST / (KEY_NEXT_FIND_COST + KEY_COPY_COST) +is assumed to be 0.1577 (See analyze in the appendix) + +There is a relationship between the above costs in that for a clustered +index the cost is calculated as ha_keyread_time() + ROW_COPY_COST. + + +Preramble +========= + +I tried first to use performance schema to get costs, but I was not +successful as all timings I got for tables showed the total time +executing the statement, not the timing for doing the actual reads. +Also the overhead of performance schema affected the results + +With --performance-schema=on + +MariaDB [test]> select sum(1) from seq_1_to_100000000; ++-----------+ +| sum(1) | ++-----------+ +| 100000000 | ++-----------+ +1 row in set (4.950 sec) + +Performance schema overhead: 30.1% + +With: +UPDATE performance_schema.setup_consumers SET ENABLED = 'YES'; +UPDATE performance_schema.setup_instruments SET ENABLED = 'YES', TIMED = 'YES'; + +Flush with: +CALL sys.ps_truncate_all_tables(FALSE); + +Performance schema overhead now: 32.9% + +Timings from: +select * from events_statements_current where thread_id=80; + +MariaDB [test]> select 885402302809000-884884140290000; ++---------------------------------+ +| 885402302809000-884884140290000 | ++---------------------------------+ +| 518162519000 | ++---------------------------------+ +-> Need to divide by 1000000000000.0 to get seconds + +As seen above, the above gives the total statement time not the time +spent to access the tables. + +In the end, I decided to use analyze to find out the cost of the table +actions: + +For example: Finding out table scan timing (and thus costs): + +analyze format=json select sum(1) from seq_1_to_100000000; +r_table_time_ms": 1189.239022 + + +Calculating 'optimizer_where_cost' +================================== + +To make the WHERE cost reasonable (not too low) we are assuming there is +2 simple conditions in the default 'WHERE clause' + +MariaDB [test]> select benchmark(100000000,l_commitDate >= '2000-01-01' and l_tax >= 0.0) from test.check_costs limit 1; ++--------------------------------------------------------------------+ +| benchmark(100000000,l_commitDate >= '2000-01-01' and l_tax >= 0.0) | ++--------------------------------------------------------------------+ +| 0 | ++--------------------------------------------------------------------+ +1 row in set (3.198 sec) + +Time of where in seconds: 3.198 / 100000000 (100,000,000) + +Verification: + +select sum(1) from seq_1_to_100000000 where seq>=0.0 and seq>=-1.0; ++-----------+ +| sum(1) | ++-----------+ +| 100000000 | ++-----------+ +1 row in set (8.564 sec) + +MariaDB [test]> select sum(1) from seq_1_to_100000000; ++-----------+ +| sum(1) | ++-----------+ +| 100000000 | ++-----------+ +1 row in set (5.162 sec) + +Time of where= (8.564-5.162)/100000000 = 3.402/100000000 (100,000,000) +(Result good enough, as sligthly different computations) + +check_costs.pl comes provides the numbers when using heap tables and 1M rows: + +simple where: 118.689 ms +complex where: 138.474 ms +no where: 83.699 ms + +Which gives for simple where: +(118.689-83.699)/1000 = 0.034990000000000007 ms +Which is in the same ballpark. + +We use the result from the select benchmark run as this has least overhead +and is easiest to repeat and verify in a test. +Which gives: +optimizer_where_cost= 0.032 ms / WHERE. + + +HEAP TABLE SCAN & ROW_COPY_COST +=============================== + +We start with heap as all rows are in memory and we don't have to take +disk reads into account. + +select sum(l_partkey) from test.check_costs +table_scan ms: 10.02078736 +rows: 1000000 + +Cost should be 10.02078736 (scan cost) + 32 (where cost) + +cost= scan_time() * optimizer_cache_cost * SCAN_LOOKUP_COST + + TABLE_SCAN_SETUP_COST + + records * (ROW_COPY_COST + ROW_LOOKUP_COST + WHERE_COMPARE_COST); + +=> +We are ignoring TABLE_SCAN_SETUP (which is just to prefer index lookup on small +tables). +We can also ignore records * WHERE_COMPARE_COST as we don't have that +in the above calcuated 'ms'. +row_costs= (ROW_COPY_COST + ROW_LOOKUP_COST) + +cost= scan_time() * 1 * 1 + + 1000000.0 * (row_costs) +=> +cost= time_per_row*1000000 + row_costs * 1000000; +=> +time_per_row+row_cost= cost/1000000 + +Let's assume that for heap, finding the next row is 80 % of the time and +copying the row (a memcmp) to upper level is then 20 %. +(This is not really important, we could put everthing in heap_scan_time, +but it's good to have split the data as it gives us more options to +experiment later). + +row_lookup_cost= 10.02078736/1000000*0.8 = 8.0166298880000005e-06 +row_copy_cost= 10.02078736/1000000*0.2 = 2.0041574720000001e-06 + +Conclusion: +heap_scan_time= 8.0166e-06 +row_copy_cost= 2.0042e-06 + +Heap doesn't support key only read, so key_copy_cost is not relevant for it. + + +HEAP INDEX SCAN +=============== + +select count(*) from test.check_costs_heap force index (l_suppkey) where l_suppkey >= 0 and l_partkey >=0 +index_scan time: 79.7286117 ms + +Index scan on heap tables can only happen with binary trees. +l_supp_key is using a binary tree. + +cost= (ranges + rows + 1) * BTREE_KEY_NEXT_FIND_COST + rows * row_copy_cost= +(for large number of rows): +rows * (BTREE_KEY_NEXT_FIND_COST + row_copy_cost) + +BTREE_KEY_NEXT_FIND_COST= cost/rows - row_copy_cost = +79.7286117/1000000- 2.334e-06= 0.0000773946117 + + +HEAP EQ_REF +=========== + +select straight_join count(*) from seq_1_to_1000000,test.check_costs_heap where seq=l_linenumber +eq_ref_index_join time: 175.874165 of which 12.57 is from seq_1_to_1000000 + +Note: This is 34% of the cost of an Aria table with index lookup and + 20% of an Aria table with full key+row lookup. + +cost= rows * (key_lookup_cost + row_copy_cost) +key_lookup_cost= cost/rows - key_copy_cost = +(175.874165-12.57)/1000000 - 2.334e-06 = 0.00016097016500000002 + + +HEAP EQ_REF on binary tree index +================================ + +select straight_join count(*) from seq_1_to_1000000,test.check_costs_heap where seq=l_extra and l_partkey >= 0 +eq_ref_join time: 241.350539 ms of which 12.57 is from seq_1_to_1000000 + +rows * (tree_find_cost() + row_copy_cost) = + +tree_find_cost()= cost/rows - row_copy_cost = + +(241.350539-12.57)/1000000 - 2.334e-06= 0.000226446539 + +tree_find_cost() is defined as key_compare_cost * log2(table_rows) +-> +key_compare_cost= 0.000226446539/log2(1000000) = 0.000011361200108882259; + + +SEQUENCE SCAN +============= + +analyze format=json select sum(seq+1) from seq_1_to_1000000; +r_table_time_ms: 12.47830611 + +Note that for sequence index and table scan is the same thing. +We need to have a row_copy/key_copy cost as this is used when doing +an key lookup for sequence. Setting these to 50% of the full cost +should be sufficient for now. + +Calculation sequence_scan_cost: + +When ignoring reading from this, the cost of table scan is: +rows * (ROW_NEXT_FIND_COST + ROW_COPY_COST) + +The cost of key scan is: +ranges * KEY_LOOKUP_COST + (rows - ranges) * KEY_NEXT_FIND_COST + +rows * KEY_COPY_COST; + +As there is no search after first key for sequence, we can set +KEY_LOOKUP_COST = KEY_NEXT_FIND_COST. + +This gives us: + +r_table_time_ms = (ROW_NEXT_FIND_COST + ROW_COPY_COST) = + (KEY_NEXT_FIND_COST + KEY_COPY_COST) * 1000000; + +-> +ROW_NEXT_FIND_COST= ROW_COPY_COST = KEY_LOOKUP_COST + KEY_COPY_COST= +12.47830611/1000000/2 = 0.0000062391530550 + + +HEAP KEY LOOKUP +=============== + +We can use this code to find the timings of a index read in a table: + +analyze format=json select straight_join count(*) from seq_1_to_1000000,check_costs where seq=l_orderkey + +"query_block": { + "select_id": 1, + "r_loops": 1, + "r_total_time_ms": 420.5083447, + "table": { + "table_name": "seq_1_to_1000000", + "access_type": "index", + "possible_keys": ["PRIMARY"], + "key": "PRIMARY", + "key_length": "8", + "used_key_parts": ["seq"], + "r_loops": 1, + "rows": 1000000, + "r_rows": 1000000, + "r_table_time_ms": 12.47830611, + "r_other_time_ms": 44.0671283, + "filtered": 100, + "r_filtered": 100, + "using_index": true + }, + "table": { + "table_name": "check_costs", + "access_type": "eq_ref", + "possible_keys": ["PRIMARY"], + "key": "PRIMARY", + "key_length": "4", + "used_key_parts": ["l_orderkey"], + "ref": ["test.seq_1_to_1000000.seq"], + "r_loops": 1000000, + "rows": 1, + "r_rows": 1, + "r_table_time_ms": 160 + "filtered": 100, + "r_filtered": 100, + "attached_condition": "seq_1_to_1000000.seq = check_costs.l_orderkey" + } + } + +This gives the time for a key lookup on hash key as: +160/10000000 - row_copy_cost = +160/1000000.0 - 2.0042e-06 = 0.00015799580000000002 + + +ARIA TABLE SCAN +=============== +(page format, all rows are cached) + +table_scan ms: 107.315698 + +Cost is calculated as: + +blocks= stats.data_file_length / stats.block_size) = 122888192/4096= 30002 +engine_blocks (8192 is block size in Aria) = 15001 + +cost= blocks * avg_io_cost() * + optimizer_cache_cost * SCAN_LOOKUP_COST + + engine_blocks * INDEX_BLOCK_COPY_COST + + TABLE_SCAN_SETUP_COST + + records * (ROW_NEXT_FIND_COST + ROW_COPY_COST)); + +When all is in memory (optimizer_cache_cost= 0) we get: + +cost= blocks * INDEX_BLOCK_COPY_COST + + TABLE_SCAN_SETUP_COST + + records * (ROW_NEXT_FIND_COST + ROW_COPY_COST)); + +To calculate INDEX_BLOCK_COPY_COST I added a temporary tracker in +ma_pagecache.cc::pagecache_read() and did run the same query. +I got the following data: +{counter = 17755, sum = 1890559} +Which give me the time for copying a block to: +1000.0*1890559/sys_timer_info.cycles.frequency/17755 = 3.558138826971332e-05 ms +And thus INDEX_BLOCK_COPY_COST= 0.035600 + +Replacing known constants (and ignore TABLE_SCAN_SETUP_COST): +cost= 107.315698 = 15001 * 3.56e-5 + 1000000 * aria_row_copy_costs; + +aria_row_copy_costs= (107.315698 - (15001 * 3.56e-5))/1000000 = +0.0001067816624 + +As ROW_COPY_COST/ROW_NEXT_FIND_COST= 0.57 (See appendex) + +ROW_COPY_COST= 0.0001067816624 * 0.57 = 0.000060865547560 +ROW_NEXT_FIND_COST= 0.0001067816624 * 0.43 = 0.000045916114832 + + +Aria, INDEX SCAN +================ + +Finding out cost of reading X keys from an index (no row lookup) in Aria. + +Query: select count(*) from test.check_costs_aria force index (l_suppkey) where l_suppkey >= 0 and l_partkey >=0 +Table access time: ms: 98.1427158 + +blocks= index_size/IO_SIZE = +(rows * tot_key_length / INDEX_BLOCK_FILL_FACTOR) / IO_SIZE +-> +1000000 * 19 / 0.75/ 4096 = 6184 +engine_blocks (block_size 8192) = 6184/2 = 3092 +(Range optimzer had calculated 3085) + +keyread_time= blocks * avg_io_cost() * cache + engine_blocks * INDEX_BLOCK_COPY_COST + rows * (KEY_NEXT_FIND_COST + KEY_COPY_COST); += engine_blocks * INDEX_BLOCK_COPY_COST + rows * KEY_NEXT_FIND_COST= + 3092 * 3.56e-05 + 1000000 * (KEY_NEXT_FIND_COST + KEY_COPY_COST) +-> +KEY_NEXT_FIND_COST + KEY_COPY_COST= (98.1427158 - 3092 * 3.56e-05)/1000000 = +0.0000980326406; + +KEY_COPY_COST= 0.0000980326406 * 0.16 = 0.000015685222496 +KEY_NEXT_FIND_COST= 0.0000980326406 * 0.84 = 0.000082347418104 + + +Aria, RANGE SCAN (scan index, fetch a row for each index entry) +=============================================================== + +Query: +select sum(l_orderkey) from test.check_costs_aria force index(l_suppkey) where l_suppkey >= 0 and l_partkey >=0 +range_scan ms: 309.7620909 + +cost= keyread_time + rnd_pos_time. +keyread_time is as above in index scan, but whithout KEY_COPY_COST: +keyread_time= 98.1427158 - KEY_COPY_COST * 1000000= +98.1427158 - 0.000015685222496 * 1000000= 82.457493304000000; +rnd_pos_time= 309.7620909 - 82.457493304000000 = 227.304597596000000 + +rnd_pos_time() = io_cost + engine_mem_cost + + rows * (ROW_LOOKUP_COST + ROW_COPY_COST) = +rows * avg_io_cost() * engine_block_size/IO_SIZE + +rows * INDEX_BLOCK_COPY_COST + +rows * (ROW_COPY_COST + ROW_LOOKUP_COST) += (When rows are in memory) +rows * INDEX_BLOCK_COPY_COST + +rows * (ROW_COPY_COST + ROW_LOOKUP_COST) + +This gives us: +227.304597596000000 = 1000000 * 3.56e-05 + 1000000*(0.000060865547560 + ROW_LOOKUP_COST) +-> +ROW_LOOKUP_COST= (227.304597596000000 - 1000000 * 3.56e-05 - 1000000*0.000060865547560) / 1000000 = 0.0001308390500 + + +Aria, EQ_REF with index_read +============================ + +select straight_join count(*) from seq_1_to_1000000,test.check_costs_aria where seq=l_linenumber +eq_ref_index_join 499.631749 ms + +According to analyze statement: + +- Cost for SELECT * from seq_1_to_1000000: 12.57 + (From Last_query_cost after the above costs has been applied) +- Time from check_costs: eq_ref's: 499.631749- 12.57s = 487.061749 + +cost= rows * (keyread_time(1,1) + KEY_COPY_COST) + +keyread_time(1,1)= INDEX_BLOCK_COPY_COST + KEY_LOOKUP_COST; + +cost= rows * (KEY_COPY_COST + INDEX_BLOCK_COPY_COST + KEY_LOOKUP_COST) +-> +KEY_LOOKUP_COST= cost/rows - 0.000015685222496 - 0.000035600 +KEY_LOOKUP_COST= 487.061749 / 1000000 - 0.000035600 - 0.000015685222496 +KEY_LOOKUP_COST= 0.000435776526504 + + +MyISAM, TABLE SCAN +================== + +select sum(l_partkey) from test.check_costs_myisam +table_scan ms: 126.353364 + +check_costs.MYD: 109199788 = 26660 IO_SIZE blocks +The row format for MyISAM is similar to Aria, so we use the same +ROW_COPY_COST for Aria. + +cost= blocks * avg_io_cost() * + optimizer_cache_cost * SCAN_LOOKUP_COST + + engine_blocks * INDEX_BLOCK_COPY_COST + + TABLE_SCAN_SETUP_COST + + rows * (ROW_NEXT_FIND_COST + ROW_COPY_COST)); + +MyISAM is using the file system as a row cache. +Let's put the cost of accessing the row in ROW_NEXT_FIND_COST. +Everything is cached (by the file system) and optimizer_cache_cost= 0; + +cost= engine_blocks * INDEX_BLOCK_COPY_COST + + TABLE_SCAN_SETUP_COST + + rows * (ROW_NEXT_FIND_COST + ROW_COPY_COST)) + +ROW_NEXT_FIND_COST= +(costs - engine_blocks * INDEX_BLOCK_COPY_COST - TABLE_SCAN_SETUP_COST)/rows - +ROW_COPY_COST += +(126.353364 - 26660 * 3.56e-05 - 1)/1000000 - 0.000060865547560 +ROW_NEXT_FIND_COST= 0.00006353872044 + + +MyISAM INDEX SCAN +================= + +select count(*) from test.check_costs_myisam force index (l_suppkey) where l_suppkey >= 0 and l_partkey >=0; +index_scan ms: 106.490584 + +blocks= index_size/IO_SIZE = +(rows * tot_key_length / INDEX_BLOCK_FILL_FACTOR) / IO_SIZE +-> +1000000 * 19 / 0.75/ 4096 = 6184 +As MyISAM has a block size of 4096 for this table, engine_blocks= 6184 + +cost= keyread_time= blocks * avg_io_cost() * cache + engine_blocks * INDEX_BLOCK_COPY_COST + rows * (KEY_NEXT_FIND_COST + KEY_COPY_COST); +-> +cost= engine_blocks * INDEX_BLOCK_COPY_COST + rows * KEY_NEXT_FIND_COST + +Assuming INDEX_BLOCK_COPY_COST is same as in Aria and the code for +key_copy is identical to Aria: +cost= 6184 * 3.56e-05 + 1000000 * (KEY_NEXT_FIND_COST + KEY_COPY_COST) +-> +KEY_NEXT_FIND_COST= (106.490584 - 6184 * 3.56e-05)/1000000 - 0.000015685222496= +0.000090585211104 + + +MyISAM, RANGE SCAN (scan index, fetch a row for each index entry) +================================================================= + +select sum(l_orderkey) from test.check_costs_myisam force index(l_suppkey) where l_suppkey >= 0 and l_partkey >=0 and l_discount>=0.0 +time: 1202.0894 ms + +cost= keyread_time + rnd_pos_time. +keyread_time is as above in MyISAM INDEX SCAN, but without KEY_COPY_COST: +keyread_time= 106.490584 - KEY_COPY_COST * 1000000= +106.490584 - 0.000015685222496 * 1000000= 90.805361504000000; +rnd_pos_time= 1202.0894 - 90.805361504000000 = 1111.284038496000000 + +rnd_pos_time() = io_cost + engine_mem_cost + + rows * (ROW_LOOKUP_COST + ROW_COPY_COST) = +rows * avg_io_cost() * engine_block_size/IO_SIZE + +rows * INDEX_BLOCK_COPY_COST + +rows * (ROW_COPY_COST + ROW_LOOKUP_COST) += (When rows are in memory) +rows * INDEX_BLOCK_COPY_COST + +rows * (ROW_COPY_COST + ROW_LOOKUP_COST) + +This gives us: + 1111.284038496000000 = 1000000 * 3.56e-05 + 1000000*(0.000060865547560 + ROW_LOOKUP_COST) +-> +ROW_LOOKUP_COST= ( 1111.284038496000000 - 1000000 * (3.56e-05 + 0.000060865547560)) / 1000000s +-> +ROW_LOOKUP_COST= 0.001014818490936 + +As the row is never cached, we have to ensure that rnd_pos_time() +doesn't include an io cost (which would be affected by +optimizer_cache_hit_ratio). This is done by having a special +ha_myisam::rnd_pos_time() that doesn't include io cost but instead an +extra cpu cost. + + +MyISAM, EQ_REF with index_read +============================== + +select straight_join count(*) from seq_1_to_1000000,test.check_costs_myisam where seq=l_linenumber; +eq_ref_join ms: 613.906777 of which 12.48 ms is for seq_1_to_1000000; + +According to analyze statement: + +- Cost for SELECT * from seq_1_to_1000000: 12.48 (See sequence_scan_cost) +- Time from check_costs: eq_ref's: 613.906777- 12.48 = 601.426777; + +cost= rows * (keyread_time(1) + KEY_COPY_COST) + +keyread_time(1)= INDEX_BLOCK_COPY_COST + KEY_LOOKUP_COST; + +cost= rows * (KEY_COPY_COST + INDEX_BLOCK_COPY_COST + KEY_LOOKUP_COST) +-> +KEY_LOOKUP_COST= cost/rows - INDEX_BLOCK_COPY_COST - KEY_COPY_COST; +601.426777 / 1000000 - 3.56e-05 - 0.000015685222496 = 0.00055014155451 +KEY_LOOKUP_COST= 0.00055014155451 + + + +InnoDB, TABLE SCAN +================== + +select sum(l_quantity) from check_costs_innodb; +table_scan 131.302492 +Note that InnoDB reported only 956356 rows instead of 100000 in stats.records +This will will cause the optimizer to calculate the costs based on wrong +assumptions. + +As InnoDB have a clustered index (which cost is a combination of +KEY_LOOKUP_COST + ROW_COPY_COST), we have to ensure that the +relationship between KEY_COPY_COST and ROW_COPY_COST is close to the +real time of copying a key and a row. + +I assume, for now, that the row format for InnoDB is not that +different than for Aria (in other words, computation to unpack is +about the same), so lets use the same ROW_COPY_COST (0.000060865547560) + +I am ignoring the fact that InnoDB can optimize row copying by only +copying the used fields as the optimizer currently have to take that +into account. (This would require a way to update ROW_COPY_COST / +table instance in the query). + +For now, lets also use the same value as Aria for +INDEX_BLOCK_COPY_COST (3.56e-05). + +The number of IO_SIZE blocks in the InnoDB data file is 34728 (from gdb)) +(For reference, MyISAM was using 26660 and Aria 30002 blocks) +As InnoDB is using 16K blocks, the number of engine blocks= 34728/4= 8682 + +cost= blocks * avg_io_cost() * + optimizer_cache_cost * SCAN_LOOKUP_COST + + engine_blocks * INDEX_BLOCK_COPY_COST + + TABLE_SCAN_SETUP_COST + + rows * (ROW_NEXT_FIND_COST + ROW_COPY_COST)); + +as optimizer_cache_cost = 0 + +cost= engine_blocks * INDEX_BLOCK_COPY_COST + + TABLE_SCAN_SETUP_COST + + rows * (ROW_NEXT_FIND_COST + ROW_COPY_COST)) + +ROW_NEXT_FIND_COST= +(costs - engine_blocks * INDEX_BLOCK_COPY_COST - TABLE_SCAN_SETUP_COST)/rows - +ROW_COPY_COST += (Ignoring TABLE_SCAN_SETUP_COST, which is just 10 usec) +(131.302492 - 8682 * 3.56e-05)/1000000 - 0.000060865547560 = +0.00007012786523999997 + + +InnoDB INDEX SCAN +================= + +select count(*) from check_costs_innodb force index (l_suppkey) where l_suppkey >= 0 and l_partkey >=0; +index_scan 114.733037 ms +Note that InnoDB is reporting 988768 rows instead of 1000000 +(The number varies a bit between runs. At another run I got 956356 rows) +With default costs (as of above), we get a query cost of 112.142. This can +still be improved a bit... + +blocks= index_size/IO_SIZE = +(rows * tot_key_length / INDEX_BLOCK_FILL_FACTOR) / IO_SIZE +-> (total_key_length is 17 in InnoDB, 19 in Aria) +1000000 * 17 / 0.75/ 4096 = 5533 +engine_blocks= 5533/4 = 1383 + +(In reality we get 5293 blocks and 1323 engine blocks, because of the +difference in InnoDB row count) + +cost= keyread_time= blocks * avg_io_cost() * cache + engine_blocks * INDEX_BLOCK_COPY_COST + rows * (KEY_NEXT_FIND_COST + KEY_COPY_COST); +-> +cost= engine_blocks * INDEX_BLOCK_COPY_COST + rows * KEY_NEXT_FIND_COST + +Assuming INDEX_BLOCK_COPY_COST is same as in Aria: +(Should probably be a bit higher as block_size in InnoDB is 16384 +compared to 8192 in Aria) + +cost= 1383 * 3.56e-05 + 1000000 * (KEY_NEXT_FIND_COST + KEY_COPY_COST) += +KEY_NEXT_FIND_COST + KEY_COPY_COST= (114.733037 - 1383 * 3.56e-05)/1000000 += +KEY_NEXT_FIND_COST= (114.733037 - 1383 * 3.56e-05)/1000000 - 0.000015685222496 +-> +KEY_NEXT_FIND_COST=0.000098998579704; + +Setting this makes InnoDB calculate the cost to 113.077711 (With estimate of +988768 rows) +If we would have the right number of rows in ha_key_scan_time, we would +have got a cost of: + +Last_query_cost: 145.077711 (Including WHERE cost for 988768 row) +(145.077711)/988768*1000000.0-32 = 114.72573444933 + + +InnoDB RANGE SCAN +================= + +select sum(l_orderkey) from check_costs_innodb force index(l_suppkey) where l_suppkey >= 0 and l_partkey >=0 and l_discount>=0.0 +range_scan 961.4857045 ms +Note that InnoDB was reporting 495340 rows instead of 1000000 ! +I added a patch to fix this and now InnoDB reports 990144 rows + +cost= keyread_time + rnd_pos_time. +keyread_time is as above in index scan, but we want it without KEY_COPY_COST: +keyread_time= cost - KEY_COPY_COST * 1000000= +114.733037 - 0.000015685222496 * 1000000= 99.047814504000000 +rnd_pos_time= 961.4857045 - 99.047814504000000 = 862.437889996000000 + +rnd_pos_time() = io_cost + engine_mem_cost + + rows * (ROW_LOOKUP_COST + ROW_COPY_COST) = +rows * avg_io_cost() * engine_block_size/IO_SIZE + +rows * INDEX_BLOCK_COPY_COST + +rows * (ROW_COPY_COST + ROW_LOOKUP_COST) += (When rows are in memory) + +rows * (INDEX_BLOCK_COPY_COST + ROW_COPY_COST + ROW_LOOKUP_COST) + +This gives us: +862.437889996000000 = 1000000 * 3.56e-05 + 1000000*(0.000060865547560 + ROW_LOOKUP_COST) +-> +ROW_LOOKUP_COST= (862.437889996000000 - 1000000*(3.56e-05+0.000060865547560)) / 1000000 +-> +ROW_LOOKUP_COST= 0.000765972342436 + +Setting this makes InnoDB calculate the cost to 961.081050 (good enough) + + +InnodDB EQ_REF with index_read +============================== + +select straight_join count(*) from seq_1_to_1000000,test.check_costs_innodb where seq=l_linenumber +time: 854.980610 ms + +Here the engine first has to do a key lookup and copy the key to the upper +level (Index only read). + +According to analyze statement: + +- Cost for SELECT * from seq_1_to_1000000: 12.57 (See sequence_scan_cost) +- Time from check_costs: eq_ref_join: 854.980610 + This is time for accessing both seq_1_to_1000000 and check_costs + time for check_cost_innodb: 854.980610-12.57 = 842.410610 ms + +cost= rows * (keyread_time(1,1) + KEY_COPY_COST) + +keyread_time(1,1)= INDEX_BLOCK_COPY_COST + ranges * KEY_LOOKUP_COST + + (rows-ranges) * KEY_NEXT_FIND_COST + +As rows=1 and ranges=1: + +keyread_time(1,1)= INDEX_BLOCK_COPY_COST + KEY_LOOKUP_COST + +cost= rows * (KEY_COPY_COST + INDEX_BLOCK_COPY_COST + KEY_LOOKUP_COST) +-> +KEY_LOOKUP_COST= cost/rows - INDEX_BLOCK_COPY_COST - KEY_COPY_COST; +842.410610 / 1000000 - 3.56e-05 - 0.000015685222496 +-> +KEY_LOOKUP_COST= 0.000791125387504; + +After the above we have +last_query_cost=918.986438; + +The cost for check_costs_innodb = +last_query_cost - sequence_scan_cost - where_cost*2 = +918.986438 - 12.57 - 32*2 = 842.416438 (ok) + + +InnodDB EQ_REF with clustered index read +======================================== + +select straight_join count(*) from seq_1_to_1000000,check_costs_innodb where seq=l_orderkey +eq_ref_cluster_join time: 972.290773 ms + +According to analyze statement: +- Cost for SELECT * from seq_1_to_1000000: 12.57 (See sequence_scan_cost) +- Time from check_costs: eq_ref_cluster_join: 972.290773 ms + This is time for accessing both seq_1_to_1000000 and check_costs_innodb. + Time for check_cost_innodb: 972.290773 - 12.57 = 959.790773 + +The estimated cost is 875.0160 + +cost= rows * (keyread_time(1,1) + + ranges * ROW_LOOKUP_COST + + (rows - ranges) * ROW_NEXT_FIND_COST + + rows * ROW_COPY_COST) + +As rows=1 and ranges=1: + +cost= rows * (INDEX_BLOCK_COPY_COST + ROW_LOOKUP_COST + ROW_COPY_COST); +-> +ROW_LOOKUP_COST= cost/rows - INDEX_BLOCK_COPY_COST - ROW_COPY_COST; +959.790773 / 1000000 - 3.56e-05 - 0.000060865547560 +-> +ROW_LOOKUP_COST= 0.0008633252254400001 + +From InnoDB RANGE SCAN we have ROW_LOOKUP_COST=0.000765972342436 +From EQ_REF with index read we have KEY_LOOKUP_COST= 0.000791125387504, +which should in theory be identical to ROW_LOOKUP_COST, + +For now we have to live with the difference (as I want to have the project done +for the next release). + +The difference could be come from the following things: + +- InnoDB estimation of rows in the range scan test is a bit off. +- Maybe the work to find a row from an internal key entry compared to + a external key is a bit difference (less checking/conversions) +- There is different keys used for range scan and this test that could have + different costs +- Maybe we should increase ROW_COPY_COST or ROW_LOOKUP_COST for InnoDB + and adjust other costs. + + +Some background. In range scan, the cost is: +- Scanning over all keys + - For each key, fetch row using rowid + +For the EQ_REF cache +- Scan seq_1_to_1000000 + for each value in seq + do a index_read() call + + +Archive scan cost +================= + +table_scan time: 757.390280 ms +rows: 1000000 +file size: 32260650 = 7878 IO_SIZE blocks + +cost= scan_time() + TABLE_SCAN_SETUP_COST + + records * (ROW_COPY_COST + ROW_LOOKUP_COST + WHERE_COMPARE_COST); + +757.390280 = scan_time() + 10 + 1000000 * (0.060866+0.032000) +-> +scan_time()= 757.390280 - (10 + 1000000 * (0.060866+0.032000)/1000) = 654.52428 + +scan_time() is defined as: + +cost.cpu= (blocks * DISK_READ_COST * DISK_READ_RATIO + + blocks * ARCHIVE_DECOMPRESS_TIME); + +Default values for above: +blocks= 7878 +DISK_READ_COST: 10.240000 usec +DIUSK_READ_RATIO= 0.20 +-> +ARCHIVE_COMPRESS_TIME= (654.52428 - (7878 * 10.240000/1000*0.2)) / 7878 = +0.081034543792841 + + +MyRocksDB, TABLE SCAN +===================== + +select sum(l_quantity) from check_costs_rocksdb; +table_scan 213.038648 ms + +cost= blocks * avg_io_cost() * + optimizer_cache_cost * SCAN_LOOKUP_COST + + engine_blocks * INDEX_BLOCK_COPY_COST + + TABLE_SCAN_SETUP_COST + + rows * (ROW_NEXT_FIND_COST + ROW_COPY_COST)); + +Some defaults: +optimizer_cache_cost = 0 +index_block_copy_cost= 0.000035600 (Assume same as innoDB) +table_scan_setup_cost= 0 (Lets ignore it for now) +row_copy_cost=0.000060865547560 (Assume same as InnoDB for now) + +show table status tells us that datalength=64699000 = 15795 4K-blocks. + +cost= engine_blocks * INDEX_BLOCK_COPY_COST + + TABLE_SCAN_SETUP_COST + + rows * (ROW_NEXT_FIND_COST + ROW_COPY_COST)) + +ROW_NEXT_FIND_COST= +(costs - engine_blocks * INDEX_BLOCK_COPY_COST)/rows - +ROW_COPY_COST += (213.03868 - 15796 * 0.000035600 - 0)/1000000 - 0.000060865547560 = +0.00015161079484 + + +MyRocks INDEX SCAN +================== + +select count(*) from test.check_costs_rocksdb force index (l_suppkey) where l_suppkey >= 0 and l_partkey >=0 +index_scan 266.80435 ms + +Note that myrocks returns 2M rows for the table when it has only 1M rows! + +block_size= 8192 +key_length= 18 +compression=0.25 (75 %) +blocks= (key_length * rows) / 4 * block_size/4096 = 18 * 1000000/4 * 2= +2198 IO_BLOCKS (=1094 engine_blocks) + +cost= keyread_time= blocks * avg_io_cost * DISK_READ_RATIO + engine_blocks * INDEX_BLOCK_COPY_COST + rows * (KEY_NEXT_FIND_COST + KEY_COPY_COST); + +As we assume that everything is in memory (DISK_READ_RATIO=0) +-> +cost= engine_blocks * INDEX_BLOCK_COPY_COST + rows * KEY_NEXT_FIND_COST; + +Assuming INDEX_BLOCK_COPY_COST and KEY_COPY_COST are same as in Aria and InnoDB) + +cost= 1094 * 3.56e-05 + 1000000 * (KEY_NEXT_FIND_COST + KEY_COPY_COST) += +KEY_NEXT_FIND_COST + KEY_COPY_COST= (266.80435 - 1094 * 3.56e-05)/1000000 += +KEY_NEXT_FIND_COST= (266.80435 - 1094 * 3.56e-05)/1000000 - 0.000015685222496 +-> +KEY_NEXT_FIND_COST= 0.000251080181104 + + +MyRocks EQ_REF with index_read +============================== + +select straight_join count(*) from seq_1_to_1000000,test.check_costs_rocksdb where seq=l_linenumber +time: 857.548991 + +Here the engine first has to do a key lookup and copy the key to the upper +level (Index only read). + +According to analyze statement: + +- Cost for SELECT * from seq_1_to_1000000: 12.57 (See sequence_scan_cost) +- Time from check_costs: eq_ref_join: 857.548991 + This is time for accessing both seq_1_to_1000000 and check_costs + time for check_cost_innodb: 857.548991-12.57 = 844.978991 ms + +cost= rows * (keyread_time(1,1) + KEY_COPY_COST) + +keyread_time(1,1)= INDEX_BLOCK_COPY_COST + ranges * KEY_LOOKUP_COST + + (rows-ranges) * KEY_NEXT_FIND_COST + +As rows=1 and ranges=1: + +keyread_time(1,1)= INDEX_BLOCK_COPY_COST + KEY_LOOKUP_COST + +cost= rows * (KEY_COPY_COST + INDEX_BLOCK_COPY_COST + KEY_LOOKUP_COST) +-> +KEY_LOOKUP_COST= cost/rows - INDEX_BLOCK_COPY_COST - KEY_COPY_COST; +844.978991 / 1000000 - 3.56e-05 - 0.000015685222496 = 0.000793693768504 + + +MyRocks EQ_REF with clustered index read +======================================== + +select straight_join count(*) from seq_1_to_1000000,check_costs_rocksdb where seq=l_orderkey +eq_ref_cluster_join 1613.5670 ms + +According to analyze statement: +- Cost for SELECT * from seq_1_to_1000000: 12.57 (See sequence_scan_cost) +- Time from check_costs: eq_ref_cluster_join: 1613.5670 ms + This is time for accessing both seq_1_to_1000000 and check_costs_innodb. + Time for check_cost_rocksdb: 1613.5670 - 12.57 = 1600.9970 + +cost= rows * (keyread_time(1,1) + + ranges * ROW_LOOKUP_COST + + (rows - ranges) * ROW_NEXT_FIND_COST + + rows * ROW_COPY_COST) + +As rows=1 and ranges=1: + +cost= rows * (INDEX_BLOCK_COPY_COST + ROW_LOOKUP_COST + ROW_COPY_COST); +-> +ROW_LOOKUP_COST= cost/rows - INDEX_BLOCK_COPY_COST - ROW_COPY_COST; +1600.9970 / 1000000 - 3.56e-05 - 0.000060865547560 = 0.00150453145244 + + +MyRocks Range scan +================== +select sum(l_orderkey) from test.check_costs_rocksdb force index(l_suppkey) where l_suppkey >= 0 and l_partkey >=0 and l_discount>=0.0 + +The MyRocks engine estimates the number of rows both for the table and +for the to be about 2M, double the real amount. + +The timing and costs from check_costs.pl are: + +range_scan time: 1845.06126 ms cost-where: 3698.8919 cost: 3730.8919 + +As the costs are about the double of the time, this is as good as we can do things until +MyRocks reported record count is corrected + +The issue with wrongly estimated number of rows does not affect the other results from check_costs.pl +as table scans estimates uses the number of rows from the analyze, not from the engine. + + +Appendix +======== + +Future improvements +=================== + +The current costs are quite good for tables of 1M rows (usually about +10% from the true cost for the test table). + +For smaller tables the costs will be a bit on the high side and for +bigger tables a bit on the low size for eq_ref joins (both with index +and with row lookup). + +The only engine that takes into account the number of rows for key lookups +is heap with binary-tree indexes. + +Ideas of how to fix this: + +- Change KEY_LOOKUP_COST, INDEX_BLOCK_COPY_COST and ROW_LOOKUP_COST + (for clustered index) to take into account the height of the B tree. + + +Observations +============ + +Ratio between table scan and range scan + +Queries used: +select sum(l_quantity) from check_costs_aria; +select sum(l_orderkey) from test.check_costs_aria force index(l_suppkey) where l_suppkey >= 0 and l_partkey >=0 and l_discount>=0.0; + +The test for Aria shows that cost ratio of range_scan/table_scan are: +disk_read_ratio=0 341.745207/139.348286= 2.4524536097 +disk_read_ratio=0.02 752.408528/145.748695= 5.1623688843 +disk_read_ratio=0.20 4448.378423/203.352382= 21.8752216190 + +As we are using disk_read_ratio=0.02 by default, this means that in +mtr to not use table scan instead of range, we have to ensure that the +range does not cover more than 1/5 of the total rows. + + +Trying to understand KEY_COPY_COST +================================== + +An index scan with 2 and 4 key parts on an Aria table. +The index has null key parts, so packed keys are used. + +Query1 "index_scan" (2 integer key parts, both key parts may have NULLS): +select count(*) from $table force index (l_suppkey) where l_suppkey >= 0 and l_partkey >=0"); + +- Optimized build: Average 164 ms/query +- gprof build: Average 465 ms/query + +[16] 51.2 0.00 0.21 3999987 handler::ha_index_next() +[15] 51.2 0.01 0.20 3999993 maria_rnext [15] +[22] 19.5 0.08 0.00 9658527 _ma_get_pack_key [22] + +This means that for 3999987 read next calls, the time of _ma_get_pack_key +to retrieve the returned key is: +0.08 * (3999987/9658527) + +The relation of KEY_COPY_COST to KEY_NEXT_FIND_COST is thus for Aria: + +0.08 * (3999987/9658527)/0.21 = 0.15777 parts of KEY_NEXT_FIND_COST + +------ + +Query 2 "index_scan_4_parts" (4 integer key parts, 2 parts may have NULL's): +select count(*) from $table force index (long_suppkey) where l_linenumber >= 0 and l_extra >0"); + +- Optimized build: 218 ms +- gprof build: Average 497 ms/query + +Most costly functions + % cumulative self self total + time seconds seconds calls ms/call ms/call name + 13.44 0.61 0.61 48292742 0.00 0.00 _ma_get_pack_key + 8.59 1.00 0.39 28298101 0.00 0.00 ha_key_cmp + 7.27 1.33 0.33 19999951 0.00 0.00 _ma_put_key_in_record + 4.41 1.96 0.20 19999952 0.00 0.00 handler::ha_index_next(unsigned char*) + +Call graph +[13] 9.0 0.20 0.21 19999952 handler::ha_index_next(unsigned char*) [13] + +[3] 21.6 0.16 0.82 19999960 _ma_search_next [3] +[18] 7.7 0.02 0.33 19999951 _ma_read_key_record [18] + 0.00 0.00 19887291/19999952 _ma_get_static_key [6565][19] + 18.4 0.10 0.64 19999936 Item_cond_and::val_int() [19] + +-> KEY_COPY_COST = 1.33/1.96 = 0.6785 parts of the index_read_next + +Total cost increases from 2 -> 4 key parts = 1.96 / 1.40 = 40% +This includes the additional work in having more key pages, more work in +finding next key (if key parts are packed or possible null) ,and copying +the key parts to the record + +I also did a quick analyze between using NOT NULL keys, in which case +Aria can use fixed key lengths. This gives a 39.4% speed up on index +scan, a small speedup to table scan (as 2 fields are cannot have null) +but not a notable speed up for anything else. + + +Trying to understand ROW_COPY_COST +================================== + +An simple table scan on an Aria table + +query: select sum(l_quantity) from check_costs_aria + +From gprof running the above query 10 times with 1M rows in the table: + +[14] 83.7 0.03 0.76 9999989 handler::ha_rnd_next() +[17] 51.6 0.49 0.00 10000010 _ma_read_block_record2 [17] +[18] 21.1 0.01 0.19 156359 pagecache_read [18] + +The function that unpacks the row is _ma_read_block_record2() + +Taking into account that all pages are cached: +(Note that the main cost in pagecache_read in this test is calculating the page +checksum) + +ROW_COPY_COST/ROW_NEXT_FIND_COST= 0.49/(0.76+0.3-0.20) = 0.56977 = 0.57 + + +Reason for SCAN_SETUP_COSTS +=========================== + +One problem with the new more exact cost model is that the optimizer +starts to use table scans much more for small tables (which is correct when +one looks at cost). However, small tables are usually cached fully so +it is still better to use index scan in many cases. + +This problem is especially notable in mtr where most test cases uses +tables with very few rows. + +TABLE_SCAN_SETUP_COST is used to add a constant startup cost for +table and index scans. It is by default set to 10 usec, about 10 MyISAM +row reads. + +The following cost calculation shows why this is needed: + +explain select count(*) from t1, t2 where t1.p = t2.i ++------+-------------+-------+-------+---------------+---------+---------+-----------+------+-------------+ +| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra | ++------+-------------+-------+-------+---------------+---------+---------+-----------+------+-------------+ +| 1 | SIMPLE | t1 | index | PRIMARY | PRIMARY | 4 | NULL | 2 | Using index | +| 1 | SIMPLE | t2 | ref | k1 | k1 | 5 | test.t1.p | 2 | Using index | ++------+-------------+-------+-------+---------------+---------+---------+-----------+------+-------------+ + +t1 has 2 rows +t2 has 4 rows + +Optimizer trace shows when using TABLE_SCAN_SETUP_COST=0: + +index scan costs +"read_cost": 0.00308962, +read_and_compare_cost": 0.00321762 + +key read costs: +"rows": 2, +"cost": 0.00567934 + +CHOSEN: +Scan with join cache: cost": 0.0038774 +rows_after_scan": 2 + +Note that in the following, we are using cost in microseconds while +the above costs are in milliseconds. + +select * from information_schema.optimizer_costs where engine="myisam"\G + ENGINE: MyISAM + OPTIMIZER_DISK_READ_COST: 10.240000 + OPTIMIZER_INDEX_BLOCK_COPY_COST: 0.035600 + OPTIMIZER_KEY_COMPARE_COST: 0.008000 + OPTIMIZER_KEY_COPY_COST: 0.066660 + OPTIMIZER_KEY_LOOKUP_COST: 0.498540 + OPTIMIZER_KEY_NEXT_FIND_COST: 0.060210 + OPTIMIZER_DISK_READ_RATIO: 0.200000 +OPTIMIZER_RND_POS_INTERFACE_COST: 0.000000 + OPTIMIZER_ROW_COPY_COST: 0.088630 + OPTIMIZER_ROW_LOOKUP_COST: 0.641150 + OPTIMIZER_ROW_NEXT_FIND_COST: 0.049510 + OPTIMIZER_ROWID_COMPARE_COST: 0.004000 +@@OPTIMIZER_SCAN_SETUP_COST 10.000000 +@@OPTIMIZER_WHERE_COST 0.032000 + +Checking the calculated costs: + +index_scan_cost= 10.240000 * 0.2 + 0.035600 + 0.498540 + 4 * (0.060210+0.066660) = 3.08962 +where_cost 0.032000*4= 0.128000 +total: 3.21762 + +key_read_cost= 10.240000 * 0.2 + 0.035600 + 0.498540 + 0.060210 = 2.64235 +key_copy_cost= 0.066660 * 2 = 0.13332 +where_cost 0.032000*2= 0.06400 +total: 2.64235 + 0.13332 + 0.06400 = 2.8396699999999999 +Needs to be done 2 times (2 rows in t1): 5.67934 + +Join cache only needs 1 refill. The calculation is done in +sql_select.cc:best_access_path() + +scan_with_join_cache= +scan_time + cached_combinations * ROW_COPY_COST * JOIN_CACHE_COST + +row_combinations * (ROW_COPY_COST * JOIN_CACHE_COST + WHERE_COST) = +3.2176 + 2 * 0.088630 + 2*2 * (0.088630 * 1 + 0.032000) = +3.87738 + +Other observations: +OPTIMIZER_KEY_NEXT_FIND_COST + OPTIMIZER_KEY_COPY_COST + OPTIMIZER_WHERE_COST= +0.060210 + 0.066660 + 0.032000 = 0.158870 +OPTIMIZER_KEY_LOOKUP_COST / 0.158870 = 3.138 + +This means that when using index only reads (and DISK_READ_RATIO=0) +the optimizer will prefer to use 3 times more keys in range or ref +than doing a key lookups! +If DISK_READ_RATIO is higher, the above ratio increases. This is one of +the reasons why we set the default value for DISK_READ_RATIO quite low +(0.02 now) + +(OPTIMIZER_ROW_COPY_COST + OPTIMIZER_ROW_NEXT_FIND_COST) / +(OPTIMIZER_KEY_COPY_COST + OPTIMIZER_KEY_NEXT_FIND_COST) = +(0.088630 + 0.049510) / (0.066660 + 0.060210) = 1.08831 +Which means that table scans and index scans have almost the same cost. +select 0.066660 + + +HEAP_TEMPTABLE_CREATE_COST +========================== + +I added trackers in create_tmp_table() and open_tmp_table() and run a +simple query that create two materialized temporary table with an unique +index 31 times. I got the following tracking information: + +(gdb) p open_tracker +$1 = {counter = 31, cycles = 302422} +(gdb) p create_tracker +$2 = {counter = 31, cycles = 1479836} + +Cycles per create = (302422 + 1479836)/31= 57492 + +1000.0*57492/sys_timer_info.cycles.frequency = 0.0249 ms +HEAP_TMPTABLE_CREATE_COST= 0.025 ms + + +What to do with wrong row estimates +=================================== + +MyRocks can have a very bad estimate of rows, both for the number of rows in the table and also +for big ranges. Analyze table can fix this, but we have to consider how to keep the row estimate +correct when tables are growing over time. + +Suggested fixed: +- If we can assume that the datafile size reported by the engine is somewhat correct, we could + estimate the number of rows as: + analyze_number_of_rows * current_datafile_size / analyze_datafile_size + + +MySQL cost structures +===================== + +MySQL 8.0 server cost are stored in the class Server_cost_constants defined +int opt_costconstants.h + +It containts the following slots and has the following default values: + +m_row_evaluate_cost 0.1 Cost for evaluating the query condition on + a row +m_key_compare_cost 0.05 Cost for comparing two keys +m_memory_temptable_create_cost 1.0 Cost for creating an internal temporary + table in memory +m_memory_temptable_row_cost 0.1 Cost for retrieving or storing a row in an + internal temporary table stored in memory. +m_disk_temptable_create_cost 20.0 Cost for creating an internal temporary + table in a disk resident storage engine. +m_disk_temptable_row_cost 0.5 Cost for retrieving or storing a row in an + internal disk resident temporary table. + +Engine cost variables: +m_memory_block_read_cost 0.25 The cost of reading a block from a main + memory buffer pool +m_io_block_read_cost 1.0 The cost of reading a block from an + IO device (disk) + +------- + +Some cost functions: + +scan_time() = data_file_length / IO_SIZE + 2; +read_time(index, ranges, rows)= rows2double(ranges + rows); +index_only_read_time()= records / keys_per_block + +table_scan_cost()= scan_time() * page_read_cost(1.0); + +index_scan_cost()= index_only_read_time(index, rows) * + page_read_cost_index(index, 1.0); +read_cost()= read_time() * page_read_cost(1.0); + + +page_read_cost()= buffer_block_read_cost(pages_in_mem) + + io_block_read_cost(pages_on_disk); + +io_block_read_cost()= blocks * m_io_block_read_cost +buffer_block_read_cost()= blocks * m_memory_block_read_cost; + + +There are also: +table_in_memory_estimate() +index_in_memory_estimate() + +If the storage engine is not providing estimates for the above, then +the estimates are done based on table size (not depending on how many +rows are going to be accessed in the table). |