Mysql optimization: Difference between revisions
(→Basics) |
(→Basics) |
||
Line 127: | Line 127: | ||
* Recurse until no JOINs left to satisfy | * Recurse until no JOINs left to satisfy | ||
* Continue iterating over outer-table until next match, repeat | * Continue iterating over outer-table until next match, repeat | ||
This is referred to as [[Algorithms: Nested Loop Join]] | |||
Ex: | Ex: |
Revision as of 02:50, 9 September 2022
Optimization is mostly about indexes.
- when to create them
- which type to create
- manipulating your query so it uses the best index
- breaking up your query if tables are large
Documentation
MySQL-8.0: Explain Output Docs https://dev.mysql.com/doc/refman/8.0/en/explain-output.html
Basics
Queries use one index per table/join.
AdditionalWHERE
statements within a table (even on indexes) involve checking each item from that initial selection result.
You'll see this represented in thefiltered
percentage ofEXPLAIN
.
While these reduce the number of results, they might make the query more expensive than a broader one.
In very large tables, you may consider using:
1) multi-column-indexes
CREATE INDEX projects_on_team_and_status_and_created_at ON projects (team_id, status, created_at); /* On disk, the index concatenates the 3x values. That one aggregate-value is then indexed just like any other column (ex. BTREE). The order is significant, you want the least-specific on the left, and the most-specific on the right (cardinality). ex. row team_id=1, status=wip, created_at=12am index: '1wip12am'2) broad cursor-based outer-query on an index, and a specific inner-query
# useful if multi-column-index is too slow # outer-loop -- broadly iterates over batches of projects # uses 'id' index here to quickly fetch results # (The bigger the table, the less specific this can be) def iterate_over_projects_in_batches(): iteration = 0 while True: outer_query = ''' SELECT id FROM projects WHERE id > ? ORDER BY id LIMIT 1000 ''' yield cursor.execute(outer_query, iteration * 1000) # inner-loop -- uses full-result scan of small number of items # doing this query from the top would be slow. def select_specific_projects_from_batch(project_ids): inner_query = ''' SELECT * FROM projects WHERE id IN (?) AND status = "active" AND created_by "foo@user.com" AND created_at > "1970-01-01" '''' for project in cursor.execute(inner_query, project_ids): do_thing(project) # general flow for project_batch in iterate_over_projects_in_batches(): select_specific_projects_from_batch(project_batch)
Queries with computed values cannot be cached
SinceCURRENT_DATE
is computed, this query cannot be cached.
pre-compute if possible.SELECT * from foo_table WHERE created_at < CURRENT_DATE # <-- cannot use index on 'created_at', function result
Indexes cannot be used inside functions or expressions
A query that refers to a column, but uses arithmetic/concatenation cannot use an index.
You should pre-compute values if possible.SELECT * FROM foo_table WHERE id = 1 + 1 # <-- cannot use 'id' index here, used in arithetic SELECT * FROM foo_table WHERE created_at > date("1970-01-01T12:00:00Z"); # <-- cannot use index on 'created_at', used in function
Each JOIN is a query
Joined tables are evaluated as separate queries.
Subqueries are also evaluated as JOINs.
More specifically, they evaluate in nested-loops for every match.
- The outer-most table/index is scanned
- For each match, scan inner table/index looking for match
- Recurse until no JOINs left to satisfy
- Continue iterating over outer-table until next match, repeat
This is referred to as Algorithms: Nested Loop Join
Ex:
SELECT * FROM outer_table INNER JOIN inner_table ON inner_table.outer_table_id = outer_table.id AND inner_table.name = "foo"in pseudocode, executes like this:
for outer_row in outer_table: if outer_row.id == id: for inner_row in inner_table: if inner_row.outer_table_id = id and inner_row.name = "foo" # yield this row
JOIN vs multi-query cost
JOIN Pros:- faster/simpler in most normal casesMulti-Query Pros:
storage_engine: - table-level-lock storage engines only lock one table at a time - tables can use different storage-engines without penalty - tables can be hosted on different physical servers without penalty table_size: - Gigantic tables may be faster with simpler queries on an index
IN is faster than multiple OR conditions
In MySQL, a binary-tree is create from the values in theIN (...)
statement,
It's time-complexity isO(log n)
vs the OR'sO(n)
.
GROUP BY using index
Tools
Explain
Typing the word
EXPLAIN
before a query, you can get a breakdown of the cost of a query.
TypeEXPLAIN FORMAT=json
before a query for even more detailed results.Primary
rows
(smaller is better)
The main indicator of a bad query.
The number of rows that will need to be searched to return a result.
Querying on an index can reduce this.filtered
(smaller is better)
The percentage ofrows
we expect the query narrows the search down to.(filtered/100) * rows
gives you an estimate of the number of rows that will be searched.
This is impacted byWHERE
conditions for columns not included in the chosen index.key
the index the optimizer chose for this query possible_keys
indexes that could have been queried on. extra
Extra information about the query. Using temporary
indicates the use of a temporary table, which is slow.Misc
ref
?? key_len
not totally certain, related to the bitwidth of the index I think. Joins
JOINs and subqueries are searched per-row in nested loops.
The order in which these nested loops are evaluated matches the order of theEXPLAIN
rows.
Strategies
Query on an Index
Create an index, and verify that it gets used.
If this is not efficient enough, try creating a multi-column index.
Limit Records
Only one index chosen per queried/joined table.
After that filters the results, each additional condition performs a full-scan over those results.
- On medium sized tables, you can make the index more specific.
- On large sized tables, you'll need to iterate over batches of rows, using a more widely scoped index (otherwise this initial search will take a long time).
Clustered Indexes
Normally, mysql's BTree stores information in the tree nodes.
A Btree with a depth of 3 will store row info at nodex at each of the 3 levels.A Clustered index is similar, except the data is exclusively stored in the leaf nodes.
These leaf nodes are kept in closer parts of memory, which makes queries like ranges and iteration cheaper.