Latest revision as of 20:13, 24 September 2022

Optimization is mostly about indexes.

when to create them
which type to create
manipulating your query so it uses the best index
breaking up your query if tables are large

Documentation

MySQL-8.0: Explain Output Docs https://dev.mysql.com/doc/refman/8.0/en/explain-output.html

MySQL-5.7: Join Types (eq, eq_ref, ..) https://dev.mysql.com/doc/refman/5.7/en/explain-output.html#explain-join-types

Tutorials

SO: eq_ref vs ref https://stackoverflow.com/questions/4508055/what-does-eq-ref-and-ref-types-mean-in-mysql-explain

understanding explain https://www.taogenjia.com/2020/06/08/mysql-explain/

Basics

WARNING:

here there be dragons. It is difficult to validate that these are correct, this is an evolving set of rules.

How WHERE works

The WHERE algorithm, in order of priority will try these options

column is chosen for index (unless an index is already chosen)

column can be use a covering index Extra: Using index

all rows must be inspected Extra: Using where
Queries use one index per table/join.
Additional WHERE statements within a table (even on indexes) involve checking each item from that initial selection result.
You'll see this represented in the filtered percentage of EXPLAIN.
While these reduce the number of results, they might make the query more expensive than a broader one.

In very large tables, you may consider using:

1) multi-column-indexes
CREATE INDEX projects_on_team_and_status_and_created_at
ON projects (team_id, status, created_at);

/*
On disk, the index concatenates the 3x values.
That one aggregate-value is then indexed just like any other column (ex. BTREE).
The order is significant, you want the least-specific on the left, and the most-specific on the right (cardinality).

ex.
  row    team_id=1, status=wip, created_at=12am
  index: '1wip12am'
2) broad cursor-based outer-query on an index, and a specific inner-query
# useful if multi-column-index is too slow

# outer-loop -- broadly iterates over batches of projects
#               uses 'id' index here to quickly fetch results
#               (The bigger the table, the less specific this can be)
def iterate_over_projects_in_batches():
    iteration = 0
    while True:
        outer_query = '''
            SELECT id
            FROM projects
            WHERE id > ?
            ORDER BY id
            LIMIT 1000
            '''
        yield cursor.execute(outer_query, iteration * 1000)


# inner-loop -- uses full-result scan of small number of items
#               doing this query from the top would be slow.
def select_specific_projects_from_batch(project_ids):
    inner_query = '''
        SELECT *
        FROM projects
        WHERE id IN (?)
        AND status = "active"
        AND created_by "foo@user.com"
        AND created_at > "1970-01-01"
        ''''
     for project in cursor.execute(inner_query, project_ids):
         do_thing(project)

# general flow
for project_batch in iterate_over_projects_in_batches():
    select_specific_projects_from_batch(project_batch)
Queries with computed values cannot be cached (still true?)
NOTE It is possible this is no longer accurate. I believe this referred to the query-cache, which is deprecated.
Since CURRENT_DATE is computed, this query cannot be cached.
pre-compute if possible.
SELECT * from foo_table
WHERE created_at < CURRENT_DATE  # <-- cannot use index on 'created_at', function result
Indexes cannot be used inside functions or expressions
A query that refers to a column, but uses arithmetic/concatenation cannot use an index.
You should pre-compute values if possible.
SELECT * FROM foo_table
WHERE id = 1 + 1  # <-- cannot use 'id' index here, used in arithetic

SELECT * FROM foo_table
WHERE created_at > date("1970-01-01T12:00:00Z");  # <-- cannot use index on 'created_at', used in function
Each JOIN is a query
Joined tables are evaluated as separate queries.
Subqueries are also evaluated as JOINs.

More specifically, they evaluate in nested-loops for every match.

The outer-most table/index is scanned

For each match, scan inner table/index looking for match

Recurse until no JOINs left to satisfy

Continue iterating over outer-table until next match, repeat

This is referred to as Algorithms: Nested Loop Join
Ex:
SELECT *
FROM outer_table
INNER JOIN inner_table
ON inner_table.outer_id = outer_table.id
AND outer_table.id = 10
in pseudocode, executes like this:
for outer_row in filter(lambda row: row.id == 10, outer_rows):
    for inner_row in filter(lambda row: row.outer_id == outer_row.id, inner_rows):
        combined = outer_row.merge(inner_row)
        yield(combined)
JOIN types can heavily impact query speed

Using EXPLAIN's type, you can see how each joined table is joined to the table in the loop above it.
Most frequently, these are the join types I have encountered.
Simple:

const
WHERE conditions fully satisfy a UNIQUE index
(ex. primary key, all columns from a unique multicolumn-index)

eq_ref
WHERE/ON condition matches an index to another table's column
(ex. WHERE tbl1.t2_id = tbl2.id)

ref
A joined table ON condition matches a non-unique index on a constant in current table. It may return multiple rows.
All matched rows from each table must be compared against each other.
(ex. WHERE addresses.country = "CA")

range
WHERE matching on a range of index values.
(ex. =, >, <, BETWEEN, IN, LIKE).

index
full table scan, that is sorted in advance by the index.

ALL
full table scan.

Complex:

index_merge

When multiple RANGE conditions on indexes on the current table can be merged into one.
The chosen algorithm is listed in EXPLAIN's Extra column, and can be:

Using intersect

Using union

Using sort_union

See details here

Misc:

ref_or_null
Same as ref, but requires a second scan of the index for NULL entries.

unique_subquery

index_subquery

test
JOIN vs multi-query cost
JOIN Pros:
- faster/simpler in most normal cases
Multi-Query Pros:
storage_engine:
  - table-level-lock storage engines only lock one table at a time
  - tables can use different storage-engines without penalty
  - tables can be hosted on different physical servers without penalty
table_size:
  - Gigantic tables may be faster with simpler queries on an index
IN is faster than multiple OR conditions

In MySQL, a binary-tree is create from the values in the IN (...) statement,
It's time-complexity is O(log n) vs the OR's O(n).

NULL columns are special

I am not confident I understand this.
I understood eq_or_ref to mean a second table scan on a join.
But I've seen dramatic speedups +20s to ~0.004s with the same query when with an index used on a different column, with WHERE col IS NOT NULL
Ask in IRC maybe? I'd like to understand this better.

GROUP BY using index

When GROUP BY can use an index is heavily version specific

GROUP BY columns within the same table to allow the use of an index.

Tools

Explain

Typing the word EXPLAIN before a query, you can get a breakdown of the cost of a query.
Type EXPLAIN FORMAT=json before a query for even more detailed results.
Primary

rows (smaller is better)
The main indicator of a bad query.
Estimated number of rows that must be examined to return a result.
Querying on an index reduces this.

filtered (smaller is better)
The percentage of rows we expect the query narrows the search down to.
(filtered/100) * rows gives you an estimate of the number of rows that will be searched.
This is impacted by WHERE conditions for columns not used by the chosen index.

possible_keys valid indexes for this query.

key index chosen by optimizer for this query

type describes how the table is joined to results from it's parent's loop.
(see join type docs, and some details below)

Misc

extra Extra information about the query.
Using indexmeans an index contains all of the information required by query, so it does not need to refer back to the clustered index.
Using wheremeans index-returned rows will be filtered in a loop
Using filesort means the rows will need to be sorted (not using the index)
Using temporary indicates the use of a temporary table, which is slow.

ref ??

key_len indexes generally search a row prefix. it is possible that a query does not need to search all characters in the index. this indicates how many bytes of index are used. (you can convert this back to characters checked, ex. if it is UTF-8 (with a max of 4bytes per character) you could divide by 4 to get the number of characters into the index that will need to be searched.

Joins
JOINs and subqueries are searched per-row in nested loops.
The order in which these nested loops are evaluated matches the order of the EXPLAIN rows.

Explain format=json

Explain's JSON format exposes additional useful details.
Primary

columns used from a multicolumn-index.

Strategies

Query on an Index

Create an index, and verify that it gets used.
If this is not efficient enough, try creating a multi-column index.
Beware of additional restrictions on multicolumn index validity, see mysql indexes for details.

Limit Records

Only one index chosen per queried/joined table.
After that filters the results, each additional condition performs a full-scan over those results.

On medium sized tables, you can make the index more specific.

On large sized tables, you'll need to iterate over batches of rows, using a more widely scoped index (otherwise this initial search will take a long time).

Join Decomposition

Instead of one query with N joins, use N queries.

Clustered Indexes

TODO:

how to create a clustered index?

Normally, mysql's BTree stores information in the tree nodes.
A Btree with a depth of 3 will store row info at nodex at each of the 3 levels.
A Clustered index is similar, except the data is exclusively stored in the leaf nodes.
These leaf nodes are kept in closer parts of memory, which makes queries like ranges and iteration cheaper.

@@ Line 29: / Line 29: @@
 = Basics =
 <blockquote>
+{{ WARNING |
+here there be dragons. It is difficult to validate that these are correct, this is an evolving set of rules.
+}}
 {{ expand
@@ Line 284: / Line 287: @@
 {| class="wikitable"
 |-
-| <code>rows</code> || (smaller is better)<br>The main indicator of a bad query.<br> Estimated number of rows that must be searched to return a result.<br>Querying on an index reduces this.
+| <code>rows</code> || (smaller is better)<br>The main indicator of a bad query.<br> Estimated number of rows that must be examined to return a result.<br>Querying on an index reduces this.
 |-
 | <code>filtered</code> || (smaller is better)<br>The percentage of <code>rows</code> we expect the query narrows the search down to.<br><code>(filtered/100) * rows</code> gives you an estimate of the number of rows that will be searched.<br>This is impacted by <code>WHERE</code> conditions for columns not used by the chosen index.
@@ Line 299: / Line 302: @@
 {| class="wikitable"
 |-
-| <code>extra</code> || Extra information about the query. <code>Using temporary</code> indicates the use of a temporary table, which is slow.
+| <code>extra</code> || Extra information about the query.<br><code>Using index</code>means an index contains all of the information required by query, so it does not need to refer back to the clustered index.<br><code>Using where</code>means index-returned rows will be filtered in a loop<br><code>Using filesort</code> means the rows will need to be sorted (not using the index)<br><code>Using temporary</code> indicates the use of a temporary table, which is slow.
 |-
 | <code>ref</code> || ??
 |-
-| <code>key_len</code> || not totally certain, related to the bitwidth of the index I think.
+| <code>key_len</code> || indexes generally search a row prefix. it is possible that a query does not need to search all characters in the index. this indicates how many bytes of index are used. (you can convert this back to characters checked, ex. if it is  UTF-8 (with a max of 4bytes per character) you could divide by 4 to get the number of characters into the index that will need to be searched.
 |-
 |}
@@ Line 342: / Line 345: @@
 * On large sized tables, you'll need to iterate over batches of rows, using a more widely scoped index (otherwise this initial search will take a long time).
 </blockquote><!-- Limit Records -->
+== Join Decomposition ==
+<blockquote>
+Instead of one query with N joins, use N queries.
+</blockquote><!-- Join Decomposition -->
 == Clustered Indexes ==

Anonymous

Search

Mysql optimization: Difference between revisions

Namespaces

More

Page actions

Latest revision as of 20:13, 24 September 2022

Contents

Documentation

Tutorials

Basics

Tools

Explain

Explain format=json

Strategies

Query on an Index

Limit Records

Join Decomposition

Clustered Indexes

Navigation

Navigation

Programs

QuickRef

Operating Systems

wiki pages

Wiki tools

Wiki tools

MySQL-8.0: Explain Output Docs	https://dev.mysql.com/doc/refman/8.0/en/explain-output.html
MySQL-5.7: Join Types (eq, eq_ref, ..)	https://dev.mysql.com/doc/refman/5.7/en/explain-output.html#explain-join-types

SO: eq_ref vs ref	https://stackoverflow.com/questions/4508055/what-does-eq-ref-and-ref-types-mean-in-mysql-explain
understanding explain	https://www.taogenjia.com/2020/06/08/mysql-explain/

`rows`	(smaller is better) The main indicator of a bad query. Estimated number of rows that must be examined to return a result. Querying on an index reduces this.
`filtered`	(smaller is better) The percentage of `rows` we expect the query narrows the search down to. `(filtered/100) * rows` gives you an estimate of the number of rows that will be searched. This is impacted by `WHERE` conditions for columns not used by the chosen index.
`possible_keys`	valid indexes for this query.
`key`	index chosen by optimizer for this query
`type`	describes how the table is joined to results from it's parent's loop. (see join type docs, and some details below)

`extra`	Extra information about the query. `Using index`means an index contains all of the information required by query, so it does not need to refer back to the clustered index. `Using where`means index-returned rows will be filtered in a loop `Using filesort` means the rows will need to be sorted (not using the index) `Using temporary` indicates the use of a temporary table, which is slow.
`ref`	??
`key_len`	indexes generally search a row prefix. it is possible that a query does not need to search all characters in the index. this indicates how many bytes of index are used. (you can convert this back to characters checked, ex. if it is UTF-8 (with a max of 4bytes per character) you could divide by 4 to get the number of characters into the index that will need to be searched.

Anonymous

Search

Mysql optimization: Difference between revisions

Latest revision as of 20:13, 24 September 2022

Documentation

Tutorials

Basics

Tools

Explain

Explain format=json

Strategies

Query on an Index

Limit Records

Join Decomposition

Clustered Indexes

Navigation

Wiki tools

Page tools