Mysql optimization

From wikinotes
Revision as of 21:37, 4 September 2022 by Will (talk | contribs)

Optimization is mostly about indexes.

  • when to create them
  • which type to create
  • manipulating your query so it uses the best index
  • breaking up your query if tables are large

Documentation

MySQL-8.0: Explain Output Docs https://dev.mysql.com/doc/refman/8.0/en/explain-output.html

Basics

Queries use one index per table/join.


Additional WHERE statements within a table (even on indexes) involve checking each item from that initial selection result.
You'll see this represented in the filtered percentage of EXPLAIN.
While these reduce the number of results, they might make the query more expensive than a broader one.

In very large tables, you may consider using:

1) multi-column-indexes

CREATE INDEX projects_on_team_and_status_and_created_at
ON projects (team_id, status, created_at);

/*
On disk, the index concatenates the 3x values.
That one aggregate-value is then indexed just like any other column (ex. BTREE).
The order is significant, you want the least-specific on the left, and the most-specific on the right (cardinality).

ex.
  row    team_id=1, status=wip, created_at=12am
  index: '1wip12am'

2) broad cursor-based outer-query on an index, and a specific inner-query

# useful if multi-column-index is too slow

# outer-loop -- broadly iterates over batches of projects
#               uses 'id' index here to quickly fetch results
#               (The bigger the table, the less specific this can be)
def iterate_over_projects_in_batches():
    iteration = 0
    while True:
        outer_query = '''
            SELECT id
            FROM projects
            WHERE id > ?
            ORDER BY id
            LIMIT 1000
            '''
        yield cursor.execute(outer_query, iteration * 1000)


# inner-loop -- uses full-result scan of small number of items
#               doing this query from the top would be slow.
def select_specific_projects_from_batch(project_ids):
    inner_query = '''
        SELECT *
        FROM projects
        WHERE id IN (?)
        AND status = "active"
        AND created_by "foo@user.com"
        AND created_at > "1970-01-01"
        ''''
     for project in cursor.execute(inner_query, project_ids):
         do_thing(project)

# general flow
for project_batch in iterate_over_projects_in_batches():
    select_specific_projects_from_batch(project_batch)


on joins


Tools

Explain

Typing the word EXPLAIN before a query, you can get a breakdown of the cost of a query.
Type EXPLAIN FORMAT=json before a query for even more detailed results.

Primary

rows (smaller is better)
The main indicator of a bad query.
The number of rows that will need to be searched to return a result.
Querying on an index can reduce this.
filtered (smaller is better)
The percentage of rows we expect the query narrows the search down to.
(filtered/100) * rows gives you an estimate of the number of rows that will be searched.
This is impacted by WHERE conditions for columns not included in the chosen index.
key the index the optimizer chose for this query
possible_keys indexes that could have been queried on.
extra Extra information about the query. Using temporary indicates the use of a temporary table, which is slow.

Misc

ref ??
key_len not totally certain, related to the bitwidth of the index I think.

Strategies

Query on an Index

Create an index, and verify that it gets used.
If this is not efficient enough, try creating a multi-column index.

Limit Records

Only one index chosen per queried/joined table.
After that filters the results, each additional condition performs a full-scan over those results.

  • On medium sized tables, you can make the index more specific.
  • On large sized tables, you'll need to iterate over batches of rows, using a more widely scoped index (otherwise this initial search will take a long time).