8 Essential SQL Optimization Tips You Should Know

Aug 19 2024 sql 12 minutes read (About 1758 words)

Whether you’re interviewing or working, you may encounter this topic. SQL optimization is a hot topic that everyone pays attention to.

Medium Link: 8 Essential SQL Optimization Tips You Should Know | by Wesley Wei | Aug, 2024 | Programmer’s Career
Author：Wesley Wei – Medium

Preface

If one day you are responsible for an online interface and it appears to have performance issues, you need to optimize the SQL statement. At that time, you will probably think of optimizing the SQL statement because its cost is much lower compared to code changes.
Then, how can we optimize the SQL statement?

This article shares 8 Essential SQL Optimization Tips, hoping to help you.

1 Avoid Using `SELECT *`

Many times, when writing a SQL statement, we like to use SELECT * directly to retrieve all columns of data at once for convenience.

Counterexample:

1	SELECT * FROM user WHERE id = 1;

In actual business scenarios, we may only need to use one or two columns. Retrieving many data but not using it is a waste of database resources, such as memory or CPU.
Moreover, the excessive data retrieval will also increase the time spent on transmitting data through network I/O.
Additionally, SELECT * will not use the index and will cause a lot of “table scan” operations, which can significantly reduce the performance of the SQL statement.

Example:

1	SELECT name, age FROM user WHERE id = 1;

When querying the database, only retrieve the columns that are actually needed. There is no need to retrieve unnecessary data.

2 Use `UNION ALL` Instead of `UNION`

We all know that using the UNION keyword in a SQL statement can get us the unique results.
However, if we use UNION ALL, we can get all the data, including duplicate records.

Counterexample:

1	(SELECT * FROM user WHERE id = 1) UNION (SELECT * FROM user WHERE id = 2);

The process of removing duplicates requires traversing, sorting, and comparing, which is more time-consuming and CPU-intensive.
Therefore, if we can use UNION ALL, it’s best not to use UNION.

Example:

1	(SELECT * FROM user WHERE id = 1) UNION ALL (SELECT * FROM user WHERE id = 2);

Unless there are some special scenarios where duplicate data is allowed in the business scenario, we should use UNION ALL.

3 Driving a Large Table with a Small Table

Driving a large table with a small table means using the data set from the small table to drive the data set from the large table.
For example, suppose we have two tables: order and user. The order table has 10,000 rows of data, while the user table has only 100 rows of data.
If we want to query all the orders placed by valid users, we can use the in keyword:

1	select * from order where user_id in (select id from user where status=1)

Or we can use the exists keyword:

1	select * from order where exists (select 1 from user where order.user_id = user.id and status=1)

In this scenario, using the in keyword is more suitable.
Why?
Because when the SQL statement contains the in keyword, it will first execute the subquery inside the in, and then execute the outer query. If the data set inside the in is small, the query speed will be faster.
On the other hand, if the SQL statement contains the exists keyword, it will first execute the left side of the exists (i.e., the main query), and then use the result as a condition to match with the right side. If there is no match, the data will be filtered out.

In this requirement, the order table has 10,000 rows of data, while the user table has only 100 rows of data. The order table is large, and the user table is small. If the order table is on the left side, using the in keyword will be more efficient.

In summary:

in is suitable for driving a large table with a small table.
exists is suitable for driving a small table with a large table.

4 Batch Operations

If you have a batch of data that has been processed by business logic and needs to be inserted into the database, how do you handle it?

Anti-pattern:

1	for(Order order : list){ orderMapper.insert(order); }

Inserting data one by one in a loop.

1	insert into order(id, code, user_id) values(123,'001',100);

This operation needs to request the database multiple times to complete the batch insertion.

However, it is well known that each remote database request will consume some performance. If our code needs to request the database multiple times to complete a business function, it will consume even more performance.
So how do we optimize?

Best practice:

1	orderMapper.insertBatch(list);

Provide a batch insertion method.

1	insert into order(id, code, user_id) values(123,'001',100),(124,'002',100),(125,'003',101);

This way only needs to request the database once, and the SQL performance will be improved. The more data there is, the bigger the improvement will be.
However, we need to note that it’s not recommended to insert too many data at one time. If the data is too much, the database response will also be slow. Batch operations should control the number of data inserted in each batch, and recommend no more than 500. If there are more than 500, divide them into multiple batches.

5 Using Limit

Sometimes, we need to query some data’s first row, such as querying a user’s first order and seeing when they made their first purchase.

Anti-pattern:

1	select id, create_date from order where user_id=123 order by create_date asc;

Query all orders for the specified user, sort them by creation date, and then get the first element in code to obtain the first order’s data.

1	List<Order> list = orderMapper.getOrderList(); Order order = list.get(0);

Although this approach does not have functional issues, it is very inefficient and wastes resources.

Best practice:

1	select id, create_date from order where user_id=123 order by create_date asc limit 1;

Use limit 1 to only return the first row of data that meets the condition.

Additionally, when deleting or modifying data, it is also a good idea to use limit at the end of the SQL statement to prevent accidental deletion or modification of unrelated data.
For example:

1	update order set status=0, edit_time=now(3) where id>=100 and id<200 limit 100;

This way, even if there is an error and the ID is incorrect, it will not affect too many records.

6 Too Many Values in IN

When we have a batch query interface, we usually use the IN keyword to filter out data. For example, we want to query user information by specifying some IDs.
The SQL statement is as follows:

1	select id, name from category where id in (1, 2, 3…100000000);

If we don’t limit anything, this query may take a very long time and even cause the interface to timeout.
What can we do?

We can use limit in SQL to limit the data:

1	select id, name from category where id in (1, 2, 3…100) limit 500;

However, we often need to add limits in our business code. The pseudo-code is as follows:

public List<Category> getCategory(List<Long> ids) {
    if (CollectionUtils.isEmpty(ids)) {
        return null;
    }
    if (ids.size() > 500) {
        throw new BusinessException("Once, the maximum number of records that can be queried is 500");
    }
    return mapper.getCategoryList(ids);
}

Another approach is to divide the IDs into batches and use multiple threads to query the data. Each batch only queries 500 records, and then combines the query results to return.
However, this is only a temporary solution and not suitable for scenarios where the IDs are too many. Even if we can quickly query out the data, but if the returned data volume is too large, network transmission will also consume a lot of performance, and the interface performance will not be improved.

7 Incremental Querying

Sometimes, we need to query data through a remote interface and then synchronize it with another database.

Bad Example:
If we directly get all the data and then synchronize it. This may be very convenient, but it brings a huge problem - if there is too much data, the query performance will be very poor.
What can we do?

Good Example:

1	select * from user where id > #{lastId} and create_time >= #{lastCreateTime} limit 100;

Sort by ID and time in ascending order. Each time, only synchronize a batch of data, with only 100 records. After each synchronization is complete, save the maximum ID and time of these 100 records to use for the next batch.
This incremental querying way can improve the efficiency of single queries.

Sometimes, when we query data on a list page, we need to avoid returning too many records at once to prevent affecting interface performance. We usually do pagination processing on the query interface.
In MySQL, we often use the limit keyword for pagination:

1	select id, name, age from user limit 10,20;

If the data volume is small, using the limit keyword for pagination has no problem. However, if the data volume is very large, using it will cause performance problems.
For example, the pagination parameter has changed to:

1	select id, name, age from user limit 1000000,20;

MySQL will query 1,000,020 records and then discard the first 1,000,000 records, only querying the last 20 records. This is very wasteful of resources.
So, how can we paginate large amounts of data?
Optimize SQL:

1	select id, name, age from user where id > 1000000 limit 20;

First find the maximum ID of the previous pagination, and then use the ID index to query. However, this solution requires that the ID is continuous and ordered.
We can also use between optimization for pagination:

1	select id, name, age from user where id between 1000000 and 1000020;

Note that between should be used on a unique index to paginate, otherwise it will cause inconsistent page sizes.

More Series Articles about Programming Skill:

https://wesley-wei.medium.com/list/programming-skills-81166d8d7555

And I’m Wesley, delighted to share knowledge from the world of programming.

Don’t forget to follow me for more informative content, or feel free to share this with others who may also find it beneficial. it would be a great help to me.

Give me some free applauds, highlights, or replies, and I’ll pay attention to those reactions, which will determine whether or not I continue to post this type of article.

See you in the next article. 👋

#code

8 Essential SQL Optimization Tips You Should Know

Preface

1 Avoid Using `SELECT *`

2 Use `UNION ALL` Instead of `UNION`

3 Driving a Large Table with a Small Table

4 Batch Operations

5 Using Limit

6 Too Many Values in IN

7 Incremental Querying

Comments

Your browser is out-of-date!

8 Essential SQL Optimization Tips You Should Know

Preface

1 Avoid Using SELECT *

2 Use UNION ALL Instead of UNION

3 Driving a Large Table with a Small Table

4 Batch Operations

5 Using Limit

6 Too Many Values in IN

7 Incremental Querying

8 High-Efficiency Pagination

Comments

Your browser is out-of-date!

1 Avoid Using `SELECT *`

2 Use `UNION ALL` Instead of `UNION`