Whether you’re interviewing or working, you may encounter this topic. SQL optimization is a hot topic that everyone pays attention to.
Medium Link: 8 Essential SQL Optimization Tips You Should Know | by Wesley Wei | Aug, 2024 | Programmer’s Career
Author:Wesley Wei – Medium
Preface
If one day you are responsible for an online interface and it appears to have performance issues, you need to optimize the SQL statement. At that time, you will probably think of optimizing the SQL statement because its cost is much lower compared to code changes.
Then, how can we optimize the SQL statement?
This article shares 8 Essential SQL Optimization Tips, hoping to help you.
1 Avoid Using SELECT *
Many times, when writing a SQL statement, we like to use SELECT *
directly to retrieve all columns of data at once for convenience.
Counterexample:
1 | SELECT * FROM user WHERE id = 1; |
In actual business scenarios, we may only need to use one or two columns. Retrieving many data but not using it is a waste of database resources, such as memory or CPU.
Moreover, the excessive data retrieval will also increase the time spent on transmitting data through network I/O.
Additionally, SELECT *
will not use the index and will cause a lot of “table scan” operations, which can significantly reduce the performance of the SQL statement.
Example:
1 | SELECT name, age FROM user WHERE id = 1; |
When querying the database, only retrieve the columns that are actually needed. There is no need to retrieve unnecessary data.
2 Use UNION ALL
Instead of UNION
We all know that using the UNION
keyword in a SQL statement can get us the unique results.
However, if we use UNION ALL
, we can get all the data, including duplicate records.
Counterexample:
1 | (SELECT * FROM user WHERE id = 1) UNION (SELECT * FROM user WHERE id = 2); |
The process of removing duplicates requires traversing, sorting, and comparing, which is more time-consuming and CPU-intensive.
Therefore, if we can use UNION ALL
, it’s best not to use UNION
.
Example:
1 | (SELECT * FROM user WHERE id = 1) UNION ALL (SELECT * FROM user WHERE id = 2); |
Unless there are some special scenarios where duplicate data is allowed in the business scenario, we should use UNION ALL
.
3 Driving a Large Table with a Small Table
Driving a large table with a small table means using the data set from the small table to drive the data set from the large table.
For example, suppose we have two tables: order
and user
. The order
table has 10,000 rows of data, while the user
table has only 100 rows of data.
If we want to query all the orders placed by valid users, we can use the in
keyword:
1 | select * from order where user_id in (select id from user where status=1) |
Or we can use the exists
keyword:
1 | select * from order where exists (select 1 from user where order.user_id = user.id and status=1) |
In this scenario, using the in
keyword is more suitable.
Why?
Because when the SQL statement contains the in
keyword, it will first execute the subquery inside the in
, and then execute the outer query. If the data set inside the in
is small, the query speed will be faster.
On the other hand, if the SQL statement contains the exists
keyword, it will first execute the left side of the exists
(i.e., the main query), and then use the result as a condition to match with the right side. If there is no match, the data will be filtered out.
In this requirement, the order
table has 10,000 rows of data, while the user
table has only 100 rows of data. The order
table is large, and the user
table is small. If the order
table is on the left side, using the in
keyword will be more efficient.
In summary:
in
is suitable for driving a large table with a small table.exists
is suitable for driving a small table with a large table.
4 Batch Operations
If you have a batch of data that has been processed by business logic and needs to be inserted into the database, how do you handle it?
Anti-pattern:
1 | for(Order order : list){ orderMapper.insert(order); } |
Inserting data one by one in a loop.
1 | insert into order(id, code, user_id) values(123,'001',100); |
This operation needs to request the database multiple times to complete the batch insertion.
However, it is well known that each remote database request will consume some performance. If our code needs to request the database multiple times to complete a business function, it will consume even more performance.
So how do we optimize?
Best practice:
1 | orderMapper.insertBatch(list); |
Provide a batch insertion method.
1 | insert into order(id, code, user_id) values(123,'001',100),(124,'002',100),(125,'003',101); |
This way only needs to request the database once, and the SQL performance will be improved. The more data there is, the bigger the improvement will be.
However, we need to note that it’s not recommended to insert too many data at one time. If the data is too much, the database response will also be slow. Batch operations should control the number of data inserted in each batch, and recommend no more than 500. If there are more than 500, divide them into multiple batches.
5 Using Limit
Sometimes, we need to query some data’s first row, such as querying a user’s first order and seeing when they made their first purchase.
Anti-pattern:
1 | select id, create_date from order where user_id=123 order by create_date asc; |
Query all orders for the specified user, sort them by creation date, and then get the first element in code to obtain the first order’s data.
1 | List<Order> list = orderMapper.getOrderList(); Order order = list.get(0); |
Although this approach does not have functional issues, it is very inefficient and wastes resources.
Best practice:
1 | select id, create_date from order where user_id=123 order by create_date asc limit 1; |
Use limit 1
to only return the first row of data that meets the condition.
Additionally, when deleting or modifying data, it is also a good idea to use
limit
at the end of the SQL statement to prevent accidental deletion or modification of unrelated data.
For example:
1 | update order set status=0, edit_time=now(3) where id>=100 and id<200 limit 100; |
This way, even if there is an error and the ID is incorrect, it will not affect too many records.
6 Too Many Values in IN
When we have a batch query interface, we usually use the IN
keyword to filter out data. For example, we want to query user information by specifying some IDs.
The SQL statement is as follows:
1 | select id, name from category where id in (1, 2, 3…100000000); |
If we don’t limit anything, this query may take a very long time and even cause the interface to timeout.
What can we do?
We can use limit
in SQL to limit the data:
1 | select id, name from category where id in (1, 2, 3…100) limit 500; |
However, we often need to add limits in our business code. The pseudo-code is as follows:
1 | public List<Category> getCategory(List<Long> ids) { |
Another approach is to divide the IDs into batches and use multiple threads to query the data. Each batch only queries 500 records, and then combines the query results to return.
However, this is only a temporary solution and not suitable for scenarios where the IDs are too many. Even if we can quickly query out the data, but if the returned data volume is too large, network transmission will also consume a lot of performance, and the interface performance will not be improved.
7 Incremental Querying
Sometimes, we need to query data through a remote interface and then synchronize it with another database.
Bad Example:
If we directly get all the data and then synchronize it. This may be very convenient, but it brings a huge problem - if there is too much data, the query performance will be very poor.
What can we do?
Good Example:
1 | select * from user where id > #{lastId} and create_time >= #{lastCreateTime} limit 100; |
Sort by ID and time in ascending order. Each time, only synchronize a batch of data, with only 100 records. After each synchronization is complete, save the maximum ID and time of these 100 records to use for the next batch.
This incremental querying way can improve the efficiency of single queries.
8 High-Efficiency Pagination
Sometimes, when we query data on a list page, we need to avoid returning too many records at once to prevent affecting interface performance. We usually do pagination processing on the query interface.
In MySQL, we often use the limit
keyword for pagination:
1 | select id, name, age from user limit 10,20; |
If the data volume is small, using the limit
keyword for pagination has no problem. However, if the data volume is very large, using it will cause performance problems.
For example, the pagination parameter has changed to:
1 | select id, name, age from user limit 1000000,20; |
MySQL will query 1,000,020 records and then discard the first 1,000,000 records, only querying the last 20 records. This is very wasteful of resources.
So, how can we paginate large amounts of data?
Optimize SQL:
1 | select id, name, age from user where id > 1000000 limit 20; |
First find the maximum ID of the previous pagination, and then use the ID index to query. However, this solution requires that the ID is continuous and ordered.
We can also use between
optimization for pagination:
1 | select id, name, age from user where id between 1000000 and 1000020; |
Note that between
should be used on a unique index to paginate, otherwise it will cause inconsistent page sizes.
More Series Articles about Programming Skill:
https://wesley-wei.medium.com/list/programming-skills-81166d8d7555
And I’m Wesley, delighted to share knowledge from the world of programming.
Don’t forget to follow me for more informative content, or feel free to share this with others who may also find it beneficial. it would be a great help to me.
Give me some free applauds, highlights, or replies, and I’ll pay attention to those reactions, which will determine whether or not I continue to post this type of article.
See you in the next article. 👋
Comments