长河落日圆: Mysql中查询相关优化分析

无意中看到一个SQL的对比分析，关于独立子查询和相关子查询对最终结果集的耗时对比。主要的SQL集中在in (select ...)下面分析一下此类SQL优化的原理和方法；

子查询

当一个查询是另外一个查询的条件时，此时称之为子查询；

子查询又分为独立子查询和相关子查询，顾名思义，独立子查询表示该子查询可以独立完成，不依赖于外部变量；而相关子查询则依赖于外部数据传入；

慢查询日志

Mysql可以通过修改/etc/my.cnf来设定慢查询日志。修改my.cnf文件：

slow_query_log = 1
slow_query_log_file = /tmp/mysqlslow.log
long_query_time = 0.0001
log-queries-not-using-indexes
log-slow-admin-statements
log-slow-slave-statements

然后重启mysql

#>service mysqld restart

我们设定的超时时间是0.0001秒，当查询超过这个时间就会打印慢查询日志。

实例

查询一张表，查询出每月最后一天的数据并展示；

我们可能第一时间写出的sql类似于这样：

SELECT * FROM orders where date in (select max(date) from orders group by substring(orders.date, -10, 7));

注意，这个sql语句用到了in，按照我们自然的想法，我们会认为这个sql将会先执行in后面的子查询，再执行外部查询。但实际上是这样吗？
我们用explain解释一下这个sql，返回：

mysql> explain SELECT * FROM orders where date in (select max(date) from orders group by substring(orders.date, -10, 7));
+----+--------------------+----------+------+---------------+------+---------+------+------+---------------------------------+
| id | select_type        | table    | type | possible_keys | key  | key_len | ref  | rows | Extra                           |
+----+--------------------+----------+------+---------------+------+---------+------+------+---------------------------------+
|  1 | PRIMARY            | orders   | ALL  | NULL          | NULL | NULL    | NULL |   93 | Using where                     |
|  2 | DEPENDENT SUBQUERY | orders   | ALL  | NULL          | NULL | NULL    | NULL |   93 | Using temporary; Using filesort |
+----+--------------------+----------+------+---------------+------+---------+------+------+---------------------------------+

我们看到后面的子查询是一个DEPENDENT SUBQUERY，也就是一个相关子查询。
原来，在mysql执行这个sql语句时，该语句被翻译为下面的语句执行：

SELECT * FROM orders where date exists (select max(date) from orders group by substring(orders.date, -10, 7) having `max(date)`=date);

解释之后是作为一个相关子查询来执行，假设外部查询的时间复杂度是O(M), 相关子查询的时间复杂度是O(N), 则此查询的时间复杂度不是O(M+N)而是O(M+M*N)
通过慢查询日志我们也可以看到该子查询的逻辑查询次数：

# Time: 140701 15:28:05
# User@Host: root[root] @ localhost []
# Query_time: 0.015520  Lock_time: 0.000162 Rows_sent: 4  Rows_examined: 9480
SET timestamp=1404199685;
SELECT * FROM orders where date in (select max(date) from orders group by substring(orders.date, -10, 7));

我们当前的表行数只有93行，但是这个sql执行时逻辑查询次数竟然达到了9480次。因此in的效率很低，最好尽量避免使用in语句。

优化

可以通过降低相关子查询表的条数来降低整体逻辑查询次数，例如这个sql可以优化为：

SELECT * FROM orders A where exists (select * from (select max(date) from orders group by substring(orders.date, -10, 7)) B where B.`max(date)`=A.date);

解释这个sql可以看到：

+----+--------------------+------------+------+---------------+------+---------+------+------+---------------------------------+
| id | select_type        | table      | type | possible_keys | key  | key_len | ref  | rows | Extra                           |
+----+--------------------+------------+------+---------------+------+---------+------+------+---------------------------------+
|  1 | PRIMARY            | A          | ALL  | NULL          | NULL | NULL    | NULL |   93 | Using where                     |
|  2 | DEPENDENT SUBQUERY |  | ALL  | NULL          | NULL | NULL    | NULL |    4 | Using where                     |
|  3 | DERIVED            | youku_tj   | ALL  | NULL          | NULL | NULL    | NULL |   93 | Using temporary; Using filesort |
+----+--------------------+------------+------+---------------+------+---------+------+------+---------------------------------+

相关子查询的行数只有4行了。大大降低了逻辑查询次数，从慢查询日志也反映了这个改变：

# Time: 140701 15:27:27
# User@Host: root[root] @ localhost []
# Query_time: 0.000915  Lock_time: 0.000201 Rows_sent: 4  Rows_examined: 560
SET timestamp=1404199647;
SELECT * FROM orders A where exists (select * from (select max(date) from orders group by substring(orders.date, -10, 7)) B where B.`max(date)`=A.date);

因此当通过explain查看sql的执行效率时，要注意DEPENDENT SUBQUERY对效率的影响，从两个SQL的执行效率看，改进后的执行之间是0.00915s，改进之前是0.015520s 还是影响比较大的，尤其对行数很大的表影响尤其明显；

长河落日圆

2014年7月1日星期二

Mysql中查询相关优化分析

子查询

慢查询日志

实例

优化

没有评论:

发表评论