早上同事要我写个MySQL去除重复数据的SQL,想起来上次写过一篇MySQL去除重复数据的博客,使用导入导出加唯一索引实现的,但是那种方式对业务影响较大,所以重新写一个存储过程来删重复数据,这一写就写了一个上午,这种BUG确实是很令人沮丧和浪费时间的。
这里把流程简单的描述一下,删重复数据的逻辑很简单:
1.根据重复判断条件找出重复记录的最小主键(一般是ID列)。
2.在符合重复条件的记录中,把主键大于最小主键的记录全部删掉即可。
假设我有如下表,需要删除start_time和end_time都一样的重复记录。
那么存储过程如下:(需要注意的是mysql存储过程和函数的创建对于缩进的要求极为严格,同样的代码如果直接copy很可能因为缩进的问题出现语法错误,要注意这一点。)
DELIMITER //
DROP PROCEDURE IF EXISTS Del_Dup_FOR_TEST;
CREATE PROCEDURE Del_Dup_FOR_TEST()
BEGIN
DECLARE min_id INT;
DECLARE v_start_time,v_end_time DATETIME;
DECLARE v_count INT;
DECLARE done INT DEFAULT 0;
DECLARE my_cur CURSOR FOR SELECT start_time,end_time,min(id),count(1) AS count FROM leo.test GROUP BY start_time,end_time HAVING count>1;
DECLARE CONTINUE HANDLER FOR NOT FOUND SET done = 1;
OPEN my_cur;
myloop: LOOP
FETCH my_cur INTO v_start_time,v_end_time,min_id,v_count;
IF done=1 THEN
LEAVE myloop;
END IF;
DELETE FROM leo.test WHERE start_time=v_start_time AND end_time=v_end_time AND id>min_id;
COMMIT;
END LOOP myloop;
CLOSE my_cur;
END;
//
DELIMITER ;
逻辑很清晰,就是根据重复判断条件依次删掉重复组中主键大于最小主键的记录们。
但是在编写过程中却遇到一个很恶心的BUG,我最初的内容是这么写的:
DELIMITER //
DROP PROCEDURE IF EXISTS Del_Dup_FOR_TEST;
CREATE PROCEDURE Del_Dup_FOR_TEST()
BEGIN
DECLARE min_id INT;
DECLARE start_time,end_time DATETIME;
DECLARE count INT;
DECLARE done INT DEFAULT 0;
DECLARE my_cur CURSOR FOR SELECT start_time,end_time,min(id),count(1) AS count FROM leo.test GROUP BY start_time,end_time HAVING count>1;
DECLARE CONTINUE HANDLER FOR NOT FOUND SET done = 1;
OPEN my_cur;
myloop: LOOP
FETCH my_cur INTO start_time,end_time,min_id,count;
IF done=1 THEN
LEAVE myloop;
END IF;
DELETE FROM leo.test WHERE start_time=start_time AND end_time=end_time AND id>min_id;
COMMIT;
END LOOP myloop;
CLOSE my_cur;
END;
//
DELIMITER ;
不同的部分在于变量定义的名称,即:
FETCH INTO的变量名绝对不能是你定义CURSOR时SQL语句查出来的列名或者列别名,也就说你定义的变量名既不能是表中已经存在的列名,也不能是你定义游标时用过的别名(如本例中的count),只要一个条件不符合,FETCH INTO就把全部的变量赋NULL值,这点你可以尝试在FETCH INTO后加一句Select打印变量名验证。
在查询到这个BUG之前去官网页面特地看了一下是否是我的语法有错误:https://dev.mysql.com/doc/refman/5.5/en/cursors.html ,确信语法没问题,但倒数第二条评论显示可能是列名的隐藏BUG,最后一条评论反驳了BUG说法,但没有办法我还是根据BUG REPORT做了以上修改,然后功能就正常了。
关于此BUG的BUG报告页面详见MySQL BUG:#28227 和 BUG:#5967
那么再回头看一下官网文档下的最后一条评论,开始我认为最后一条反驳BUG的评论完全是扯淡,是哪个傻X说这不是个BUG的?后来仔细想了想,他俩都对,这确实也算个BUG,傻X的也是我。
贴一下页面下最后两条评论(截止2018.08.01):
Posted by Brent Roady on May 9, 2012
It should be noted that the local variable names used in FETCH [cursor] INTO must be different than the variable names used in the SELECT statement
defining the CURSOR. Otherwise the values will be NULL.
In this example,
DECLARE a VARCHAR(255);
DECLARE cur1 CURSOR FOR SELECT a FROM table1;
FETCH cur1 INTO a;
the value of a after the FETCH will be NULL.
This is also described here: http://bugs.mysql.com/bug.php?id=28227 Posted by Jérémi Lassausaie on February 3, 2015
Answer for Brent Roady :
I don't see any bug in the bahaviour described.
DECLARE a VARCHAR(255);
/* you declare a variable "a" without a specified default value, a=NULL */
DECLARE cur1 CURSOR FOR
SELECT a FROM table1;
/* You declare a cursor that selects "a" FROM a table */
OPEN cur1;
/* You execute your cursor query, a warning is raised because a is ambiguously defined but you don't see it */
FETCH cur1 INTO a;
/* you put your unique field in your unique row into a (basically you do "SET a=a;") so a is still NULL */
There is no bug report, just a misunderstanding.
Brent遇到的现象与我相同,并列出了BUG Report的链接。
Jeremi(猜测可能是个程序员)回答,这是一个显而易见的误解,当你声明了变量a(初始值为NULL),然后FETCH INTO a就相当于set a=a,在任何程序语言中这都是无解的。
因此在编写存储过程中为定义的变量加个前缀标识是很好的习惯,想起以前Oracle写存储过程确实都加v_前缀,SQL Server 都用@前缀,现在轮到mysql却忽略了,确实需要牢记下。