mysql-不太同步时对齐时间戳

我有以下三个系列的表中定义的3个进程A,B和C:

http://sqlfiddle.com/#!2/48f54

CREATE TABLE processA
(date_time datetime, valueA int);

INSERT INTO processA
                (date_time, valueA)
VALUES
('2013-1-8  22:10:00', 100),
('2013-1-8  22:15:00', 100),
('2013-1-8  22:30:00', 100),
('2013-1-8  22:35:00', 100),
('2013-1-8  22:40:00', 100),
('2013-1-8  22:45:00', 100),
('2013-1-8  22:50:00', 100),
('2013-1-8  23:05:00', 100),
('2013-1-8  23:10:00', 100),
('2013-1-8  23:20:00', 100),
('2013-1-8  23:25:00', 100),
('2013-1-8  23:35:00', 100),
('2013-1-8  23:40:00', 100),
('2013-1-9  00:05:00', 100),
('2013-1-9  00:10:00', 100);


CREATE TABLE processB
(date_time datetime, valueB decimal(4,2));

INSERT INTO processB
                (date_time, valueB)
VALUES
('2013-1-08  21:46:00', 3),
  ('2013-1-08  22:11:00', 4),
  ('2013-1-08  22:31:00', 5),
  ('2013-1-08  22:36:00', 6),
  ('2013-1-08  22:41:00', 7),
  ('2013-1-08  23:06:00', 8),
  ('2013-1-08  23:20:00', 2),
  ('2013-1-08  23:46:00', 3),
  ('2013-1-09  00:34:00', 9);


CREATE TABLE processC
(date_time datetime, status varchar(4));

INSERT INTO processC

VALUES
('2013-1-08 18:00:00', 'yes'),
('2013-1-08 19:00:00', 'yes'),
('2013-1-08 20:00:00', 'yes'),
('2013-1-08 21:00:00', 'yes'),
('2013-1-08 22:00:00', 'yes'),
('2013-1-08 23:00:00', 'no'),
('2013-1-08 00:00:00', 'no'),
('2013-1-08 01:00:00', 'no');

如您所见,每个过程的读数发生时间都不相同.

> ProcessA(如果发生),每隔5分钟进行一次
> ProcessB,读数发生在不可预测的时间,但通常在一小时内发生多次
> ProcessC将始终具有小时值(是或否).

首先,我想转换processB以便每隔5分钟读取一次数据,以便数据与processA对齐,然后使我能够以5分钟间隔标记对两个表进行简单连接.为了进行转换,应将每5分钟的数据设置为[-30,30)分钟窗口内最近的processB观测值.如果值是等距的,则取平均值.如果30分钟内没有可用的窗口,则将其设置为null.

一旦有了这些,我就可以使用ProcessC在%Y%m%d%H上进行简单的连接,使用如下所示的方法来获得一个最终表,其中所有数据都以5分钟的间隔对齐:

date_format(date_time, '%Y%m%d%H') = date_format(date_time, '%Y%m%d%H')

如果有人有任何指示/指导,我将不胜感激.我很感激.

样本输出:

'2013-1-8  22:10:00', 100, 4, yes    <--- closer to 22:11 than 21:46
'2013-1-8  22:15:00', 100, 4, yes    <--- closer to 22:11 than 21:31
'2013-1-8  22:30:00', 100, 5, yes    <--- closer to 22:31 than 22:11
'2013-1-8  22:35:00', 100, 6, yes    <--- closer to 22:36 than 22:31
'2013-1-8  22:40:00', 100, 7, yes    <--- closer to 22:41 than 22:36
'2013-1-8  22:45:00', 100, 7, yes    <--- closer to 22:41 than 23:06
'2013-1-8  22:50:00', 100, 7, yes    <--- closer to 22:41 than 23:06
'2013-1-8  23:05:00', 100, 8, yes    <--- closer to 23:06 than 23:06
'2013-1-8  23:10:00', 100, 8, no     <--- closer to 23:06 than 23:20
'2013-1-8  23:20:00', 100, 2, no     <--- closer to 23:20 than 23:10
'2013-1-8  23:25:00', 100, 2, no    <--- closer to 23:20 than 23:10
'2013-1-8  23:35:00', 100, 3, no    <--- closer to 23:46 than 23:20
'2013-1-9  00:05:00', 100, 3, no    <--- closer to 23:46 than 00:34
'2013-1-9  00:10:00', 100, 6, no    <--- takes the avg of 3 and 9

解决方法:

棘手的部分是从processB中检索与您所发现的processA的每一行相对应的适当行.

让我们逐步进行.

首先,我们需要能够加入processA和processB来检索候选时间戳对.让我们这样做:

               SELECT a.date_time a, 
                      TIMESTAMPDIFF(SECOND, a.date_time, b.date_time) timediff
                 FROM processA a
                 JOIN processB b 
                   ON TIMESTAMPDIFF(SECOND, a.date_time, b.date_time) >= -1800
                  AND TIMESTAMPDIFF(SECOND, a.date_time, b.date_time) <   1800

这使我们满足了[-30,30)标准的a和b时间.这个结果有很多行;但是我们可以对其进行检查,以确保我们正确完成了范围比较. http://sqlfiddle.com/#!2/48f54/47/0

现在,我们需要生成时间窗口,以为您的一个或多个匹配b记录搜索每个记录.像这样

       SELECT a, 
              MIN(ABS(timediff)) windowsize
          FROM (
               SELECT a.date_time a, 
                      TIMESTAMPDIFF(SECOND, a.date_time, b.date_time) timediff
                 FROM processA a
                 JOIN processB b 
                   ON TIMESTAMPDIFF(SECOND, a.date_time, b.date_time) >= -1800
                  AND TIMESTAMPDIFF(SECOND, a.date_time, b.date_time) <   1800
         ) d
        GROUP BY a  

这将产生两列:第一列是距a的时间戳,第二列是范围内最接近的b时间戳(或多个时间戳,如果要平均多个,则为时间戳)的时间范围.该结果集的记录没有任何行,而记录的b记录不足以考虑. http://sqlfiddle.com/#!2/48f54/46/0

最后,我们需要检索每条记录的b条记录值并取其平均值.这是这里.

SELECT processA.date_time date_time,
       processA.valueA valueA,
       AVG(processB.valueB) valueB
  FROM processA
  LEFT JOIN (
        SELECT a, 
               MIN(ABS(timediff)) windowsize
              FROM (
                   SELECT a.date_time a, 
                          TIMESTAMPDIFF(SECOND, a.date_time, b.date_time) timediff
                     FROM processA a
                     JOIN processB b 
                       ON TIMESTAMPDIFF(SECOND, a.date_time, b.date_time) >= -1800
                      AND TIMESTAMPDIFF(SECOND, a.date_time, b.date_time) <   1800
             ) d
          GROUP BY a
    ) j ON processA.date_time = j.a
   LEFT JOIN processB ON (    processB.date_time >= j.a - INTERVAL j.windowsize SECOND
                          AND processB.date_time <= j.a + INTERVAL j.windowsize SECOND
                          AND processB.date_time < j.a + INTERVAL 1800 SECOND)
  GROUP BY processA.date_time, processA.valueA

请注意,这里有几个开放范围(<运算符而不是< =运算符).这些是为了适应您的[-30,30)开阔范围.这是查询. http://sqlfiddle.com/#!2/48f54/45/0

这个最终查询将三个表连接在一起:processA,我们的虚拟表,它显示每个时间戳的搜索范围,以及processB.最后一个ON子句执行实际的范围搜索.开放范围使其变得更加复杂.

看看情况如何?从内到外构造查询是有帮助的.

不要忘记在processB.date_time上添加索引.

我很乐意将processC的联接留给该虚拟表.

上一篇:linux-UBIFS将文件同步到闪存需要多长时间


下一篇:阅读时PHP下载文件