Holding Out for a [MySQL] Hero

By barneyb on April 27, 2009

I've been fighting this god-forsaken query for a week, and can come up with absolutely no explaination for its behaviour.Â Next time I see you, I'll buy you a beer if you can help me understand what's going on.Â There are two tables involved, and here are their CREATE statements:

CREATE TABLE `sequence` ( -- contains rows from -50,000 to 50,000
  `num` int(11) NOT NULL,
  PRIMARY KEY  (`num`)
)
CREATE TABLE `count_datum` ( -- ~550K rows, ~22K rows with seriesId = 47
  `id` int(10) unsigned NOT NULL auto_increment,
  `seriesId` int(10) unsigned NOT NULL,
  `timestamp` datetime NOT NULL default '0000-00-00 00:00:00',
  `count` double NOT NULL default '0',
  PRIMARY KEY  (`id`),
  KEY `k_seriesId_timestamp` (`seriesId`,`timestamp`)
)

And then we have the query itself.Â Note that one line is commented out.

select ts.ts as timestamp,
  (
    select `count`
    from count_datum
    where seriesId = 47
      and timestamp <= ts.ts
--    and timestamp > ts.pts
    order by timestamp desc
    limit 1
  ) as `count`
from (
    select '2009-04-27 8:00:00' + interval 20 * num minute as ts,
      '2009-04-27 8:00:00' + interval 20 * (num - 1) minute as pts
    from sequence
    where num between 0 and 12
  ) ts

The `ts` subquery uses the sequence table to create a fixed-length list list of timestamps 20 minutes apart for four hours, starting at 8am on April 27th (in the `ts` column), along with the previous timestamp (in the `pts` column).Â Only the former is used in the query as it's written.Â It then uses the timestamps to pull the `count` column from the last record in the `count_datum` table recorded prior or equal to the timestamp.Â In plain English, it pull the last known value for every 20 minutes, regardless of how old that value was (might be a couple minutes, might be a couple hours).Â The query runs quite quickly as written, somewhere around 50ms on my server.

If you uncomment the commented out line, you'll get the same behaviour, except that it pulls the last known value at each timestamp, as long as it was recorded after the previous timestamp.Â To put that another way, if a given 20 minute window didn't have a value recorded with it you'll get a null back instead of a value from a previous window.Â This version of the query runs in around 7,000ms (seven seconds) on the same server. Â Yes, well over 100 times slower.

Here is some example data from count_datum, followed by the result from running each version of the query.Â The rows that I've bolded are the ones that actually get returned by the queries, the unbolded rows are ignored (because they're not the last record in any target window).

raw data in count_datum        | with carryover                 | without carryover
-------------------------------+--------------------------------+-------------------------------
timestamp                count | timestamp                count | timestamp                count
2009-04-27 08:00:00          1 | 2009-04-27 08:00:00          1 | 2009-04-27 08:00:00          1
2009-04-27 08:50:00          1 | 2009-04-27 08:20:00          1 | 2009-04-27 08:20:00       NULL
2009-04-27 08:55:00          2 | 2009-04-27 08:40:00          1 | 2009-04-27 08:40:00       NULL
2009-04-27 09:55:00          0 | 2009-04-27 09:00:00          2 | 2009-04-27 09:00:00          2
2009-04-27 10:42:00          1 | 2009-04-27 09:20:00          2 | 2009-04-27 09:20:00       NULL
2009-04-27 10:45:00          2 | 2009-04-27 09:40:00          2 | 2009-04-27 09:40:00       NULL
2009-04-27 10:50:00          8 | 2009-04-27 10:00:00          0 | 2009-04-27 10:00:00          0
2009-04-27 10:52:00          7 | 2009-04-27 10:20:00          0 | 2009-04-27 10:20:00       NULL
2009-04-27 10:57:00          6 | 2009-04-27 10:40:00          0 | 2009-04-27 10:40:00       NULL
2009-04-27 12:00:00          2 | 2009-04-27 11:00:00          6 | 2009-04-27 11:00:00          6
                               | 2009-04-27 11:20:00          6 | 2009-04-27 11:20:00       NULL
                               | 2009-04-27 11:40:00          6 | 2009-04-27 11:40:00       NULL
                               | 2009-04-27 12:00:00          2 | 2009-04-27 12:00:00          2

So the 64 thousand dollar one beer question is "why does it do that?"Â Running each version through EXPLAIN returns exactly the same information, including rowcounts and index usage.Â Converting the pair of timestamp comparisons to a BETWEEN … AND yields a slight improvement (down to just under 6,000ms), but nothing to write home about.Â Here's the EXPLAIN:

id  type                table       type    keys                    key                     key_len     ref     rows    extra
1   PRIMARY             <derived3>  ALL                                                                         13
3   DERIVED             sequence    range   PRIMARY                 PRIMARY                 4                   13      Using where; Using index
2   DEPENDENT SUBQUERY  count_datum ref     k_seriesId_timestamp    k_seriesId_timestamp    4           const   12224   Using where

I'm running on MySQL 5.0.45 on CentOS 5.Â Nearly identical behaviour on my old server which was CentOS 4 with an older MySQL (though I'm not sure which exactly).

select ts.ts as timestamp, max(`counted`.`count`) as `count` from ( select '2009-04-27 8:00:00' + interval 20 * num minute as ts, '2009-04-27 8:00:00' + interval 20 * (num - 1) minute as pts from sequence where num between 0 and 12 ) ts left join ( select `count`, `timestamp` from count_datum where seriesId = 47 order by timestamp desc ) as `counted` on `counted`.`timestamp` <= ts.ts and `counted`.`timestamp` > ts.pts group by ts.ts

select raw.timestamp, cd.`count` from ( select ts.ts as timestamp, max(d.timestamp) as datumTs from ( select '2008-04-27 8:00:00' + interval 20 * num minute as ts, '2008-04-27 8:00:00' + interval 20 * (num - 1) minute as pts from sequence where num between 0 and 12 ) ts left outer join ( select timestamp from count_datum where seriesId = 47 ) d on d.timestamp > ts.pts and d.timestamp <= ts.ts group by ts.ts ) raw left outer join count_datum cd on cd.seriesId = 47 and cd.timestamp = raw.datumTs

4 responses to “Holding Out for a [MySQL] Hero”

Joe Zack
April 28, 2009 at 8:50 am | Permalink

Watching this!

anthony

April 28, 2009 at 3:03 pm | Permalink

I'm not a mysql expert and I only briefly looked at the SQL, but couldn't you use a left join and a subselect instead of a subselect in your select statement? I think if it's in your select it runs the query with every row (maybe?).

Something like this? I didnt test this, so it might not even work.

barneyb
April 28, 2009 at 3:19 pm | Permalink

anthony,

That's certainly faster that the double comparison, but noticeably slower than the single comparison, and it doesn't do quite the same thing. It returns the maximum `count` value for each window in time, not the last `count` value. E.g., for the '2009-04-27 11:00:00' record your query would return 8, not 6.

A different approach, though, so I'll play with it some more and see if I can get it to do the right thing.

barneyb

April 28, 2009 at 5:11 pm | Permalink

anthony, here's a tweaked version of your query that has the correct behaviour. Slightly slower than your original because of the extra layer.

Holding Out for a [MySQL] Hero

4 responses to “Holding Out for a [MySQL] Hero”

Recent Posts

Categories

Archives