I've been fighting this god-forsaken query for a week, and can come up with absolutely no explaination for its behaviour. Next time I see you, I'll buy you a beer if you can help me understand what's going on. There are two tables involved, and here are their CREATE statements:
CREATE TABLE `sequence` ( -- contains rows from -50,000 to 50,000 `num` int(11) NOT NULL, PRIMARY KEY (`num`) ) CREATE TABLE `count_datum` ( -- ~550K rows, ~22K rows with seriesId = 47 `id` int(10) unsigned NOT NULL auto_increment, `seriesId` int(10) unsigned NOT NULL, `timestamp` datetime NOT NULL default '0000-00-00 00:00:00', `count` double NOT NULL default '0', PRIMARY KEY (`id`), KEY `k_seriesId_timestamp` (`seriesId`,`timestamp`) )
And then we have the query itself. Note that one line is commented out.
select ts.ts as timestamp, ( select `count` from count_datum where seriesId = 47 and timestamp <= ts.ts -- and timestamp > ts.pts order by timestamp desc limit 1 ) as `count` from ( select '2009-04-27 8:00:00' + interval 20 * num minute as ts, '2009-04-27 8:00:00' + interval 20 * (num - 1) minute as pts from sequence where num between 0 and 12 ) ts
The `ts` subquery uses the sequence table to create a fixed-length list list of timestamps 20 minutes apart for four hours, starting at 8am on April 27th (in the `ts` column), along with the previous timestamp (in the `pts` column). Only the former is used in the query as it's written. It then uses the timestamps to pull the `count` column from the last record in the `count_datum` table recorded prior or equal to the timestamp. In plain English, it pull the last known value for every 20 minutes, regardless of how old that value was (might be a couple minutes, might be a couple hours). The query runs quite quickly as written, somewhere around 50ms on my server.
If you uncomment the commented out line, you'll get the same behaviour, except that it pulls the last known value at each timestamp, as long as it was recorded after the previous timestamp. To put that another way, if a given 20 minute window didn't have a value recorded with it you'll get a null back instead of a value from a previous window. This version of the query runs in around 7,000ms (seven seconds) on the same server.  Yes, well over 100 times slower.
Here is some example data from count_datum, followed by the result from running each version of the query. The rows that I've bolded are the ones that actually get returned by the queries, the unbolded rows are ignored (because they're not the last record in any target window).
raw data in count_datum | with carryover | without carryover -------------------------------+--------------------------------+------------------------------- timestamp count | timestamp count | timestamp count 2009-04-27 08:00:00 1 | 2009-04-27 08:00:00 1 | 2009-04-27 08:00:00 1 2009-04-27 08:50:00 1 | 2009-04-27 08:20:00 1 | 2009-04-27 08:20:00 NULL 2009-04-27 08:55:00 2 | 2009-04-27 08:40:00 1 | 2009-04-27 08:40:00 NULL 2009-04-27 09:55:00 0 | 2009-04-27 09:00:00 2 | 2009-04-27 09:00:00 2 2009-04-27 10:42:00 1 | 2009-04-27 09:20:00 2 | 2009-04-27 09:20:00 NULL 2009-04-27 10:45:00 2 | 2009-04-27 09:40:00 2 | 2009-04-27 09:40:00 NULL 2009-04-27 10:50:00 8 | 2009-04-27 10:00:00 0 | 2009-04-27 10:00:00 0 2009-04-27 10:52:00 7 | 2009-04-27 10:20:00 0 | 2009-04-27 10:20:00 NULL 2009-04-27 10:57:00 6 | 2009-04-27 10:40:00 0 | 2009-04-27 10:40:00 NULL 2009-04-27 12:00:00 2 | 2009-04-27 11:00:00 6 | 2009-04-27 11:00:00 6 | 2009-04-27 11:20:00 6 | 2009-04-27 11:20:00 NULL | 2009-04-27 11:40:00 6 | 2009-04-27 11:40:00 NULL | 2009-04-27 12:00:00 2 | 2009-04-27 12:00:00 2
So the 64 thousand dollar one beer question is "why does it do that?" Running each version through EXPLAIN returns exactly the same information, including rowcounts and index usage. Converting the pair of timestamp comparisons to a BETWEEN … AND yields a slight improvement (down to just under 6,000ms), but nothing to write home about. Here's the EXPLAIN:
id type table type keys key key_len ref rows extra 1 PRIMARY <derived3> ALL 13 3 DERIVED sequence range PRIMARY PRIMARY 4 13 Using where; Using index 2 DEPENDENT SUBQUERY count_datum ref k_seriesId_timestamp k_seriesId_timestamp 4 const 12224 Using where
I'm running on MySQL 5.0.45 on CentOS 5. Nearly identical behaviour on my old server which was CentOS 4 with an older MySQL (though I'm not sure which exactly).
Watching this!
I'm not a mysql expert and I only briefly looked at the SQL, but couldn't you use a left join and a subselect instead of a subselect in your select statement? I think if it's in your select it runs the query with every row (maybe?).
Something like this? I didnt test this, so it might not even work.
anthony,
That's certainly faster that the double comparison, but noticeably slower than the single comparison, and it doesn't do quite the same thing. It returns the maximum `count` value for each window in time, not the last `count` value. E.g., for the '2009-04-27 11:00:00' record your query would return 8, not 6.
A different approach, though, so I'll play with it some more and see if I can get it to do the right thing.
anthony, here's a tweaked version of your query that has the correct behaviour. Slightly slower than your original because of the extra layer.