Holding Out for a [MySQL] Hero

I've been fighting this god-forsaken query for a week, and can come up with absolutely no explaination for its behaviour.  Next time I see you, I'll buy you a beer if you can help me understand what's going on.  There are two tables involved, and here are their CREATE statements:

CREATE TABLE `sequence` ( -- contains rows from -50,000 to 50,000
  `num` int(11) NOT NULL,
  PRIMARY KEY  (`num`)
)
CREATE TABLE `count_datum` ( -- ~550K rows, ~22K rows with seriesId = 47
  `id` int(10) unsigned NOT NULL auto_increment,
  `seriesId` int(10) unsigned NOT NULL,
  `timestamp` datetime NOT NULL default '0000-00-00 00:00:00',
  `count` double NOT NULL default '0',
  PRIMARY KEY  (`id`),
  KEY `k_seriesId_timestamp` (`seriesId`,`timestamp`)
)

And then we have the query itself.  Note that one line is commented out.

select ts.ts as timestamp,
  (
    select `count`
    from count_datum
    where seriesId = 47
      and timestamp <= ts.ts
--    and timestamp > ts.pts
    order by timestamp desc
    limit 1
  ) as `count`
from (
    select '2009-04-27 8:00:00' + interval 20 * num minute as ts,
      '2009-04-27 8:00:00' + interval 20 * (num - 1) minute as pts
    from sequence
    where num between 0 and 12
  ) ts

The `ts` subquery uses the sequence table to create a fixed-length list list of timestamps 20 minutes apart for four hours, starting at 8am on April 27th (in the `ts` column), along with the previous timestamp (in the `pts` column).  Only the former is used in the query as it's written.  It then uses the timestamps to pull the `count` column from the last record in the `count_datum` table recorded prior or equal to the timestamp.  In plain English, it pull the last known value for every 20 minutes, regardless of how old that value was (might be a couple minutes, might be a couple hours).  The query runs quite quickly as written, somewhere around 50ms on my server.

If you uncomment the commented out line, you'll get the same behaviour, except that it pulls the last known value at each timestamp, as long as it was recorded after the previous timestamp.  To put that another way, if a given 20 minute window didn't have a value recorded with it you'll get a null back instead of a value from a previous window.  This version of the query runs in around 7,000ms (seven seconds) on the same server.   Yes, well over 100 times slower.

Here is some example data from count_datum, followed by the result from running each version of the query.  The rows that I've bolded are the ones that actually get returned by the queries, the unbolded rows are ignored (because they're not the last record in any target window).

raw data in count_datum        | with carryover                 | without carryover
-------------------------------+--------------------------------+-------------------------------
timestamp                count | timestamp                count | timestamp                count
2009-04-27 08:00:00          1 | 2009-04-27 08:00:00          1 | 2009-04-27 08:00:00          1
2009-04-27 08:50:00          1 | 2009-04-27 08:20:00          1 | 2009-04-27 08:20:00       NULL
2009-04-27 08:55:00          2 | 2009-04-27 08:40:00          1 | 2009-04-27 08:40:00       NULL
2009-04-27 09:55:00          0 | 2009-04-27 09:00:00          2 | 2009-04-27 09:00:00          2
2009-04-27 10:42:00          1 | 2009-04-27 09:20:00          2 | 2009-04-27 09:20:00       NULL
2009-04-27 10:45:00          2 | 2009-04-27 09:40:00          2 | 2009-04-27 09:40:00       NULL
2009-04-27 10:50:00          8 | 2009-04-27 10:00:00          0 | 2009-04-27 10:00:00          0
2009-04-27 10:52:00          7 | 2009-04-27 10:20:00          0 | 2009-04-27 10:20:00       NULL
2009-04-27 10:57:00          6 | 2009-04-27 10:40:00          0 | 2009-04-27 10:40:00       NULL
2009-04-27 12:00:00          2 | 2009-04-27 11:00:00          6 | 2009-04-27 11:00:00          6
                               | 2009-04-27 11:20:00          6 | 2009-04-27 11:20:00       NULL
                               | 2009-04-27 11:40:00          6 | 2009-04-27 11:40:00       NULL
                               | 2009-04-27 12:00:00          2 | 2009-04-27 12:00:00          2

So the 64 thousand dollar one beer question is "why does it do that?"  Running each version through EXPLAIN returns exactly the same information, including rowcounts and index usage.  Converting the pair of timestamp comparisons to a BETWEEN … AND yields a slight improvement (down to just under 6,000ms), but nothing to write home about.  Here's the EXPLAIN:

id  type                table       type    keys                    key                     key_len     ref     rows    extra
1   PRIMARY             <derived3>  ALL                                                                         13
3   DERIVED             sequence    range   PRIMARY                 PRIMARY                 4                   13      Using where; Using index
2   DEPENDENT SUBQUERY  count_datum ref     k_seriesId_timestamp    k_seriesId_timestamp    4           const   12224   Using where

I'm running on MySQL 5.0.45 on CentOS 5.  Nearly identical behaviour on my old server which was CentOS 4 with an older MySQL (though I'm not sure which exactly).

Don't Forget to Bounce Sendmail

When you change the hostname of your computer, you need to do two things: add a local alias to Sendmail's config and restart sendmail.  I'm pretty sure I've forgotten step two every time I've set up a new box, and my new server was no exception.  Fortunately, I finally remembered that that was the problem this morning, and now it's forwarding local mail like a champ.

My Anal Retentiveness

I hate what stuff breaks.  I'll admit to punting on occasion, but I really do try to avoid that whenever possible.  While doing the FeedBurner stuff I went looking for site references, and found one on ColdFusion Bloggers that was quite interesting.  I'm one of the aggregated feeds, as you'd probably imagine, and have been for several years.  The interesting thing is that none of my feed information has updated since I was originally added, and it still works.

In that time, I've switched blogging engines from MovableType (two versions) to a custom CF/JSP solution (one version) to WordPress (numerous versions).  I've also done four server migrations, including a stint on a really low-end shared PHP host.  The tagline and URLs that ColdFusion Bloggers has aren't current anymore, since they all changed to their current values with the initial move to WordPress, but both the "people" URL and the feed URL are still valid, and so I keep getting aggregated.

I also still support all of the URL formats for ranking PotD, including those that were implemented in PHP on that same shared host while the actual app was running on a workstation at house with no inbound connectivity (PHP would store them locally and CF would go batch download them periodically).  While writing this post, I went and scanned my access logs to see if it mattered and actually found a couple hits to that format, even though I haven't sent out a link in that format in at least a couple years.  The only one I don't fully support uses a DynDNS domain that pointed at my home network – it works, but it gives an SSL certificate mismatch warning, and don't think it's worth the cost to get a whole new cert just for that.

I don't really have a point.  Just nice to see a concrete example of my desire to keep things compatible actually paying off.

Now Running FeedBurner

I set up FeedBurner for my main blog XML feed this afternoon.  Pretty painless, and it appears that everything is all hunky-dory.  However I figured I'd mention it in case there are problems so you'll know why.

Here's the Apache config I used (wrapped for readability):

RewriteCond  %{HTTP_USER_AGENT}
  !FeedBurner
  [NC]
RewriteRule  ^/barneyblog/(feed(/atom|/rss)?/?|wp-rss2?.php)$

http://feeds2.feedburner.com/barneyblog/

  [P,L]

Why use mod_rewrite instead of one of several WordPress plugins, you ask?

First, the one linked to by FeedBurner itself (and which it hosts) returns a 404.  If it had worked, I probably would have just left it, but it didn't.  I didn't bother to go searching, since mod_rewrite can do the dirty work.

Second, I didn't really want to do the redirect that a plugin would have used.  If you go hit my feed at http://www.barneyb.com/barneyblog/feed/, you'll notice that appears to be self-hosted, and yet has FeedBurner content.  That URL has always been my feed URL, and will continue to be so.  It's all about that magic "P" in the RewriteRule.  Flip it to an "R" and you have the "normal" behaviour of redirecting over to FeedBurner, at the expense of no longer controlling the URL.  With the redirect anyone who isn't careful (which is anyone that doesn't use autodiscovery, and probably a lot of apps that do use it), will inadvertantly get a non-canonical URL for the feed, and get it in the pants if it ever changes (regardless of whether the canonical URL remains constant).  With proxying, I've protected myself from that.  And yes, I know the "L" is redundant; you already know I'm weird.

GoodBye Misbehaving Spiders

Alright, all you misbehaving spiders that don't flush your DNS cache (I'm looking at you Yahoo! Slurp and Adobe Feeds), I'm killing my old box today.  So while currently you're transparently proxying through to the new one because you won't respect my TTLs in DNS, that's gonna stop.  Here's the list of user agents that are still hitting my old box even though DNS (with 1 hour TTLs) has been updated for 36+ hours:

Adobe Feeds
AideRSS/2.0 (aiderss.com)
Baiduspider+(+http://www.baidu.com/search/spider.htm)
Baiduspider+(+http://www.baidu.jp/spider/)
ColdFusion
Gigabot/3.0 (http://www.gigablast.com/spider.html)
magpie-crawler/1.1 (U; Linux amd64; en-GB; +http://www.brandwatch.net)
Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; Trident/4.0; GTB6; .NET CLR 2.0.50727; .NET CLR 3.0.4506.2152; .NET CLR 3.5.30729; InfoPath.2; MS-RTC EA 2; OfficeLiveConnector.1.3; OfficeLivePatch.0.0; MSOffice 12)
Mozilla/4.0 (compatible; NaverBot/1.0; http://help.naver.com/customer_webtxt_02.jsp)
Mozilla/5.0 (compatible; Yahoo! Slurp/3.0; http://help.yahoo.com/help/us/ysearch/slurp)
Mozilla/5.0 (compatible; Yahoo! Slurp; http://help.yahoo.com/help/us/ysearch/slurp)
Mozilla/5.0 (Twiceler-0.9 http://www.cuil.com/twiceler/robot.html)
R6_CommentReader(www.radian6.com/crawler)
R6_FeedFetcher(www.radian6.com/crawler)
SkyGrid/1.0 (+http://skygrid.com/partners)
Sosoblogspider+(+http://help.soso.com/soso-blog-spider.htm)
Sosospider+(+http://help.soso.com/webspider.htm)
SurveyBot/2.3 (Whois Source)
Yeti/1.0 (NHN Corp.; http://help.naver.com/robots/)

Unindex Search Plugin

Even though I use Google Custom Search for my blog, I get a lot of WordPress's internal search results pages indexed by robots.  Didn't want that to happen anymore, so I wrote a simple plugin to address the issue.  Maybe someone will find it useful:

<?php
/*
Plugin Name: Unindex Search
Plugin URI: http://www.barneyb.com/
Description: Addes a NOINDEX meta tag for search result pages
Author: Barney Boisvert
Version: 0.1
*/
add_action('wp_head', 'unindexsearch_head');
function unindexsearch_head() {
  if (is_search()) {
    ?><meta name="robots" content="NOINDEX" />
    <?php
  }
}
?>

Simply checks if it's a search page and if so writes out the META tag.

I Love Apache

So after I got my new server online, I wanted to deal with my oversight on DSN TTLs.  Not surprisingly, mod_rewrite saved the day again.  First, since my old server is subject to the same TTLs, I added a record in my /etc/hosts file to point all of them at the new IP.  Then I changed my Apache config to run this rule set for all hosts:

RewriteEngine   On
RewriteCond     %{HTTP_HOST}   (.+)
RewriteRule     ^/(.*)$        http://%1/$1    [P,L]

Done.  That simply proxies any request with a host header through to the same exact URL, except that this time the DNS lookup is done by the old server, which has the /etc/hosts file to help it find the right place, instead of the browser that has incorrect DNS cache.  Works like a champ.

Back Up

Back alive again. Pretty smooth, considering I'd done zero prep, and didn't even have ColdFusion or PHP installed 90 minutes before killing my Apache. I'm sure there will be all kinds of random little issues, but hopefully they're minimal since I by and large just rsynced everything across the wire.  I did manage to do a little cleanup to the blog stuff that I've been needing to do for two years, so that's nice.  Still on my hacked-together, Ant-based WordPress though.

I'm Going Down

And not the good kind. In an hour or so I'm shutting this server down and migrating to another. Routing issues continue, and no resolution. Hopefully I'll be offline for 15-20 minutes, but who knows. I'm going totally blind because I really don't have any other options. See you on the flip side…

The Saga Continues

So my routing issues to my old box are still in force.  Sporadic, inconsistent, the whole mess.  Their NetOps team seems to think it's something on my box, but I don't see how it could be my box if certain routes always work, while the offending route fails sometimes.  I've got a single NIC, a single network cable, a single IP and an IPTABLES configuration that is devoid of any  rand() calls.

So I'm thinking my grand plan, which was to carefully migrate my apps from old server to new, one at a time, and clean up some of the cruft that has build up over the last couple years might go flying out the window.  Now I'm thinking I'm going to turn off all the apps on the old box, do an rsync of everything, repoint DNS, and fight through anything that doesn't work as expected.  Hardly elegant, but it'll at least get me to a box that is consistently accessable from the internet.

It also would alleviate the need deal with my WordPress quandry.  After a weekend of reflection, the right solution seems to be patching WordPress so that it's well behaved (or at least better behaved), and then supply the patch back to Automattic and hope for the best.  I'm a competent developer, and while I'd like to just use it out of the box, the benefit of open source is exactly this option, and as an open source proponent, I feel at least minimally obligated to leverage that option on principle alone.

That said, I don't know if I can deal with an only partially accessible box for much longer.  I'm going to the symphony tonight, so I'm not going to be doing any admin work until at least tomorrow, but if the routing issues aren't resolved by then, I think I'm just going to punt.  I hate punting.