Shoot the Engineers

About a week ago, Marc Funaro wrote an interesting blog post about CFML and OO.  The prevailing opinion (via Twitter, blogs, etc) is that Marc is incorrect/inaccurate/inexperienced/whatever, and I disagree completely.  He hit the nail on the head.

HTTP is a stateless, request-response environment.  Nearly all web applications interface with a SQL database, which is also a predominantly stateless request-response environment.  Those are orthogonal to the core OO principle of interacting stateful objects.  It's far closer to the FP (functional programming) paradigm, but particularly on the SQL side, still doesn't match completely.

To use OO in a SQL-backed web app, you hide the mismatches with ORM, an object-based Front Controller implementation, and session facades.  As Marc points out, Java works pretty well in this paradigm for two main reasons:  Java is crazy fast, and Java developers have invested ridiculous amounts of effort in tooling to support this model.  CF has neither of these advantages.  I'm not belittling the effort poured into various frameworks (Fusebox, Model-Glue, ColdSpring, Transfer, etc.), just that they are significantly behind what is available to Java developers.

Unlike Marc, I happen to think that a Front Controller framework is essential, but I don't use a OO one for exactly the reasons he outlines.  I build FB3lite for just this purpose: 70 lines of straightforward procedural code that help me enormously with certain common tasks.  I often masquerade my apps as standalone pages with mod_rewrite (converting /viewUser.html into /index.cfm?do=viewUser), but that's a cheat.

I also use CFCs  and ColdSpring for my business tier, but no object (domain) model for me.  The CFCs are really just glorified function libraries that I can use ColdSpring's AOP engine to wrap transactions around without having to manage them explicitly in my code.  In order to get the AOP I have to use CFCs, and I like the namespacing they provide (so I can have a 'doThing' method in multiple namespaces without conflict), but there is no real OO-ness there.

I know what you're saying.

Yes, I often preach the benefits of OO and encourage people to learn about it and use it.  But using a howitzer to hunt mice in your garage is not a clever idea.  If I'm writing Java (or Groovy), I'm going to use OO structures, but that's because of the programming environment.  I am a pragmatic person.  I like to learn about a wide array of tools and then use as few of them as possible, knowing that there are other options available if I need them.

Yes, built CFGroovy with Hibernate support so I could use ORM in my CFML apps via Groovy objects.  It provides the best of both worlds, the speed and tooling of Java in a CFML environment.  That approach works quite well, but if I don't need the complexity, I'm not going to do it.

My First cf.objective()

I know I'm late to the "cf.objective() recap" party, but I've been both crazy busy and rather tired, so I haven't got to it until now.

First, I'd never been to Minneapolis before, and from the little I saw, it's a pretty nice place.  Obviously I missed the "buried under snow" part, and that definitely puts a damper it as a potential home, but I liked it.  Very walkable, clean, and aside from the second-story causeways between the buildings, a nice asthetic overall.  The hotel was in a great spot, with a pretty varied selection of dining an drinking establishments within easy walking distance.

Before I got there, I hadn't quite internalized how small a 200-person conference actually is.  "Social" is a skill I didn't inherit from my father, unfortunately, but with the number of people I knew already, I didn't feel nearly as isolated as I often do at CFUNITED (which is five times the size).

The sessions were pretty good, over all.  I didn't get to go to several that I would have liked to because of scheduling, but c'est la vie.  Here's a rundown of the notable ones I attended:

Adobe's keynote the first day was interesting, and I might be mixing it in with some other Adobe presentations, but quite fascinating to see crowd reaction to certain Centaur features.  CFFINALLY and CFCONTINUE?  Nothing.  It should be noted that I was the only one to applaud them last year at CFUNITED.  Remote diff of server configuration?  Huge applause.  WTF?!?  Script your production environments, people.  If they're ever out of sync, you're doing your job wrong.   ORM stuff got much applause, of course, and rightfully so.  Drag and drop, full-stack scaffolding also did.  Do people actually use that?  Great marketing/sales tool, no question, but for actual applications?!?!  But I digress…

Marc Escher's talk on unit testing was quite interesting, I thought.  I've tried numerous times, with numerous technologies, to really embrace unit testing and failed every time.  Actually had the best luck doing it with Flex, which just drips with irony.  I'm not predicting success next time I attempt it, but I'm confident I'll do better than last time.  On a similar vein, Sean Corfield's talk on cf.spec provided some nice pointers.  I'm not too sure about the "readable" spec document concept, but an interesting technique.  Until you can have exactly one spec document, I'm not sure of the utility, but I think that's really an editor/syntax problem, not a conceptual one.

Mark Mandel's intro to Transfer was quite interesting as well.  That I attended might surprise you if you're familiar with the various Hibernate projects I've worked on, but ORM is still voodoo in my mind.  Coming back to the basics and being "introduced" to ORM from the ground up is always interesting, because the subtleties in interpretation provide a great introspective of ORM as a whole.  The odds of me picking up Transfer and using it on a "real" project are pretty small, but I didn't go to learn about Transfer in particular, more about ORM in general.

Let me be clear on this, Transfer is amazing.  It does things with CFML that I would have sworn were impossible, and does them fast enough to be perfectly servicable.  It's just not the tool that fits my style.  I've been a Hibernate user for many years, and that's a hard framework to supercede.  Honestly, I bet I'll never replace Hibernate with another ORM solution, but instead replace it with an alternate approach (an object database, for example).

As you might expect, I also went to Adobe's talks about the new ORM functionality coming in Centaur.  When I first was exposed to their Hibernate implementation, I was pretty skeptical.  There seemed to be a global misunderstanding of both the technology and the problem it was design to solve, but that has turned around 180 degrees, and Centaur looks to have pretty robust ORM capabilities.  I've got a major bone to pick with how Adobe is marketing the functionality, but the actual implementation looks pretty sound.  It's hard to get a complete picture with the pre-release secrecy, but I'm a lot more excited about it than I was 6 months ago.

Even more exciting is what will hopefully be coming out of Railo/JBoss in the coming months.  There's been no formal talk of what that looks like yet, and it's probably a safe bet that it'll be similar to ColdFusion's implementation (for obvious reasons), but with Railo and Hibernate both under the JBoss umbrella, I think there's some cool stuff on the way.  Obviously any speculation is just that, but with Railo supplanting ColdFusion in a lot of places I use CFML, I'm understandably excited about it.

The last session was Adam Haskell's talk on mentoring and code review.  That is a misnomer of a title, if you ask me, because while he did talk about that stuff, the point was really about team dynamics.  Working on a team is hard.  Working in a "get things done" environment only makes it worse.  Fostering the team, particularly around helping junior developers move up in the world, takes time and effort, but it's worth it.  I think Adam did a good job of emphasising that any sort of formal process is less effective than an equivalent informal process.  Informal is inherently more personal, and with the typically sterile world of technology and computers, the "personal" stuff is really important.

Of course, the big draw of any conference (though the hardest to justify) is the meals/drinks/etc. that happen outside the actual conference.  (It seems like I just said the same thing two sentences in a row, completely by accident.)  With the conference as small as it was, the informal socialization was a lot tighter, I though.  Far less spreading of groups, and so more churn within them.

I also really liked the way they did lunch, with actual table service, rather than a buffet.  With a more formal meal, you end up sitting and talking with fewer people, but for a longer period of time.  Aside from fostering more involved conversation, it also provides a nice break from the chaos to resettle for the afternoon.

Talking with other developers is always fun, and typically the source of the best tidbits of information.  I always like learning about stuff, even if it has no direct applicability, because it gets you mind thinking in ways it otherwise wouldn't.  And it seems to happen pretty often that 6 months down the road one of those random bits of information suddenly because fairly relevant.  Maybe not directly, but at least opens my eyes to some potential approach I wouldn't have otherwise considered.

Great conference, overall.  I gotta hand it to Jared and his team.

FB3lite as a Custom Tag

Last night at work Koen uncovered an issue with using FB3lite as a custom tag.  Inside the tag it does "formUrl2Attributes" to merge the two scopes into the attributes scope.  What I'd done incorrectly was omit the "don't override" parameter to the structAppend calls, so the URL and FORM scopes would supercede any existing attributes for the invocation.

During normal execution this is irrelevant.  It's similarly irrelevant if you don't have a collision between FORM/URL parameters and custom tag attributes.  However, if you do have a collision, the FORM/URL parameter wins, which is clearly incorrect.

To fix this, I added the missing third parameter to structAppend, as well as reversing the lines (so FORM still overrides URL as it always has).  You can grab the source directly, or pull from Subversion.

Effective Photo Manipulation

Image manipulation is a common tasks for web applications, usually centered around creating and managing thumbnails (of photos, PDFs, videos, whatever).  Photo manipulation is a subset of image manipulation, and has a couple aspects that differentiate it other types.

First and foremost, photo quality is of high importance.  Contrast this with creating a thumbnail of a PDF (especially a text-heavy one); there's no way you can be representative of the original's detail.  A photo's thumbnail, however, must be as representative as possible.  Unfortunately, photos are typically encoded as JPG files, which use a lossy compression algorithm, which means that every time you manipulate them in any way, you're reducing the quality.

Photos are also often subject to user-editing (cropping, annotating, etc.), and while this is commonly done with dedicated photo editing tools (installed on the user's computer), plenty of web apps do it too.  I'm not thinking Photoshop Web, here, I'm talking about your site's photo gallery software that lets you upload originals and rotate them so they're facing the right way.

Between these two things, you (the developer) can get stuck in a bind.  You need to be able to edit photos repeatedly, but at the same time maintain quality.  The trick is to only ever edit a photo once.  Seriously.

This is simpler than it sounds.  The trick is to start thinking about designing your app interms of operations (rotating) instead of states (rotated).

When you get your hands on the original image, put is somewhere safe.  That's the only image you're ever going to deal with.  When your user comes and rotates it, write the specifics of that operation to the database, and then take your original and run all the stored operations on it to create the "current" view.  Then when they crop it, write the operation to the database, and then take your original and run all the stored operations on it to create the "current" view.  You can see where I'm going here.

Same thing goes for thumbnails.  If you need a 150×150 thumbnail of a phote, don't take the "current" view and resize it, take the original, run all the stored operations on it, and tack on a "resize to 150×150″ at the end to create the thumbnail.

This works because you only run through the lossy compression once, so your generated views stay at a noticably higher quality than if you save after each operation.  It might not be apparenty in a 50×50 thumbnail, but if with two operations (rotate then crop) on a full-size image, the difference in quality by doing them together verses encoding in between is quite apparent.

Of course, you can't do this sort of processing on demand, so you have to store the results of the transformations.  The key is to never read those in to do the next step, always start from the original and play back all the manipulations in sequence.  This mechanism fits very nicely with the mod_rewrite-based caching mechanism I wrote about a couple days ago.  When a new operation is saved, wipe out all the cached versions of the photo and let them rebuild as needed.

This isn't a particularly revolutionary idea, and I know many photo tools (e.g. Picasa) do things this way, but I've found that it's worth the extra effort.  Obviously it's utility is predicated on having decent originals, but believe me, your users will notice.

Efficient Caching With mod_rewrite

Caching with mod_rewrite?  What?  I'll admit it's a slightly misleading title; the cache is actually a disk cache, but mod_rewrite is where the magic happens.  Bear with me for a moment…

Most content on the web is fairly static.  Some of it changes every few minutes, some changes every few hours, some changes a few times a month, and the vast majority of it changes approximately never.  However, a large percentage of it is generated dynamically, every request.  Maybe it's news articles, maybe it's thumbnails for images/pdfs/videos, maybe it's RSS feeds, but identical content is dynamically generated over and over again.  Huge waste of resources.

On the flip side, you can use pre-generation to build stuff ahead of time so you can serve everything statically.  However, that can be ridiculously expensive as well.  For example, my blog has several hundred (if not thousands) distinct feeds available on it.  The main one (listing posts), one per category (posts), one per author (posts), the main comment feed (listing comments), and one per post (comments).  Each of those is available in RSS 2.0, Atom 0.3, and RSS 0.92 formats.  Pregenerating those all the time is silly, because the vast majority of them will never be accessed, let alone frequently.

Ideally, we'd be able to generate these resources dynamically, on demand, but then keep the output around to serve back statically for subsequent requests.  This saves us the expense of pregenerating lots of stuff that will never be accessed, but gives us the speed of static access after the first request.

Duh, Barney, what's your point?

My points is that while this is, in a conceptual mindset, the obvious solution, it's ridiculously trivial to implement.  It'll take longer to read this post than to set it up.  As such, there's no excuse for being resource constrained on non-user-specific resources, even though this seems to be a really common complaint.

Here's a more concrete example.  Say I host photo galleries, allowing people to upload their full-size images, and I provide several views of the galleries with appropriate thumbnails.  Those pages are littered with things like this:

<img src="/gen_tn.cfm?id=12345&width=100&height=100" />

This is great, because I can create arbitrarily sized thumbnails without having to go back and regenerate them for all existing photos.  That's handy when I create a new layout and realize I want 125×125 thumbnails instead of 100×100, and then want to use 250×250 for the 'featured' section.  But I'm generating the thumbnails dynamically every request, which is a waste.  And adding caching in gen_tn.cfm is the wrong answer.  : )

First, let's change the URLs in the pages to look like this:

<img src="/tn/p12345-100x100.jpg" />

Same information as before, just packaged differently.  Then I'll use the following RewriteRule to (internally) turn it back into the original request to gen_tn.cfm (effectively a no-op):

RewriteRule  ^/tn/p([0-9]+)-([0-9]+)x([0-9]+)\.jpg$  /gen_tn.cfm?id=$1&width=$2&height=$3  [PT,L]

Lipstick on a pig, you might say, and you'd be almost right.  We now have normal-looking URLs for our thumbnails (lipstick), but they're still dynamically generated every request (on a pig).  This abstraction, however, is incredibly powerful.  Lets add a RewriteCond in front of that rule real quick:

RewriteCond  %{REQUEST_FILENAME}                     !-s
RewriteRule  ^/tn/p([0-9]+)-([0-9]+)x([0-9]+)\.jpg$  /gen_tn.cfm?id=$1&width=$2&height=$3  [PT,L]

That says to only do the RewriteRule if the requested file doesn't exist or is zero length ('-s' says a regular file with non-zero length, the '!' negates it).  Next step is to create the 'tn' directory in your web root and ensure it's writable by your application server.  You can probably see where I'm going with this…

The final step is to tweak gen_tn.cfm slightly.  Currently, it creates the thumbnail and serves it back to the client.  We need to change it so that before serving it back, it writes it to disk in that new 'tn' directory, using the appropriate filename.  Once that's done, send it to the client as usual.  The next time the thumbnail is requested, Apache will hit the RewriteRule, but the RewriteCond will not match (because the file exists and has length).  As such, it won't be rewritten to gen_tn.cfm, and will instead be served statically directly from disk bypassing the application server completely.

With those couple simple changes, you suddenly have a ridiculously effective caching mechanism in place.

What about changes to the source, though?  You realize one of your photos (#12345) was miscropped, so you fix it and upload a new version, but you want your thumbnails to be regenerated too.  Fortunately, flushing the cache is as simple as deleting all files in 'tn' that match '*p12345*.jpg'.

Same thing goes for deletions.  If you decide you just want to remove photo #12345 completely and want to remove the thumbnails too, run the same deletion of '*p12345*.jpg' from the 'tn' directory.  Or if you stop using 100 pixel thumbnails (like when I switched to 125×125 a few paragraphs ago), you can just delete '*100×100*.jpg'.

Because you're using the filenames as an index of sorts, it means you have to name your files carefully.  The filename needs to contain not only everything to uniquely specify the file (photo ID, width, and height in this example), but also everything that you might want to use for clearing the cache.  For example, if you need the ability to clear based on gallery ID you'd need to change the URL to '/tn/g123-p12345-125×125.jpg' or something.  In this case the gallery ID isn't needed for unique specification, only for flush selection.

The net of this is that you can hit that sweet spot: avoiding any extra work generating resources that aren't accessed, and never generating the same resource more than once.  Obviously the first request to a resource has to wait for generation, so this technique isn't suitable for all use cases, but it covers a huge swath of them.  It's especially well suited to situations where you have a large number of resources and have either relatively light usage across them and/or need the ability to change the derived resources' specifications (e.g. new thumbnail dimensions or new XML feed formats).

As you'd imagine, PotD (NSFW, OMM) uses this technique extensively for several classes of thumbnails as well as RSS feeds.  It also does some pre-generation where the first-request delay is unacceptable.  I also used this to great effect at my previous employer's for front-end caching of CMS-generated HTML pages.  We handled hundreds of millions of pages per day on a pair of single-P4 servers with 1GB of RAM each, with an average cache life of between two and four hours.

One significant gotcha is that you only get full-request caching with this technique.  I.e. you can't cache portions of a request's response, because it's either fully dynamic (the first request) or fully static (subsequent requests).  For example, most blogs have a "remember me" feature so you don't have enter you information each time you want to comment.  In order to beat this, you need some sort of two-phase generation where the cache happens between the phases, and that means you have to have your application running "above" the cache.  Ajax can be used as the second phase, but that's a disaster waiting to happen, if you ask me.

Minor FB3lite Update (and a Weird CF Bug)

This evening while adding some reporting to PotD (NSFW, OMM) to help nail down some performance issues that I think are Apache's fault, I noticed a strange issue with FB3lite.  If you've used it, you know the core of the "framework" are the do() and include() UDFs.  Both contain a CFINCLUDE tag, and a weird situation arises with scoping.

CFML has implicit scope traversal, so if you have an unscoped variable, it will automatically traverse across a bunch of scopes until it finds one with the right name.  Further, you get an implicit local scope within UDFs, and you get a magic psuedo-scope within query-based CFOUTPUT and CFLOOP tags.

What I noticed was that looping over a query with a column named "template" always displayed the currently executing template, instead of the value from the query.   No worries, I thought, since prefixing it with the query name solved the issue.  "template" isn't a common query name for me, so while it surprised me I'd never noticed or heard about this magic "template" variable before, I didn't think too much of it.

Then a while later I realized it was my own fault, because of the include() UDF in FB3lite.  The argument to the UDF is named "template", and CFML places the implcit local scope at the top of the heap, including above the magic query-loop scope.  That argument was shadowing my query variable.

I've fixed FB3lite to use a prefix on all it's UDF arguments, so this problem cannot manifest itself anymore.  Well, I suppose it could, but you'd have to have a weird-ass variable name (e.g., "_fb3lite_template").  You can download the latest version, or pull it from Subversion.

CFGroovy And Script Output

One final CFGroovy update for cf.objective() 2009, and then heading to a plane.

In the first version of CFGroovy output generated by your script was discarded.  This fit the original concept behind scriptlets well, but it made debugging kind of a pain, because you had to exit the script in order to dump out any state.  With the CFGroovy2 implementation, output from your script is passed off to the normal page output, just as you'd expect.  The demo (source or output) now showcases this using in-script emission of the list that it builds, rather than CFDUMPing it between scripts.

PotD For The World

Last night was sort of the release of Pic of the Day (not safe for work, or my mom) into the wild.  The project is a couple months shy of five years old, and while I've talked about it obliquely all over the place, I've never really publicized it directly.  I'd made the assumption that it was just a shared secret because it comes up in conversation over beers quite frequently, and I do actually refer to it by name occasionally, but I was quite wrong.  I brought my little MOO "business cards" to distribute, and I was amazed at the response.

For some background, PotD spiders the internet looking for pictures, downloads them, and then sends them to people, one pic per day.  Subscribers can rate each picture, right from their email, and the system learns what they want and tries to send them more of it.  It started as a joke; a friend and I thought it'd be funny to spam a third friend's inbox with dirty pictures, just for the hell of it.  Five years later, I have a fairly robust prioritization engine that I've had great fun building.

Obviously handing out business cards with dirty pictures on the back is going to be a conversation starter, but it was really interesting to talk about all the minutiae of the application.  I got a number of very interesting suggestions to add to my queue of things to think about.  The most common comment/question was about monetization.  I'm sure I could make a killing if I wanted to, but PotD is a hobby.  It's something I do for fun, in my free time.  The primary reason I do any sort of publicizing is because the data analysis only becomes relevant as the subscriber population grows, and that's really the fun part.  And since it's free (and the interface is largely asynchronous – via email), I don't have to worry about ensuring it's completely stable, error free, and available all the time.  That makes hacking a lot more fun because, lets face it, stability, error handling, and availability aren't typically the "fun" part of application development.

What was perhaps more interesting was the amount of stuff I've done that has PotD as it's sole impetus (or at least primary impetus).  If you go look at my projects page, seven of the eleven projects were created purely for PotD, most notably TransactionAdvice, SchemaTool, FB3Lite, and Amazon S3 integration.  The other three are ComboBox, FlexChart, andjQuery Checkbox Range Selection.  CFGroovy had it's impetus in PotD as well, though the Hibernate aspects quickly grew (out of proportion, in hindsight), and I've not actually used it in PotD beyond a couple trivial spots.  Beyond the actual projects, everything I do with SVG, Batik and Weka (a data mining package), plus a lot of Google Charts stuff and most of those damned query performance issues are all PotD.

Over the years, the application has lived on four different servers, including one that was accessible only via an asynchronous proxy written in PHP.  Yeah, really.  It's pure Adobe ColdFusion, starting on CF7, but now on CF8.0.1.  FB3Lite is the front controller, ColdSpring is used for all the DI/AOP needs, and the codebase is largely procedural even though the majority of it is packaged as CFCs.  Excluding third-party code, there are 129 CFM files (9,760 lines), 53 CFCs (15,794 lines), 12 JS files (2,402).  The database has about 650K data records, along another 1.4M records that are "non-data", if you will (log tables, lookup tables, etc.).   None of these numbers are particularly sizable, and the application itself is far from the largest I've worked on, but I'd say it's the most complex because of how many different pieces they are, the variety of jobs they do, and the level of automation in the various data flows.

So welcome to the world, Pic of the Day.

CFYourFavoriteLanguage (Formerly CFGroovy)

CFGroovy grew some wings this afternoon.  It retains it's core functionality of running Groovy code in a CFML environment, whether you have it installed on your classpath or if it's transparently loaded from the local copy of the JAR.  However, it now supports any JSR 223 scripting language as well (assuming you're on a 1.6 or newer JVM).  Of the various choices, Groovy seems the best fit for CFML developers (hence the focus on this language), but I also tested Python (via Jython) and PHP (via Quercus).

Of the CFML engines, Railo 3.1 was the champ, running all three guest languages flawlessly.  ColdFusion 8.0.1 refused to run the Python example, not really sure why.  Open BlueDragon refused both Python and PHP.  All three run Groovy, of course, even with the conversion to use the JSR 223 interface (for consistency) instead of the "normal" GroovyClassLoader interface.

You can access any installed languages via the new 'lang' (or 'langauge') attribute; Groovy remains the default, of course.  Here's an example for PHP:

<g:script lang="php">
<?php
  $variables["myArray"][] = "it's some PHP";
?>
</g:script>

The empty brackets mean "create a new item at the end", so that line appends a string to the named array.

Latest mods are in Subversion, of course.

More CFGroovy2 Goodness

Last night at dinner I was talking with Mark Mandel and Luis Majano and realized I'd completely misunderstood the way JavaLoader worked based on my initial look see.  So for the price of 21 additional lines (nine of which are purely for misbehaving CFML runtimes), CFGroovy will transparently load an internal copy of Groovy if it can't find one on the classpath.

I've created a branch in Subversion to house the new version at https://ssl.barneyb.com/svn/barneyb/cfgroovy/branches/cfgroovy2/engine/.  It's organized the same way as the trunk, so there is a ../demo/ directory that contains a trivial demo application.  Here's the demo template, so you can get a feel for how easy CFGroovy is to use:

<cfimport prefix="g" taglib="engine" />

<cfset variables.myArray = listToArray("barney is tall,CFML is taggy,Groovy is AWESOME!") />

<cfoutput>
<h1>Inline Groovy</h1>

<p>This demo creates an array of strings, CFDUMPs it, uses some inline
Groovy (via <code>&lt;g:script&gt;</code>) to add a few more, and
then CFDUMPs it again.
</p>

<h1>Only Three</h1>
<cfdump var="#variables.myArray#" />

<g:script>
// better add emery
variables.myArray.add("emery")
// and some other stuff, using some other syntaxes
variables.myArray += "CF Runtime: " + server.coldfusion.productname + " " + server.coldfusion.productversion
variables.myArray << "User Agent: " + cgi.http_user_agent
</g:script>

<h1>There we go!</h1>
<cfdump var="#variables.myArray#" />
</cfoutput>