Where Are The CTRL-S, ALT-TAB, F5 Web Frameworks?

Ok, people, where are all the web frameworks that will give me a CTRL-S, ALT-TAB, F5 workflow?  As I've been shopping around, it seems everything requires more than that.  Some places you can script the additional steps into the refresh, but not always.  Am I the only person that doesn't want to have a million extra steps in the process every time I make a change?

Just to be clear, I'm not necessarily saying you have to avoid reinitializing your app or whatever, I just don't want to have to do it manually (nor figure out manually when I ought to do it).  Similarly, I don't want to have to rely on an IDE to do it for me, the framework should take care of all that.  I should be able to open up ANY source file in my app with an arbitrary text editor, change it, and then load the app in my browser and see my change.  I don't care if it takes 5 seconds because the app has to reload/rebuild itself to deal with the change, I just want it to happen without my intervention or any sort of special external tools.

Is this too much to ask?

Rebuilding Pic of the Day

I need some help, thoughts, recommendations as I undertake this, but first some background…

As I do every 15-18 months, I've decided that it's time to rebuild Pic of the Day.  I've never actually done it; the codebase is still the same one I started 5-6 years ago and have edited (often daily) since then.  But the amount of cruft is becoming more and more problematic, and while I could do a hard-core refactoring and trimming down of the app, I don't see a compelling benefit to doing it that way versus a ground-up rewrite, and I'm confident the latter will actually be quite a bit faster.

In the past I've created partial re-implementations with pure CFML, Spring/Hibernate, Grails, and CFML/Groovy hybrids.  In every case, one of the objectives was a gradual migration, where the two versions either shared a database, or did incremental data copies from old to new, so the app could be ported in stages.

I've decided I really don't want to do that.  Obviously I need to move data from old to new, but I'm happy with just doing the pic/recipient/rank tuples and the associated entities, and starting from scratch with the other bits (the spider state, the image pool, historical records, etc.).

My question for all of you is really about the technology stack.  As I mentioned above, I've tried several.  Time-to-market would be maximized with a CFML-centric solution, because that's what I have the best infrastructure and tooling for, but that's not a significant driver.  PotD is a hobby; it's how I entertain myself for hours every night after the kids are in bed.  I do have resource constraints on my server, particularly RAM, so that is a consideration, but other than that I'm pretty much open for anything.

If you were undertaking this project, what would you use and why?  If you don't supply the why, I'm deleting your comment.  :)

CFYourFavoriteJVMLanguage

Just in case you didn't come to my talk on leveraging Groovy in your CFML applications at CFUnited, I wanted to share this simple CFM page I demoed:

<cfimport prefix="g" taglib="cfgroovy2" />
<cfscript>
  myArray = [
    "barney"
  ];
</cfscript>
<!--- thanks Groovy --->
<g:script>
  variables.myArray.add("heather")
</g:script>
<!--- thanks Quercus --->
<g:script lang="php">
  <?php
    $variables["myArray"][] = "lindsay";
  ?>
</g:script>
<!--- thanks Jython (and damn your semantic whitespace!) --->
<g:script lang="python">
variables["myArray"].append("emery")
</g:script>
<cfdump var="#myArray#" label="myArray" />

Yes, that's building a single CFML array (named 'myArray') using CFSCRIPT, Groovy, PHP, and Python.

CFGroovy2 is not a way to plug Groovy code into your CFML; it's a way to plug an arbitrary JSR-223 scripting language into your CFML.  I happen to focus on Groovy because I feel that language best compliments CFML/Java development, but any language that supports the JSR-223 scripting specification (for embeddeding scripting languages into the JVM) will happily run via the <g:script> tag.

CFGroovy2 and Ehcache

Until this point, CFGroovy2 has used a custom WeakHashmap/WeakReference caching mechanism for compiled scripts.  It works, and it ensures that the script cache won't run your JVM out of memory, but that's about it's only selling point.

Today I plugged in Ehcache as an optional caching backend if it's available on your classpath.  When CFGroovy2 spins up it's internal infrastructure (on first use, and stored in the server scope), it will see if Ehcache is available.  If it is, it will be used, otherwise the original mechanism will be.  Note that JavaLoader is not used to load Ehcache; you must have Ehcache on your classpath if you wish to use it.

To complement this new behaviour, I've also added a <g:manager> tag.  You can use it to clear the script cache, as well as retrieve information about the cache.  To clear the cache, you can do this:

<g:manager action="clearCache" />

If you want to see how big the cache is, or view performance metrics (Ehcache-only), you can do this:

<g:manager action="getCacheInfo"
  result="info" />

The 'info' variable is guaranteed a 'usingEhcache' value (boolean or 'unknown'), along with a 'cacheSize' value (integer).  With Ehcache, you'll also get hits, misses, and evictions (all integers).

As always, the runtime data structures are stored in server.cfgroovy should you wish to introspect them directly.  However, they are not part of the public API, and are therefore not guaranteed to remain unchanged across releases.

Subversion access for the engine is here: https://ssl.barneyb.com/svn/barneyb/cfgroovy2/trunk/engine.  If you already have Groovy on your classpath or are planning to put it there (which is recommended), you only need 'script.cfm' (and optionally 'manager.cfm').

Custom Scopes for CFGroovy2

The middle of last week I committed an update to CFGroovy2 to allow an arbitrary number of scopes to be passed in as attributes, just like you have always been able to do with the 'variables' attribute.  If you update, the change is the addition of lines 47-51:

<cfloop list="#lCase(structKeyList(attributes))#" index="scope">
  <cfif listFind("language,lang,script,variables", scope) EQ 0>
    <cfset binding.put(scope, attributes[scope]) />
  </cfif>
</cfloop>

Basically it just loops over all the attributes, lowercases them (because Groovy is case sensitive, but CFML's structs aren't – this is required for consistent behaviour across CFML runtimes), ensures they're not one of the four well-known attributes to the <g:script> tag, and adds them to to global bindings.

This is especially useful for scriptlets within UDFs and CFC methods where.   Without this change, the typical use case was like this:

<cffunction name="sayHello">
  <cfargument name="name" />
  <cfset var local = {
    greeting = variables.greeting,
    name = arguments.name
 } />
  <g:script variables="#local#">
    variables.result = variables.greeting + ", " + variables.name
  </g:script>
  <cfreturn local.result />
</cffunction>

There are a couple issues with this:

  1. The 'local' psuedo-scope is referenced as 'variables' inside the <g:script>, which compromises readability
  2. Since 'variables' is populated with the 'local' psuedo-scope, you can't access the CFC's 'variables' scope from with <g:script> (hence copying variables.greeting into local.greeting)
  3. Because there is only a single 'variables' attribute, you can't pass in multiple psuedo-scopes

With the new addition, all three problems are addressed, and the same example will look like this (changes in blue):

<cffunction name="sayHello">
  <cfargument name="name" />
  <cfset var local = {
    greeting = variables.greeting,
    name = arguments.name
  } />
  <g:script local="#local#">
    local.result = variables.greeting + ", " + local.name
  </g:script>
  <cfreturn local.result />
</cffunction>

Notice that I no longer have to copy 'greeting' from the 'variables' scope into the 'local' psuedo-scope, and than within the <g:script> I still have the variables/local delineation, exactly as it exists in the CFML code.  I haven't shown passing multiple extra psuedo-scopes, but just tack on additional attributes.

I haven't tried this with CF9, but now that it provides direct access to the function-local scope (named 'local'), you should be able to pass that into <g:script> using this mechanism and have it behave natively.

Edit Distances Bug

This evening I found a bug in one of the optimizations that I made to the edit distance function.  I've corrected the code in the original post, and made a note of the change there as well.  Just wanted to mention it in a second post so anyone who read via RSS will be aware of it (since they won't necessarily go back and look at the original).

Edit Distances and Spiders

An edit or string distance is the "distance" between two strings in terms of editing operations.  For example, to get from "cat" to "dog" requires three operations (replace 'c' with 'd', replace 'a' with '0', and finally replace 't' with 'g'), thus the edit or string distance between "cat" and "dog" is three.  Aside from replace, there are also the insert and delete operations, so the distance between "cowbell" and "crowbar" is four (insert 'r', replace 'e' with 'a', replace 'l' with 'r', delete 'l').  This particular sort of edit distance is called the Levenshtein distance.

Here is an implementation of a function in Groovy that does the computation (based on the psuedocode at http://en.wikipedia.org/wiki/Levenshtein_distance):

def editDistance(s, t) {
  int m = s.length()
  int n = t.length()
  int[][] d = new int[m + 1][n + 1]
  for (i in 0..m) {
    d[i][0] = i
  }
  for (j in 0..n) {
    d[0][j] = j
  }
  for (j in 1..n) {
    for (i in 1..m) {
      d[i][j] = (
        s[i - 1] == t[j - 1]
        ? d[i - 1][j - 1] // same character
        : Math.min(
            Math.min(
              d[i - 1][j] + 1, // delete
              d[i][j - 1] + 1 // insert
            ),
            d[i - 1][j - 1] + 1 // substitute
          )
      )
    }
  }
  d[m][n]
}

That might not seem very useful, but consider the problem of grouping strings together.  This works especially well for URLs, which are hierarchical in nature, and therefore typically differ in only small ways from other similar URLs (at least as far as the site in question's internal concept of organization is concerned).  As a concrete example, you wouldn't expect "http://www.google.com/search" and "http://mail.google.com/a/barneyb.com" to be very similar pages, because their URLs are quite different.  However, you'd probably expect "http://mail.google.com/a/barneyb.com" and "http://mail.google.com/a/example.com" to be similar.

This came up as part of my never ending quest to optimize Pic of the Day's behaviour, specifically the spidering aspect.  Consider a photo gallery page on some arbitrary site.  The core of it is 10-20 links to pages that show full-size images, but that is surrounded by (and possibly interleaved with) navigation, links to related content, advertisements, etc.  So the task is to get those core links and none of the other garbage.  Also keep in mind that the algorithm has to work on arbitrary pages from arbitrary sites, so you can't rely on any sort of contextual markup.

Fortunately, this is a simple task with a string distance-based algorithm.  Consider this list of URLs (the 'href' values for all 'a' tags on some gallery page, sorted alphabetically, and trimmed for illustrative purposes):

http://www.hq69.com

http://www.hq69.com/cgi-bin/te/o.cgi?g=home

http://www.hq69.com/galleries/andi_sins_body_paint/index.php
http://www.hq69.com/galleries/beatrix_secrets/beatrix_secrets_001.php

http://www.hq69.com/galleries/beatrix_secrets/beatrix_secrets_002.php

http://www.hq69.com/galleries/beatrix_secrets/beatrix_secrets_003.php

http://www.hq69.com/galleries/beatrix_secrets/beatrix_secrets_004.php

http://www.hq69.com/galleries/juliya_studio_nude/index.php

http://www.hq69.com/galleries/khloe_loves_ibiza/index.php

http://www.hq69.com/galleries/lola_spray/index.php

You can quite easily see that the URLs we want are lines4-7 (in blue) via visual scan, and while you might not realize it, you compared the relative difference between all the URLs and decided those four were sufficiently similar to be considered the "target" URLs.  The first part is an edit distance of some sort, and the latter part is based on some sort of relative threshold for acceptance.  More subtle is that lines 3, 8, 9, and 10 (in green) are also quite similar, but not nearly as similar as the first group.

Of course, using the string distance function to arrive at this relationship isn't a direct path.  On approach is to compare every string to each other, and then build a graph of similarity to deduce the clusters.  Unfortunately, this is prohibitively expensive for moderately sized sets.  It's also really complicated to implement.  ;)

Much easier is to build the clusters directly.  Create an (empty) collection of clusters, and then loop over the URLs.  For each URL iterate through the clusters until you find one it is sufficiently similar to and add it.  If you don't find a suitable cluster, create a new cluster with the URL as the sole member.  Here's some code that does just that:

clusters = []
threshold = 0.5
urls.each { url ->
  for (c in clusters) {
    if (1.0 * editDistance(url, c[0]) / Math.max(url.length(), c[0].length()) <= threshold) {
      cluster.add(url)
      return
    }
  }
  clusters.add([url])
}

You'll end up with an array of arrays, with each inner array being a cluster of similar URLs matching the clusters outlined above.  The 'threshold' variable determines how close the strings must be in order to be considered cluster members.  In this case I'm using a threshold of 0.5, which means that the the edit distance must be no more than half the max length of the two strings.  I.e., at least half the characters have to match.  This is a ridiculously low threshold, but I needed it in order for the second cluster to materialize.  In practice you'd want a threshold of 0.05 to 0.1 I'd say, though I haven't spent much time tuning.

This algorithm is reasonable fast and greatly reduces the number of distance computations required in order to build the clusters.  However, it's still pretty slow.  Fortunately, there are a few heavy-handed optimizations to make.

First and simplest, URLs are typically most different at the right end (i.e. the protocol and domain are usually constant), and since identical substrings don't change the distance computation, stripping an identical prefix from the strings might greatly reduce the amount of checking required without impacting the accuracy.

Second, since the difference cannot be less that the difference in length between the two strings, we can check that against the threshold up front and avoid doing any part of the distance computation.

Third, we can push the threshold check partially into the editDistance() function so that it will abort as soon as a sufficient distance is found without having to check the rest of the strings.

Fourth and finally, keeping the clusters sorted by size (largest first) assures that we'll get the most matches with the fewest cluster seeks, which reduces the number of comparisons that need to be made.  For equal-sized clusters, putting the one with shorter URLs first will further increase the chance that the "difference in length" check (optimization two) will trigger, saving even more comparisons.

Here's the complete code with these optimizations in place (optimization two moved the threshold check into a separate method):

Update 2009/09/25: I found a bug in the short-circuiting evaluation mechanism (optimization three), and have corrected the code below.  Fixing this issue required doing the diagonal optimization I mentioned at the end of the post.  It is highlighted in green.  It limits the building of the 'd' matrix to only the diagonal stripe that it is possible to traverse within the bounds of the provided threshold.

def editDistance(s, t, threshold = Integer.MAX_VALUE) {
  for (i in 0..<Math.min(s.length(), t.length())) {
    if (s[i] != t[i]) {
      s = s.substring(i)
      t = t.substring(i)
      break;
    }
  }
  int m = s.length()
  int n = t.length()
  int[][] d = new int[m + 1][n + 1]
  for (i in 0..((int) Math.min(m, threshold))) {
    d[i][0] = i
  }
  for (j in 0..((int) Math.min(n, threshold))) {
    d[0][j] = j
  }
  for (j in 1..n) {
    int min = Math.max(j / 2, j - threshold - 1)
    int max = Math.min(m, j + Math.min(j, threshold) + 1)
    for (i in min..max) {
    for (i in 1..m) {
      d[i][j] = (
        s[i - 1] == t[j - 1]
        ? d[i - 1][j - 1] // same character
        : Math.min(
            Math.min(
              d[i - 1][j] + 1, // delete
              d[i][j - 1] + 1 // insert
            ),
            d[i - 1][j - 1] + 1 // substitute
          )
      )
      if (d[i][j] > threshold) {
        return d[i][j]threshold * 2 // falsely inflate to avoid floating point issues
      }
    }
  }
  d[m][n]
}
def doStringsMatch(s, t, threshold) {
  if (s == t) {
    return true;
  } else if (s == "" || t == "") {
    return false;
  }
  def maxLen = Math.max(s.length(), t.length())
  if (Math.abs(s.length() - t.length()) / maxLen > threshold) {
    return false
  }
  1.0 * editDistance(s, t, threshold * maxLen) / maxLen <= threshold
}
clusters = []
threshold = 0.1
clusterComparator = { o1, o2 ->
  def n = o2.size().compareTo(o1.size())
  if (n != 0) {
    return n
  }
  o1[0].length().compareTo(o2[0].length())
} as Comparator
urls.each { url ->
  clusters.sort(clusterComparator)
  for (cluster in clusters) {
    if (doStringsMatch(url, cluster[0], threshold)) {
      cluster.add(url)
      return
    }
  }
  clusters.add([url])
}

Groovy Objects in CFML (a la Ben)

Ben Nadel posted an interesting article over on his blog titled Instantiating Groovy Classes In The ColdFusion Context where he demoed how to create a class factory in Groovy and invoke it from CFML to insantiate new instances of Groovy classes without actually reentering a Groovy context.  I wanted to expound on what he demoed a bit more.

First and foremost, these two snippets are identical in functionality:

<cfset variables.map = createObject("java", "java.util.HashMap") />
<g:script script="variables.map = new HashMap()" />

After either line executes, 'variables.map' will contain a HashMap instance that is a full-fledged citizen of the JVM.  The way it was created is irrelevant.  This goes for ANY Java object, whether it's a CFC, Groovy, core Java, etc.  That why Ben's code works; once his factory is created, it can just do it's thing.  The context that it's methods are invoked in is irrelevant becaues the object itself doesn't change based on context.

I'm not sure I recommend a generic factory like this for real applications, though it's certainly good for experimentation.  For actual apps, I usually write one or more singletons (often DAOs) in Groovy that fulfill both a business role in the app as well as the VO creation role.  Then you have newPerson(name, hair, gender) rather than get("Person").init(name, hair, gender) with a proxy and a pile of reflection.

What makes this even more interesting is that because your objects are Groovy objects, they have the Groovy neatness available to them.  In particular, implicit getters.  If you write your persistence layer with Groovy (but not Hibernate), implicit getters can easily synthesize the transitive recall that your Hibernate-managed objects get for free.

At work, we do a lot of stuff on a JSR-270 Java Content Repository (with Magnolia CMS), which can't be fronted by Hibernate, but using Groovy I still get lazy loading across entity relationships.  My DAO (written in Groovy) has this method for retrieving a list of newsletters:

def getNewsletterList(int limit = 999999999) {
  def nodes = daoHelper.query("/*[MetaData/mgnl:template='newsletter']").reverse()
  if (nodes.size() > limit) {
    nodes = nodes.subList(0, limit)
  }
  nodes.collect { new Newsletter(daoHelper, it) }
}

And the Newsletter class has this implicit getter:

def _articles
def getArticles() {
  if (! _articles) {
    _articles = daoHelper.query("/*[jcr:uuid='$uuid']/*[MetaData/mgnl:template='newsletter_article']")
      .collect { new Article(daoHelper, it) }
  }
  _articles
}

When first invoked, it'll use daoHelper (which was passed to the Newsletter in the first snippet) to pull out a list of article nodes from the JCR and inflate them into Article instances.  That list of instances is cached in '_articles' so it doesn't have to be created multiple times.  This lets me grab a newsletter, and if needed, lazy load the newsletter's articles without paying the cost of loading the articles if I'm not going to need them.  But even better is the CFML that uses these objects.

Here's a snippet that renders a list of newsletters (retrieved via getNewsletterList) with a list of articles:

<ul class="issue-list">
<cfloop array="#newsletters#" index="newsletter">
  <li><h3 class="title"><a href="#newsletter.handle#">#newsletter.title#</a></h3>
    <h4>In This Issue:</h4>
    <ul>
      <cfloop array="#newsletter.articles#" index="article">
        <li><a href="#article.handle#">#article.title#</a></li>
      </cfloop>
      <li class="full"><a href="#newsletter.handle#">View Full Issue</a></li>
    </ul>
  </li>
</cfloop>
</ul>

What you'll notice is that I'm treating the 'newsletters' variable as if it were an array of structs, when it's actually an array of Groovy instances.  In particular, notice the way I loop over the articles.  No method call, it's just a property.  This is ridiculously powerful, because it lets you flip-flop between properties and getters/setters at will without affecting calling code.

This code doesn't illustrate it, but I can traverse much more deeply that this.  Say for each article I wanted to list the distinct set of categories that the article's author has posted at least one article in.  Easy:

cats = article.author.articles.sum({ it.categories }).unique

So this will hit the implicit getAuthor() method of Article to lazy-load an author, then call getArticles() on Author to get all of their articles, then iterate over them and calling getCategories() on each one and adding all the collections together, and then finally pull unique values.  The result will be a list of Category objects which can be further traversersed:

newslettersForFirstCategory = cats.first().articles.collect({ it.newsletter }).unique

This sort of traversal might seem really hard to read, but it's easily picked up and becomes very intuitive.  You start at the left, and each step transforms your initial data into a new piece of data and passes it down the chain.

It is worth mentioning that just like Hibernate, this sort of lazy loading does result in the N+1 query problem.  Tune your database, optimize your SQL, and you probably won't have a problem.

This Is Why Groovy Rules

So I have a collection of newsletters, and each newsletter has a collection of articles.  Each article, in turn, has a collection of authors and a collection of categories.  Now what I need to do is get a list of unique handles for all the authors and categories for a given newsletter.  Here's the CFML version:

<cfset var article = "" />
<cfset var author = "" />
<cfset var category = "" />
<cfset var s = {} />
<cfloop array="#variables.newsletter.articles#" index="article">
  <cfloop array="#article.authors#" index="author">
    <cfset s[author.handle] = "" />
  </cfloop>
  <cfloop array="#article.categories#" index="category">
    <cfset s[category.handle] = "" />
  </cfloop>
</cfloop>
<cfset variables.handles = structKeyArray(s) />

And here's the Groovy version:

variables.handles = variables.newsletter.articles.sum({
  it.authors.collect { it.handle } + it.categories.collect { it.handle }
}).unique()

I rest my case.

The use of addition ('+' operator and 'sum' method) to express appending collections together is very elegant, and coupled with 'collect' to transform each item in a collection, you get really concise and direct code.  The CFML also suffers from the lack of a Set data type, so you have to fake it with the keys of a structure.

LessCss for CFML Developers

LessCss is a nifty extension to the core CSS language to support variables, calculated values, inclusion-by-reference, nesting, and some other goodies.  Basically the idea is to simplify and reduce duplication in large stylesheets.  Here's a simple example of a variable and a computed value:

@color: #fdd;
#header {
  background-color: @color;
  color: @color / 2;
}

That translates into the following CSS:

#header {
  background-color: #ffdddd;
  color: #7f6e6e;
}

Unfortunately, browsers don't understand LESS, they only understand CSS, so you have to compile it.  Even worse, LESS is implemented as a Ruby gem and/or Rails plugin, which makes it annoyingly complex to deal with.

Fortunately, JRuby and CFGroovy2 will happily smooth things over, and let you do the translation on demand.  For development, this maintains the CTRL-S/ALT-TAB/F5 workflow that we all know and love.  For production, the compilation would be better done during your release creation process, but for sites without a pile of traffic, doing it dynamically is probably no big deal.  You might also want a rewrite engine (like mod_rewrite) to make your URLs prettier or offload the existence checks, but I've not addressed that here. Real6

First, your HTML needs to be changes to have LINK tags like this (this is why the rewrite solution is ideal):

Then you need lesscss.cfm, of course:





  src = new File(variables.thisDir + url.p)
  dest = new File(src.parentFile, src.name.replaceAll(/\.le?ss/, ".css"))
  variables.regenerate = ! dest.exists() || src.lastModified() >= dest.lastModified()
  variables.srcPath = src.canonicalPath
  variables.destPath = dest.canonicalPath


  
    require 'less'
    $variables["css"] = Less::Engine.new(File.new($variables["srcPath"])).to_css
  
  

Before that'll run, you'll need CFGroovy2 installed (and possibly update the CFIMPORT tag to point at the right location).  You'll also need Groovy, JRuby, the JRuby libs, and Less installed into your /WEB-INF/lib folder.  CFGroovy2 can bootstrap it's own Groovy, but since you have to drop JAR for the JRuby and Less pieces, it's easier to just do the same for Groovy.  Here's the JARs:

With that, restart your server, write yourself a Less file (or grab the sample at the top), and fire 'er up!