2008/01/11

On the Largeness of Code

Stevey's Blog Rants contained a nice rant about code size. This resonates with me in a big way. Stevey characterizes statically typed languages (meaning Java et al) as being overly verbose, and dynamic languages as being much more concise. Among the reasons for this, Stevey points at the static type system as adding a lot of overhead. But he didn't go into a lot of detail as to why this is so.

I think there are at least three reasons:
  1. Libraries for static languages tend to focus on data structures and algorithms over protocols
  2. Because of this emphasis on structures and algorithms, the code ends up shuffling a lot of data around, for no net benefit
  3. By their nature, static languages wall off an entire domain of programming techniques that can drastically reduce source code overhead
I did an informal study a while back. I looked at the implementations of doubly linked lists (dlists) written in Eiffel and in a popular C++ library. Taking into account all the inherited code, the comments and so forth, I looked at the total statement count to implement a dlist in the two languages. The first surprise was that they were about the same size. The second surprise: they were about 1,100 lines of code each.

Over one thousand lines of code for a dlist! Everyone who has taken a class in data structures has probably hand-coded a one-off dlist, but they didn't need a thousand lines of code to do it.

So what is the dynamic language equivalent of a dlist, and how big is it? Truthfully, I don't know. I've never seen a doubly linked list in a dynamic language. Perhaps my experience is too shallow, or perhaps its because all these languages come with built-in lists that are good enough 99.9% of the time. The programmer doesn't have to worry about the underlying functionality, whether its doubly linked, singly linked, blocks of arrays, or whatever. The lists just work, you can easily rearrange things on both ends, and performance isn't usually an issue. So in this competition to implement dlists we have:

Static languages: 1,100
Dynamic languages: zero

And in this case as in golf, small numbers are are better. So why should we be concerned about this? Well, I have a feeling that as goes the dlist, so goes the rest of the program. If it takes a thousand lines of code to implement a simple algorithm, then every trivial little thing is going to take a lot of code, and before long, we're talking megalines.

Another really interesting aspect of this is not just the bulk of code needed to implement a dlist. It's the fact that the class exists at all. Browse a popular C++ library (boost comes to mind) or if you have a taste for obscure languages, the Eiffel library. There are a plethora of classes that define algorithms and data structures. Linked lists, trees, hash tables, hashed sets, queues, dequeues, stacks and so on. In the current distribution of the Eiffel compiler, there over 120 classes with the word "list" in the name.

Now browse the standard distribution for Perl or Python. There are broad categories for protocols, frameworks and interfaces, things like SOAP, xmlrpc, SMTP, Apache mods, test rigs, and so on. In the 5.8 Perl distribution on my machine, there are only 3 modules with "list" in the name. These libraries all exist for the statically typed languages too, but they're piled on top of all the algorithms and data structures mentioned above. This leads to more problems that I'll discuss below.

So where are all the data structure libraries for Perl and Python? They are out there, but they're not needed for most day to day coding. If you need a tree in Perl, you make do with a list of lists, since that's a natural enough way of representing a tree. You don't need a tree class implemented with thousands of lines.

All of these classes that focus on algorithms and data structures are interesting -- to a software technogeek. And at one time when memory and processor cycles were at a premium, they were highly relevant (they still are in specific applications). But today for the majority of programmers and applications, I think they just get in the way and obfuscate the problem that's being tackled. After parsing an XML file, I just need an interface that lets me easily traverse it. I don't want a boatload of classes between me and the actual data.

So I think this is the first reason for code bloat with statically typed languages: an overemphasis on algorithms and data structures. This is a permanently embedded feature of the territory. Certainly you couldn't take the average C++ list class and expect to get very far shoving arbitrary lists of lists into it to emulate a tree.

The second thing that adds to code bloat is that these classes, combined with design patterns, leads to a proliferation of intermediate state. It has to, because all those algorithms and structures need to track a lot of information to handle what they're doing. Contributing to this, the data used by their higher libraries are often in the "wrong" container, say an arrayed list instead of a linked list, and more code is needed to mesh different representations.

So you've got all these data structures that aren't needed in a dynamic language, and all this state to help manage them, and all this code to manage all this state. This also means that the client of these classes has to manage a lot of their own internal state about the objects they're working with (hence the need for things like type-specific iterators for C++ container classes, and so on).

The ironic thing is that all of these algo-structures are meant to provide some performance gain or convenience for manipulating data. A dlist is more efficient than an array for some operations; a tree is more efficient than a dlist for others. But in most applications, it doesn't matter. Other constraints, like disk I/O or network traffic or user input often dominate. Or the performance difference isn't really noticeable; computation is cheap these days.

Brief digression: at one company a few years ago, the performance of a data processing program had slowed to a crawl. Another rather arrogant programmer spent several months re-implementing the B-Tree algorithm at its core using a combined heap and splay tree. The result was a marginal performance improvement. Months later, I found a horribly inefficient sort algorithm at the bottom of the program, replaced it, and got a 90% increase in performance. The focus on data structures was entirely misplaced and significantly increased the complexity of the code.

So these are two reasons for statically typed language bloat. There's a third reason: there are just some things you can't do in a statically typed language that you can do in a dynamic language. I'm thinking in particular of metaprogramming and declarative programming. When I first started working with these notions I was very leery of them. It felt a bit like putting the crazies in charge of the insane asylum. I can barely control what's going on in my program now, and I'm going to have the program write code for itself? Crazy indeed.

Yet these are incredibly powerful concepts that can lead to very a concise solution to a problem. An example of this (and where I cut my teeth on metaprogramming) can be seen in the bOP library from Bivio. Intended for developing Web sites, bOP makes extensive use of metaprogramming metadata and code generation. If you look at the Petshop example ( you can view the source for each page online), you won't see much HTML, and in fact a lot of the pages look like large data declarations (lists of lists). The data is just the relevant attributes for a given page, perhaps with references to data objects for populating fields, controls and tables.

The whole thing is very sophisticated and very concise. It's also challenging to read if you're not a Perl aficionado. Yet the point remains that Bivio is able to implement large, real-world applications with very small teams, and very small software artifacts, especially when compared to the Java equivalent.

A few years ago AOP was making a big splash in the Java world. AOP is really just a very constrained form of metaprogramming. After the initial splash, interest in it seems to have died down, I think in part to anti-competitive practices of proponents of AOP, but that's a separate rant. I think another reason it faded is precisely because it is limited. It was too easy to hit those limitations, and AOP didn't provide a way of overcoming them.