Beat back Dependency Creep

Sean Nolan

5 years ago

This is often the first problem new (and some old) engineers create for themselves. It sounds so smart — why build something that already exists? It’s already debugged. We get improvements for free. That’s not our core competency. All things that can be true. No sane person is going to build an OS or database by themselves. Certain core middleware can make sense too. We stand on the work of others and thank Gates and Torvalds for that.

But dependency is an insidious bastard, and in an honest accounting you often lose more than you win. Every guest in your home creates build/patch/deploy complexity; bloats your system with extra code you don’t need; opens you up to security risks; adds (often non-obvious) performance issues; and sets you up for hours of JAR/DLL hell figuring out which version of what is conflicting with who. And the shining promises of open source notwithstanding, you’re almost always reliant on external developers and their own timelines for updates.

OK then. If usually dependencies are bad, but sometimes they’re good — and you rarely have enough information on day one to know which is which — what’s the play? There is no perfect answer, but I optimize using three basic principles:

Reject new dependencies by default. Most of the time this is the right choice, so it saves you time and keeps the system maximally clean. Hearing “no” is also a good checkpoint moment for new developers that haven’t learned these lessons yet; they’ve grown up including random code without even thinking about it.
Identify areas that might justify the cost of dependency; re-evaluate as appropriate. You’ll be surprised how often that initial “no” is the last you ever hear. Simple implementations do the job and they just don’t come up again. But if you do find bugs, or need to extend functionality in these areas, keep your eyes open. At some point the calculus may shift and justify a new approach.
Only let your most experienced developers make the final call. You want these decisions made by folks that have made similar ones before and have had to live with them. There is simply no substitute for the school of hard knocks. More tactically, having a funnel avoids dependencies sneaking in under the radar.

My Friend CSV

CSV (“comma-separated values”) text files are an underappreciated workhorse of our economy. They can be written and read by almost any app on any platform, they’re tight and efficient, they compress well and you can make sense of them with a text editor. If you can name a tech organization that doesn’t have CSV somewhere in its processes, a crisp $5 bill shall be yours.

This simplicity and ubiquity make CSV a great vehicle for thinking about dependency judgments. There are dozens if not hundreds of CSV readers and writers available for Java alone, most notably the “Apache Commons CSV” project which is frankly pretty sweet.

It’s worth a brief interruption to comment on Apache Commons more generally. It is chock full of great stuff, and if I’m going to use external code at all, it’s always my first stop. They work hard to simplify their own dependency trees and keep interfaces stable, because they understood from day one the very issues I’m raising here. A blanket “ok” to Commons projects isn’t nuts; just don’t imagine it comes with zero cost. Way too many Java libraries require or embed specific versions of Commons JARs … and if you’re forced to use one of those, it’s game on for JAR hell just the same.

V1 – Short and Sweet

It starts something like this. Given the name of a Muppet, return the year they were introduced. I’ve created a CSV file with this data (at least for some key Muppets), sourced from Wikipedia: Muppets.csv.

Honestly, this ask doesn’t even deserve code; we can do an acceptable job with a simple-if-ugly shell pipeline (e.g., tail -n +2 Muppets1.csv | awk -F , '$2=="Gonzo" {print $1}'). But if we were to open up an editor, our first stab is probably going to look something like this: Csv1.java (unless I specify otherwise, all of these code samples can be directly compiled with javac and run with java).

No rocket science here, just some IO sugar and that great string workhorse “split”. But even at this insanely simple level, there are two sneaky but significant differences between our shell and Java versions.

The obvious one is performance — “readAllLines” does exactly that, creating an array with every line in the reference file. This kind of thing gnaws at good engineers, but it’s probably just fine. Complexity is the enemy, and an array of strings is pretty easy to work with. Do a little envelope math — how big is your file honestly going to be? If we’re talking less than a MB or even more, these days you just probably shouldn’t sweat it. We’ll come back to this one later.

The more interesting difference is that the shell version doesn’t work. OK, sometimes it doesn’t work. Specifically it doesn’t work using files created on Windows. On most platforms, the character that marks end of line in text files is \n or “line feed”. On Windows, there is a second character (\r or “carriage return”) before the line feed. Annoying, but not crazy — on old teletype printers the commands for moving the printer back to the start of the current line (“carriage return”) and down to the next line (“line feed”) were separate.

Anyways, the Unix-style tool awk (by default) only considers \n as a line separator. This means that if you feed it a Windows text file (even running on Windows), there will be a stray \r at the end of every line. In our script, that means the line we think should match “Gonzo” really will only match “Gonzo\r”. Whoops. Easy to fix, but easy to forget, and a good reminder that inherited or unintended behavior is lurking everywhere.

V2 – Onesy-twosies

Your simple tool is either going to be used a few times and forgotten, or it’s going to slowly worm its way into regular use as part of some business process. The latter is great! But as those processes evolve, requests to fix bugs (aka “under-specified features”) will come your way. A few common ones:

New columns are added to the file, or existing ones are re-ordered.
The file contains extra whitespace that should be ignored.
Somebody wants to use it with tab separators instead of commas.
Somebody else wants to look at different fields.

A good engineer faced with these will ask: is this the right time to take a new approach? Should I replace my code with a dependency that has already dealt with these issues? In the list above, nothing really merits that consideration — simple, clear tweaks do the trick without major surgery or regression risk.

To demonstrate this; let’s go ahead and fix these up. We’ll do it in two steps so that I can make my point with a clean diff. First, a version that works the same as the first, just isolated into its own class and with a cute lamba flow: Csv2a.java. By the way, you’re not the first to say I should be more targeted with my Exception handling; get over it.

Here’s an updated version: Csv2b.java. We do a little work up front to pick a separator and set up a map of header names to positions, slip in a call to trim leading and trailing space, and we’re good to go. The changes don’t require refactoring the original, don’t change behavior for existing users, and can be tested in a targeted and easily-automated way.

V3 – This is getting real

Time marches forward, and eventually you might find yourself facing an ask that tips the scales away from do-it-yourself. For a CSV parser, two likely game changers could be:

Fields contain escaped separator characters.
The reference file gets really big (that’s a lot of Muppets!).

These can only be addressed in our code with some non-trivial refactoring. Our old buddy “split” will need to be replaced because it doesn’t handle escapes. And loading the whole file into memory now seems problematic, so we’ll need a streaming solution. The good news is that the lambda-based usage pattern we started with should still hold up. So before we jump too quickly, let’s go down the rabbit hole and see where we end up: Csv3.java.

It turns out that the streaming piece was actually easier than it first appeared. If we were processing aggregates or sorting over the input, things would have been different, but as it was we were able to just “drop in” a new loop using a BufferedReader and some cleanup. You’ll notice my very explicit cleanup pattern; it’s not the most concise, but it’s a habit that has virtually eliminated resource leaks in my code.

Much more interesting is the code that parses out the fields within each line. There is no rehabilitating our “split” implementation, so we find ourselves walking and parsing strings. We’re not quite at the point where the only sane implementation is a state machine, but we’re definitely playing in that zone. A few things make it easier to grapple with the complexity:

This is really the only place (sorry Charles) where Hungarian Notation still saves the day. ich, cch, Max vs Mac — I promise that these conventions will save your bacon again and again.
Invariants and termination conditions matter. Make sure you know exactly where you stand at the start of each call and each loop iteration.
I am not a big commenter, but in methods like this, they really help. Colorizing editors especially help our brains pick out patterns and related details. Normally it’s better to rely on well-named subroutines for this, but that is harder to do with string parsing code, which requires a lot of state.
Walk the code on paper with real edge cases: lines that start or end with separators, lots of quoting, and so on. Automation and code coverage is great but, at least for me, there is no substitute for line-by-line tracing.

Other design-related issues pop up in this version as well. Where should we be using lists vs. arrays, and what should we be caching vs generating on-demand? How much of the file structure do we expose and in what ways (e.g., is my new “peekHeaders” callback appropriate)? Even the test-focused “echoMain” entrypoint needs to be questioned — is there a security risk hiding in that code?

Whew! All of this is just to highlight that there really is nothing like writing the code to make the abstract tangible. If you have the time, I recommend writing your own versions. Even with these simple examples, the calculus starts to just feel “off” when you’re doing it wrong — becoming a solid engineer requires cultivating and listening to the voices in your head.

So where to go from here? We’ve written the features and they seem to work, so it may make sense to just stay the course a little longer. But you know, it turns out that we didn’t really implement CSV escaping properly — quoted fields in a CSV file are allowed to contain newline characters, so now even our brand new readLine approach has to go!

It’s probably time to finally cry uncle and leave this to the professionals.

V4 – Life with Dependency

Your first choice in this new world is (duh) which library to pick. Don’t rush this! A few considerations:

What open-source licensing models are you OK with? It’s easy to forget about this, but copyleft and other approaches matter to real businesses. If you’ve ever been involved in an IPO or acquisition, you’ve seen this first-hand.
For commercial libraries, what is the cost structure? Transaction-based fees can get big quick. Can you freely use the library on developer and test machines? Etc.
How well-supported is it? Is there an active development team? How often are patches released? Too often and too rarely both can be warning signs.
Does it fit with your usage patterns? Just as important as API structure is impact on your build, deployment, configuration and monitoring systems.
Will it meet your specific performance requirements (and have you proven them)?
And of course, what secondary dependencies do you have to buy into as well?

The obvious choice for CSV parsing in Java is Apache Commons CSV. And true to form, they’ve done a great job managing secondary dependencies — there are none.

The next important choice is all about abstraction. Do you minimize disruption by sliding in a new implementation under your existing API, or take on a bigger change? Unless you’re forced to by the nature of the dependency, stay strong and keep it consistent! Especially when it comes to infrastructure code, “change one thing at a time” is a rule that will rarely do you wrong.

For our purposes here I’m going to download and reference the library directly. In real life you’re probably using a build system like Maven or SBT that automatically pulls artifacts. This is great, but beware — auto-resolving the dependency tree can pull in secondary libraries you never even considered. In any case, our naked build and execution commands for Csv4.java are a little more complicated:

javac -cp commons-csv-1.8.jar:. Csv4.java
java -cp commons-csv-1.8.jar:. Csv4 "Rizzo the Rat" Muppets.csv

Dependency refactoring always raises interesting issues. For example, Commons CSV returns the list of header names as a List<String> but our method returns an array. In order to keep the API intact I’ve updated the method to do the array transform. This looks innocent, but a caller that references this method in every call to handleLine could easily encounter performance issues. A smarter implementation would sidestep this by caching the transformed version.

In a real-world environment, it can be tricky to maintain your abstractions over time. For example, users may want to parse a file that contains comment lines. Commons CSV supports this feature, but our API does not. We are presented with three options:

Make this functionality the default for all users.
Extend the API with a configuration option (e.g., as an argument to process).
Expose the underlying objects directly (e.g., the CSVFormat object).

#1 is by far the best case if it works — add a call to CSVFormat.withCommentMarker('#') and we’re done. The worst kind of API extension is the one that nobody cares about but everyone has to think about; keep it simple if you can!

#2 is your best choice when usage varies. But it can be hard to stick to, what with #3 batting its eyelashes at you from across the room. So easy, and you can even pretend you’re “future proofing” by making other as-yet unleveraged capabilities available. Why do the busywork of just passing configuration through your API?

Here’s the thing. If you are really, really sure you’ll never want to change up this dependency, then just go all in. A leaky abstraction — weaker than its underlying dependency while still tightly bound to it — is the absolute worst. You will hate it and your users will hate it. Better to just toss your API and go native.

Just don’t come running to me when you find some critical flaw and have to unwind dozens of random uses in parts of the codebase you barely understand. Because that will happen for sure. Over the long run, I guarantee you’ll be happier bucking up and eating the “busywork” to keep your abstraction clean.

Was this all worth it?

We’ve now run through four generations of our CSV parser. That’s a lot of work. Would we have been better off taking the dependency from the start?

We picked CSV as a contrived example and in truth, for this requirement the answer is probably yes. Commons CSV is solid, low-dependency, well-maintained, performant code; in comparison and with hindsight our baby implementation was doomed all along.

But what matters here is the process — not the endgame. It is easily worth the cost of refactoring a few cases to keep the codebase crisp for the rest; simplicity almost always wins the day. And when you do decide to pick up a dependency, that assessment is infinitely more informed having been through a first implementation. Requirements defined in the abstract tend to be useless.

All of this is about optimizing in the face of unknowns and changing needs. Your code has to do the job today, and it needs to be adaptable to handle future jobs with the lowest possible cost and risk. The best engineers are always thinking about and fine-tuning this balance. Done well, it’s like magic.

Code from this post: Csv1.java, Csv2a.java, Csv2b.java, Csv3.java, Csv4.java