When you’re learning the game of software development, you probably never think of software risk. You write a few lines of code, then you celebrate seeing “Hello World” on the screen. Rinse, repeat as you build ever more impressive stuff, teaching yourself the craft.
But then you take a job doing this stuff, and that changes in a hurry. Sure, it’s mostly organizational leadership that tosses around the term “software risk” in meetings and looks to quantify it. But anyone with skin in the game feels the weight of risk acutely. What if the user types in something we don’t expect? What if the release doesn’t go well? Will we get frantic calls at 3 AM about this?
When commerce depends on the software that you write, the possibilities for problems seem endless. Risk is everywhere.
What is Risk?
Before we look at how to reduce risk, let’s take a crack at meaningfully defining it. In the world of commerce or investment, risk refers to the possibility of an investment losing value. And while that’s somewhat applicable to the commercial world in which software resides, we need to generalize a bit. Let’s take this definition from the dictionary to cover our bases:
Someone or something that creates or suggests a hazard.
Truly, this is what we mean when we talk about software risk. Sure, hazardous production behavior of the software might result in financial damages to your company. If you view the software as a commodity (which I don’t really recommend), you could even say that the presence of defects reduces its value. But the investment definition fits awkwardly. Really, we’re talking about hazard.
Software is really complicated. We build this stuff, reason about it the best we can, test it out, then send it to production like an ocean liner on its maiden voyage. And when it gets to that metaphorical ocean, a lot of unexpected things can go dramatically wrong. We talk about this as risk.
Defining Low Impact
Of course, we’re not helpless. As software development shops, we can do a lot to mitigate this risk, and we frequently do. Here are some common suggestions that you’ll hear:
- Get a good, robust, automated test suite in place.
- Ramp up exploratory QA and load/stress testing efforts.
- Break the code apart into more isolated, decoupled microservices.
- Switch methodology and “go agile.” (Or go the other direction and plan harder)
- Hire experts to come in for an assessment.
Everything there can absolutely help. But if you’re working in a lot of small-to-medium-sized shops, these things tend also to be pipe dreams: “Oh, sure, we’ll just tell our customers to wait a while and put in a bunch of tests, go agile, and refactor the whole application. And when we’re done with that, we’ll get to work on building that morale-boosting waterslide from the third floor to the first floor, too.”
So instead of the traditional approaches that require significant investment and the overhaul of your current approach, let’s look at some relatively low-impact things you can do. While these require time investment and behavior changes, all of them are things you can do with relatively minor cash investment and relatively minimal disruption to your general work.
1. Stop Programming by Coincidence
Understanding this is easy enough. Ever have an experience like the following?
You’ve got some nagging, intermittent bug that you’ve been chasing for hours. You’ve finally got it isolated to a few lines of code, and so you throw in an intermediate debut print statement to see the value of some variable…except when you throw that debugging statement in, the bug actually goes away.
Confused, you run the code over and over to be sure. You take the debug statement out and the bug comes back. Put the debug statement back and it’s gone. You look at the clock, see that it’s 7 PM and realize that you’re starving. “Whatever,” you mutter. “Don’t ask — just be thankful.” You commit the code, mark the bug fixed, and call it a day.
If you do this, stop doing it. Don’t deliver anything to your codebase without knowing exactly why it works. Every time you do that, you’re increasing the deficit between your code’s behavior and your team’s understanding, and you’re compounding the risk of problems in production.
2. Enlist Automated Reviews
Stopping with the “programming by coincidence” will ensure that you’re not knowingly delivering code you don’t understand. But that doesn’t by a long shot eliminate the issue of unknowingly delivering things you don’t understand. This goes by the mundane name of “introducing bugs.”
We as software developers have many strategies for bug prevention. These include things like the aforementioned automated tests, as well as peer code reviews. But if you don’t have time for either one of those, you can get a lot of mileage out of static analyzers and automated code review tools.
Run these regularly, and you can have tons of valuable feedback on your code in seconds, correcting issues cheaply now instead of expensively later.
3. Beef Up Your Logging
Another low-impact step you can take with your code is to beef up your logging a good bit. If you’re not logging at all, start. If you’re not doing much, do more.
Generally speaking, logging code is low risk and conceptually simple to add. This is especially true if you’re using a logging framework, which I’d highly recommend. It’s all pretty straightforward, and it should have minimal impact on your application’s logic and your reasoning about it.
It won’t affect your development and code much, but judicious and informative log entries can have a huge impact for you once your software is in production. This lets you play archaeologist and track down issues much more efficiently. It also lets you gather a lot more general information about your application’s behavior.
This wealth of information can help get you off the endless treadmill of fixing one defect after another.
4. Get in the Monitoring Game
You can further reduce your software risk by implementing monitoring on top of your beefed-up logging. Unlike the strategies discussed so far, this one involves tooling and process purely in production.
Now that you’ve got logging functionality spitting out a lot of good, useful, and well-formatted information, you can take extra advantage of it. Sure, the log helps you when you have an error that you’re looking to track down. But you can monitor your logs and look for problematic situations even before anyone reports them to you.
Seeing an unusual number of unauthorized login attempts? Seeing spikes in memory usage? These things are often precursors to much bigger problems that result in seriously upset users. If figuring out these issues more quickly reduces risk, how much risk do you save by preventing them? Monitoring your software can help you do just that.
5. Root-Cause Analysis of Issues
I’ll close with a final and more process-oriented suggestion. Make sure to do root-cause analysis on issues that come up.
So you’ve detected a spike in hardware usage, prompting you to go provision more resources before any users were affected. Great! Crisis averted. But how do you know it won’t happen again?
If this sounds familiar, it’s because I’m book-ending this with item 1 from my list: the “programming by coincidence” suggestion. Both strategies involve not letting yourself be satisfied with a problem going away until you understand it. With root-cause analysis, it applies to issues you’ve discovered in production and remediated already. It can be tempting to call it a day, but spend a little extra time tracing the issue back to its root cause and deciding what to do about that root cause.
Mitigating software risk is fairly easy when you have a lot of time and other resources on your side. But all too frequently, you don’t. So rely on simple hacks, some discipline, and your own ingenuity to spare yourself headaches.