Reading time: 5 – 8 minutes
Engineering something to be scalable, highly available, and easily manageable has been the focus of much of my time recently. Last time I talked about spiderweb architecture, because it has attributes of scalability and high availability, yet comes with a hidden cost. Complexity.
Here is a fictional set of questions, and my responses for the application architecture.
Q: Why does complexity matter?
JAW: Because when your system is complex, there is less certainty. Logical branches in the possible state of a system mean more work for engineers to create a mental model, and decide what action to take. Complexity means there are more points of unique failure.
Q: But my team is really, really smart; my engineers can handle clever and complex mental models!
JAW: That wasn’t a question, but I do have a response. Given a team at any moment in time, there is a finite amount of complexity that the team can deal with. Complexity can be in the application’s logic, dealing with delivering business value. Or, it can be in non functional requirements. If the NFR’s can be met with lower complexity, this will translate directly to more business value. A team will grow in their ability to manage complexity as they understand more and more of it, and team size can increase. Although those productivity increases can be used for business value, or complex architectures. And often, NFR’s can be met while still achieving simplicity.
Q: So how do I deal with a large, complex application which needs an emergency fix on one of the small components?
JAW: Yes, I know the scenario. You want to make a small change into production, but it sounds less risky to only push one part. Here’s my recipe for success: make every deployment identical, and automated. (Ideally push into production from continuous builds, with automated testing.) In the event of an emergency push into production, alter the code from your version control tag, and deploy that as you would every other push. My colleague Paul Hammant call non-standard, risky pushes “white knuckle three-in-the-morning deployments.”
Don’t make the e-fix a one-off, non-standard production push. Have the entire system simple, and repeatable. With repeatability and automated repetition comes security. Very flexible (read: complex), extensible (read: rarely tested day to day) hooks can be built into a system, in order to make it possible to push just one small component into production. However in reality unused code becomes stale, and when a production emergency happens, people will be so scared to try these hooks. Or if they do, there is a greater risk of a misconfiguration, and failure. Which will necessitate a fix of the failed fix which tried to fix the original tiny defect. More complexity. Blowing the original availability requirements out of the water.
Q: So, what is simplicity?
JAW: My definition says: Simplicity is the preference of fewer combinatorial states a system can be in. Choose defaults over
I recently read a quote from High Scalability, which I think gives a good definition of what simplicity is (emphasis added):
“Keep it simple! Simplicity allows you to rearchitect more quickly so you can respond to problems. It’s true that nobody really knows what simplicity is, but if you aren’t afraid to make changes then that’s a good sign simplicity is happening.“
[Caveat: some complexity makes sense, it's just too much in the wrong places increases risk. And there is a threshold everyone needs to find: how much risk, how much flexibility, and how much energy to devote to reducing the risk while keeping high flexibility.]
A preconditon of modern manufacturing, the concept of interchangeable parts that can help simplify the lower layers of an application stack, isn’t always embraced as a virtue. A common behavior of small teams on a tight budget is to tightly fit the building blocks of their system to the task at hand. It’s not uncommon to use different hardware configurations for the webservers, load balancers (more bandwidth), batch jobs (more memory), databases (more of everything), development machines (cheaper hardware), and so on. If more batch machines are suddenly needed, they’ll probably have to be purchased new, which takes time. Keeping lots of extra hardware on site for a large number of machine configurations becomes very expensive very quickly. This is fine for a small system with fixed needs, but the needs of a growing system will change unpredictably. When a system is changing, the more heavily interchangeable the parts are, the more quickly the team can respond to failures or new demands.
In the hardware example above, if the configurations had been standardized into two types (say Small and Large), then it would be possible to muster spare hardware and re-provision as demand evolved over time. This approach saves time and allows flexibility, and there are other advantages: standardized systems are easy to deploy in batches, because they do not need assigned roles ahead of time. They are easier to service and replace. Their strengths and weaknesses can be studied in detail.
All well and good for hardware, but in a hosted environment this sort of thing is abstracted away anyway, so it would seem to be a non-issue. Or is it? Again using the example above, replace “hardware” with “OS image” and many of the same issues arise: an environment where different components depend on different software stacks creates additional maintenance and deployment headaches and opportunities for error. The same could be said for programming languages, software libraries, network topologies, monitoring setups, and even access privileges.
The reason that interchangeable parts become a key scaling issue is that a complex, highly heterogeneous environment saps a team’s productivity (and/or a system’s reliability) to an ever-greater degree as the system grows. (Especially if the team is also growing, and new developers are introducing new favorite tools.) The problems start small, and grow quietly. Therefore, a great long-term investment is to take a step back and ask, “what parts can we standardize? Where are there differences between systems which we can eliminate? Are the specialized outliers truly justified?” A growth environment is a good opportunity to standardize on a few components for future expansion, and gradually deprecate the exceptions.