Archive for the ‘architecture’ Category
Reading time: 4 – 6 minutes
Visualize your team’s code to get people enthused and more responsible for unit testing. See how commits contribute to or degrade testability. See massive, untested checkins from some developers, and encourage them to change their behavior for frequent, well tested commits.
Several years ago I worked with Miško Hevery, who came up with an unreleased prototype to visualize commits on a bubble chart. The prototype didn’t make it as an open source project, and Misko’s been so busy I don’t know what happened to it. I was talking to him recently, and we think there are lots of good ideas in it that other people may want to use. We’d like to share these ideas, in hopes it interests someone in the open source community.
This is a standard Flex Bubble Chart. This was a rich flash widget with filtering, sorting, and other UI controls.
- Each bubble is a commit
- The X axis is a timeline
- The size of bubbles is the size of each checkin
- The colors of bubbles are unique per developer
- The Y axis represents the ratio of (test code / total code) changed in that checkin.
Size is measured in lines of code, an imperfect measure, but it proved useful for his internal use.
We found that showing this to a few people generally got a big interest. All the developers wanted to see their bubbles. People would remember certain commits they have made, and relive or dispute with each other why certain ones had no test code committed.
Best of all, though, when used regularly, it can encourage better behavior, and expose bad behavior.
Developers often leave patterns. When reading code, I’ve often thought “Ahh, this code looks like so-and-so wrote it.” I look it up, and sure enough I’ve recognized their style. When you have many people checking in code, you often have subtly different styles. And some add technical debt to the overall codebase. This tool is a hard to dispute visual aid to encourage better style (developer testing).
Many retrospectives I lead start with a timeline onto which people place sticky notes of positive, negative, and puzzling events. This bubble chart can be projected on the wall, and milestones can be annotated. It is even more effective if you also annotating merges, refactorings, and when stories were completed.
If you add filtering per user story and/or cyclometric complexity, this can be a client friendly way to justifying to business people a big refactoring, why they need to pay down tech debt, or why a story was so costly.
While you must be careful with a tool like this, you can give it to managers for a mission control style view of what is going on. But beware of letting anyone fall into the trap of thinking people with more commits are doing more work. Frequent commits is probably a good sign, but with a tool like this, developers must continue to communicate so there are no misunderstandings or erroneous conclusions made by non-technical managers.
Some developers may claim a big code change with no test coverage was a refactoring, however even refactorings change tests in passing (Eclipse/Intellij takes care of the updates even if the developer is not applying scrutiny). Thus the real cause for the large commit with no tests deserves investigation.
- Filter by user
- Filter by size or ratio to highlight the most suspicious commits
- Filter by path in the commit tree
- Showing each bubble as a pie chart of different file types, and each of their respective tests.
- Display trend line per team or per area of code
- Use complexity or test coverage metrics instead of lines of code
- Add merge and refactoring commit visualizations
- Color coding commits to stories, and adding sorting and filters to identify stories with high suspected tech debt.
- Tie in bug fixes and trace them back to the original commits’ bubbles
We hope some of you also have created similar visualizations, or can add ideas to where you would use something like this. Thanks also to Paul Hammant for inspiration and suggestions in this article.
Reading time: 2 – 2 minutes
Following up on my previous post on Session State, there are a few conceptual ways to think about caches that I want to cover.
Cached items can be placed in session. That may be the easiest, but soon it will expose limitations. For instance, session may be serialized as one big blob. If so, you won’t be able to have multiple concurrent threads populating into the same cache. Imagine when logging in, you want to have a background asynchronous thread lookup and populate data.
The key with cache is the keys that are cached need to be independent. A session may be okay to store as a serialized object graph. But cache keys could be primed in multiple synchronous threads, so a single graph could involve locking or overwriting parts of session. If cache entries are deep in a graph, you’re almost certain to have collisions and overwritten data.
IMHO, the most important thing is: Cache keys should be deterministic. For instance, if I want to cache all of a user’s future trips (and that is a slow call), I want to be able to look in the cache without first looking up the key to the cache in some other store. I want to say “Hey, given user 12345, look in a known place in the cache (such as “U12345:futureTrips”) to see if some other thread already populated the cache value.” This does mean you need to count more uniquely addressable cache locations in your application logic, but the extra accounting is well worth the flexibility it gives you. For instance, it allows background threads to populate that cache item, without overwriting anything else.
Reading time: 3 – 4 minutes
Web apps have numerous choices for storing stateful data in between requests. For example, when a user goes through a multi-step wizard, he or she might make choices on page one that need to be propagated to page three. That data could be stored in http session. (It also could get stored into some backend persistence and skip session altogether).
So where does session get stored? There are basically four choices here.
- Store state on one app server, i.e. for java in HttpSession. Subsequent requests need to be pinned to that particular app server via your load balancer. If you hit another app server, you will not have that data in session.
- Store state in session, and then replicate that session to all or many other app servers. This means you don’t need a load balancer to anchor a user to one app server; multiple requests can either hit any app server (full replication). Or use the load balancer to pin to particular cluster (themselves replicating sessions, and giving higher availability). Replication can be smart so only the deltas of binary data are multicast to the other servers.
- Store the state on the client side, via cookies, hidden form fields, query strings, or client side storage in flash or html 5. Rails has an option to store it in cookies automatically. Consider encrypting the session data. However, some of these options can involve a lot of data going back and forth on every request, especially if it’s in a cookie and images/scripts are served from the same domain.
- Store no state on app servers, instead write everything in between requests down to backend persistence. Do not necessarily use the concept of http session. Use id’s to look up those entities. Persistence could be a relational database, distributed/replicated key/value storage, etc. Your session data is serialized in one big object graph, or as multiple specific entries.
What you choose is up to many factors, however a few guidelines help:
- Try to keep what is in session small.
- If possible, keep session state on the client side.
- Prefer key/value storage replicated among a small cluster over replicating all session state among all app servers.
- Genuinely consider sticky sessions, which home users to a particular app server. Many benefits abound, including what Paul Hammant talks about w.r.t Servlet spec 7.7.2 Appengine’s Blind Spot.
- If you serialize an object graph, recognize that when you do a deployment it will probably mean existing sessions are now unable to be deserialized by the newly deployed app. Avoid this by using your load balancer to swing new traffic into the new deployments, monitor errors, and then let the old sessions expire before switching all users over to the new deployment. Bleed them over.
- Session is not for caching. It may be tempting to store data in session for caching purposes, but soon you will need a cache.
- Store in session what is absolutely necessary for that user, but not more. See caches, above.
Reading time: 5 – 8 minutes
Engineering something to be scalable, highly available, and easily manageable has been the focus of much of my time recently. Last time I talked about spiderweb architecture, because it has attributes of scalability and high availability, yet comes with a hidden cost. Complexity.
Here is a fictional set of questions, and my responses for the application architecture.
Q: Why does complexity matter?
JAW: Because when your system is complex, there is less certainty. Logical branches in the possible state of a system mean more work for engineers to create a mental model, and decide what action to take. Complexity means there are more points of unique failure.
Q: But my team is really, really smart; my engineers can handle clever and complex mental models!
JAW: That wasn’t a question, but I do have a response. Given a team at any moment in time, there is a finite amount of complexity that the team can deal with. Complexity can be in the application’s logic, dealing with delivering business value. Or, it can be in non functional requirements. If the NFR’s can be met with lower complexity, this will translate directly to more business value. A team will grow in their ability to manage complexity as they understand more and more of it, and team size can increase. Although those productivity increases can be used for business value, or complex architectures. And often, NFR’s can be met while still achieving simplicity.
Q: So how do I deal with a large, complex application which needs an emergency fix on one of the small components?
JAW: Yes, I know the scenario. You want to make a small change into production, but it sounds less risky to only push one part. Here’s my recipe for success: make every deployment identical, and automated. (Ideally push into production from continuous builds, with automated testing.) In the event of an emergency push into production, alter the code from your version control tag, and deploy that as you would every other push. My colleague Paul Hammant call non-standard, risky pushes “white knuckle three-in-the-morning deployments.”
Don’t make the e-fix a one-off, non-standard production push. Have the entire system simple, and repeatable. With repeatability and automated repetition comes security. Very flexible (read: complex), extensible (read: rarely tested day to day) hooks can be built into a system, in order to make it possible to push just one small component into production. However in reality unused code becomes stale, and when a production emergency happens, people will be so scared to try these hooks. Or if they do, there is a greater risk of a misconfiguration, and failure. Which will necessitate a fix of the failed fix which tried to fix the original tiny defect. More complexity. Blowing the original availability requirements out of the water.
Q: So, what is simplicity?
JAW: My definition says: Simplicity is the preference of fewer combinatorial states a system can be in. Choose defaults over
I recently read a quote from High Scalability, which I think gives a good definition of what simplicity is (emphasis added):
“Keep it simple! Simplicity allows you to rearchitect more quickly so you can respond to problems. It’s true that nobody really knows what simplicity is, but if you aren’t afraid to make changes then that’s a good sign simplicity is happening.“
[Caveat: some complexity makes sense, it's just too much in the wrong places increases risk. And there is a threshold everyone needs to find: how much risk, how much flexibility, and how much energy to devote to reducing the risk while keeping high flexibility.]
A preconditon of modern manufacturing, the concept of interchangeable parts that can help simplify the lower layers of an application stack, isn’t always embraced as a virtue. A common behavior of small teams on a tight budget is to tightly fit the building blocks of their system to the task at hand. It’s not uncommon to use different hardware configurations for the webservers, load balancers (more bandwidth), batch jobs (more memory), databases (more of everything), development machines (cheaper hardware), and so on. If more batch machines are suddenly needed, they’ll probably have to be purchased new, which takes time. Keeping lots of extra hardware on site for a large number of machine configurations becomes very expensive very quickly. This is fine for a small system with fixed needs, but the needs of a growing system will change unpredictably. When a system is changing, the more heavily interchangeable the parts are, the more quickly the team can respond to failures or new demands.
In the hardware example above, if the configurations had been standardized into two types (say Small and Large), then it would be possible to muster spare hardware and re-provision as demand evolved over time. This approach saves time and allows flexibility, and there are other advantages: standardized systems are easy to deploy in batches, because they do not need assigned roles ahead of time. They are easier to service and replace. Their strengths and weaknesses can be studied in detail.
All well and good for hardware, but in a hosted environment this sort of thing is abstracted away anyway, so it would seem to be a non-issue. Or is it? Again using the example above, replace “hardware” with “OS image” and many of the same issues arise: an environment where different components depend on different software stacks creates additional maintenance and deployment headaches and opportunities for error. The same could be said for programming languages, software libraries, network topologies, monitoring setups, and even access privileges.
The reason that interchangeable parts become a key scaling issue is that a complex, highly heterogeneous environment saps a team’s productivity (and/or a system’s reliability) to an ever-greater degree as the system grows. (Especially if the team is also growing, and new developers are introducing new favorite tools.) The problems start small, and grow quietly. Therefore, a great long-term investment is to take a step back and ask, “what parts can we standardize? Where are there differences between systems which we can eliminate? Are the specialized outliers truly justified?” A growth environment is a good opportunity to standardize on a few components for future expansion, and gradually deprecate the exceptions.
Large Web App Architecture: Yes to Thicker Stack on One Hardware Node, No to Beautiful “Redundant” Spiderwebs
Reading time: 4 – 7 minutes
My last client our team worked with had a large ecommerce operation. Yearly revenue in the new site is in the high single digit billions of dollars. This necessitates extremely high availability. I will draw an initially favorable looking configuration for this high availability (“beautiful spiderwebs”), but then tear it apart and suggest an alternative (“Thicker Stack on One Hardware”).
1. “Beautiful Spiderwebs” – Often Not Recommended
Here’s one common way people could implement high availability. Notice how there are always multiple routes available for servicing a request. If one BIG IP goes down, there is another to help. And this could be doubled with multiple data centers, failed over with DNS.
The visible redundancy and complexity in one diagram may be appealing. One can run through scenarios in order to make sure that yes, we can actually survive any failure and the ecommerce will not stop.
So then what could make this my Not Recommended option?
2. Martin’s Reminder how to Think About Nodes
Fowler reminded us in Patterns of Enterprise Application Architecture how to look at distribution and tiers. For some reason people keep wanting to have certain “machines running certain services” and just make a few service calls to stitch up all the services you need. If you’re concerned about performance, though, you’re a looking for punishment. Remote calls are several orders of magnitude greater than in process, or calls within the same machine. And this architectural preference is rarely necessary.
One might lead to the first design with the logic of: “We can run each component on a separate box. If one component gets too busy we add extra boxes for it so we can load-balance our app.” Is that a good idea?
The above is not recommended:
A procedure call between two separate processes is orders of magnitude slower [than in-process]. Make that a process running on another machine and you can add another order of magnitude or two, depending on the network topography involved. [PoEAA Ch 7]
This leads into his First Law of Distributed Object Design: Don’t distribute your objects!
Put all the classes into a single process and then run multiple copies of that process on the various nodes. That way each process uses local calls to get the job done and thus does things faster. You can also use fine- grained interfaces for all the classes within the process and thus get better maintainability with a simpler programming model. [PoEAA Ch 7]
3. “Strive for Thicker Stack on One Hardware Node” – Recommended
Observe the recommended approach below. There is still an external load balancer, but after a request is routed to an Apache/Nginx/etc front end, you’re all on one* machine.
If one tier fails on a node, pull the whole node out from rotation. Replace it. And re-enter it in the mix.
Your companies teams have worked together to be able to deploy modular services. So when your ecommerce site needs a merchant gateway processing service, you can include that (library or binary) and run it locally on your node, making a call through to it as needed.
Services are also simpler to deploy, upgrade and monitor as there are fewer processes and fewer differently-configured machines.
(* I understand there may be the occasional exception for remote calls that need to be made to other machines. Possibly databases, mcached obviously third party hosted services, but the point is most everything else need not be remote.)
4. But, Practically Speaking How Far Do We Go?
A caveat first: these benefits get pronounced as you have more and more nodes. (And thus, more and more complex of spiderwebs of unnecessary failover).
Should there be a database server running on each node? Probably not at first. There is a maintenance associated with that. But after sharding your database and running with replication, why not? This way if a node fails, you simply pull it out and replace it with a functioning one.
5. Checklist of Takeaway Lessons
- Keep it local. Local calls orders of magnitude faster than remote calls.
- Make services modular so they don’t need to be remote, yet still have all the organizational benefits of separate teams.
- Simplicity in node-level-redundancy is preferred over tier-level-redundancy.
Often, people think of high availability with terms such as the following: Round Robin, Load Balancing, and Failover. What do you think of? Leave a comment below with how you meet the trade-offs of designing for HA as well as architectural decisions of low latency.