Large Web App Architecture: Yes to Thicker Stack on One Hardware Node, No to Beautiful “Redundant” Spiderwebs
Reading time: 4 – 7 minutes
My last client our team worked with had a large ecommerce operation. Yearly revenue in the new site is in the high single digit billions of dollars. This necessitates extremely high availability. I will draw an initially favorable looking configuration for this high availability (“beautiful spiderwebs”), but then tear it apart and suggest an alternative (“Thicker Stack on One Hardware”).
1. “Beautiful Spiderwebs” – Often Not Recommended
Here’s one common way people could implement high availability. Notice how there are always multiple routes available for servicing a request. If one BIG IP goes down, there is another to help. And this could be doubled with multiple data centers, failed over with DNS.
The visible redundancy and complexity in one diagram may be appealing. One can run through scenarios in order to make sure that yes, we can actually survive any failure and the ecommerce will not stop.
So then what could make this my Not Recommended option?
2. Martin’s Reminder how to Think About Nodes
Fowler reminded us in Patterns of Enterprise Application Architecture how to look at distribution and tiers. For some reason people keep wanting to have certain “machines running certain services” and just make a few service calls to stitch up all the services you need. If you’re concerned about performance, though, you’re a looking for punishment. Remote calls are several orders of magnitude greater than in process, or calls within the same machine. And this architectural preference is rarely necessary.
One might lead to the first design with the logic of: “We can run each component on a separate box. If one component gets too busy we add extra boxes for it so we can load-balance our app.” Is that a good idea?
The above is not recommended:
A procedure call between two separate processes is orders of magnitude slower [than in-process]. Make that a process running on another machine and you can add another order of magnitude or two, depending on the network topography involved. [PoEAA Ch 7]
This leads into his First Law of Distributed Object Design: Don’t distribute your objects!
The solution?
Put all the classes into a single process and then run multiple copies of that process on the various nodes. That way each process uses local calls to get the job done and thus does things faster. You can also use fine- grained interfaces for all the classes within the process and thus get better maintainability with a simpler programming model. [PoEAA Ch 7]
3. “Strive for Thicker Stack on One Hardware Node” – Recommended
Observe the recommended approach below. There is still an external load balancer, but after a request is routed to an Apache/Nginx/etc front end, you’re all on one* machine.
If one tier fails on a node, pull the whole node out from rotation. Replace it. And re-enter it in the mix.
Your companies teams have worked together to be able to deploy modular services. So when your ecommerce site needs a merchant gateway processing service, you can include that (library or binary) and run it locally on your node, making a call through to it as needed.
Services are also simpler to deploy, upgrade and monitor as there are fewer processes and fewer differently-configured machines.
(* I understand there may be the occasional exception for remote calls that need to be made to other machines. Possibly databases, mcached obviously third party hosted services, but the point is most everything else need not be remote.)
4. But, Practically Speaking How Far Do We Go?
A caveat first: these benefits get pronounced as you have more and more nodes. (And thus, more and more complex of spiderwebs of unnecessary failover).
Should there be a database server running on each node? Probably not at first. There is a maintenance associated with that. But after sharding your database and running with replication, why not? This way if a node fails, you simply pull it out and replace it with a functioning one.
5. Checklist of Takeaway Lessons
- Keep it local. Local calls orders of magnitude faster than remote calls.
- Make services modular so they don’t need to be remote, yet still have all the organizational benefits of separate teams.
- Simplicity in node-level-redundancy is preferred over tier-level-redundancy.
Often, people think of high availability with terms such as the following: Round Robin, Load Balancing, and Failover. What do you think of? Leave a comment below with how you meet the trade-offs of designing for HA as well as architectural decisions of low latency.





Actually it was a little more spiderweby than the first picture indicates.
Remember, we had the ESB instances on the App Server which also meant that the App Servers had meshed connections to each other as well as the Backend Services.
Drawing it will make the picture less beautiful, which is appropriate as it begins to reflect the torturous mess that it really is.
JoshG
19 Aug 09 at 4:48 am
I think this summarizes the (painful) lessons learned from deploying J2EE applications. Using EJBs to distribute “logical” components resulted in punishing performance, that had to be optimized with direct database queries. At some point, we realized we were spending far too much time working around the J2EE framework. Hence the solution to use POJOs and load-balance / fail-over the complete stack.
Gerald Boersma
19 Aug 09 at 11:55 am
I can’t event read this article. That bloody scrolling twitter thing on the side is too distracting!
Someone
16 Sep 09 at 10:51 pm
[...] and easily manageable has been the focus of much of my time recently. Last time I talked about spiderweb architecture, because it has attributes of scalability and high availability, yet comes with a hidden cost. [...]
Simplicity is Better for Deploying in Production Web Architectures at JAW Speak
30 Jan 10 at 1:20 pm