A tale of utilization vs performance

I once worked on a SaaS product that’s typical of a lot of B2B and niche B2C products.  It had a userbase in the thousands or tens of thousands, and provided business functions that involved heavyweight, complex workflows.  Not many users, lots of compute.  I would say that this is the norm for line of business applications, outside of the industry rockstars that are in the news all the time.

We didn’t need docker.  Docker’s claim to fame is that it’s lightweight.  Once you deploy a multi-gigabyte JVM app that can keep lots of cores busy, it really doesn’t matter if you’ve deployed it onto a VM or a container.  The host overhead is in the noise.   In a containers-on-VMs scenario, no VM is ever going to run more than one instance of the container, so the container layer is just management and skillset overhead.

We didn’t need autoscaling.  This application could keep a server farm busy, but it really only needed the one server farm.  It was easily sharded, if we did hit those limits.  It wasn’t bursty, and it used enough queuing that it could handily absorb the bursts that did happen.

We didn’t need Kubernetes because we didn’t need docker or autoscaling.  Again, unnecessary overhead in skillset and tooling.

We didn’t need these things because we weren’t twitter.  We weren’t even one micro-twitter (if that’s a unit of scale).  We might have got to one million users total, eventually (although I don’t think they ever did). We weren’t ever going to get to one million users a month, let alone a day.

We did have performance problems.  Of course we did.  They were caused by bad SQL and bad algorithms.  We know those were the causes, because we could see the issues staring us in the face.  Every time we did a PoC optimisation exercise, we could easily find improvements of multiple orders of magnitude.  But we never committed to regression testing those and getting them through to production.

Instead, we addressed our performance problems by spending so much on infrastructure that we could have hired two or three more developers.  You have to buy a lot of infrastructure to make your application 1000 times faster.  We settled for a bit less than that.

But it didn’t matter how fast or slow our application was, because we fixated on utilization.  The faster we made our application, the worse our utilization looked.  To a server admin, an average CPU utilization of 20% looks like a healthy server.  To an accountant, it looks like an 80% cost saving waiting to be had.

So we took our slow, unoptimized application, and moved it to docker and Kubernetes.  We didn’t get our extra developers, so we never got to optimize at all.  We took a big hit in training and migration, so productivity dipped.  Reliability got worse for a while, because we made some mistakes in ops.  And our application still had performance problems, because any one request by any one user was still running massively underperforming algorithms and overwhelming the database with unoptimisable queries and deadlocks.

However, our utilization figures were immaculate.  As for the performance issues: when I left, they were talking about putting those bad algorithms into lambdas.