How to efficiently scale your workload in Kubernetes using observability

Publish on 14 Nov, 2024 - by Budhaditya Bhattacharya
Last updated: 08 Apr, 2025

Observability – when you get it right – can unlock a huge range of efficiencies and new capabilities. To explore this in relation to Kubernetes, Tyk sat down with self-confessed observability geek Henrik Rexed, Cloud Native Advocate at Dynatrace. Henrik walked us through how to efficiently auto-scale your workload in Kubernetes, relying on observability to do so.

In this blog, we’re covering:

Why it’s essential to define resource quotas in Kubernetes and optimize your resource allocations.
How you can go further than the default metrics exposed by the Kubernetes API, adding external metrics for superior insights.
Why – and how – to report the cost of your workload from day one and use this as a key performance indicator (KPI) to optimize your environment while auto-scaling.

Watch the full webinar video via the link above or read on for key insights and top takeaways.

Why auto-scale in Kubernetes using observability?

Using observability to auto-scale in Kubernetes can save you time and money. It can also support your green IT agenda, reducing the energy related to an environment. While we’re focusing on cost below, this latter point will be a natural bonus – optimize the cost and you’ll tick the energy checkbox as a result.

An auto-scaling example

We can look at the example of a big brand that decided to open a new website for its new brand. The main brand had significant traffic on its existing website, but its second site was a fresh new website, fresh new brand and fresh new offering. That meant lots of time spent building and testing its applications.

When the new site – a retail portal – launched and delivered stable performance, the team was pleased to progress to working on innovations and new features. However, it was also important to reduce the cost of the new environment.

What is the cloud cost of your Kubernetes cluster?

How do cloud providers charge for your cloud environments? This is a hugely important point because if you’re running Kubernetes environments that aren’t optimized, doing so will be very expensive.

Optimizing your cluster means stepping back and finding out where you can optimize, plus understanding how cloud providers charge.

A cluster is like an onion, with lots of different layers. First is the cloud provider, where you deploy. The cloud provider manages your master nodes, with fees for that management. Then you have your nodes on different regions and egress communication. So, if you have data coming from one data zone and jumping to another, that’s going to cost you, too.

These costs will depend on the number and type of nodes running in your cluster, with CPU and memory coming into play. If you do AI or machine learning things, you’ll probably have GPU as an extra angle on your costs.

Then, of course, there’s storage, with PVC size and number impacting your bill. With the retail portal example we used above, there’s also the cost of exposing the site for public clients to connect to it via an external IP address.

Where to optimize

When it comes to optimizing the cost of your cluster, you can resize your nodes, persistent volumes or workloads.

Sticking with the website example above, resizing the nodes doesn’t make sense, as they have been sized to suit future growth. The same is true of persistent volumes – there is little point reducing, only to increase later on.

This leaves resizing the workload.

How to resize your workload

With Kubernetes, every time you deploy (no matter how you deploy), you have to put some resource allocations. So, whether you have one, two, three or x nodes, in total you have a capacity of CPU and memory. Every time you deploy Kubernetes, the schedule takes those resources and allocates them to your cluster, meaning nobody else can use them.

Optimizing means looking at that usage. If you’re only using a small percentage of the resources you’ve requested, perhaps you’re over-provisioning. This means there is scope for optimization, based on looking at usage versus requests and at usage versus limits.

Start by creating a baseline. You can use the observability stack in your production environment and measure that. You could also test your current settings and get data that provides a picture of how everything behaves in terms of costs and utilizations. With your data in hand, it’s time to look at the numbers, reconfigure, rerun the simulations, analyze and so on; it’s an iterative process.

Using the right tooling

If your architecture uses mainly a microservices approach, K6 and Dynatrace can work well in terms of focusing your attention on response times. You’ll also need to think about visibility on the cluster, making Prometheus a sensible choice.

For measuring cost, OpenCost from the Cloud Native Computing Foundation (CNCF) is great for exposing the estimation of a workload’s costs. You can then use that cost as a daily KPI.

In the particular case of the retail portal website discussed above, Istio is used as a service mesh, so Envoy is in the mix too. This has the potential to provide a couple of indicators to support optimization, too.

With these tools in place, you can then use the OpenTelemetry Collector to take the data and send it to Dynatrace and produce dashboards to display the results. You can use those to track your efficiency, looking at what you’re using versus what you requested, targeting the most efficient workload while also looking at the cost angle.

You can have fun with applying this approach yourself by playing around with Henrik’s Github environment: https://github.com/henrikrexed/Autoscaling-workshop

Using the dashboards and comparing results to your baseline can reveal plenty of insights, from unstable response times to changing costs as you iterate, as well as resource usage and efficiency.

Unstable response times

Unstable response times are worth a quick sidenote here. You might see them in your baseline or in your optimized solutions. They result from the way in which you allocate resources with Kubernetes.

Essentially, you need to put the right resource request numbers in when you deploy, to help Kubernetes be more efficient. If you don’t put anything in, you’re not helping Kubernetes operate efficiently. Likewise, if you over-allocate resources, as we mentioned above, you’re losing efficiency.

A word about limits

You also need to consider limits. With Kubernetes, you use limits to define maximum values. So, if you do the right load test, you can figure out the maximum memory. You can define memory limits in bytes. Hit your memory limit and Kubernetes will kill your events.

You can also define CPU limits, using function time. This is based on sharing CPU resources – essentially, sharing CPU cycles. You define how much function time you need, then when you hit that quota, Kubernetes will stop your function until the next cycle. The result is a cycle of working and pausing, working and pausing, working and pausing, known as CPU sparking.

The result? Unstable response times.

Scaling Kubernetes with HPA

If your solution is to have a higher number of smaller pods, rather than a lower number of larger pods, it’s time to scale. And this is where the Horizontal Pod Autoscaler (HPA) comes into its own.

The HPA is a default object in Kubernetes, so you don’t have to install anything to use it. Happy days. You can use it to create a rule that will automatically scale a replica of your deployment, or your stateful set, based on numbers – so you can configure specific thresholds.

Without HPA, you can deploy or manage your stateful set, set a replica number and add or remove pods, but it’s quite a manual approach.

HPA relies on metrics exposed on the Kubernetes API, which has two metrics available: CPU usage and memory usage. You define a role on those two indicators, then the HPA will add or remove pods depending on the value being observed.

What should you scale?

Let’s consider an example using the default rule HPA. You can focus on all components that will provide business value and need to respond properly: your frontend, product catalog, cart, shipping, payment and so on. All of your critical and connected services.

Now, when you auto-scale, there’s one golden rule: don’t scale when you need it. Why? Because it can take about a minute for a container to start serving traffic – more if you have Java or Spring Boot. So, you have to consider the time it takes to have the pod available to really serve traffic. If you don’t, you can scale too late and have a domino effect, where everything goes down.

To avoid that happening, you need to measure how long it takes to get a healthy, live component. Let’s say it’s two minutes. Then, instead of using that value at the moment you need to scale, you take that value and use it two minutes before.

Using external metrics

While we’ve used the CPU and memory metrics above, it’s possible to use external metrics as well. Kubernetes uses a metric adapter to achieve this. An adapter is like a dictionary that supports the Kubernetes API to translate calls to the right queries on the backend. At the time of writing, there are adapters for Prometheus and Datadog.

Unfortunately, you can only have one adapter in your cluster. However, most of us are using several observability solutions. The answer here is Keptn, a CNCF project that provides a unified metric server. This basically interacts with the Kubernetes API and exposes providers, translating metrics into given queries for the backend of your choice.

The KeptnMetricProvider currently supports Prometheus, Dynatrace and DQL (Dynatrace Query Language). You deploy it in your cluster, define a metric provider, then define the actual metrics you need. After that, you can create your HPA rule relying on the Keptn metric.

Using the example retail portal site, you can then put the rule on Envoy. With that site, we identified that the main problem resulting in instability was throttling – so we introduced throttling into the equation when iterating the workload autoscaling. The result was faster responses than the baseline, while we also almost doubled the load, aligned with CPU usage and had a consistent number of containers. We also pushed efficiency from around 20% to 47% and reduced the cost of the environment to around an eighth of what it was initially.

Key takeaways

What can you take away from all this? Plenty!

In Kubernetes, be sure to define resource quotas and optimize your resource allocations.
For efficient auto-scaling, don’t rely on the default metrics exposed by the Kubernetes metrics API. There are so many metrics that are available in your observability backends and so much you can achieve with observability. Use the Keptn Metric Server to use Dynatrace metrics in your HPA rules to take advantage of this.
Finally, report the cost (or energy consumption) of your workload on a daily basis from day one and use that KPI to optimize your environment.

Ready for more? Our video on boosting engineering efficiency with OpenTelemetry, Keptn and Tyk is a great next step.

And if you’ve finished auto-scaling successfully? Then why not turn your hand to GitOps-enabled API management in Kubernetes?

Analyse and optimise API products API platform governance & optimisation Create, secure & test APIs Launch & market API products Scaling API platforms