Leveraging the RED method for streamlined root cause analysis

Publish on 05 Dec, 2024 - by Budhaditya Bhattacharya
Last updated: 06 Dec, 2024

What is the RED method for root cause analysis and how can you leverage it? Tyk caught up with Grafana Labs Solutions Engineer Mayra Oyola to find out – and to look at the tools you can use to streamline your approach.

In this blog, we’re covering:

What the RED method is and why it’s so useful.
The open source tools you can use to leverage its benefits.
How the RED method underpins a proactive approach to your future observability strategy.

What is the RED method for streamlined root cause analysis?

The RED method was forged by Grafana CTO Tom Wilkey around 2015, when he noticed there was a gap in monitoring philosophy centered around microservices. Since then, it has become fundamental to the way that Grafana approaches observability.

RED stands for:

Rate – the number of requests per second. A system with a high request per second is likely to provide a smoother, more responsive experience for your users, because it can handle a high number of requests effectively and quickly.
Error – the number of requests that are failing. This doesn’t only cover obvious things like timeouts and failed transactions; it can also be a sign that your service is either partially or completely unavailable.
Duration – the amount of time requests take to complete. This can indicate the speed your user is experiencing, highlighting a slow application or website, and also resource bottlenecks.

Combining these three metrics delivers insight into the health of your application, along with a quick cognitive understanding of the customer experience it’s delivering.

When should you use the RED method?

It’s important to collect RED metrics for every single one of your services. Doing so provides insight into those individual services. It also starts to establish a uniform, scalable, repeatable strategy that can grow with you and your application. That being said, the RED method is most commonly applied to request driven services and applications.

Let’s look at service level objectives (SLOs) as an example of the value of the RED method.

Service level agreements (SLAs) are a fundamental agreement between you and your users. An SLA sets out customer expectations and lets your team know which issues they’re responsible for and in what priority order. Because the RED method is so centralized around how happy customers are, the RED metrics make are a great way to measure your SLAs.

Service level indicators (SLIs) are the metrics you select to determine whether you’re meeting the SLA you defined.

RED method tools

When you start to establish SLO goals and want to use the RED method, there are plenty of open source tools that can help. OpenTelemetry (OTel) is a powerful collection of protocols, instrumentation libraries, agents and a collector that you can use to get the telemetry data. You can then visualize that data using a tool such as Grafana.

There are many different options for architecting and deploying OpenTelemetry. You can instrument your code and start admitting telemetry data in OTel format, that you can forward to the OTel collector and eventually a backend. You also have the option to skip the collector and send the data straight from your application to your backend.

OTel is taking the observability world by storm because of its vendor agnostic approach. It means that starting your observability journey with OpenTelemetry can save you time and migration pain as your strategy evolves and your tool belt grows. It’s like a collection of standards and best practices to ensure everything runs smoothly. The huge open source community backing also means there’s plenty of support and advice available on using OTel.

OpenTelemetry spans programming languages, frameworks, libraries and even services, providing comprehensive coverage regardless of how diverse your technology stack is or how it evolves.

Grafana Tempo is another helpful tool for leveraging the RED method. It’s an open source, highly scalable, distributed tracing backend that can ingest common open source tracing protocols, including Jaeger, Zipkin and OTel.

For the RED method, Grafana’s metrics generator is particularly useful. This uses the data you have in Tempo to generate metrics from traces. It has three different sets of processors, including one called the span metrics processor, which is responsible for generating the three metrics we discussed above for every single combination of dimensions. This includes the service name, the operation, the status code… basically any tag or attribute present in the span. (A quick side note: The more dimensions you enable within your spans, the higher the cardinality of the generated metrics, so add a few blogs on cardinality management to your reading list!)

The span metrics processor is designed to mirror OTel’s span metrics connector, providing another option for getting those important RED metrics.

Another helpful tool is Grafana Pyroscope. This helps pinpoint where the latencies are in your code, enabling you to optimize performance from there. Pyroscope is an open source continuous profiling database. It enables faster incident response and resolution by providing code-level insight. It’s also a handy tool for cost-cutting, as it can help identify where you have resource hotspots.

You can use these tools to look at your data, your profiles, and identify any latency issues that need changing within your code. You can also potentially compare it to a better performing piece of code and then make the changes you deem necessary. This is going to help you achieve those SLAs we mentioned above, while also ensuring your customers stay happy and enjoy a good experience using your application.

Diving into the RED method in action

The screenshot above shows the monitoring of a mythical application. The dashboard is comprised of panels that visualize many of the metrics we’ve discussed above. On the left, the request section points out spikes or dips in the volume of HTTP requests, broken down by service, for a granular view.

On the right of the screenshot is the overall error percentage, which will alert you if your overall error percentage exceeds a certain threshold. Remember, it’s also important to look at individual services, as one account might be called less than the other services, meaning the overall error percentage calculation could drown it out, as shown here:

List forms can also be helpful for seeing the information you need very quickly, as the screenshot below demonstrates. It provides plenty of valuable information at a single glance, which is important when creating dashboards and can be particularly useful for rankings that are dependent on ever-changing metrics.

Visualizing latency is also useful. Understanding what 95% or 99% of your users are experiencing in terms of performance can be impactful, as well as helpful when it comes to establishing your SLOs.

Once you’ve established observability, and have greater visibility into your systems, it’s time to start thinking proactively. This means using your new insight to alert you the moment something happens. No more hearing about it from your users first!

You can establish alerts for request rate, high latency and so on, but you can also start thinking more proactively. For example, you could establish an alert based on the response time for your payment service, linked to the SLO you’ve defined. You could also use machine learning to forecast what metrics such as request rate should be, then establish metrics that alert you to any anomalous behavior. That gives you the power to start working on your application and solving things before your customers even experience it.

This demonstrates the power of the RED method in understanding your user experience. You can see at a glance from your error percentages whether your users are enjoying a smooth experience or dealing with page load errors and unavailable services. If it’s the latter, your alerts can ensure you’re already dealing with it before you hear about it from your users. By taking traces, generating metrics from them, storing them in your metrics backend and using them to create dashboards, you can take the RED method and grow it into an observability strategy you can be proud of.

Taking observability into the future

You can look even further into the future of observability, as we can see with this ride share example application:

Here, you can see that the us-east region, which is performing well, provides a baseline for a side-by-side comparison with the eu-north region, which is using up a lot more CPU. This provides scope for looking at what’s different in the eu-north region, to understand what could be causing it to use so much more CPU.

This type of side-by-side comparison can be used for a range of purposes, from diving into spikes to seeing the difference between two pieces of code. In this example, the visualizations show instantly that the order car is much longer in eu-north than it is in the us-east region. Looking at the data in this and other ways provides insights that enable you to improve your code to be as efficient as possible, ensuring that all users benefit from the same high-quality experience.

Does the RED method replace the USE method?

The USE method stands for utilization, saturation and errors. It’s an established method that works really well with hardware. The RED method is not intended to replace the USE method; instead, RED was developed for microservices. You can leverage both methods to ensure your hardware is happy and your customer experience is good, enjoying the peace of mind of knowing that everything is working properly and efficiently.

Where to start

If you’re ready to get started with the RED method, Grafana’s documentation has plenty of detail on how to instrument everything in line with best practices from day one. Additionally, Grafana.com\dashboards has a whole library of dashboards that you can download, use and customize to your specific use case. Enterprise solutions are also available in the form of application observability to Grafana Cloud users.

Where next?

If you’re ready for more observability-related insights, check out the Tyk blog. We’ve covered everything from DORA metrics and improving developer productivity to how to use observability to troubleshoot API issues within minutes.

If you really want to take your observability knowledge to the next level, you can also access Tyk’s on-demand API observability fundamentals program, designed to take your API observability skills from basic to extraordinary.

Analyse and optimise API products API platform governance & optimisation Create, secure & test APIs Scaling API platforms