Breaking the chain of blame: true end-to-end testing with distributed tracing

Publish on 26 Sep, 2024 - by Budhaditya Bhattacharya

To test our OpenTelemetry (OTel) implementation here at Tyk, we use Tracetest. It’s a great tool, with plenty of potential for testing APIs based on OpenTelemetry. As such, we were delighted to catch up recently with Ken Hamric, Founder of Tracetest, and Lead Software Engineer Oscar Reyes. Ken and Oscar talked us through how you can break the chain of blame by applying true end-to-end testing with distributed tracing.

We’ve gathered their key insights below to show how your business can derive value through distributed tracing.

In this blog, we cover:

How to leverage observability data in your test environments
How this can break the chain of blame and encourage other teams to adopt observability
How you can build this into your CI/CD pipeline

You can watch the full webinar video above or read on to discover our top takeaways.

Using observability proactively

Traditionally, we use observability reactively to provide visibility when troubleshooting. But what if you also leverage it proactively in your tests?

Let’s say you have a sample system with a React frontend and Tyk Gateway to proxy backend services. There’s a two-step backend process, with APIs and a worker that fetches information from a third-party API and stores information in Redis and Postgres. Everything is instrumented for OpenTelemetry, will all telemetry from the frontend, Tyk gateway and backend services going straight to the OTel Collector and then to be stored in Jaeger. You can scan the code below to see and play with this Tyk/Playwright repo:

If something goes wrong, it’s all too easy for management to blame the developers at the frontend and for those developers to pass that blame along to the backend. But there is a better way…

The advantages of end-to-end testing with distributed tracing

By using true end-to-end testing with distributed tracing proactively, you can:

Break the chain of blame
Reduce mean time to repair (MTTR) for test failures
Encourage developers and quality assurance teams to adopt observability – suddenly, it’s no longer just something for site reliability engineers!

Say someone fiddles with our fictitious sample system and knocks out the worker that fetches information from the third-party API. You can look at the traces and reactively figure out what has happened, then fix it. Great. But what if you could catch the problem earlier?

This is where a tool such as Tracetest provides you with the opportunity to get more out of your existing OpenTelemetry traces. You can execute tests against your generated traces and OTel data, using the instrumentation that’s already there.

Tracetest’s Playwright integration is an MPM package that you can load and install on your Playwright scripts. After you import the package, you just need to create a Tracetest instance object. Then you run the capture function before going to the website you’re testing. This allows Tracetest to have the same trace for the entire user session, meaning the spans can be connected all the way from the frontend. Finally, you need to use the summary function to validate that the trace-based tests work as well as the frontend tests with Playwright. You can now run this in your CI/CD pipelines and break everything if the trace test fails.

You can also output the links to go to your user interface (UI) for a better visual of what happened. The Tracetest UI lets you see the full trace, including user clicks and fetch requests created by the frontend (this is all auto-instrumented by OTel libraries). In the case of the sample system described above, you’ll also be able to see the Tyk Gateway parts (how the key authentication happens, if the key has expired, the access), the backend requests from the NodeJS API, the request validation and the worker.

You now have the power to add custom test breaks. You can set the test to break, for example, if someone deletes the worker again. All it takes is a few clicks. You can test it’s working by knocking out the worker in Docker.

Going beyond that, you can add additional test specs, for example to ensure that all your database spans are fast. Simply add a single assertion that says the database needs to be running in less than one hundred milliseconds.

Proactive testing with distributed tracing

Once you’re comfortable with using distributed tracing for end-to-end testing, you can build it into your CI/CD pipelines to use it continually, using your existing setup. It can output links to your CI/CD setup, so you can follow those to the Tracetest UI whenever you need. You can also share the links with your team, so everyone can see what has broken, or set up alerts to Slack to validate that everything is working as expected.

By putting this in place, you can effectively break the chain of blame. No more pushing blame from the frontend to the backend! Instead, you can visually see from end-to-end what’s going on via the trace.

What’s more, your developers and quality assurance team will be able to see clear value in your observability efforts.

Tracetest open source versus commercial

It’s worth noting that Tracetest has both open source and commercial versions available. The open source version is a single instance for running locally, while the commercial version is ideal for multiple environments, with plenty of advanced features.

Tracetest use cases

Tracetest suits a range of use cases, from fully serverless environments to Kubernetes and GraphQL. Ultimately, if there’s a trace somewhere, Tracetest can grab it.

While we’ve used Playwright in the example above, Tracetest has integrations with Cypress, AWS Lambda and X-Ray, Datadog, Elastic, Kafka, the OTel Collector and many more distributed tracing and testing tools and frameworks.

Tracetest’s integration with K6 is particularly useful for load testing. For example, an API make be working well and able to handle the load but have other processes behind it that can’t handle that same load. Tracetest enables you to inspect that process and set assertions against it, so you can really see what’s going on – all while using the same language between production and testing.

The fact that dev teams can write tests also means that they will start putting better data into their instrumentation – because they need it to build the tests. Then, when you’re in production and troubleshooting at 2 am, you’ll find you’ve got better data in your trace, so you can fix things faster. It’s a very positive cycle of observability-driven development.

Why not read more about Tracetest in action by seeing how Tyk leverages it for effective integration testing of OpenTelemetry?