Designing observable API platforms

There’s plenty to think about when it comes to designing observable API platforms. That’s why Tyk’s Budha Bhattacharya sat down with API experts and enthusiasts Colin Griffin (Krumware), Lorna Mitchell (Redocly) and Amod Gupta (Traceable) to chat about the importance of observability and how to build platforms that make the most of it.

In this blog, we’ll cover:

  • Observability can save time and make your developers happier
  • Always keep the destination in mind when designing for observability
  • Getting started is just as crucial as getting it right – are you logging yet?

Watch the full webinar video below, or read on for our top takeaways for successfully designing observable API platforms.



How can we ensure observability doesn’t take a back seat in the development lifecycle?

According to Krumware’s Colin Griffin, we often approach the development lifecycle in terms of how developers are interacting with applications. In platform engineering, the focus is on the end users – the developers, data teams, operators and so on – from an application management perspective. API observability tends to take a back seat – often because teams don’t really understand how to implement it or who is responsible for what.

This is where observability teams come in. They can help educate developers on how to get actionable information about how an API is operating, whether in terms of security, performance or just in general. There is plenty of scope for developers to take advantage of existing tools in their platform, but too many development teams don’t feel equipped for that and don’t know what observability means or how they can benefit from it. A lack of communication then compounds the difficulty of actually getting those tools in place. Observability teams need to work with developers to overcome these challenges and ensure that observability can be a core feature.

Why is observability in API platforms important?

Timesaving has to be top of the list, according to Griffin. Without observability, developers have to bounce from workload to workload to workload to figure out what’s going wrong and where. It can take weeks to identify a single problem, compared to just minutes with an observability pattern in place.

That timesaving then ripples throughout the business. Developers can work more efficiently and achieve more with their time. They can focus on the right things at the right time and be happier in their work. The implications of that are huge.

For Redocly’s Lorna Mitchell, gaining value from observability in API platforms is also about finding the right level of observability for different parts of the application stack. APIs are often the edge, meaning if you’re under any kind of attack that’s where the metrics will tell you there’s a problem.

Building observability into your APIs as the first level of pattern therefore means you can get plenty of value back without making a huge investment. APIs with observability in them mean you can always get the value of those APIs, because you’ve got resilience and reliability in those outside-facing parts of your application.

What happens when you don’t get observability right?

Traceable’s Amod Gupta points out that it’s important to strike a balance between having no observability data and having too much observability data. Both situations can be considered bad API observability. No data means you can waste hours if not days fighting fires when something goes wrong, trying to figure out which link in the chain failed. Conversely, you can have all the data in the world but be brought down by a lack of organisation of that data.

It’s become a staple to have observability data these days. It’s why the OpenTelemetry, CNCF-driven movements have become so successful – it’s a given that we need this kind of data. But it’s also crucial to think about how we’re going to use that data. Are you going to get alerts for every single metric that you’re ingesting from your platform, or will you have some outcome-driven KPIs that are driving the alerts? The latter means that, when alerts fire, you can drill down into the components that may have caused that outcome to deviate from the baseline.

Take the example of a movie ticketing platform. Too many unorganised alerts would be a problem for CPU usage, memory utilisation, garbage collection and so on. But outcomes-driven alerts, created with an understanding of metric baseline ticket sale patterns by day and by hour, could ensure alerts were only triggered when there was a genuine business problem.

How do observability design considerations evolve over time?

Redocly’s Lorna Mitchell points out the importance of thinking about the destination when it comes to API design. What are you shipping? What kind of API is it? You’ll need to map out what’s appropriate in terms of governance, standards, security, what the production environment looks like and so on. Different needs and goals will dictate a different approach to observability.

That said, the OpenAPI Specification (OAS) is here to help, whatever your goals may be. It can enable you to communicate precisely and thoroughly across your teams, your tools and every stage of the life cycle, from design through to your API gateways. You’ll also be using your OpenAPIs in the future with things that haven’t been invited yet!

Mitchell also warns against not taking things seriously enough at the outset of the design process. Mistakes here might include:

  • Not getting your tool chain to the point where different teams can control different parts of the API surface.
  • Not shipping the public endpoints to the API gateway, so that they’re not there, let alone secured.

Designing for observability isn’t simply a process of writing code, generating the API description and using it. It’s about manoeuvring the pieces of OpenAPI, using tools to combine things and filter out things that aren’t really implemented yet – whether endpoints or whatever it is that makes sense in the context of your application.

This is where a lot of organisations aren’t quite getting to grips with the initial steps – they perhaps don’t have good linting or they’re using recommended rules rather than adding what’s important for their organisation. That’s the path to ending up with something that’s a bit of a compromise.

Understanding the value of open standards

Tyk’s Budha is a major advocate of open standards – as is Tyk, with our open source API gateway, native support for the OpenAPI Specification, dashboard that’s extensible using an Open Policy Agent and native support for OpenTelemetry. As Budha points out, open standards introduce foundational guardrails and rules around standardisation and governance. These strong, predictable foundations mean you can innovate faster, with developers having faith that the system isn’t going to break every time they try something new.

Open standards are essentially a language that enables communication and connection across different integrated platform stacks. They enable more than just standalone systems, communicating with observability platforms, security solutions, identity providers and more, while avoiding the cost of vendor lock-in. It’s something from which every business can derive value.

Designing API security to meet future needs

Never ship an API without the security it will eventually have. That’s a golden rule. But how do you design API security to meet your future needs, as well as your current ones? Traceable’s Amod Gupta reminds us that APIs are really the gates of our modern distributed services. Securing those gates is an end-to-end journey – one that starts when you’re thinking about developing your APIs and continues from your OpenAPI Specifications into writing, code scanning and active testing.

From that point forward, passive detections come into play, through looking at traffic and harnessing the real-time capabilities of OpenTelemetry and observability. Security and observability are very tightly coupled. Traces are a big contributor, enabling event detection through analysis of individual traces and of patterns, which can reveal enumeration attacks and attackers probing for weaknesses.

Traditionally, people associate observability with performance monitoring, but it plays a huge role in security too, due to the rich data that traces provide.

Overcoming the challenges of building observable API platforms

One of the big benefits of our cloud native world is that you can run someone else’s applications and use someone else’s tools. But to do this, you need to get your own house in order. That means ensuring you’re logging appropriately.

Making sure your code, applications and APIs follow proper logging patterns can make building an observable API platform less challenging. It enables telemetry data to be picked up, syphoned and delivered to where you need it without you having to worry about hard coding that. Logging to standard error and standard out is a good starting point. Observability tools can then pick up those logs and run with them, adding more embedded tracing and logging along the way to build actionable intelligence.

If you don’t know where to start with logging and are worried about the cost implication of logging everything, Krumware’s Colin Griffin’s advice is to log as much as you can at the application layer, then implement logging drivers or other tools to filter what you’re storing, rather than limiting what you’re logging. Remember that with proper observability tools you can predict and understand the cost of your logging and monitoring.

Should you use OAS to describe observability requirements?

Not really, according to Redocly’s Lorna Mitchell. The OpenAPI description is there to describe your API interface. Changing your observability or improving your tracing is not part of that. There are standards that you can use to describe tracing events and similar – for example, with OpenTelemetry there are some useful async API elements – but they’re not all OpenAPI.

However, Traceable’s Amod Gupta points out that it’s important to record your KPIs and document your SLAs. A user needs to know what to expect. OAS may be the place to put that or it may not – but it needs to be somewhere.

What are the north star metrics that drive value for API platforms?

Krumware’s Colin Griffin advises simply getting started, rather than worrying too much at first about picking a particular set of metrics. Start logging first, to make sure your APIs are producing farmable, actionable intelligence.

For Redocly’s Lorna Mitchell, response time is a key product metric. When response time starts to decay, it’s a big problem for all your integrators. Customer churn will follow hot on the heels of a decaying response time.

Traceable’s Amod Gupta adds that load, response time and error rate are north star metrics from a performance perspective. When you’re using observability for security, those metrics change a little, with volumetric data becoming more important in terms of identifying what type of attacks are happening. In that case, looking at the kind of data that’s coming in, whether it’s coming over a secure protocol and the enumeration type patterns within that data are all key.

On the business side, Gupta points out that metrics will depend on the business you’re in. An ecommerce company, for example, will be focused on customers coming to the page – how many are logging in, surfing, adding items to their basket and checking out.  There, OpenTelemetry data and performance monitoring data will drive the entire funnel.

API observability fundamentals

If the discussion above has inspired you to further expand your knowledge, why not access Tyk’s on-demand API observability fundamentals programme? It has been designed to take your observability skills from basic to extraordinary, revolutionising the way you approach platform team operations and observability strategies.