Bad API observability

Observability has become a buzzword, and the API world is slowly catching up. But be careful when looking for content about API observability- there are many outdated best practices and irrelevant articles out there.

I was recently invited as a guest to Stephen Townshend’s Slight Reliability podcast, and his brilliant blog post on “Bad observability” inspired me to start ranting about the topic.

Here is a follow-up to our discussion: the observability anti-pattern to avoid when working with APIs. Thanks, Stephen, for the inspiration!

Anti-pattern #1 – you forgot your users

It’s easy to lose sight of the fact that APIs are built for people to use. APIs are not just a technical interface; it’s a bridge that connects digital experiences and enables people to accomplish tasks efficiently.

There are two different kinds of users for your API:

  • The developers who are integrating your API into their system.
  • The users that will interact with the applications or services that rely on your API.

If you can’t understand how they interact with your APIs, you’re missing half the picture.

When developers integrate with your APIs, how long does it take them to make their first successful API call? What are typical errors slowing them down? Observability plays an essential role in surfacing insights to improve developer experience.

Once your customers’ applications or services relying on your APIs are in production, can you track how your releases impacted their requests? Which version of your APIs are they using? What are the most popular endpoints? Are all customers getting the same performance? Same error rates?

Anti-pattern #2 – relying on API monitoring only

API monitoring typically involves sending automated test requests to API endpoints and verifying the responses. It helps you keep track of the health of your APIs even when there is no customer activity at all, and you can be used to report uptime.

But API monitoring is nothing else than black box testing: you don’t see what is happening inside. You just know that your requests reach the API and receive responses but don’t see why they are failing.

Also, you can only create and maintain tests for some possible use cases. You will miss fires.

Anti-pattern #3 – using API access logs for troubleshooting

API access logs are records of requests and responses made to an API, capturing details such as the requestor’s IP address, HTTP method, requested endpoint, status codes, and other relevant information for monitoring, debugging, and security purposes.

Using access logs to troubleshoot API issues is like searching for a needle in your microservices haystack. One of our users had a performance issue with their APIs; they spent days searching for the root cause using access logs – is it the Gateway? Does it need more CPU? Is it a configuration issue? They enabled distributed tracing – and within seconds, they could see what was happening: all the time was spent in the authentication server issuing JWT tokens.

To really understand what is happening in your system, you need to be able to track a single user request from the API Gateway to the upstream services and their dependencies for effective troubleshooting.

As Martin Thwaites puts it: “Logs are a poor man’s observability tool.”: You need distributed tracing.

Anti-pattern #4 – different teams, different tools, different data

Different teams have different needs. The tool that works for your DevOps team might not fit your product team best. By forcing everyone to use the same tool, you’re not promoting unity but inefficiency.

However, while promoting tool diversity might be the right move for your organisation, there should be some coordination and integration between different tools and data sources. Open standards like OpenTelemetry help ensure that information can be shared and leveraged across teams and tools.

The DevOps and product teams should rely on the same error rate when assessing the performance and reliability of the newest API product launched.

Anti-pattern #5 – one size fits all

Fifth on the list is the anti-pattern of needing to understand that different API architecture styles have different needs for observability.

REST APIs might be today’s default, but newer styles like GraphQL and gRPC are gaining popularity. Each of these styles has distinct characteristics and requirements when it comes to observability. Just think about GraphQL’s way of handling errors, which differs significantly from REST APIs. Web Socket and event-driven APIs have their unique observability challenges as well.

Different API architectures have different needs, and your observability strategy needs to reflect that to improve the overall reliability of your application.

Anti-pattern #6 – observability is for production only

Next up, we have the anti-pattern of not using observability in pre-production.

By using distributed tracing during the API development lifecycles, developers can trace the flow of requests and responses across different services, providing a comprehensive view of how the API behaves under various scenarios.

With observability, they can identify performance bottlenecks and even detect architectural problems like N+1 problems in GraphQL queries.

For teams leveraging APIOps practices, observability data can be used as an additional validation before promoting an API to the production stage.

Anti-pattern #7 – starting the trace at the API gateway is overrated

Last but not least, we have the anti-pattern of not starting the distributed trace at the API gateway. Tutorials about observability usually start the instrumentation process at a microservice level. It is great to have detailed insights into your microservices. But a lot will happen at the API Gateway level when dealing with APIs.

You want to capture all your user transactions, including those that never reach your microservices because of rate-limiting rules, an authentication problem, or a caching mechanism.

Starting the trace at the API Gateway gives you a clear entrance point and a complete picture of the journey of all your users. That’s why using a modern SaaS API gateway, like Tyk, with built-in support for OpenTelemetry is essential.

So there you have it – seven API observability pitfalls to avoid. Keep your APIs observable and your users happy!