Everything that could go wrong with GraphQL and how OTel can help

When it comes to GraphQL, a LOT can go wrong! Thankfully, OpenTelemetry can help. Here’s how. 

Introducing any new technology into the architecture landscape usually means added complexity and new challenges – along with a general sense of overwhelm. GraphQL is no exception. Thankfully, OpenTelemetry can do much to help solve the problems of GraphQL. It can help from a developer and engineering perspective, as well as on the operations side of things.  

In this blog, we’ll look at:

  • How OpenTelemetry (OTel) can address GraphQL challenges, benefitting developers and operations teams.
  • How OTel aids effective GraphQL error detection. 
  • OTel’s role in identifying and addressing performance issues in GraphQL, enabling you to optimise performance. 

 

 

Watch the full webinar video via the link above, or read on for our top takeaways. 

SQL versus GraphQL versus REST

Let’s start with SQL. SQL is a query language that’s used to manage and manipulate data in a database. It allows you to access many records with a single declarative query and it introduces this concept of accessing records by some key – you reference an ID, long primary key and so on. 

With SQL, you can perform operations on data within a database – selecting, inserting records, updating, deleting and so on. And with a single query, you can access data across multiple tables using joins.

Some of these concepts have made their way into the REST world. This allows us to access a resource using a different HTTP method. So you can use GET, POST, PUT and DELETE to manipulate resources. However, to access a specific resource or a nested resource, you need to do so via a different endpoint URL, meaning you can’t really perform joins using REST (unlike with good old SQL). 

We won’t go into the history of GraphQL today, but in simple terms it’s a little bit of a mishmash between SQL and REST. It takes ideas that were originally developed to query databases and applies them to the wider internet. So, you can expose them as an API. A single GraphQL query can return connected data. So, like SQL, you can use GraphQL queries or mutations to change, manipulate or remove data. 

Unlike SQL, however, GraphQL doesn’t just query data stored in databases. It’s a bit more flexible and a lot more general purpose. Data can reside almost anywhere; it can be multiple databases and different file systems – REST services, gRPC, SOAP or even event-based systems such as Kafka. It’s like a declarative query language for the internet. 

Why use GraphQL? 

Let’s look at a common example. You need to get the posts that have been published on a blog and find the authors of those posts. All you need from the authors is an ID and a name, nothing more. 

This is where you start running into issues with REST, starting with over-fetching from the API. This is where your request returns far too much data; with REST, you get back the entire resource, including a whole bunch of data you don’t need. It’s inefficient – particularly when you scale. 

An opposite (but equally inefficient) problem is under-fetching. This is when the API call doesn’t fetch enough information, so it has to make one or more additional round trips to enrich its data and build the required data model. That’s a lot of wasted bandwidth and resources, not to mention added latency. Plus a lot of complexity for consumers who want to integrate with the API product. 

These are the kind of problems that GraphQL solves. You can use it to query the API, with the GraphQL request syntax conveniently mirroring the shape of the JSON you would expect in the response. You can request precisely what you need, with absolute flexibility, and know exactly which data will be returned. Nothing more, nothing less. 

GraphQL enables API producers to define their schema and write some very clever code – or do it declaratively with Tyk – to dynamically resolve those requests. And because you have a shared schema for both producers and consumers, there’s a contract between them. So you get the benefit of being able to scaffold and generate both service and type-safe clients, just like gRPC. 

GraphQL and OpenTelemetry

Let’s use another hypothetical example to look at the benefits of using OpenTelemetry with GraphQL – an imaginary travel business with three microservices:

  • A country service that returns information about different countries
  • An image service that provides pictures of different destinations 
  • A weather service that lets you know whether to expect rain or shine 

Think of GraphQL as an ingress, which can conveniently expose these microservices as a combined and very purposefully designed API product. You can use queries to obtain information from all of these services. 

As a simple example, let’s say you have one client – an internal React app for your website. If something goes wrong, your internal team will probably reach out by Slack to let you know. But what about once you’ve exposed that app as an API product and sold it? You could end up with a whole bunch of different applications by different consumers, with different developers (internal, partners, the public…) all building against your GraphQL API – and even publishing to your API marketplace.

At this scale, if something goes wrong, you need to know about if FAST. It won’t just be your internal team reaching out on Slack – it will be your customer base walking away and switching to your competitors. That’s why monitoring GraphQL in production is so important. 

This is where the RED method comes in. The RED method is a monitoring strategy used to gain insight into the health and performance of distributed systems:

  • Rate – the number of requests, per second, your services are serving
  • Errors – the number of failed requests per second
  • Duration – distributions of the amount of time each request takes

Based on these metrics, it’s easy to understand how well the service is performing and to set up your service level objectives (SLOs). The first step of this is to instrument the GraphQL service with OpenTelemetry.

There are different implementations of GraphQL available on the market, including Tyk’s implementation. Between the official OpenTelemetry website and tools such as Jaeger’s span metrics collector and Prometheus, you should have everything you need to instrument your service with OpenTelemetry for RED method monitoring. 

GraphQL error detection isn’t the same as REST error detection

In addition to the RED method monitoring, you need to look out for GraphQL getting sneaky – particularly in relation to upstream errors and resolver errors. The sneaky part is that GraphQL can return errors in the response body, not the HTTP layer. 

This is where GraphQL errors are fundamentally different from REST errors. You can no longer rely solely on HTTP status codes and status tests. 

Is there an official OpenTelemetry semantic convention with an attribute you can use to catch this type of error? Sadly, no. But all is not lost. Thankfully, you can add your own attribute and do some manual instrumentation. 

The conclusion? GraphQL error detection doesn’t work like standard REST – you need a bit more logic around it. 

OpenTelemetry and GraphQL performance issues 

There are plenty of use cases for GraphQL, as well as plenty of challenges. One added complexity is that each client can have a different performance profile on their per query basis, depending on how complex the query is. So, while one client could experience amazing performance, another could be facing performance problems.

There’s also the N+1 challenge – a commonly recognised problem, where GraphQL’s execution of a separate resolver for each field in a query results in N+1 round trips to the database. 

OpenTelemetry can help. You can use it to detect N+1 queries. You can also use it to set alerts in your test and production environments. That way, you can be aware of these kinds of expensive queries much faster. 

Other issues that can affect GraphQL performance include cyclic queries (a potential vulnerability for GraphQL APIs, as they can be used for denial-of-service attacks) and very expensive queries (those with both greater depth in terms of nesting levels and with higher complexity). 

While OpenTelemetry can help with some of these issues, it’s not quite there yet with others – which is why Tyk is adding its voice to OpenTelemetry community discussions. 

Right now, while OpenTelemetry is very helpful for monitoring and troubleshooting GraphQL queries, semantic conventions aren’t always respected by instrumentation libraries, while the RED metrics monitoring approach needs to be extended to report GraphQL specifics. 

Want to join the conversation about GraphQL and OpenTelemetry? Then why not chat to the Tyk team to find out more?