Improving your telemetry data for better observability 

How can you improve your telemetry data for better observability? This is the question Tyk put to Iris Dyrmishi, Senior Observability Engineer at Miro. Iris had plenty of insights to share – including why your observability team is costing your company so much money! 

In this blog, we’ll cover:

  • Higher quality telemetry data underpins reduced MTTR, higher visibility, better correlations and greater value for money. 
  • A company-wide awareness of observability costs and best practices can do much to drive up data quality. 
  • Improvements to traces, metrics and logs all drive superior observability 

Watch the full webinar video below, or read on for our top takeaways.

 

 

Creating value for money through better observability 

A mature observability system costs money. Data needs to be processed, saved and collected, which has an inevitable cost. But you don’t need to just throw money at it and collect everything – you can take a smarter approach, constantly improving your telemetry data quality. 

This is often a priority for organisations with mature observability systems and experienced observability engineers. And cost is not the only factor in the drive to improve telemetry data. Observability teams are service providers, keen to provide the best for their engineers and for the business. This is where quality trumps quantity. 

The business value of quality telemetry data

When you have quality data, you reduce mean time to repair (MTTR – also known as mean time to recovery). That’s because your engineer doesn’t have to sift through a bunch of junk before they find the significant data after they receive an alert and need to respond to an incident. 

Higher visibility is also interconnected with MTTR because when you can see the data you need, you don’t waste time on unnecessary information. Dashboards that aren’t built properly and have ten metrics with the same name, meaning you don’t know which one is yours, are hardly going to deliver efficiency when it comes to finding the details you need.  

Collecting quality data also supports better correlations. The three pillars of observability – logging, metrics and traces – have traditionally been kept separate but correlating them, particularly traces and logging, can provide a better approach. Of course, you can’t do that with some simple matrix, putting lots and traces together by timestamp. That won’t work. Instead, your data needs the necessary information. 

How to improve your telemetry data 

First and foremost, observability is a team sport. An observability system that is built by a singular observability team is a wasted opportunity. The engineers who build applications know their apps best, so they should be full owners of their telemetry signals, alerts and dashboards. They should be the ones setting them up – with the observability team there to provide proper guidance, showcase best practices and provide the tools needed for instrumenting alerts and dashboards. 

It’s also important to take a company-wide awareness to cost. Having FinOps working together with observability can be very useful here. Once teams see how much it’s costing to send traces, they tend to be smarter about what they’re sending and what they’re consuming. The cost goes down and the quality of the data goes up. 

Improving tracing 

Tracing has vast potential – when it’s built right. This means including enough information on the spans. Without it, data quality decreases.

Improving tracing data means you can:

  • Span events, including descriptions that tell you exactly what happens – this means you don’t need to worry about correlation with logging, as you have the information there immediately. 
  • Enrich spans with custom attributes – if you’re using legacy or custom instrumentation and need extra information to provide correlation with logging, you can easily do so with an OpenTelemetry collector. 

Improving tracing is unique to each business – just make sure that the events and attributes you add are meaningful. That way, when you need the information, it will be there for you. Just remember that it needs to be uniform – everyone is an owner, but standard formats, according to the guidelines of your observability team, are key to efficiency. 

Key to improving tracing is working out which spans are unnecessary. Tail sampling can help here. Associated with the OpenTelemetry collector, tail sampling is done at the end of the trace. You collect the trace, with all of the spans it in, then apply customisable policies to it based on attributes, latency… whatever you prefer – they’re extremely flexible. This will control which spans you keep or drop, helping you shape very concise and useful information. 

Improving metrics

One of the biggest issues that engineers report with metrics data is high cardinality. How you evaluate cardinality in your metrics will depend on your setup. You could create cardinality alerting or dashboards, for example, to help identify metrics with high cardinality and evaluate the need for certain labels. 

Again, teamwork is the order of the day. Engineering teams shouldn’t just create the telemetry signals in the first place but should be involved in this process of cleaning up metrics. They’re the owners – they need to decide what to drop. All under the guidance of the observability team, of course. Good communication and good guidelines are essential. 

The other key to improving metrics is to challenge everything. Consider whether a metric is supposed to be a metric. Could it be a log? Is it necessary? Might it be better if it fits the tracing? Don’t just build on top of what’s there – challenge and refine first and your observability will improve as a result. 

Put metrics volume under the microscope too. It’s easy to end up with a lot of metrics. Millions and millions of time series. So it’s time to ask some important questions:

  • Do you really need to scrape all your targets as often as you do? Consider the interval you need and how it could affect alerting and performance. 
  • Are the metrics being used? Constantly check if there are any unused metrics and evaluate whether you can drop them. 
  • Do you have the same label with different naming? This happens a lot. Consider introducing OpenTelemetry instrumentation libraries for uniform labelling and conventions for custom metrics. Doing so can save you a lot of money and drive up the quality of your data – particularly when you consider that alerts and dashboards will be built based on this labelling. 
  • Are your engineers having issues finding metrics? Uniformity can help here, too, as can enriching the most significant labels. 

Improving logging 

Mapping explosions are common when it comes to logging. They make it difficult to filter or query your logs. Lack of uniformity is often at the heart of this, as every application is imposing its own rules. OpenTelemetry logging can provide uniformity to combat this. 

Another logging issue is that a lot of debug logs are active 24/7, often unnecessarily. Do you really need them to be active 24/7? And what if you implemented a policy that drops the backlogs after a certain amount of time? Getting FinOps’ input into this conversation could represent a significant cost saving while also improving your data quality, just as we discussed above in relation to tracing. 

On the subject of logging issues, let’s talk about personally identifiable information (PII) in the logs data. It shouldn’t happen, but it does – for all sorts of reasons. Observability teams need to identify this and work with engineering teams to get to the root of the problem. Implementing filters to filter out PII data before it reaches the back end can help. Ultimately, the goal is for PII data not to be generated in the form of application logs at all. (If you really need PII data in logging, you’ll need a fully isolated solution for audit logs and PII security.)

Quick wins for improving your observability data 

Why not explore these other ways to improve your observability data?

  • Spans to metrics lets you get metrics about latency, calls, how many calls are done, size and plenty more. 
  • Architecture diagrams can help you see your applications better with a simple query. 
  • Auto-instrumentation during application runtime can be great for teams that don’t want to make a lot of code changes or have sensitive applications. 

Putting all of this together means you can have better data, a better observability experience and lower costs. A triple win for your teams and your entire organisation. 

If you’re ready to embrace better observability, why not read up on OpenTelemetry integration with Tyk API Gateway