Skip to main content

Documentation Index

Fetch the complete documentation index at: https://tyk.io/docs/llms.txt

Use this file to discover all available pages before exploring further.

Tyk Gateway exports metrics and traces via the OpenTelemetry Protocol (OTLP) and writes logs to stderr for external collection. This page covers production-ready configuration for each signal: which export topology to use, how to control metrics cardinality, how to tune trace sampling, and how to correlate logs with traces.

Exporting Data via OTLP

Use OTLP as the export protocol. It is vendor-neutral and supported by every modern observability backend. The Gateway supports both gRPC (default, more efficient for high throughput) and HTTP transports. Choose gRPC unless your network or backend requires HTTP.

Export Topology

Direct to backend: simpler setup, works well for managed cloud backends (Datadog, Dynatrace, New Relic, Elastic Cloud). Suitable for lower traffic volumes where buffering and retry are handled by the backend. Via OTel Collector: recommended for production. Decouples Tyk from the backend, adds buffering and retry, enables tail-based sampling (see Trace Sampling), and fans out to multiple backends simultaneously.

Open Source Backends

BackendBest forNotes
Grafana LGTM (Loki, Grafana, Tempo, Mimir)All signalsTempo for traces, Mimir/Prometheus for metrics, Loki for logs
JaegerTracing onlyAccepts OTLP natively since v1.35; simple to self-host
Prometheus + GrafanaMetrics onlyPull model; use OTel Collector’s Prometheus exporter or remote_write
ELK / OpenSearchLogs + APMElastic APM accepts OTLP; Logstash can ingest OTel Collector output
For vendor-specific configuration (Datadog, Dynatrace, New Relic, Elastic, Jaeger), see the Traces configuration guide.

Metrics: Cardinality and Performance

What Is the Cardinality Problem?

Each unique combination of dimension values for a metric creates a separate time series. Unbounded dimensions, such as one label per user, per IP address, or per request ID, cause exponential growth in series count. This consumes memory in Tyk Gateway, increases storage costs in your backend, and slows query performance.

Tyk’s Built-In Cardinality Limit

Tyk ordinarily caps each metric instrument at 2,000 unique label-value combinations. A different limit can be set using the cardinality control. If the configured limit is exceeded, the additional combinations are aggregated into an overflow bucket marked with the attribute otel.metric.overflow=true. Aggregate counts are preserved in the overflow bucket, but you lose the ability to break down data by the overflowing dimension combination. Alert on overflow to catch cardinality issues early. The following example uses PromQL:
increase(<your_metric_name>{otel_metric_overflow="true"}[5m]) > 0
If you are hitting the cardinality cap then the limit can be raised, but the recommended action is to reduce cardinality in your custom metrics configuration.

Custom Metrics: Stay at 10 Dimensions or Fewer

The OTel SDK processes metrics on a fast path when an instrument has 10 or fewer dimensions. Exceeding this threshold increases memory allocations and slows metric recording. Keep each custom metric instrument to 10 or fewer dimensions.

Dimension Safety Guide

When defining custom metrics, choose dimension sources based on their cardinality characteristics:
CardinalitySourcesExamples
Safemetadata, config_data bounded fieldslisten_path, endpoint, method, http.response.status_code
Cautionsession fields, bounded JWT claimsapi_key, oauth_id, alias (bounded per tenant but can be large)
Avoidheader / context with unbounded valuesip_address, user_id JWT claim, request_id, raw path
NeverToken or bearer valuesUnique per request; exhausts the cardinality limit immediately

Traces: Sampling Strategy

Trace data is the most expensive observability signal to collect, store, and query. Sampling controls what fraction of traces you keep.

Head-Based Sampling

With head-based sampling, the decision whether to create a trace for a request is made before any span data exists. In other words, the sampling logic is applied in the Gateway. This is controlled using the TraceIDRatioBased approach as described in the trace sampling section.
{
  "opentelemetry": {
    "traces": {
      "sampling": {
        "type": "TraceIDRatioBased",
        "rate": 0.1
      }
    }
  }
}
Characteristics:
  • Fast: no buffering required, predictable overhead
  • Limitation: cannot guarantee capture of all errors or slow outliers at low sample rates
  • Recommended default: 10% (rate: 0.1) for most production deployments

Tail-Based Sampling

With tail-based sampling, the sampling decision is made in the OTel Collector after the full trace has been collected. The OTel Collector buffers spans in memory, then applies policies to decide what to keep. This allows you to guarantee 100% retention of error traces and slow requests regardless of the overall sample rate. To use tail-based sampling, send all traces from Tyk to the Collector using the AlwaysOn approach as described in the trace sampling section, then apply policies in the Collector. This is the default setting.
# OTel Collector processors section
processors:
  tail_sampling:
    decision_wait: 10s
    policies:
      - name: keep-errors
        type: status_code
        status_code: {status_codes: [ERROR]}
      - name: keep-slow
        type: latency
        latency: {threshold_ms: 1000}
      - name: sample-rest
        type: probabilistic
        probabilistic: {sampling_percentage: 5}

Sampling Trade-Offs

StrategyCostAccuracyBest for
AlwaysOn (100%)HighPerfectDev/staging, low-traffic APIs
TraceIDRatioBased, rate: 0.1 (10%)LowGood averageMost production deployments
TraceIDRatioBased, rate: 0.01 (1%)Very lowPoor for rare eventsVery high traffic, budget-constrained
Tail-based via OTel CollectorMedium (Collector infra)BestWhen you need all errors and outliers

Logs: Collection and Correlation

Tyk Gateway writes logs to stderr, not via OTLP. They must be collected by an external agent.

Enable Access Logs

Access logs record one line per request: HTTP method, path, upstream latency, status code, response size, and client IP. They are the fastest way to observe Gateway traffic without a full tracing setup and are essential for log-based alerting such as 5xx spike detection. Access logs are enabled in the Gateway configuration or the equivalent environment variable:
{
  "access_logs": {
    "enabled": true
  }
}
See Access Logs for the full list of configurable fields and custom template options.

Use JSON Log Format

JSON format has lower parsing overhead than the default text format and is directly indexable by log backends (Loki, Elasticsearch, CloudWatch). Enable this mode in the Gateway configuration or the equivalent environment variable:
{
  "log_format": "json"
}
JSON log format is recommended for all production deployments.

Trace Correlation

When OpenTelemetry is enabled, Tyk injects trace_id and span_id into request-scoped log entries. This lets you pivot from a log line directly to the corresponding trace in Tempo or Jaeger.
trace_id and span_id are only present on sampled requests. At 10% head-based sampling, 90% of log lines will not carry trace IDs.

Log Collection Options

StackRecommended Approach
Grafana LGTMPromtail or Grafana Alloy → Loki
ELK / OpenSearchFilebeat or Logstash
KubernetesOTel Collector with filelog receiver; see Collecting Gateway Logs with OTel on Kubernetes
Cloud (AWS/GCP/Azure)CloudWatch agent, GCP Logging agent, or Azure Monitor agent

Performance Impact Summary

SignalBaseline CostMain RiskMitigation
MetricsLow (batched OTLP export)Cardinality explosion in custom metrics≤10 dimensions per instrument; monitor otel.metric.overflow
TracesMedium (per-request spans)High sampling rate at scaleHead-based 10% default; tail-based for error capture
LogsLow (stdout write)Log volume at debug verbosityUse info level in production; enable JSON format
Analytics (Tyk Pump)Medium–High (Redis writes)~13% RPS reduction when tracking all requestsUse Do-Not-Track middleware for non-critical endpoints