Skip to main content

Documentation Index

Fetch the complete documentation index at: https://tyk.io/docs/llms.txt

Use this file to discover all available pages before exploring further.

Overview

Tyk Gateway 5.13+ exports metrics natively via OTLP, removing the need for Tyk Pump as an intermediary for Prometheus-based observability:
Gateway → OTel Collector → any OTLP backend

                 Prometheus scrape endpoint
Every request the Gateway handles generates three signal types that share common identifiers, enabling end-to-end correlation:
IdentifierIn logsIn metrics (Prometheus label)In traces
API IDapi_idtyk_api_idtyk.api.id attribute
Response flagresponse_flagtyk_response_flag
Consumer keyapi_keyAvailable via custom api_metrics configtyk.api.apikey
Trace IDtrace_id— (use exemplars)span traceId
Gateway instanceservice_nameservice.instance.id
This guide covers two categories of metrics:
  • RED metrics: What your APIs are doing right now: request rate, error classification, and latency decomposition.
  • Gateway health metrics: What the Gateway process itself is doing: memory, goroutines, and configuration state.

Prerequisites

  • Tyk Gateway 5.13 or later
  • OpenTelemetry metrics enabled (opentelemetry.metrics.enabled: true in tyk.conf, or TYK_GW_OPENTELEMETRY_METRICS_ENABLED=true)
  • An OTel Collector configured to export to Prometheus (or a compatible OTLP backend)
  • Prometheus scraping the Collector’s metrics endpoint
If you haven’t set up the OTel pipeline yet, see OpenTelemetry tracing and metrics.

Key thresholds at a glance

Use these as starting points and adjust based on your API’s traffic profile and SLOs.
SignalHealthy thresholdAlert threshold
p95 end-to-end latency< 500ms> 1s for 5 min
p99 end-to-end latency< 1s> 2s for 5 min
Error rate (non-2xx)< 2%> 10% for 5 min
Gateway-only avg latency< 10ms> 50ms
Goroutine count< 2,000> 5,000 for 10 min
Auth failure rate< 0.01/s baseline> 0.1/s for 2 min
Upstream error rate (URS)0Any sustained for 3 min

Core RED Metrics

Request Rate

Request rate tells you how much traffic the Gateway is handling. Sudden drops indicate that requests are not reaching the Gateway: DNS failures, network partitions between clients and the gateway, or client-side misconfiguration. Metrics:
Metric name (OTel)Prometheus nameTypeKey dimensions
tyk.api.requests.totaltyk_api_requests_totalcounterhttp_request_method, http_response_status_code, tyk_api_id
What are normal request rate thresholds? There is no universal baseline. Request rate varies by deployment. Establish a baseline over 7 days and alert on deviations greater than ±30% from the rolling average for the same hour of the prior week. A more actionable check is to watch for a sudden drop to near-zero on a previously active API:
# Alert if rate drops > 80% compared to 1 hour ago
rate(tyk_api_requests_total{tyk_api_id="<api_id>"}[5m])
  < 0.2 * rate(tyk_api_requests_total{tyk_api_id="<api_id>"}[5m] offset 1h)
Troubleshooting unexpected changes in request rate:
IssuePossible CausesRemediation
Rate drops to zero on one API, others normalAPI definition deleted or disabled; DNS for that API’s listen path changedCheck Gateway logs for api_id config errors; verify API is active in Tyk Dashboard
Rate drops uniformly across all APIsNetwork partition between clients and Gateway; load balancer health check failure; OTel Collector not receiving (metrics gap, not a real traffic drop)Check Gateway health endpoint; confirm traffic is actually dropping and not just metric export failing
Rate spikes suddenly on one APITraffic surge, DDoS, or runaway batch clientFilter logs by api_id and group by api_key to identify the calling consumer
Rate split unevenly across Gateway instancesSticky sessions or uneven load balancer weightsCheck service.instance.id label in metrics to compare per-instance rates

Error Rate

Error rate is the primary SLI for API availability. Tyk classifies every non-success response with a response_flag: a two- or three-character code that tells you exactly where and why a request failed, before you read a single log line. Metrics:
Metric name (OTel)Prometheus nameTypeKey dimensions
http.server.request.durationhttp_server_request_duration_secondshistogramhttp_request_method, http_response_status_code, tyk_api_id, tyk_response_flag
tyk.api.requests.totaltyk_api_requests_totalcounterhttp_request_method, http_response_status_code, tyk_api_id
Response flag reference (full list: Error classification):
FlagHTTP StatusUpstream called?error_sourceMeaning
(HTTP status code, e.g. 200)200Yes(absent)Successful request — tyk_response_flag is set to the HTTP status code
AMF401NoAuthKeyAuth header entirely absent
AKI403NoAuthKeyAuth header present, key invalid or expired
QEX403NoRateLimitAndQuotaCheckKey’s quota window exhausted
RLT429NoRateLimitAndQuotaCheckPer-second rate limit exceeded
URS500YesUpstreamUpstream returned a 5xx error
What error rate thresholds should I set? A non-2xx rate below 2% is a healthy baseline for most APIs with authenticated consumers. Alert at 10% over a 5-minute window for most APIs. For payment or health endpoints, tighten to 1%. Calculate the current error rate:
(
  rate(tyk_api_requests_total{http_response_status_code!~"2..", tyk_api_id="<api_id>"}[5m])
  /
  rate(tyk_api_requests_total{tyk_api_id="<api_id>"}[5m])
) * 100
Classify errors by flag to route to the right runbook:
sum by (tyk_response_flag) (
  rate(http_server_request_duration_seconds_count{tyk_api_id="<api_id>"}[5m])
)
Check upstream_latency in logs first. If it is 0, the request never left the Gateway: the error originated in auth, quota, or rate-limit middleware. If upstream_latency > 0 and response_flag = URS, the upstream itself failed.
Troubleshooting elevated error rates:
IssuePossible CausesRemediation
Surge in AMF (401)Auth header entirely absent at Tyk — the client never sends it, or an intermediary (load balancer, CDN, reverse proxy) strips it before reaching the GatewayIdentify source: {prefix="access-log"} | json | response_flag="AMF" | line_format "ip={{.client_ip}} api={{.api_name}}". Single IP after a network change → suspect header stripping by an intermediary. Wide IP spread → clients calling the wrong listen path
Surge in AKI (403)Key was rotated, expired, or revoked; credential stuffing attackSingle source IP → contact consumer. Wide IP spread → credential attack; tighten rate limits
Sustained QEX (403)Consumer legitimately exhausted their quota tierIdentify consumer via key field in logs; invite upgrade or raise quota ceiling in policy
RLT (429) climbingLegitimate traffic burst; retry storm after upstream errorCheck whether RLT is protecting a degraded upstream (correct) or blocking legitimate traffic (adjust limit). Add backoff guidance to consumer
Sustained URS (500)Backend service degraded; upstream 5xx errorsExtract error_target (upstream hostname) and upstream_addr from logs; escalate to backend team. Check retry/circuit-breaker plugin config
Non-2xx with no response_flagAuth plugin returning custom status; unhandled middleware errorSearch Gateway error logs for the request’s trace_id

Latency

Tyk exports three latency histograms that let you decompose end-to-end response time into what the Gateway spent versus what the upstream spent. Use them together. Any one alone gives an incomplete picture. Metrics:
Prometheus nameMeasuresKey labels
http_server_request_duration_secondsTotal time from first byte received to last byte sent (client view)http_request_method, http_response_status_code, tyk_api_id, tyk_response_flag
tyk_gateway_request_duration_secondsGateway-only processing time (middleware chain, auth, rate limiting)http_request_method, tyk_api_id, tyk_response_flag
tyk_upstream_request_duration_secondsTime waiting for the upstream service to respondhttp_request_method, tyk_api_id, tyk_response_flag
Histogram bucket boundaries (all three metrics, in seconds):
0.001  0.005  0.01  0.025  0.05  0.1  0.25  0.5  1  2.5  5  10  +Inf
What latency thresholds should I set?
PercentileTargetAlert
p50 (median)< 100ms
p95< 500ms> 1s for 5 min
p99< 1s> 2s for 5 min
Gateway-only average< 10ms> 50ms
Latency isolation rule: If http_server_request_duration_seconds p95 is high but tyk_gateway_request_duration_seconds p95 is < 10ms, the Gateway is healthy: the latency is upstream. If both histograms are elevated, the gateway is the bottleneck.
Query p95 and p99 end-to-end:
histogram_quantile(0.95, rate(http_server_request_duration_seconds_bucket{tyk_api_id="<api_id>"}[5m]))
histogram_quantile(0.99, rate(http_server_request_duration_seconds_bucket{tyk_api_id="<api_id>"}[5m]))
Query Gateway-only average:
rate(tyk_gateway_request_duration_seconds_sum{tyk_api_id="<api_id>"}[5m])
  / rate(tyk_gateway_request_duration_seconds_count{tyk_api_id="<api_id>"}[5m])
Troubleshooting high latency:
IssuePossible CausesRemediation
High p95 overall, gateway avg < 10msUpstream service degradation; slow backend endpointCheck tyk_upstream_request_duration_seconds p95. Extract error_target from URS logs. Escalate to backend team
Both gateway and upstream histograms elevatedGateway CPU saturation; goroutine backpressureCheck go_goroutine_count trend and go_memory_used_bytes. See Goroutines below
p99 widening without p50 or p95 changeLoad-induced queuing tail; a fraction of requests hitting a slow upstream pathFilter slow requests in logs: latency_total > 1000. Copy trace_id → Jaeger/Tempo to see which span is slow
High latency scoped to one clientClient calling a slow upstream endpoint (not a gateway issue)Per-key filter in logs: api_id="<api_id>" | line_format "key={{.api_key}} latency={{.latency_total}} path={{.path}}"
Latency spike after config reloadCold cache or policy recalculation during reloadCheck tyk_gateway_config_reload_duration_seconds. Latency spike should resolve within 1–2 minutes

Gateway Health Metrics

Gateway health metrics reflect the internal state of the Gateway process itself, independent of API traffic. They are available when opentelemetry.metrics.enabled: true (Tyk Gateway 5.13+) and can be disabled independently with runtime_metrics: false if you only need RED signals.

Memory

Metrics:
Metric name (OTel)Prometheus nameTypeWhat it tells you
go.memory.usedgo_memory_used_bytesGaugeMemory in use by the Go runtime, broken down by go_memory_type label ("stack" or "other"). Monitor go_memory_type="other" for leak detection; the stack series grows proportionally with goroutine count
go.memory.gc.goalgo_memory_gc_goal_bytesGaugeTarget heap size before next GC cycle
go.memory.limitgo_memory_limit_bytesGaugeConfigured GOMEMLIMIT value; use as the denominator for utilization alerts (alert when go_memory_used_bytes exceeds 85% of this)
go.memory.allocatedgo_memory_allocated_bytes_totalCounterTotal bytes allocated since startup
go.memory.allocationsgo_memory_allocations_totalCounterTotal allocation count since startup. High rate = allocation pressure
What memory thresholds should I set? There is no fixed absolute limit. Thresholds depend on how many APIs are loaded. Alert on rate of growth instead:
  • go_memory_used_bytes{go_memory_type="other"} growing > 10% per hour with stable traffic and stable API count → investigate
  • If you set GOMEMLIMIT, alert when go_memory_used_bytes{go_memory_type="other"} exceeds 85% of go_memory_limit_bytes; GC will become aggressive and start impacting request latency
Troubleshooting memory issues:
IssuePossible CausesRemediation
go_memory_used_bytes{go_memory_type="other"} growing monotonically over hours with stable API countMemory leak in a middleware or connection poolRequires "enable_http_profiler": true in tyk.conf. Capture heap profile: curl http://gateway:<control_api_port>/debug/pprof/heap > heap.pprof. Contact Tyk support with the profile and trend chart
Memory growing proportionally with API countNormal — each API definition has memory overheadIncrease instance memory; review whether all loaded APIs are still needed
Memory growing faster than expected with stable API countHigh allocation pressure from request handlingCheck rate of go_memory_allocations_total; if climbing steeply, contact Tyk support with a heap profile

Goroutines

Metrics:
Metric name (OTel)Prometheus nameTypeWhat it tells you
go.goroutine.countgo_goroutine_countGaugeNumber of active goroutines. Monotonic growth = leak
go.processor.limitgo_processor_limitGaugeNumber of OS threads available (GOMAXPROCS)
go.config.gogcgo_config_gogc_percentGaugeGC target percentage (GOGC env var)
What goroutine thresholds should I set? A healthy Gateway at moderate load runs with 500–2,000 goroutines. Alert at 5,000 goroutines sustained for 10 minutes. A one-time spike during a traffic burst is normal; a monotonically increasing trend over hours is not.
go_goroutine_count{service_name="tyk-gateway"}
Troubleshooting goroutine growth:
IssuePossible CausesRemediation
Monotonically increasing goroutine count over hoursGoroutine leak in connection handler or background workerRequires "enable_http_profiler": true in tyk.conf (off by default). Collect from the Control API port: curl http://gateway:<control_api_port>/debug/pprof/goroutine > goroutine.pprof. Share with Tyk support along with the trend chart
Goroutine count high relative to trafficCPU saturation; too many goroutines contending for OS threadsCheck go_processor_limit (GOMAXPROCS). Consider scaling horizontally
Goroutine spike correlated with config reloadReload spawning goroutines before previous ones completeCheck tyk_gateway_config_reload_total rate. Avoid overlapping reloads

Configuration State

Metrics:
Metric name (OTel)Prometheus nameTypeWhat it tells you
tyk.gateway.apis.loadedtyk_gateway_apis_loadedGaugeAPI definitions currently loaded. Sudden drop = sync failure
tyk.gateway.policies.loadedtyk_gateway_policies_loadedGaugePolicies currently loaded
tyk.gateway.config.reloadtyk_gateway_config_reload_totalCounterTotal config reloads since startup
tyk.gateway.config.reload.durationtyk_gateway_config_reload_duration_secondsHistogramTime per reload. High values indicate large API counts
What to watch for:
  • A sudden drop in tyk_gateway_apis_loaded (not a gradual decrease) typically means a failed sync from the Dashboard or an accidental bulk delete. Alert if the value drops by more than 10% in a single scrape interval.
  • A steady increase in tyk_gateway_config_reload_total at a rate faster than your deployment cadence suggests a reload loop; investigate what is triggering reloads.
  • Reload duration p95 growing over time suggests the API definition set is expanding and reload times need to be accounted for in SLOs.

Common Anti-Patterns

Auth Failure Surge

A sudden spike in AMF or AKI response flags means clients are failing authentication. Sustained rates above 0.1 req/s (6 per minute) warrant investigation. Identify:
rate(http_server_request_duration_seconds_count{tyk_response_flag=~"AMF|AKI"}[5m])
In Loki:
{prefix="access-log"} | json | response_flag=~"AMF|AKI"
  | line_format "{{.time}} {{.response_flag}} ip={{.client_ip}} api={{.api_name}}"
Classify:
PatternLikely causeAction
Single source IP, sustained after deploymentCredential rotation missed in deploy configContact consumer; verify keys still valid in Tyk Dashboard
Single source IP, random key attemptsMisconfigured integration (wrong env endpoint)Contact consumer
Wide IP spread, varied keysCredential stuffing or API scanningAdd IP allowlisting; tighten rate limits
All APIs, after Tyk Dashboard restartGateway missed key syncTrigger manual key sync; check Dashboard connectivity
AMF and AKI are identical from a consumer-impact perspective: both result in the request never reaching the upstream. The distinction matters for root cause: AMF means the client didn’t send a key at all; AKI means the client sent a key that Tyk cannot resolve.

Cardinality Overflow

The Gateway caps at 2,000 unique attribute combinations per instrument by default (see cardinality control). When this cap is reached, additional data points are recorded with otel.metric.overflow = true rather than creating new time series. Detect:
# Any non-zero result means cardinality overflow is occurring
tyk_api_requests_total{otel_metric_overflow="true"}
Troubleshoot:
IssuePossible CausesRemediation
Overflow on tyk_api_requests_totalMore than 2,000 unique api_id × method × status_code combinationsAudit which APIs are generating unusual method/status combinations; consider raising cardinality_limit — contact Tyk support before doing so in production
Overflow after adding custom dimensionsCustom dimension like client_ip or user_id is high-cardinalityRemove the high-cardinality dimension or scope it with a hash/prefix
Cardinality overflow does not drop data. The aggregate counts are preserved in the overflow bucket. You will see correct totals but lose the ability to break the data down by the overflowing dimension combination.

Retry Storm

Client retries without exponential backoff amplify an upstream failure. RLT (429) responses trigger more retries, which hit rate limits, which trigger more retries, a self-reinforcing loop that increases load on both the Gateway and the upstream. Identify: A retry storm shows RLT rate climbing while overall request rate is also climbing:
# RLT rate climbing
rate(http_server_request_duration_seconds_count{tyk_response_flag="RLT"}[1m])

# If this is also climbing, clients are retrying
rate(tyk_api_requests_total[1m])
Troubleshoot:
IssuePossible CausesRemediation
RLT climbing with overall rate climbingClients retrying 429s without backoffIdentify consumer via logs: response_flag="RLT" | line_format "key={{.key}} ip={{.client_ip}}". Advise consumer to implement exponential backoff with jitter
RLT starts immediately after upstream error (URS)Upstream degradation triggers client retries which hit rate limitsFix the upstream first. The rate limit is correctly protecting the backend during degradation
RLT on a single consumer, others unaffectedSingle consumer batch job sending burstsWork with consumer to spread requests or raise their rate limit ceiling

Set Up Alerting

Prometheus Alertmanager handles alert routing. The Gateway emits Prometheus-compatible metrics via the OTel Collector; alert rules evaluate against those metrics. Configure Prometheus to load alert rules:
# prometheus.yml
rule_files:
  - "tyk_alerts.yml"

alerting:
  alertmanagers:
    - static_configs:
        - targets: ["alertmanager:9093"]
Restart Prometheus after adding the rule file. Recommended alert rules:
groups:
  - name: tyk-gateway
    rules:

      # Error rate > 10% over 5 minutes
      - alert: TykHighErrorRate
        expr: |
          (
            rate(tyk_api_requests_total{http_response_status_code!~"2..", tyk_api_id="<api_id>"}[5m])
            /
            rate(tyk_api_requests_total{tyk_api_id="<api_id>"}[5m])
          ) > 0.10
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "Error rate > 10% for {{ $labels.tyk_api_id }}"

      # p95 latency > 1 second
      - alert: TykHighLatency
        expr: |
          histogram_quantile(0.95,
            rate(http_server_request_duration_seconds_bucket{tyk_api_id="<api_id>"}[5m])
          ) > 1
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "p95 latency > 1s for {{ $labels.tyk_api_id }}"

      # Auth failure surge (AMF or AKI)
      - alert: TykAuthFailureSurge
        expr: |
          rate(http_server_request_duration_seconds_count{tyk_response_flag=~"AMF|AKI"}[5m]) > 0.1
        for: 2m
        labels:
          severity: warning
        annotations:
          summary: "Auth failures > 6/min: client misconfiguration or credential attack"

      # Quota exhaustion
      - alert: TykQuotaExhaustion
        expr: |
          rate(http_server_request_duration_seconds_count{tyk_response_flag="QEX"}[5m]) > 0
        for: 1m
        labels:
          severity: info
        annotations:
          summary: "Consumers hitting quota limits"

      # Rate limit rejections
      - alert: TykRateLimitRejections
        expr: |
          rate(http_server_request_duration_seconds_count{tyk_response_flag="RLT"}[1m]) > 0
        for: 1m
        labels:
          severity: info
        annotations:
          summary: "Rate limit rejections (RLT)"

      # Upstream 5xx errors
      - alert: TykUpstreamErrors
        expr: |
          rate(http_server_request_duration_seconds_count{tyk_response_flag="URS"}[5m]) > 0
        for: 3m
        labels:
          severity: critical
        annotations:
          summary: "Upstream returning 5xx for {{ $labels.tyk_api_id }}"

      # Goroutine growth
      - alert: TykGoroutineGrowth
        expr: go_goroutine_count{service_name="tyk-gateway"} > 5000
        for: 10m
        labels:
          severity: warning
        annotations:
          summary: "Goroutine count elevated, possible leak"

      # API count dropped suddenly
      - alert: TykApisLoadedDrop
        expr: |
          (tyk_gateway_apis_loaded - tyk_gateway_apis_loaded offset 2m)
            / tyk_gateway_apis_loaded offset 2m < -0.1
        for: 0m
        labels:
          severity: critical
        annotations:
          summary: "API definition count dropped > 10%: possible sync failure"
Alert summary:
AlertThresholdSeverityWhat it signals
TykHighErrorRate> 10% non-2xx for 5 minwarningAPI availability degraded
TykHighLatencyp95 > 1s for 5 minwarningSlow responses
TykAuthFailureSurge> 0.1/s for 2 minwarningCredential problem or attack
TykQuotaExhaustionAny QEX for 1 mininfoConsumer tier management needed
TykRateLimitRejectionsAny RLT for 1 mininfoConsumer hitting rate limits
TykUpstreamErrorsAny URS for 3 mincriticalBackend degraded
TykGoroutineGrowth> 5,000 for 10 minwarningPossible goroutine leak
TykApisLoadedDrop> 10% drop in 2 mincriticalConfig sync failure

PromQL Quick Reference

Replace <api_id> with your Tyk API definition ID.
## Error rates

# Overall non-2xx rate (as percentage)
(
  rate(tyk_api_requests_total{http_response_status_code!~"2..", tyk_api_id="<api_id>"}[5m])
  / rate(tyk_api_requests_total{tyk_api_id="<api_id>"}[5m])
) * 100

# Error breakdown by response flag
sum by (tyk_response_flag) (
  rate(http_server_request_duration_seconds_count{tyk_api_id="<api_id>"}[5m])
)

# Upstream error rate (URS only)
rate(http_server_request_duration_seconds_count{tyk_response_flag="URS", tyk_api_id="<api_id>"}[5m])

## Latency

# p50 / p95 / p99 end-to-end
histogram_quantile(0.50, rate(http_server_request_duration_seconds_bucket{tyk_api_id="<api_id>"}[5m]))
histogram_quantile(0.95, rate(http_server_request_duration_seconds_bucket{tyk_api_id="<api_id>"}[5m]))
histogram_quantile(0.99, rate(http_server_request_duration_seconds_bucket{tyk_api_id="<api_id>"}[5m]))

# Gateway-only average latency
rate(tyk_gateway_request_duration_seconds_sum{tyk_api_id="<api_id>"}[5m])
  / rate(tyk_gateway_request_duration_seconds_count{tyk_api_id="<api_id>"}[5m])

# Upstream average latency
rate(tyk_upstream_request_duration_seconds_sum{tyk_api_id="<api_id>"}[5m])
  / rate(tyk_upstream_request_duration_seconds_count{tyk_api_id="<api_id>"}[5m])

## Gateway health

# Goroutine count trend
go_goroutine_count{service_name="tyk-gateway"}

# Memory in use (go_memory_type="other" = heap-adjacent; use for leak detection)
go_memory_used_bytes{service_name="tyk-gateway", go_memory_type="other"}

# GC target heap size
go_memory_gc_goal_bytes{service_name="tyk-gateway"}

# Configured GOMEMLIMIT (denominator for utilization alert)
go_memory_limit_bytes{service_name="tyk-gateway"}

# APIs and policies loaded
tyk_gateway_apis_loaded{service_name="tyk-gateway"}
tyk_gateway_policies_loaded{service_name="tyk-gateway"}

## Consumer-level breakdown (requires custom api_metrics configuration)
# These metrics are NOT emitted by default. They must be defined in
# opentelemetry.metrics.api_metrics in tyk.conf.

# Example: requests by API key (last 6 chars), if configured with api_key session dimension
# increase(tyk_requests_by_apikey_total{tyk_api_id="<api_id>"}[1h]) by (api_key)

# Example: 5xx errors by route, if configured with endpoint metadata dimension
# increase(tyk_requests_by_route_total{tyk_api_id="<api_id>", http_response_status_code="500"}[1h]) by (tyk_endpoint)

Next Steps

  1. Enable the OTel pipeline: if you haven’t yet, follow the setup instructions to enable native OTLP export and route signals to your observability backend.
  2. Try the reference Grafana dashboards: see the observability setup guide for a full modern observability stack (Loki, Grafana, Tempo, Prometheus) with pre-built panels for all the metrics in this guide.
  3. Configure Prometheus alerting: copy the alert rules from Set Up Alerting, replace <api_id> with your real API IDs, save as tyk_alerts.yml, and restart Prometheus.