Easy monitoring for Kubernetes Ingress traffic with Traefik
For the past 5 years, at the time I’m writing this post, I’ve worked with Kubernetes and I got time to use various Ingress Controllers like, Nginx, Kong, Ambassador and Traefik.
The use cases I got were quite standard:
- Route traffic to different services under the same domain
- Rewrites
- Redirections
- IP whitelisting
- Split traffic base on weights
- Sticky sessions
At last, we ended with Traefik. The integration with Kubernetes was very nice even some years ago. Annotations did all the work fine and just worked as expected. We ran Traefik v1 for 3 years and we’re ending the migration to v2 right now not without some issues I’ll explain later.
We handle a big amount of traffic and probably because of the industry I work, attacks are something we need to address almost every day. But there’re other things we’ve to deal with too:
- Bugs
- Changes in application behavior
- Dictionary attacks from bot nets
- Huge spikes of traffic that are seasonality events
- Marketing storm traffic
- Lack of application metrics
- Lack of business metrics
Monitor the traffic under this circumstances it’s complex. Static limits are useless, can be too low that cause unnecessary noise and wake up people at night or too high and miss important events.
Imagine a static limit of 2% on the global error rate of your site. A distributed error caused by non critical bugs can cause a 2% error rate while a 100% error rate on the checkout endpoint only represents 1.5% of the total and don’t trigger any alert.
The plan
For this reasons (and others) every time I’ve a new application in that needs an ingress I follow the same plan:
- Monitor critical endpoints for error rates
- Static high limits warnings to detect critical things like attacks.
- Define Apdex scores so you make sure response times are under toleration margin.
- Use some anomaly detection functions like z-scores to detect traffic anomalies
- Replace this statics limits where required using the anomaly detection metrics.
1 – When managing legacy applications (or modern ones poorly executed from observability POV) traffic is where I use to start adding alerts. Usually the lack of context at the beginning of the project (I’m usually not involved in the projects) requires some conversations to understand the needs and get the critical endpoints. For the critical endpoints, as a first iteration, I only set at first place alerts for 5xx status codes. By doing this I only cover critical cases but will help me with regressions while I acquire knowledge about the domain I’m working or during the development in the staging environment. In this cases a static limit should work well enough.
2 – Another common case I’ve are attacks and bugs in SPAs (ohh dear, Spas) causing tons of unnecessary requests. I usually set a static high limit, but after many iterations I’ve now things that automatically detects this alerts like moving z-score averaging last 3 weeks of traffic. But even with that, when it goes live or to staging, will only trigger warning as doesn’t have enough data to cover the prediction and can generate too much noise, for this reason, a static limit fits better at day one.
3 – Apdex it’s a calculation that set an score to you application base on user satisfaction (that fits very well in the industry I currently work). By providing 2 a toleration and a frustration margin it give you a number between 0 and 1. If you want to dive deeper, which I recommend if you don’t know about, I recommend this post from Newrelic. This metric is an indicator of something happening in your platform and affecting users but will not tell you what, you need to combine this with other metrics. If this alert rings and no other, then you will need to improve the alerting mechanism.
4 – I wrote a post about that already but just slides from a presentation. The idea it’s discover anomalies in a normal distribution. In the slides you can find how to deal with seasonality too. I should write a proper post, yeah, will do.
5 – After all of this, you’re probably already in production, understanding the needs, the contexts and with enough data to generate some predictions. It’s time to remove most of this static alerts, maybe decrease the error rate from 5% to 2%, etc. Some static limits will be there always but over ratios and not values. You’ll have an static limit for error rate, but not for requests_total
Traefik and The Migration Problem
We use Traefik as Ingress for ALL of our exposed apps since 2017. We’ve many sites managed by traefik and it’s the centric point for Ingress metrics. This allows us to monitor at once all the sites and don’t repeat ourselves. Let me give a real example:
- Video contest sites.
- One video release a week.
- People like it. Like a lot.
- Video released every Wednesday at 20:30. You can clearly see seasonality here.
- Some videos will be more popular that others but an increase in traffic is expected at this hour.
- Marketing people exists and have impact.
Having this in mind let’s see what are the metrics Traefik provides:
We’re in the middle of the migration from 1.7 to 2.3 and metrics differs so let’s see how much.
Traefik v1
- backend_request_duration_*
- backend_requests_total
- backend_connections_open
- backend_server_up
- With labels {backend, code, method, protocol}
- entrypoint_open_connections
- entrypoint_request_duration_*
- entrypoint_requests_total
- With labels {entrypoint, code, method, protocol}
Traefik v2
- Same entrypoint metrics
- Backend metrics no longer exists, ciao to many grafana charts.
- service_request_duration_*
- service_requests_total
- With labels {service, code, method, protocol}
See this example, we use Kubernetes Ingress to define traffic rules like that:
Hosts: www.dummy.com, www.hello.com
Paths:
– /join -> subscription-service
– /api -> api-gateway
– / -> frontend
Using traefik v1 we’ve:
traefik_backend_requests_total {backend=”www.dummy.com/”, code, method, protocol}
traefik_backend_requests_total {backend=”www.dummy.com/api”, code, method, protocol}
traefik_backend_requests_total {backend=”www.dummy.com/join”, code, method, protocol}
traefik_backend_requests_total {backend=”www.hello.com/”, code, method, protocol}
traefik_backend_requests_total {backend=”www.hello.com/api”, code, method, protocol}
traefik_backend_requests_total {backend=”www.hello.com/join”, code, method, protocol}
Using traefik 2:
traefik_service_requests_total {service=”namespace-join-service-8080″, code, method, protocol}
traefik_service_requests_total {service=”namespace-api-gateway-3000″, code, method, protocol}
traefik_service_requests_total {service=”namespace-frontend-80″, code, method, protocol}
Problem – Unable to monitor rule to service traffic
In v1 that you can’t see from your metrics where the requests goes with backend metrics but at least you know from WHERE is coming.
Only routes defined in the ingress will appear in the metrics. Define the important ones even if goes to the same service to have specific monitors for this endpoints.
In v2 with metrics by service we’ve lose the ability to monitor per rule. If the join-service it’s having a huge amount of traffic you can’t know where is this traffic coming from, dummy or hello, and you’ll need to consult access logs.
Lose the ability to monitor different rules it’s a big problem for us and stopped the migration of some multi-sites. We delayed the migration when we saw this and wait a bit for v2 to mature. I saw this issue in March and I was looking time to time but saw no progress. We wanted to use some cool things middle-wares provides and we also wanted to migrate before enter in 2021 because various reasons in our roadmap and the EOL of v1 in Dec 2021.
The solution
I recently decided to open a PR to add router metrics in the following format:
- router_connections_open
- router_request_duration_*
- router_requests_total
All of them with this labels:
- code
- method
- protocol
- router
- service
I included not only router name but also service name as labels. By doing this I can correlate an anomaly in a particular service with the router/rule that it’s generating that anomaly. See an example in this comment. And yes and now, cardinality increase but isn’t dynamic as router and service are static once defined and doesn’t change per request like when adding path as label. There’re no real problem having this labels. Plus you can disable service labels as exists as dimension inside router metrics.
But not only that, I can also generate a topology map of all my ingress that uses traefik 2 as ingress class. And I can do that because each metric contains rule and destination. Let me tell you how:
In Grafana, install Service Dependency Graph follow the setup instructions and set the following queries to Prometheus (or Thanos, cortex….):
#A Req/s
label_replace(
sum by (router, service, protocol) (rate(traefik_router_requests_total{service!~".*@internal"}[10m]))
,
"origin_external",
"$1",
"router",
"(.+)@.*"
)
#B Error rate
label_replace(
sum by (router, service, protocol) (rate(traefik_router_requests_total{service!~".*@internal", code=~"5.."}[10m]))
/
sum by (router, service, protocol) (increase(traefik_router_requests_total{service!~".*@internal"}[10m])),
"origin_external",
"$1",
"router",
"(.+)@.*"
)
#C Median response times
label_replace(
histogram_quantile(0.5, sum(rate(traefik_router_request_duration_seconds_bucket{service!~".*@internal"}[5m])) by (le, router, service, protocol)),
"origin_external",
"$1",
"router",
"(.+)@.*"
) * 100000 # <- Conversion factor
Do the mapping and you’ll be able to see a graph like the one below where every single ingress rule with traffic will appear with a relation to the service it routes to displaying requests per second, average response times and error rate:


Every new Ingress will automatically appear in this chart along the average response time and error rate.
This feature is running in our production stack already and it’s looking ok. I expect to see it merged, who knows, maybe released in v2.4.
Notes
- Open source it’s also about the freedom to adapt something to your needs.
- Ingress controller monitoring it’s important but doesn’t replace or try to substitute a service mesh. Are different things for different needs.
- By monitoring your service only with the ingress controller you’re not taking into consideration the internal traffic.
- I’ll address in another post the prometheus rules I use to monitor the traffic.