A tale of evolutionary applications, iterations, and performance (part 2)

Published in

Spotahome Product

5 min readFeb 14, 2019

This post is the 2nd part of a series of posts, on part 1 we saw the problem and the first solution we delivered. Let’s continue with the story…

2nd Age: 1st bottleneck

After 2 weeks and more and more builds were stored on the service we saw that the latency was growing a bit. The users were not affected but we wanted to reduce the latency because we wanted to reduce technical debt.

So we started improving the performance of the application, but where do we start?

This research process had a duration of ~2 months using spare time/cycles. In total were around three days of one person.

We only knew that the latency started when we changed the storage type from in-memory to Redis. We wanted to start small, our possible bottlenecks at first sight were:

Database
Application resources (CPU, Memory, IO)

Our first guess was the Redis. After checking the stats and metrics of the Redis, we concluded, the Redis was perfect, very small latency and resource consumption, the database wasn’t the bottleneck.

Then we started with the application resource usage.

These were the tests made to isolate the bottlenecks (some of them have been omitted because they were dead ends and not interesting):

Increate instance number on the cluster: It didn’t increase/reduce latency (Redis discarded once again because ops per second increased).
Redis connection pool on app increased: performance increased but not much, so we also discarded.
Run application from localhost to the production database in RO mode: better performance < 400ms p99.

At this point based on the latest result, we concluded that was very likely that the environment where the app was running was the main problem and not the Redis implementation. We discarded the network problem because we had more applications that transfer more data in the same conditions. So… everything points to the CPU.

Increased CPU requests and limits: The performance improved! but… the CPU real usage didn’t increase.

At this point, we were on something.

We had some guess on a hot path where concurrency was being used aggressively without limits to insert project builds on the database.

We made different combinations increasing and decrease CPU requests/limits, operator workers and limiting concurrency.

For example one of the combinations changes:

from 2 workers to 4 workers and 200m CPU request

But it was strange because although the application’s resources were increased and the performance was improved, the CPU usage was not being increased.

At this point we knew that the CPU was the problem, we did some profiling on the application with pprof and check what was consuming the most CPU.

Example to profiling 30s of CPU usage:

go tool pprof http://localhost:6060/debug/pprof/profile?seconds=30

And we used Speedscope to load the samples in flamegraphs (for us is more visual than google/pprof one).

Flamegraph of the CPU usage (ordered on the left the most CPU consuming)

List of instructions consuming CPU (descending order)

Like we see on the pictures the most time is spent on JSON unmarshaling (23%). So we changed the JSON implementation, instead of using Go’s standard library implementation we tried with json-iterator/go library. We created a branch and deployed that branch with this small change, and the results were almost the same (very minimal improvements) so this wasn’t the bottleneck we were looking for.

The CPU profiling although didn’t point the bottleneck, give us the hint that the problem was out of the application. So… we checked how the CPU was working in Kubernetes… we started checking if the CPU was being throttled by the kubernetes CPU manager although the CPU was not being consumed (strange behavior but we checked).

Make a small Prometheus query on our metrics:

max(
    rate(container_cpu_cfs_throttled_seconds_total{
      pod_name=~"^ci-operator.\*", 
      container_name="app"}[5m]
    )
)

The query showed us that the CI operator was being throttled (between 1s and 3s!).

We found the current bottleneck. On the next image, you can see the heatmap of the different CPU throttled seconds on all Kubernetes cluster operators, it’s clear where the throttling of the CI operator is.

Heat maps of all operators CPU throttling

After researching a bit about the unnecessary throttling that was being applied by the CPU manager to the CI operator, we read (here, here and here) that is a common problem/bug on Kubernetes and increasing the CPU limits or disabling them it would make a real and safer throttling.

We increased the limits a lot (300m to 1200m):

The operations on the storage rate kind and its latency (with Redis and CPU 1200m)

The latency of the API (with Redis and CPI 1200m)

As it can be seen on the images there was a huge improvement… (from 2.5s to 900ms).

Good enough, we stopped here until next performance improvement iteration.

3rd Age: 2nd bottleneck

And here we are at this point (1st of February of 2019) the latency started to increase during this time and although there was no impact on the user experience the latency measures were insane for one of the handlers ~7s getting project builds.

Knowing the problematic endpoint and how the performance has been degrading on this endpoint after weeks we knew that the problem was how the implementation of Redis was retrieving the builds of a project, instead of doing iteratively, we changed to a concurrent way using the bulkhead pattern and goresilience library implementation.

Latency before and after the concurrency change.

P99 latency (by handler) before and after the concurrency change.

This change reduced the latency of the slowest handler (7s spikes to 2.5s) but increased the latency of the others (100ms to 300ms)

P99 latency (by handler without project builds one) before and after the concurrency change.

This seemed that after setting concurrency the CPU usage increased and the throttling started again (remember our first bottleneck).

CI operator CPU throttling before and after adding concurrency

So we increased the limit from 1200m to 3000m and 8000m and the latencies went down more.

Latency before and after the concurrency and CPU change.

At this moment we had the service at the place we wanted using one day of work of one person. For now, we were good, let’s continue iterating when needed.

The story continues on the 3rd part.

Thank you!

A tale of evolutionary applications, iterations, and performance (part 2)

2nd Age: 1st bottleneck

3rd Age: 2nd bottleneck

Written by Xabier Larrakoetxea