A tale of evolutionary applications, iterations, and performance (part 1)

Xabier Larrakoetxea
Spotahome Product
Published in
6 min readFeb 14, 2019

--

This is a story of an application called CI operator, that has been with us for 4 months, this application has been developed in the infrastructure team. In these four months, the application has evolved in many aspects and in this post, we will tell how it was created, how evolved, what decisions were taken and why those decisions were applied.

We will split this story into three posts, in part 1 we will see the initial design and creation of the application, on part 2 we will see the bottlenecks we encountered and on part 3 we will see the current state and some conclusions of this trip.

First of all, we need to be on the same page and have some information about the context.

In Spotahome almost all of our projects use Brigade as the pipeline/CI/CD system. Brigade is awesome because it gives us the ability to have dynamic pipelines(made in JS) with native Kubernetes runtime which we are already using intensively for most of our platform.

Engineering department(>80 people) has been migrating all their projects from Concourse to Brigade this past 2018 and to use it correctly the frontend-core team developed a web interface based on React, BFF pattern, and GraphQL that calls the Brigade HTTP API to get the information.

The application for deploying and checking the builds in Spotahome

The problem

Like almost every application, the application is the solution to one or more problems, and this is not different, the CI operator solves a problem with Brigade: The information about the builds and logs the Brigade API returns is ephemeral, it gets this information from Kubernetes itself so, when a pod of a build has gone, it’s logs and information (exit code, id…) will disappear. If we are talking about old builds this is not a problem, but our clusters autoscale constantly and its common to lost pods information in a small amount of time (in extreme cases, the build finishes and in few seconds the information has gone).

To tackle this problem we created the CI operator, this application is a Kubernetes operator that stores information of all the Brigade installations on the cluster (we have one Brigade installation per team on the cluster, and we have ~11 teams), the operator will expose that information to the app that the frontend-core team developed.

The application has a simple design, its based in 2 components:

  • The storer (operator): The operator itself that is getting all the information from the cluster (builds, logs…) and storing in a database.
  • The HTTP API: An HTTP REST API exposed to the clients to get that information from the database and avoid having to share a database which is an antipattern.

In this picture, you can visualize better the solution:

The architecture of clients and CI operator

There are more details in the process of the solution but is not required for this post.

Team premises

In the Infrastructure team, we try to follow some premises that can help to understand the processes we followed with CI operator.

  • Think and design before writing a line of code.
  • Try following best practices while developing without dogmatisms (testing, documentation, PRs, SOLID…).
  • Each line of code is technical debt (the best code is no code).
  • Design and implement things as simple as possible.
  • Tomorrow you will have more information than today (postpone decisions until is necessary).
  • Ship fast and iterate every day.
  • Observability is a requirement, not a feature.
  • Avoid overengineering and premature optimization.
  • Automate repetitive tasks and reduce toil every day.
  • Test in production and rollback fast.
  • Complexity will kill us.

In other words, we love simplicity and “boring” things.

History of the CI operator

Let’s start with the history of our application and why we took those decisions. I’ll split it in ages.

Age 0: Initial release

On October of 2018, we arose the Brigade ephemeral information problem and we started thinking a solution.

The solution as previously was described it was the CI operator, a Kubernetes operator with an API that will store all the data and is written in Go.

This was the definition of our storage (the simples one we could imagine):

Our first implementation was based in memory storage and without any listing options. Why did we make this decision?

  • We didn’t have information about the usage of the data to decide what kind of storage backend is a good candidate (K/V, SQL, document, graph…).
  • Ship fast.
  • Solve partially the current problem (be better today than yesterday).
  • Check how our initial store design, model (business logic) and API was working to validate the process.

We try following SOLID principles when we develop, this gives us the ability to postpone decisions easily, solve the immediate problem, create maintainable code and have the door open to fast improvements.

We created a simple implementation with the memory storage based on maps.

The 10th of October of 2018, we deployed this first implementation and the results were very good. There was only one drawback, the data also was ephemeral, we knew that, but it was less ephemeral than the Brigade one. People were happier and it was better than before in just 3 days of work (including design) of 1.5 people.

Our applications try to ship with metrics from day one and this is no exception, based on the metrics that we recorded we had this data:

The operations on the storage rate kind and its latency (In memory)
The latency of the API (In memory)

The results seemed good.

1st Age: persistent storage

After a week of seeing the application working, we validated that the design at that moment was more than enough, the HTTP API was more than enough and the requirement meets from the users were more than enough. Except… that we wanted to persist the data.

We only needed to design the storage part, all the other parts of the application were working good in production. Based on the collected data and the type of HTTP API we designed, we made the decision to go with a Redis because:

  • Easy to use and maintain.
  • Flexible and lots of structures.
  • Fast.
  • Mature and tested in many environments and by many companies (us included).
  • Possible use of expiration and remove old data easily in the future when we knew the growth rate.
  • If we lost everything if something very bad happens is not a problem.

In one day, one person and ~300 LoC it was ready and shipped by the end of the day. We took the simplest approach we had for the Redis, use sets to store the builds of the projects and use simple keys to store the builds data serialized in JSON and the logs in raw.

And we shipped the 24th of October…

The operations on the storage rate kind and its latency (with Redis)
The latency of the API (with Redis)

We expected a latency increase, but we didn’t understand why the latency was so high (2s p99).

Luckily the web UI that uses data from this application was working ok and the users didn’t have a bad experience using the service, so, although we knew that it wasn’t an ideal latency it satisfied our needs:

  • Users experience not affected.
  • Users use it and is a valid solution for them.

You are wondering why with this latency the users were not affected?

The web that the users use, uses continuous polling and updates the data after the first load dynamically, giving the user the false sensation of real-time updates, eventually everything will be updated correctly, the user doesn’t need millisecond real-time updates for this kind of applications (what’s the difference of showing the build has started at T0 or T0 + 4s? for us, nothing), so the huge latency increase was not affecting the user experience.

For now, we were good, let’s continue iterating when needed.

The story continues on the 2nd part.

Thank you!

--

--