Summary

On Monday, October 4th, 2021, from 14:25 (UTC+2 Paris), the Dokeos infrastructure started to experience slowdowns. Some requests resulted in a 5xx error, corresponding to a too long response time. Dokeos immediately detected the problem and the technical team stopped development to analyze the monitoring tools, perform network tests, and collect as much key data as possible from the servers. Customers have contacted us to report the same issue.

The technical team consulted its three tools: AppSignal, Sentry and the vendor's monitoring, Amazon. Here's what they came up with:

The servers are not overloaded (neither CPU nor memory);
The asynchronous job server (Sidekiq) is accumulating jobs without being able to absorb them;
The Redis cache tool on the database is saturated at 100% of its capacity;
Access to the database is complicated because Redis can no longer absorb the requests.

While these last two items are the causes of the slowdowns and inaccessibilities, the starting point for the increase in database requests goes back a bit further. Our implementation of the "log in/log out" is extremely demanding and since each learner access is a database record, the database is more stressed. In our monitoring tool, we have seen that several large clients have requested large data exports at the same time, this coincidence and simultaneity has brought our tools to saturation.

Our infrastructure is designed to be "elastic" and "adaptive" to the needs of our application. A first adjustment was therefore made, bringing the portals back online. This was followed by a new wave of slowdowns as all the systems went back online to absorb all the pending requests. This peak again exceeded the maximum threshold. By 5:45pm, stability returned as the backlog of requests was absorbed and the number of connections on the platform decreased.

Cause

Logs of log-in/log-out

L’ANDPC (Agence Nationale du Développement Professionnel Continu) announced on August 5, 2021 to all its training organizations that it will henceforth monitor the login and logout dates of any learner who takes a training course. As this request had never been made or required before, we urgently developed a tool that captures all the entries and exits on the training modules on our LMS. This new functionality generates one line per connection and per user. It goes without saying that this represents a few thousand lines per minute. We immediately identified the technical challenge and the bandwidth consequences of this new data flow. An upgrade (increase) of the database was set up. A specific monitoring showed that everything was going well and that the requests were being processed in parallel with the other jobs on the asynchronous server.

Heavy exports

Following the explosion of export requests with the new DPC regulation in August, we also made the strategic decision to entrust large exports to our asynchronous server. Indeed, large exports were causing a 500 error because the loading time was too long for browsers. This solution allows for smoother navigation on the portals and compiles the data "in parallel". Since the creation of the Regulatory Report and Detailed Report statistical functionalities, these exports are compiled with the processing of one job = one line (=one learner for one training course). This takes up a lot of space on the asynchronous server. These hundreds or even thousands of rows are generated, and one row is processed in a few milliseconds, but even such a fast processing cannot compensate for the hundreds of large exports requested at the same time.

Conclusion

The addition of our new logging development with the increasing demand for exports has resulted in a saturation of the asynchronous server, and more specifically, its cache memory. This memory allows for display and short-term processing and is a necessary step for jobs before they arrive and are processed on the Sidekiq (asynchronous server). Our Redis and our Sidekiq being at saturation point, the application suffered significant slowdowns and experienced 3 periods of inaccessibility of a few minutes. These two tools, real lungs of the SaaS solution are now our number 1 priority for the next two weeks.

Loopholes

The main factors that led Dokeos to this situation are:

Significant increase in the number of requests on Sidekiq;
Saturation of the Redis cache;
A single database for writing and reading;

Solution

Analyze during the incident

We proceeded to the methodical shutdown of suspicious services:

Stopped Sidekiq jobs;

Decrease the number of interconnections on Rails;
Decrease the number of web instances on the server;
Analysis of resource consumption between Sidekiq and Puma;
Monitoring of flows on Redis;

Corrective mesures

Development

Creation of a single job for statistical exports rather than extracting line by line and then building a file: In test phase, deployment October 6 or 7, 2021;
Creation of a database replica to split reading and writing of data: Scheduled for October 15, 2021 maximum;

Infrastructure

Increase in database size and bandwidth: OK;

Conclusion

The Dokeos team sincerely apologizes for the inconvenience caused on Monday, October 4, 2021. Any system can be improved, this incident proves it to us and highlights the areas of improvement for the stability and performance of our LMS. The corrective actions described above are already being implemented as you read this.

Download this report on PDF here.

Update 15/10/2021

Incident report 11th May 2022

Incident report 4th Oct. 2021

Summary

Cause

Logs of log-in/log-out

Heavy exports

Conclusion

Loopholes

Solution

Analyze during the incident

Corrective mesures

Development

Infrastructure

Conclusion