Statistics dashboard performance: from tortoise to hare
Written by Daniël Niemeijer on 14th July 2017
In my previous blog post, I wrote about creating a statistics dashboard for our VoIPGRID partners. After finishing up the frontend, it’s now time to take a look at how things are working behind the scenes.
Because the dashboard has to load quickly and feel snappy, the performance matters. There’s a little problem though: the VoIPGRID platform processes around a million calls each week. This means the database tables which contain call information and billing information are huge. So the queries that are executed on these tables must be really fast if the dashboard calculates all the statistics real time. As there are around 15 statistics that must be calculated, this sounded quite impossible when I started with this project.
After the first couple of queries were written Marco tried them out on a copy of the production database. This resulted in one query returning data after 40 seconds and the other query blocking the database.
Real time is not an option
After realizing that these queries were never going to work in real time, there had to be thought of a solution. Together with Marco, I thought about the problem of slow queries and the performance of the dashboard.
The solution we came up with was the creation of models in which the statistics were stored. The creation and updating of the models will happen each night, this way the queries would not interfere too much with traffic on the platform. The models that were created are a monthly time series model and a daily time series model.
Time series models
The monthly time series model holds all of the statistics needed in the dashboard. This way it is possible to load the statistics on the dashboard with a single query. The daily time series model is created with the thought that in the future it can be used to give more detailed information.
The names of the models, of course, imply that it would have made more sense to use an actual time series database. And yes, it would. However, this meant that the infrastructure team needed to provide a time series database which was ready for production. At the moment this was not possible. So maybe in the future, the switch to a time series database like InfluxDB will be made.
Testing and optimizing the queries
As I stated before, some of the written queries were not performing very well and needed optimization even when the statistics were to be calculated overnight. To test the queries and find bottlenecks, my project was configured to use a database which contained production data but only has read rights. To make performance measuring a bit easier I created a dry-run option for the management command. The dry-run option means there is no data saved to the database but everything is calculated and logged to the terminal.
While timing the individual queries a bottleneck was soon found: calculating a top 10 client list based on revenue grew exponential when a partner had more clients. This happened because of the value which was being ordered on. I used the name of a client instead of the ID. After implementing this fix the query took about 10 seconds instead of roughly 110 seconds for a large partner.
The part that was slowing the query down at this point was the group_by which was used in the SQL statement. However, this is needed to sum the revenue for each client. A way to fix this could be to do the grouping in Python instead. After implementing this and timing the method it appeared that this was slowing it down a lot more. The method took about 350 seconds. This meant that I was going with the 10 seconds as this only happens with very large partners. On average this function takes about half a second for other partners.
Is it fast?
After optimizing as many slow parts as possible it now takes about 6 minutes to calculate the statistics for a month. The calculation of daily statistics is slower and this is mainly because of a method that needs to figure out to what country a phone call was being made. Doing a dry-run of the daily calculations ended up in taking up to half an hour. This, I think, is fairly acceptable given that this means that each night in under 45 minutes all of the statistics can be calc