How we established monitoring for our product health
In the past, we introduced monitoring and alerting that tracks the technical status of our platform and services. However, we sometimes experience issues that our monitoring cannot detect. This could be team-independent or cross-team issues or even issues that do not have a technical reason. For example, two alerts in different teams might not appear urgent but combined they make an impact. To answer these questions, we have started working on Product Health monitoring.
Even more interesting: Have our latest product improvements succeeded? Are they as well accepted by our users as we believed they would be when we started implementing them? An example could be the latest improvement to the “Review Manager”, a single page application. Is it working in all the different browser versions and is it improving the learning experience of our users?
That’s why we started working on Product Health Monitoring, with which we want to focus on the non-technical aspects of our products. Initially, we set up some simple use cases like “Number of Review Sessions”. We didn’t want to only display numbers. We were aiming for easy-to-digest and dynamic graphs - nothing gets more desinterest in the teams than not (or slow) moving graphs. They were to be shown on huge screens and attract passers-by that would also ask questions if something looks weird to them - such as a dropping curve or huge spikes. The time-based windows are defined differently per team. For some teams, it is sufficient to look at the graph for one week, others want to observe them daily or even more often. All of the graphs are drawn in almost real-time and allow to visually detect anomalies or invariances within minutes. This enables us as Technical Product Managers to get back to the teams, correlate specific events with releases, and evaluate observed anomalies. Are they intended? Are they just short-term (like for data migrations)? Should a feature be rolled back to allow for a throrough analysis of the data? In some cases, we also discovered incidents or outages of third party systems even before they notified us (if at all).
Let’s get back to our example of “Number of Review Session”: In the graph shown below it looks all good, the current numbers of the Review Manager (blue) are above last week’s numbers (gray). The delta (green) also indicates a higher usage. So the most recent feature improvement and release seem to work fine and is also accepted by our users.
From a technical point of view we have decided to use Kibana and mainly the visualisation feature “Timelion”. This feature allows to not only draw a visually appealing graph based on Elasticsearch queries but also to combine different queries in one graph! So we can make use of various data in one graph and compare them. In some cases it is as simple as just compare the current data with the same data from a week ago. This allows us to detect invariances where maybe something is broken, because - depending on the time of the day and seasonality of course - we assume, that last week’s users were as active as this week. For example, we would expect users are reviewing their vocabulary this week as often as they did last week.
We came up with a number of visualisation conventions that make it easy not just for us but for anyone to read the graphs. Even stakeholders that are not involved in the nitty gritty details can look at the graphs and more or less immediately understand them without reading documentations or labels.
The code snippet below shows an example, which we use for comparing the data from today and a week ago. You may ask, why do we calculate a moving average (
mvavg) for the data a week ago. That’s because for the baseline, it is not necessary to have outliers (up or down) visible - we’re just interested in the baseline. Additionally it makes the graphs easier to read.
.es('<your kibana query>').color("#1a75ff").lines(width=2).label("This week"), .es('<your kibana query>', offset=-1w).lines(fill=0.5,width=0.5).color(gray).mvavg(window=10).label("Last Week"), .es('<your kibana query>').subtract(.es('<your kibana query>', offset=-1w)).bars(width=1).color(green).mvavg(window=10).label("Delta")
As the next step, we are going to also enable alerting to not only visually/personally detect possible issues but also inform the teams or stakeholders immediately. At the time we have set this up, there was no alerting on AWS Kibana - but now there is 😉As written above, we plan to include these alerts in our established alerting process using PagerDuty instead of following a new process.