A short foreword

In the middle of 2018 we signed a new contract with a huge client, the contract was of a new kind to us - we would have to provide a paid access to one of our services (node.js + express based REST API) with strict Service Layer Agreement (hereinafter SLA) conditions. Since this contract would double our revenue and was a kind of “life changer” to us, the first thing we thought of was “now we need a proper monitoring, we can’t f*ck this up, not now”. That’s how the story begins.

The first thing we thought of was “now we need a proper monitoring, we can’t f*ck this up, not now”

Diving into the subject

We’d never seriously dealt with monitoring of REST APIs before, so we started to discover the subject.

So, how can we monitor our REST API? It turned out there were (and still is) three main approaches to do this:

  1. To develop a custom monitoring solution, e.g. using ElasticSearch and Kibana or Grafana
  2. To use a tool that pings your API from the outside and checks its health by monitoring response times and codes
  3. To use a conventional APM tool like New Relic
  4. For those who prefer coming through the back door: to use an analytics like Google Analytics or Mixpanel

Using a front-end analytics in back-end doesn’t look like a good idea at all, so we decided to omit the 4th approach.

Let’s figure out what we need to monitor in the first place, I’ll base this on our SLA terms.

Here are the key questions we have to have answers to:

  • Is our app responding to our clients’ calls not slower than it should by the SLA terms?
  • What codes does it return besides the codes it should return by the SLA terms?
  • How many calls is our app receiving from the clients? We have to know how many times the client called our app by months for billing purposes.

The obvious but worth mentioning thing is that we have to be able to set up alerts related to these metrics.

Custom monitoring solution

Since we’re a team of developers, the first idea was “let’s develop it ourselves, how hard can it be?”. So right after the contract I mentioned in the foreword was signed, we quickly developed a self-hosted monitoring tool based on ElasticSearch and Grafana.

Since we’re a team of developers, the first idea was “let’s develop it ourselves, how hard can it be?”

Here is the dashboard we created (please, don’t let Russian symbols scary you, I didn’t mean this):

grafana

At first glance this solution looked like exactly what we needed:

  • it was notifying us about slow responses and incorrect response codes
  • it displayed all the info we needed, such as number of calls, requests rate and duration, etc.
  • we could display almost any data and set up any triggers for alerts with ElasticSearch and Grafana

However, we had to invent everything from scratch, so we had to find answers to next questions:

  • how to properly collect the metrics we need from our REST API?
  • how to develop a sustainable middleware that would never crash the host app?
  • how to build a data model that can provide us with the ability to monitor all the metrics we need
  • how to store the data, use ElasticSearch or InfluxDB or maybe MSSQL?
  • and many others

You can see that there’s plenty of space to mess up the whole thing by missing a tiny, but crucial point, e.g. a small error in metrics collection may lead to huge troubles, unfortunately we had such experience, we’ll publish an article about this a bit later.

To finally kill the idea of developing your own monitoring tool, there are more issues related to it:

  • you have to monitor it, things like ElasticSearch health, correctness of the data collection and other important stuff: takes much of your time
  • proper authentication and access control is always an issue, e.g. we couldn’t rely on Grafana authentication, we hosted it inside our infrastructure, so any time one of us wanted to check any of the metrics from the outside of our network he had to connect to our VPN
  • each time we modified something in our REST API we had to double check that our metrics collector wouldn’t crash or misbehave

Using a tool that pings your API from the outside

After inspecting numbers of “pinging” services (most of them were quite awful), I found a cool tool called Checkly.

It looks so nice, and it’s not overloaded with functions you’ll never use, it took me 5 minutes to set it up, below is the dashboard I created:

checkly

However, you can’t rely solely on this monitoring because these metrics have nothing to do with the actual clients’ experience. Your API may respond fast and correctly to these pings from the outside, but there’s always a chance it performs terrible with your clients’ data, and you’ll never know this.

That’s why we had to turn down this approach and use a tool that monitors the metrics of actual customers’ interaction with our REST API.

Conventional APMs

We considered the most popular solutions: New Relic APM and Elastic Cloud APM. Let’s describe both of them in details.

New Relic APM

I signed up for New Relic APM trial, added their agent into our REST API and started setting it up.

The default New Relic APM functionality includes everything you might ever need (and unfortunately much more), I’m not going to waste your time describing all its features, you can check them out on the official website. Instead, let’s focus on how useful (or not) it is for REST API monitoring.

First of all, I found the default New Relic APM dashboards too overloaded but missing features I wanted to see in the first place, such as response codes. I didn’t need so detailed breakdown of every call of my app, I know what it does and what resources it uses, I programmed it myself, for god’s sake!

Here is one of the default New Relic APM dashboards:

new-relic-default-dashboard

But then I discovered a feature called New Relic Insights, it is a visualization tool that allows you to build custom dashboards using the data from your New Relic APM. It uses “New Relic Query Language” for querying the data, and it’s quite amazing! It took me less than a minute to figure it out and start building a dashboard without reading manuals or tutorials. Well done New Relic.

The only thing that may let you down is that you can’t group the data by custom attributes since they’re not indexed, I had to use custom events to record all the metrics I need.

Here is the dashboard I created after half an hour (sorry for Russian, again):

new-relic-insights

This was promising. After I finished setting up the dashboard, I moved to setting up the alerts. It’s a bit complicated in New Relic, but after 10 minutes you get used to it. The only drawback is the lack of predefined basic alerting rules (or maybe I missed it?) like “notify me every time when my REST API responds with 500 code”.

The only drawback is the lack of predefined basic alerting rules (or maybe I missed it?) like “notify me every time when my REST API responds with 500 code”.

So, everything was set up, the only questions left were the price and the data retention (days) for custom dashboards. The official website wasn’t quite informative about it so I contacted the sales manager and asked these questions. The price turned out to be quite high, but it was acceptable. But there was one thing that spoiled everything: “You will have 8 days data retention for Insights”, the manager said. This meant we couldn’t use New Relic for counting the number of calls from our clients on monthly basis.

But there was one thing that spoiled everything: “You will have 8 days data retention for Insights”

My dreams were ruined and I signed up for Elastic Cloud APM trial.

Elastic APM

I signed up for the cloud version - AWS based ElasticSearch + Elastic APM + Kibana stack, the setup process is not as smooth as New Relic’s, though there’s nothing complicated about it. It took me about 15 minutes to set it up. The agent is also similar to New Relic’s one in terms of integration - nothing to be bothered with.

Default APM dashboards look identical to New Relic’s, maybe a bit less overloaded, but the whole impression is the same: you can’t use it for REST API monitoring out of the box, at least when you have to comply with strict SLA terms.

Here is one of the default Elastic APM dashboards:

So I started creating a custom dashboard with Kibana. Event though I had a strong background with ElasticSearch, I was struggling with Kibana for almost a whole day, it took me about 5 hours to finally create a dashboard with all the data I needed. It turned out to be much more complicated than New Relic Insights.

But the good thing is that you can aggregate your data by any field, including custom tags, thanks to ElasticSearch and its default .keyword mapping of every field.

Here is the dashboard I made:

elastic-apm-custom-dashboard

So, the dashboard is ready, let’s move to setting up the alerts.

First of all, I tried to create alerts using Kibana Watchers UI, that didn’t work. Every time I was trying to save the watcher I was receiving the “internal server error”, I found a few bug reports about this on discuss.elastic.co. So I decided not to waste my time trying to fix this bug and to create watchers using Elastic REST API. After an hour of studying the manuals I managed to create all the alerts I needed.

Each alert is configured with a JSON object including query, threshold and actions descriptions. Below is an alert that notifies us about calls of our service that took more than 10 seconds to complete:

{
    "trigger": {
        "schedule": {
            "interval": "5m"
        }
    },
    "input": {
        "search": {
            "request": {
                "body": {
                    "size": 0,
                    "query": {
                        "bool": {
                            "must": [
                                {
                                    "term": {
                                        "context.system.hostname": "selectel-app"
                                    }
                                },
                                {
                                    "term": {
                                        "transaction.type": "request"
                                    }
                                },
                                {
                                    "range": {
                                        "transaction.duration.us": {
                                            "gte": 10000000
                                        }
                                    }
                                },
                                {
                                    "range": {
                                        "@timestamp": {
                                            "gte": "now-5m",
                                            "lt": "now"
                                        }
                                    }
                                }
                            ]
                        }
                    },
                    "aggs": {
                        "group_by_route": {
                            "terms": {
                                "field": "transaction.name.keyword",
                                "size": 5
                            },
                            "aggs": {
                                "max_resp_time": {
                                    "max": {
                                        "field": "transaction.duration.us",
                                        "script": {
                                            "lang": "painless",
                                            "source": "_value / 1000000"
                                        }
                                    }
                                }
                            }
                        }
                    }
                },
                "indices": [
                    "apm*"
                ]
            }
        }
    },
    "condition": {
        "compare": {
            "ctx.payload.hits.total": {
                "gt": 0
            }
        }
    },
    "actions": {
        "integram": {
            "throttle_period_in_millis": 1000,
            "webhook": {
                "scheme": "https",
                "host": "******.com",
                "port": 443,
                "method": "post",
                "path": "/webhook/*********",
                "params": {},
                "headers": {},
                "body": "{\"text\":\"API RESP TIME > 10s ({{#ctx.payload.aggregations.group_by_route.buckets}}{{key}}: {{doc_count}} responses (max resp time: {{max_resp_time.value}});{{/ctx.payload.aggregations.group_by_route.buckets}})\"}"
            }
        }
    }
}

Pretty complicated, right? Now imagine having 15 JSONs like this, managing them is pure hell. What’s more, you can’t segregate your alerts by applications or hosts, you always have to include these terms into the query within the alert JSON. This creates a huge space for making a mistake, missing a crucial incident and f*cking up everything.

Nevertheless, Elastic Cloud was the only solution that matched all our requirements. It does not limit your data retention - you can use all the space you paid for, also I managed to create useful dashboard and to set up all the alerts I needed. So we decided to stick with Elastic Cloud for a while.

Conclusion (a sort of)

The story doesn’t end here.

We’d been using Elastic Cloud for about a month, the lack of ability to create and manage alerts without pain and suffer along with complicated Kibana’s dashboard management was driving us mad.

We were still in need of a simple tool, that would have a single simple dashboard and native alerting only for key REST API metrics. So we began considering developing our own service that would match our needs, but that’s another story.