fbpx
Categories
SDLC

Monitoring with purpose

My team is currently in the process of releasing a new service to production and the monitoring conversation came up soon after.

I found myself going over examples, setup instructions, and tutorials on how to set up monitoring.

At some point down the rabbit hole, I realized that I wasn’t seeing the value yet, so I put down all the video tutorials and began reflecting for a moment

Why do we need monitoring?

The first thing that came to mind was to ask, why?

Why are we setting up alerting?

why are we creating dashboards in Datadog?

why are we paging engineers via Pagerduty?

The usual answer was around the need for visibility, understanding, and notification for actionable metrics.

But that felt empty, I mean, how do you know if what you are measuring is valuable?

we could set up one thousand metrics and still feel like we are missing monitors and alerts.

Then it hit me, is all about the user experience! doh

Here comes the value

I imagined myself using our system and figuring out what keeps people coming back, and how could I tie that to the monitoring we are creating

The idea is that systems have a reputation, and that reputation is built based on certain expectations.

I want to know what those expectations are and measure them

Here is a set of questions I asked myself about our service, along with the monitoring that could help answer it

Is the service functional?  in other words, does it do what I need it to do? to help answer functional requirements, we could measure for bugs

Is the service available? If attempt to use it any time of day, I can be sure that the service is up and running. Turns out you can measure for availability to answer this ๐Ÿ˜Š

Is it consistent? so that I know that it will do what I need all the time. This one was easy, measure for errors

Is it performant? raise your hand ๐Ÿ™‹๐Ÿฝโ€โ™‚๏ธ if you enjoy watching a “loading” message. Measuring for latency helps with this

Lastly, Is it scalable? for simplicity, does it do what I need, when I need, fast enough, regardless of other users? for that, we measure for traffic and saturation

Now we have a purpose

That quick Q&A exercise gave the monitoring conversation a goal, a direction, a purpose!

With a better understanding of how our monitoring and alerting can measure the expectations that our users have from our system, we could get to work by asking ourselves some more questions.

  • Can we confidently answer the questions above?
  • Do we understand what is actionable vs informational?
  • If actionable, will your paging system wake you up?
  • If not actionable, can we determine the business value of the information so that we know when we need to plan for it?

Action

Now that we understand that our metrics and alerts help us measure and maintain user expectations, we are ready to start adding monitors.

I hope you enjoyed this quick journey into monitoring with me

Some good resources on the subject are Googleโ€™s guidance on monitoring for distributed systems, as well as their service level objective intro

Enjoy!

Icons made by Eucalyp from www.flaticon.com