Incident Management

Gone are the days when I could shout out to the family ‘I’m just going to reboot the server. Get off now!’ Since exposing several LAN resources such as Nextcloud and WordPress to the internet, I’ve come to realise there’s a whole customer base who have to be considered prior to causing any disruption to the network. For instance, what happens if someone’s in the middle of reading a blog post or, worse still, in the middle of a WooCommerce transaction? I certainly don’t want to have a reputation of providing unreliable services. Bad news travels fast. I need to have mechanisms in place to keep customers informed of outages and upcoming scheduled maintenance.

Statuspage

The main communication tool I use for incident management is Statuspage. It’s a really powerful, but easy to use tool to keep customers informed of unscheduled and scheduled outages. During an incident, I use it to advise clients of how restoration is progressing. It’s so much more useful than just having a static page that says ‘Site down’. The surprising thing is that all that power comes completely free at an entry-level. Create your own status page and spend some time getting familiar with it. It’s brilliant!

Redirection – The big hammer approach

The other tools that I use during an incident are for redirecting users to the status page for the services that I offer to the public. These include Cloudflare, my external DNS provider and a Caddy reverse proxy behind which sit the LAN services that are offered externally. Which one I use depends on the context. If components on the main conduit between my LAN and the internet are down, such as the router or the reverse proxy, then none of the services offered externally are available. In this case, I flick a switch in Cloudflare to redirect all external traffic to my status page.

Redirection – A granular approach

Where I want more granular control for redirecting a subset of customers to my status page, which might happen when just one service goes off-air, my Caddy reverse proxy comes to the rescue.

For my Caddyfile, this is a snippet I use for simple reverse proxies under normal operating conditions:

(online) {
  {args.0}.udance.com.au {
    encode gzip 
    import dnschallenge

    log {
      format json
      output file /var/log/caddy/{args.0}.log {
        roll_keep 7
      }
    } 

    reverse_proxy http://{args.1} 
  }
}

For example, to make my blog site available externally, this is how I would use this snippet:

import online blog 10.1.1.4     # blog.udance.com.au

I’d use the snippet below to redirect customers to my status page when a service is offline:

(offline) {
  {args.0}.udance.com.au {
    encode gzip
    import dnschallenge
    redir https://udance.statuspage.io{uri} temporary
  }
}

For example, to redirect customers of my blog site to the status page when the site is down, this is how I would use this snippet:

import offline blog 10.1.1.4     # blog.udance.com.au

Alerting

So how do we let customers of a service know about an incident affecting a service, or upcoming maintenance? Statuspage provides two APIs that can be embedded within a website (use of a highly configurable WordPress theme is recommended). The first API is really just a link to the status page. For instance, for this blog site, you will see it as System Status in the header above this post. A customer can click on the link anytime to see if any maintenance has been scheduled in the near future. This is the implementation of System Status:

The second API sends a message when a new event occurs. For example, consider the following incident:

When the incident is created, a message pops up in the lower left corner of a customer’s browser window.

…or on their phone browser.

To set up the messaging, I first copied a status embed code from Statuspage:

…and I added it to the footer of the active WordPress theme.

Note: I didn’t want the Title to appear on the web page, but I couldn’t not give the HTML code a title, so I made the title a full-stop, which made it virtually indiscernible on the blog site.

What I’ve described in this post are pretty much all the hooks I’ve set up to help me keep my customers informed of outages and scheduled maintenance. It lends itself to good customer service and to building a better brand reputation.

Keep Reading

PreviousNext

Comments

Leave a Reply