Sleepless in Seattle: Wake up with AWS Incident Manager

REFERENCE
10 min read

Sleepless in Seattle

Dolt is a MySQL compatible database with Git like features. On May 18th, we launched Hosted Dolt, a cloud-hosted Dolt database with built-in logs and monitoring. If you're not familiar with Hosted Dolt, here are some blogs to get started:

We recently added a support ticket system to the Hosted Dolt website. Users can create support tickets of varying urgency if something goes wrong and a member of the DoltHub team will be dispatched to resolve the issue.

This blog will cover how we use and set up AWS Incident Manager to manage our support tickets.

What is Hosted Dolt?

Hosted Dolt is our solution for AWS RDS or MariaDB SkySQL and is modeled after AWS Management Console or GCP Console.

You use the administrative website to provision and manage a running Dolt database. Once you provision the database you connect to it with any MySQL client over the internet. When you're done with the database, deactivate it and we'll stop charging you for it.

Hosted Dolt Console

Because Hosted Dolt is used for running version controlled databases in production, uptime is extremely important. We have a team of veteran cloud service engineers to keep your database operating smoothly. However, on the off chance something goes wrong, we need an escalation plan in place to address and resolve urgent issues as soon as possible.

Comparing incident management tools

When we started to look into tools that could help us implement our support ticket system, we had two main criteria:

  1. Hosted Dolt users can create and manage tickets on our website
  2. Our team can easily manage tickets through an admin console

We considered a few options that hit these criteria - PagerDuty, Splunk On-Call (formerly VictorOps), AWS Incident Manager, and implementing the system ourselves. Ideally if there were existing tools we could use to implement this system, we'd prefer to not spend the time and resources building it from scratch.

Here's a quick comparison of these three tools.

Incident management tools

PagerDuty

Initial Release
August 2009
Pricing
Free for up to 5 users/month, $21 per user/month after
Features
- Automated precision response
- Business-wide orchestration
- Major incident learning

Splunk On-Call (formerly VictorOps)

Initial Release
December 2012, Acquired June 2018
Pricing
Starts at $5 per user per month
Features
- iOS and Android apps
- Incident context and audit trail
- Rules engine
- Machine learning-based responder recommendations

AWS Incident Manager

Initial Release
May 2021
Pricing
$7 per response plan per month
Features
- Automatically collect and track the metrics
- Collaborate through contacts, escalation plans, and chat channels
- Automate repeatable steps to resolve incident

Some of our team members had used VictorOps at their former jobs at Snapchat. While it was praised for its nice web interface and ease of on-call scheduling, there was some additional integration work and maintenance required by having monitoring in a separate system.

We ultimately decided to go with AWS Incident Manager. We use Amazon CloudWatch for metrics and logs for Hosted Dolt instances, and it easily integrates with AWS Incident Manager to alert and alarm off of metrics. Having a single place to monitor tickets and metrics was appealing from an operational standpoint.

How our support ticket system works

A support ticket consists of an impact level, title, summary, and affected deployments. We started with two impact levels - Critical (production system down, response within an hour) and Low (general guidance, response within 24 hours). You can choose zero to many related deployments (you must have an active deployment to create a critical ticket) and add relevant details in markdown. Our on-call team member will be paged and respond as fit.

This is our support ticket workflow:

1. Oh no! Something goes wrong with a deployment. The user creates a support ticket.

Create support ticket

You can create support tickets either from the support tab on the deployment management console or from hosted.doltdb.com/support.

2. Our on-call team member gets notified via email or phone.

Support ticket email notification

3. The user realizes they should include more information and edits the ticket.

Edit support ticket

4. Our paged team member fixes the issue and uses the AWS Incident Manager console to edit and resolve the ticket.

AWS Incident Manager Console

5. The ticket gets updated on the Hosted Dolt website so the user can see the resolution.

Resolved support ticket

Setting up the support ticket system

There are a few steps we needed to take to use AWS Incident Manager to create a support ticket system on our website.

1. Create a response plan

First, we need to create a response plan in AWS Incident Manager. This lets us define who is notified and how we respond when an incident occurs. You can refer to the AWS documentation for how to create a response plan.

Ours is pretty simple and looks something like this:

AWS Response Plan

Take note of the ARN, as we will need this when setting up the API.

2. Using the AWS Systems Manager Incident Manager SDK

As mentioned in the Hosted Dolt Infrastructure blog, our Hosted Dolt API is a Golang service providing GRPC endpoints. Luckily, Amazon has an AWS SDK for Go and we can use the ssmincidents package, which provides the API Client, operations, and parameter types for AWS Systems Manager Incident Manager.

We set up an incident service within our Hosted Dolt API that creates a client from the incoming config, and then takes incoming ticket, deployment, and user information and converts them to resources that can be used by the operations provided by the ssmincidents package. Specifically we use:

  • StartIncident - Starts an incident using the response plan (that you created above) ARN, title, and related items (deployment urls and creator username)
  • GetIncidentRecord - Get incident information
  • UpdateIncidentRecord - Update incident information, such as title, summary, and impact level
  • ListRelatedItems - Lists related items, in our case related deployment urls
  • UpdateRelatedItems - Add or remove related items, in our case related deployment urls

3. Using the incident service in Hosted Dolt API

Once our incident service was set up, we created some GRPC endpoints that can be used by our front end to view and manage tickets on our website - CreateIncident, GetIncident, ListIncidents, and UpdateIncident. You'll notice these don't necessary map to the incident manager operations we use above. There were some extra steps we needed to take to work with the AWS SDK in order to curate the user experience we wanted on our website.

The AWS operation ListIncidentRecord can return records filtered by the createdBy field. Since our Hosted Dolt users don't map to AWS users, we couldn't use this operation to list tickets for a user on our website. To get around this, we store some incident information, including the hosted creator ID, in the database that backs Hosted Dolt (which happens to be Dolt) when an incident is started, which allows us to list incidents for a user. The downside of this is that some information, like incident state (i.e. if the ticket is open or resolved), isn't necessarily updated in our database when updates are made from the AWS console. We can call the GetIncident operation for every listed item to get around this.

The input for the AWS operation StartIncident doesn't include a summary, only a response plan ARN, title, impact level, and related items. It made more sense to us that the user provides a summary when they create the ticket so that the paged team member doesn't have to wait around for important information that could be crucial for getting a fix out. So CreateIncident calls StartIncident, then UpdateIncidentRecord to add the summary, and then stores the record information in our Dolt database.

When a user is updating a ticket, we also didn't want them to have to separately update ticket information and related deployments. So UpdateIncident handles both the UpdateIncidentRecord and UpdateRelatedItems AWS operations. Similarly, GetIncident uses both GetIncidentRecord and ListRelatedItems operations.

4. Set up the front end

Once our API endpoints are ready to go, we need to create a UI to display, view, and update support tickets. We created a support page and plugged in React components for a support ticket form (used by both create and update), a ticket list, and a detailed ticket view. Create an account and deploy a database to check it out!

Future work

As Hosted Dolt grows and more people use our support ticket system, we can add more features that can further help us support our users. Some of these include:

  • Notifying users when a ticket is updated or resolved by a team member
  • Option to open chat channel for more instantaneous communication
  • Add more impact types (AWS also has High/partial failure, Medium/reduced service, and No impact options)
  • Create a post-incident analysis to improve our response in the future
  • List support tickets by deployment so organization members, not just ticket creators, can view open and resolved tickets

Conclusion

If you're curious about Hosted Dolt or want to chat about incident management, join us on Discord or file an issue on GitHub. We have some exciting features in the works for Hosted Dolt, including backups and a database UI similar to DoltHub for your hosted instance.

SHARE

JOIN THE DATA EVOLUTION

Get started with Dolt

Or join our mailing list to get product updates.