PagerDuty

PagerDuty Business Response

PagerDuty for Business response is an incident response tool built to communicate the status of critical incidents to an organization's infrastructure.

Opportunities discovered by chance

At PagerDuty, our regular and continuous contact with our customers provided unique insights into pain points and identified patterns. One recurring comment from our customers' executive, support, and sales teams was that tracking and monitoring the progress of an ongoing incident was at times difficult. Understandably, PagerDuty was designed for developers and not non-technical users. It simply wasn’t tailored to their specific needs. Consequently, our customers in engineering, network operations, and security roles often experienced frequent interruptions from non-technical personnel during critical moments. Executive staff would frequently join incident calls seeking updates, inadvertently prolonging the troubleshooting and service restoration efforts by the engineers.

Digging a little deeper

We utilized data from various sources to gauge the extent of these issues. This involved writing scripts, conducting interviews, distributing surveys, querying Zendesk, consulting with account managers and sales teams, and analyzing user interactions within the platform. Through this comprehensive research effort, we not only validated existing findings but also unearthed fresh insights.

  • Executive and customer success teams require deeper insight into ongoing incidents.
  • Responders often face disruptive interruptions at inopportune moments from external parties.
  • Support personnel lack essential information to effectively communicate downtime occurrences to customers.
  • Non-technical personnel perceive technical services based on their business impact and functionality, not as individual services.

I don't get it. What's a service?

Each technical service within PagerDuty represents a vital piece of infrastructure for the respective customer's organization. When an issue arises with one of these services, it triggers an incident, prompting a coordinated response. Organizations may have few or numerous technical services, encompassing various functions. These services might be called something such as Transactions, Payments API, Checkout Service, or Authentication. In PagerDuty, each service is created and monitored independently. Our research indicated that customers in non-technical roles often mentally group these services based on their business function. For example, the aforementioned services might collectively be referred to as a single entity, such as Web Application. A member of the customer support team would likely prefer to receive information indicating a problem with the Web Application rather than one of its individual services. However, for technical staff and engineers, their focus lies in precisely identifying the impacted service.

Project goal

Create a new status communication tool to help an entire organization understand the impact of an ongoing incident.

Provide support teams with accurate and current information
Frame services for non-technical customers as a single entity
Remove distractions from responders

A two part project

We collectively made the decision to develop an internal status dashboard for PagerDuty customers—a tool for providing real-time updates for ongoing incidents accessible to all members of the organization. In order to cater to non-technical users, we conceived the idea of organizing technical services into groups. These groups, termed as business services, would comprise multiple technical services representing a common business function. The status dashboard would relay the overall health of these business functions rather than individual technical services. When an incident affects a technical service within a business service, the entire business service will be flagged as impacted.

✨ The Design ✨

We built an experience that allows teams to construct business services by grouping related technical services.

Business services are configured by adding technical services as a supporting service of a business function.

When an incident is triggered on a technical service that the business service relies on, the status dashboard communicates the impact.

Incident history provides a log of past incidents going back six months. Improvements would eventually be made to filter by custom ranges.

Not every incident is the same

Following our beta release, customers began voicing concerns regarding the display of business services as impacted, even when they weren't truly affected. Not every incident carries critical importance, and the status dashboard's communication of non-critical incidents was causing unnecessary alarm among users in non-technical roles. Engineers expressed frustration at being inundated with inquiries for issues that didn't demand immediate attention. Frequently, triggered incidents amounted to nothing more than warnings that engineers could address at a later time. Warnings were so commonplace that many customers found the status dashboard in a perpetual state of impact. To address this issue, we introduced a small feature enabling control over the display of impacting incidents, allowing for better management of severity levels.

We added the ability to set thresholds for when an incident would impact a business service.

The results

Status Dashboard transitioned into a general availability feature via a phased rollout. During the beta phase, we utilized feedback from test groups to refine the user experience. Within six months of its official release, we had met our goal in number of new user licenses and total account adoption.