PagerDuty

PagerDuty Business Response

PagerDuty for Business response is an incident response tool built to communicate the status of critical incidents to an organization's infrastructure.

The problem

PagerDuty was receiving complaints from two separate but equally important user groups. Engineering teams complain of feeling blocked and sometimes intimidated during incident response as a result of unnecessary personnel interrupting them while trying to resolve issues. Often times, this was staff in director or executive roles overshadowing engineering teams.

Non-responders or what we will call non-technical staff, feel left in the dark as incidents progress. Non-responders are staff in roles such as customer support agents, account executives, or those previously mentioned executives and directors. They don't have the insight they need to make decisions about the business as incidents unfold.

Digging a little deeper

Partnering with a dedicated user researcher, we developed a research plan that I would later execute. Our goal was to understand the scale and severity of these complaints through customer interviews, surveys, and analysis of platform data.

Customer interviews

I segmented both users groups, invited equal numbers from each in for remote interviews, and prepared an interview script that was tailored for the conversation but not leading.

  • 63% of those in engineering roles indicated that response interruption was a top frustration
  • 73% of those in non-technical roles indicated experiencing a lack of clarity during a response
  • Once someone indicated these types of issues were a problem, I was able to deep dive into specific details.

Customer surveys

I again segmented both users groups and distributed surveys to a much larger scale than the customer interviews. These were multiple choice questions designed to inquire about frustrations each group experiences during an ongoing incident.

  • 12% response rate, 96 respondents
  • 64% of engineers indicated personnel interruptions as their primary or secondary frustration
  • 68% of those in non-technical roles selected a response indicating they felt left in the dark during an incident

Platform analysis

A key feature of PagerDuty's incident response tool is its integrations. One of the more popular ones is Zoom. With the Zoom integration, customers can automatically start a conference bridge once an incident is triggered. I used data from this feature to identify 100 users that fell into the non-technical audience and observed that in the past 6 months, 51% of them had joined the conference bridge of an incident.

Key insight

Non-technical users think about services in the context of a business function, not individual pieces of infrastructure.

I don't get it. What's a service?

Each technical service within PagerDuty represents a vital piece of infrastructure for the respective customer's organization. When an issue arises with one of these services, it triggers an incident, prompting a coordinated response. Organizations may have few or numerous technical services, encompassing various functions. These services might be called something such as Transactions, Payments API, Checkout Service, or Authentication. In PagerDuty, each service is created and monitored independently. Our research indicated that customers in non-technical roles often mentally group these services based on their business function. For example, the aforementioned services might collectively be referred to as a single entity, such as Web Application. A member of the customer support team would likely prefer to receive information indicating a problem with the Web Application rather than one of its individual services. However, for technical staff and engineers, their focus lies in precisely identifying the impacted service.

The opportunity

We have an opportunity to carve out an entirely new branch of the business.

PagerDuty, no longer for just engineers

A product historically built for a highly technical user base could be expanded to a much larger audience–people who are invested in incident outcomes but are not part of the response process. Not only can we create a brand new, revenue generating service, we can also alleviate the pains experienced by engineers by separating the two groups on the platform.

Proposed solution

An internal status communication tool that communicates the impact to areas of business instead of services.

Exploring the problem space

In order to communicate the impact to specific business areas, we would need a way to map individual technical services to a parent that represents the group. The idea was simple. If any one of the child services experiences an incident, that business area would be considered impacted. I led a series of workshops and white-boarding sessions with product and engineering to understand how grouping would work, and how incidents would translate to a dashboard serving the new non-technical audience. Each of these were then mapped into user flows with finer detail.

Solutions and testing

Designing an experience to group services and present impacted ones on a dashboard was the easy part. Where this project becomes complex is understanding the terminology of engineers. The technical services that represent a business area are dependencies of the parent service grouping. In the beginning the language we were using for these dependencies was Upstream and Downstream. What I uncovered during testing is that these terms can mean the exact opposite of the other depending on the organization. After countless iterations, Supporting Services and Dependent Services tested well and was used to represent service dependencies.

PagerDuty offers a feature for incident priority. The initial thought was to use this to represent severity of an incident on the new status dashboard. We would discover during testing that customers don't always use priority in the way we expected. Priority is customizable and the way teams use it differs greatly. Customer testing informed us that non-technical groups don't necessarily care about the severity of the incident. They simply care about whether the business area is experiencing issues or if it is healthy. This led to using a simple binary representation, Impacted or Healthy.

✨ The Design ✨

We built an experience that allows teams to construct business services by grouping related technical services.

Business services are configured by adding technical services as a supporting service of a business function.

When an incident is triggered on a technical service that the business service relies on, the status dashboard communicates the impact.

Incident history provides a log of past incidents going back six months. Improvements would eventually be made to filter by custom ranges.

Not every incident is the same

Following our beta release, customers began voicing concerns regarding the display of business services as impacted, even when they weren't truly affected. Not every incident carries critical importance, and the status dashboard's communication of non-critical incidents was causing unnecessary alarm among users in non-technical roles. Engineers expressed frustration at being inundated with inquiries for issues that didn't demand immediate attention. Frequently, triggered incidents amounted to nothing more than warnings that engineers could address at a later time. Warnings were so commonplace that many customers found the status dashboard in a perpetual state of impact. To address this issue, we introduced a small feature enabling control over the display of impacting incidents, allowing for better management of severity levels.

We added the ability to set thresholds for when an incident would impact a business service.

The results

Status Dashboard transitioned into a general availability feature via a phased rollout. During the beta phase, we utilized feedback from test groups to refine the user experience. Within six months of its official release, we had met our goal in number of new user licenses and total account adoption.