PagerDuty
PagerDuty was receiving complaints from two separate but equally important user groups. Engineering teams complain of feeling blocked and sometimes intimidated during incident response as a result of unnecessary personnel interrupting them while trying to resolve issues. Often times, this was staff in director or executive roles overshadowing engineering teams.
Non-responders or what we will call non-technical staff, feel left in the dark as incidents progress. Non-responders are staff in roles such as customer support agents, account executives, or those previously mentioned executives and directors. They don't have the insight they need to make decisions about the business as incidents unfold.
Partnering with a dedicated user researcher, we developed a research plan that I would later execute. Our goal was to understand the scale and severity of these complaints through customer interviews, surveys, and analysis of platform data.
I segmented both users groups, invited equal numbers from each in for remote interviews, and prepared an interview script that was tailored for the conversation but not leading.
I again segmented both users groups and distributed surveys to a much larger scale than the customer interviews. These were multiple choice questions designed to inquire about frustrations each group experiences during an ongoing incident.
A key feature of PagerDuty's incident response tool is its integrations. One of the more popular ones is Zoom. With the Zoom integration, customers can automatically start a conference bridge once an incident is triggered. I used data from this feature to identify 100 users that fell into the non-technical audience and observed that in the past 6 months, 51% of them had joined the conference bridge of an incident.
Each technical service within PagerDuty represents a vital piece of infrastructure for the respective customer's organization. When an issue arises with one of these services, it triggers an incident, prompting a coordinated response. Organizations may have few or numerous technical services, encompassing various functions. These services might be called something such as Transactions, Payments API, Checkout Service, or Authentication. In PagerDuty, each service is created and monitored independently. Our research indicated that customers in non-technical roles often mentally group these services based on their business function. For example, the aforementioned services might collectively be referred to as a single entity, such as Web Application. A member of the customer support team would likely prefer to receive information indicating a problem with the Web Application rather than one of its individual services. However, for technical staff and engineers, their focus lies in precisely identifying the impacted service.
A product historically built for a highly technical user base could be expanded to a much larger audience–people who are invested in incident outcomes but are not part of the response process. Not only can we create a brand new, revenue generating service, we can also alleviate the pains experienced by engineers by separating the two groups on the platform.
In order to communicate the impact to specific business areas, we would need a way to map individual technical services to a parent that represents the group. The idea was simple. If any one of the child services experiences an incident, that business area would be considered impacted. I led a series of workshops and white-boarding sessions with product and engineering to understand how grouping would work, and how incidents would translate to a dashboard serving the new non-technical audience. Each of these were then mapped into user flows with finer detail.
Designing an experience to group services and present impacted ones on a dashboard was the easy part. Where this project becomes complex is understanding the terminology of engineers. The technical services that represent a business area are dependencies of the parent service grouping. In the beginning the language we were using for these dependencies was Upstream and Downstream. What I uncovered during testing is that these terms can mean the exact opposite of the other depending on the organization. After countless iterations, Supporting Services and Dependent Services tested well and was used to represent service dependencies.
PagerDuty offers a feature for incident priority. The initial thought was to use this to represent severity of an incident on the new status dashboard. We would discover during testing that customers don't always use priority in the way we expected. Priority is customizable and the way teams use it differs greatly. Customer testing informed us that non-technical groups don't necessarily care about the severity of the incident. They simply care about whether the business area is experiencing issues or if it is healthy. This led to using a simple binary representation, Impacted or Healthy.
Following our beta release, customers began voicing concerns regarding the display of business services as impacted, even when they weren't truly affected. Not every incident carries critical importance, and the status dashboard's communication of non-critical incidents was causing unnecessary alarm among users in non-technical roles. Engineers expressed frustration at being inundated with inquiries for issues that didn't demand immediate attention. Frequently, triggered incidents amounted to nothing more than warnings that engineers could address at a later time. Warnings were so commonplace that many customers found the status dashboard in a perpetual state of impact. To address this issue, we introduced a small feature enabling control over the display of impacting incidents, allowing for better management of severity levels.
Status Dashboard transitioned into a general availability feature via a phased rollout. During the beta phase, we utilized feedback from test groups to refine the user experience. Within six months of its official release, we had met our goal in number of new user licenses and total account adoption.