PagerDuty
At PagerDuty, our regular and continuous contact with our customers provided unique insights into pain points and identified patterns. One recurring comment from our customers' executive, support, and sales teams was that tracking and monitoring the progress of an ongoing incident was at times difficult. Understandably, PagerDuty was designed for developers and not non-technical users. It simply wasn’t tailored to their specific needs. Consequently, our customers in engineering, network operations, and security roles often experienced frequent interruptions from non-technical personnel during critical moments. Executive staff would frequently join incident calls seeking updates, inadvertently prolonging the troubleshooting and service restoration efforts by the engineers.
We utilized data from various sources to gauge the extent of these issues. This involved writing scripts, conducting interviews, distributing surveys, querying Zendesk, consulting with account managers and sales teams, and analyzing user interactions within the platform. Through this comprehensive research effort, we not only validated existing findings but also unearthed fresh insights.
Each technical service within PagerDuty represents a vital piece of infrastructure for the respective customer's organization. When an issue arises with one of these services, it triggers an incident, prompting a coordinated response. Organizations may have few or numerous technical services, encompassing various functions. These services might be called something such as Transactions, Payments API, Checkout Service, or Authentication. In PagerDuty, each service is created and monitored independently. Our research indicated that customers in non-technical roles often mentally group these services based on their business function. For example, the aforementioned services might collectively be referred to as a single entity, such as Web Application. A member of the customer support team would likely prefer to receive information indicating a problem with the Web Application rather than one of its individual services. However, for technical staff and engineers, their focus lies in precisely identifying the impacted service.
We collectively made the decision to develop an internal status dashboard for PagerDuty customers—a tool for providing real-time updates for ongoing incidents accessible to all members of the organization. In order to cater to non-technical users, we conceived the idea of organizing technical services into groups. These groups, termed as business services, would comprise multiple technical services representing a common business function. The status dashboard would relay the overall health of these business functions rather than individual technical services. When an incident affects a technical service within a business service, the entire business service will be flagged as impacted.
Following our beta release, customers began voicing concerns regarding the display of business services as impacted, even when they weren't truly affected. Not every incident carries critical importance, and the status dashboard's communication of non-critical incidents was causing unnecessary alarm among users in non-technical roles. Engineers expressed frustration at being inundated with inquiries for issues that didn't demand immediate attention. Frequently, triggered incidents amounted to nothing more than warnings that engineers could address at a later time. Warnings were so commonplace that many customers found the status dashboard in a perpetual state of impact. To address this issue, we introduced a small feature enabling control over the display of impacting incidents, allowing for better management of severity levels.
Status Dashboard transitioned into a general availability feature via a phased rollout. During the beta phase, we utilized feedback from test groups to refine the user experience. Within six months of its official release, we had met our goal in number of new user licenses and total account adoption.