How We Keep Your Services Running: Inside Our Prog Lead Process

Ever wonder what happens when a data sync is delayed, a PAD tool hits some snags, or an issue is raised by a PAD user? We want to pull back the curtain a bit and share how our engineering team monitors, responds to, and resolves any issues that come up. 

Each week, one of our engineers takes on the role of Prog Lead, our internal on-call engineer responsible for keeping things running smoothly. The engineer serving as Prog Lead is the first point of contact for investigating alerts, handling timely requests from partners, and coordinating incident response. They are also supported by the rest of our engineering team who are collectively the secondary on-call. Prog Lead tags in other engineers to pinch hit when multiple things are going on at once or when there is an issue outside of that person's area of expertise. Ensuring the reliability of our systems is a team effort and having a Prog Lead gives us all the peace of mind to focus on our work, knowing that we will be tagged in when needed!

On the monitoring and alerting side of things, we use Google Cloud Monitoring to create metrics and alerting policies to let us know when there is something wrong with our systems or syncs. We use Windmill (an internal developer platform) to run automated checks on your data and ensure it is up to date and ready. When something does go wrong, these tools integrate with Slack and PagerDuty to alert the Prog Lead. 

Most issues are minor and resolved quickly, often with a simple restart. But when something more serious comes up, the Prog Lead kicks into action by calling an incident and ensuring the team is aware of what's going on. They start off by moving the conversation to a dedicated incident Slack channel and tagging in people from our Partnerships and Engineering teams to help with notifying affected parties and diagnosing the issue. The team updates our Statuspage to make sure you are notified of any issues that may affect you and then works together to resolve the issue. An incident isn’t complete once a fix is put in place though! We make sure to schedule out a blameless retro where we work on figuring out how we can prevent future incidents of this kind from happening again.

Our Prog Lead process ensures someone’s always watching the wheel, so you can trust your tools are monitored, maintained, and backed by a team that cares. When an issue affects you, the Prog Lead will post updates on our Statuspage. This is your go-to source for real-time information on any issues with PAD. We recommend signing up for automatic notifications for the systems and tools you use to make sure you’re always in the know. 

And whenever you have any questions or need some help, reach out to us at help@techallies.org and our team will make sure your questions are answered and problems are resolved!

Reach out to us with questions on issue resolution and PAD support.
Next
Next

Team Spotlight: Katie Miller