For agile software teams, moving to fully take responsibility not only for Development and Test, but also for Operations is a challenge. On top of that, supporting their services 24/7 may seem more than one can demand.
This post hopefully gives you some useful thoughts to consider when deciding how to effectively deal with support of your software applications/services 24/7.
First. Many have seen the value of a clear ownership of the code that builds your application. With code means: application code, tests, database scripts, configurations. Not only that but also the delivery pipeline definition and infrastructure code. It is important that one application or service has one (and only one) team owner.
At the very start, the team that chooses to create a repository (e.g. Git-repo in GitLab) and store some code in it is the owner of the lifecycle of the code. From the first day code hits production, the team that owns the code must then also take care of the 24/7 operational support of the code. If the team do not, the team is not (at least by my definition) autonomous. A team is usually between 3-8 persons.
Consider below aspects before taking a decision if you want to involve an external team/partner to take care of supporting operations of your code in production:
– What team do you think handle support better than your own team? Your team knows their code best. Agree?
– Do you believe you can write solution templates for an external person to follow when things start to fail at night? And if you do write “disaster instructions”, will the external person who follow those do more good than harm? For all types of things that can go wrong?
– What about offload the heavyweight lifting of for example a database server to a cloud provider. Using for example AWS RDS service your team don’t need to patch servers or care about high availability. The cloud provider manages that for you. Of course, you need to invest in understanding what the cloud provider does and does not do for you. With Serverless services your team does probably not need any capacity planning at all.
– Zero Downtime deployments: When your services can be updated without end user impact your team members will do it on office hours. QA mindset will significant increase when the engineers who writes the code also the same day push the change to production. And when the change is pushed, they are awake ready to deal with good or bad end-user impact.
– If your services are very business critical, consider introducing a pager system. That is an on-call schedule for all team members. If the system goes down at night, the engineer(s) who have the pager will be notified. But for that to work you really need to have a reliable monitoring of your apps. False alarms during night is not that very popular. This is unusual in Europe but more common in US.
Finally. If you seriously consider external support, are you doing that to increase customer satisfaction? Or just to have someone else to blame when your services are not available?