DevOps and a Culture of Trust

2021-06-25 opinions

Modern DevOps means culture more than tooling and tech. The former is a blend of a “trust yet verify” approach to what engineers deliver and a sustained effort to increase that trust by de-risking work and eliminating the invisible. That effort should create a virtuous circle that will promote generalist engineers to do more in more areas of the stack.

Trust is often the line of segregation between devs and the operational functions, and I have seen firsthand cases where a whole ops team felt that their tools were a little too sharp for the engineers working on the software side of things. There is a managerial and organisational aspect to DevOps, which will inevitably end up feeding culture downstream. It is rare to see the concept of “good culture” without the idea of “trust”.

But trust probably should not be given without caveats, especially when the penalties for bad events can be so severe. For that, we can thank something that engineers of our industry share with runway models: that work is often associated with events, and that’s a good thing.

Those events are representations of units of work, from an engineer’s current BAU work (commits), from project-related events (via release tags or via merged pull requests), or coming from sources like the system itself (e.g. alerts from disasters or simply status report as its input events).

The number of applications for these events is enormous, can feed an array of systems and application, and is the source of the persistent association of DevOps with its tools.

Another way that DevOps practices facilitate trust is via the doctrine that “for something to happen, it has to be spelt out”. Or every command should be done via text. Enabling that, many of our tools take their instruction from configuration, which is to say, the result of parsed text.

Indeed, most change should be visible at a press of a “compare” button and in a fairly readable format. The goal is to weed out unrecorded and dated practices like finishing the configuration by manually SSH’ing on a server or other unrecorded spell cast from a UI dashboard.

This allows for chains that were inexistent a couple years ago to exist and that will allow us to take things backwards during outages following a bad deployment: the event that triggered it, the code change that created the event, the commit hash of the code change, the author of the change and hopefully back to the associated task.

These days, deployment should rarely be a human’s job. If all the tasks are recorded, we now have a playbook to repeat at will, and configuration enables a team to code its perfect “teammate” from its shared wisdom. One that we would wake up and run the job it knows all the details and minutiae of, and at the touch of an event, over the risk of humans going through a long deployment protocol. One that engineering management and the engineers themselves trust more than anyone to always get it right.