#43 Applying Resilience Engineering Practices to Scale Data Sharing – Interview w/ Tim Tischler

Provided as a free resource by DataStax AstraDB

Patreon

In this episode, Scott interviewed Tim Tischler, Principal Engineer at Wayfair. Prior to Wayfair, Tim worked as a Site Reliability Champion at New Relic and is well known in the “human factors” and resilience engineering space.

Per Tim, our current work culture is overly action-item driven – every meeting must have a set of agenda items generated from it. This prevents people from having learning-focused meetings exclusively designed for context sharing. Humans’ brains work differently between learning and fixing mode and we ask totally different questions. To be able to scale our knowledge sharing, we need to have the space to have learning-focused meetings.

A good way to center learning-focused meetings, be they “show and tell” or event storming sessions, is via sharing stories – human communication is founded on story sharing through the millennia. Tim’s “show and tell” and event storming sessions at Wayfair have had extremely positive reviews so far.

Tim sees ticket-based interactions – just throwing requirements on someone’s JIRA backlog or similar – as fundamentally flawed. If Team A gives Team B requirements, Team B just looks to close the ticket versus getting both sides in the room to exchange context and have a negotiation. Tim prefers two modes of interactions over ticket systems: #1 – no human-touch, automated interactions, e.g. an API; and #2 – high touch, high context sharing interactions.

For resilience engineering specifically, you should apply learnings to each data product AND the mesh as a whole. Part of that is a broad acceptance that you are in a highly dynamic and highly changing org – there will be changes! A few anti-patterns to resilience engineering that apply to data mesh are: 1) a hub and spoke relationship model where one person is the key glue – this is bad at a human level and even worse at a technical level :); 2) business leaders pushing for metrics without sharing the specific context as the results end up as completely empty and useless things you are tracking; and 3) not embedding people building platforms into the teams they are building the platform for – they must really understand the workflows.

Books/posts/papers mentioned:

Blameless PostMortems and a Just Culture by John Allspaw – Link

The Theory of Graceful Extensibility: Basic rules that govern adaptive systems by David D Woods – Link

The Field Guide to Understanding ‘Human Error’ by Sidney Dekker – Link

Data Mesh Radio is hosted by Scott Hirleman. If you want to connect with Scott, reach out to him at community at datameshlearning.com or on LinkedIn: https://www.linkedin.com/in/scotthirleman/

If you want to learn more and/or join the Data Mesh Learning Community, see here: https://datameshlearning.com/community/

If you want to be a guest or give feedback (suggestions for topics, comments, etc.), please see here

All music used this episode created by Lesfm (intro includes slight edits by Scott Hirleman): https://pixabay.com/users/lesfm-22579021/

Data Mesh Radio is brought to you as a community resource by DataStax. Check out their high-scale, multi-region database offering (w/ lots of great APIs) and use code DAAP500 for a free $500 credit (apply under “add payment”): AstraDB

Leave a Reply

Your email address will not be published.