#72 Reliability in Data Mesh: Why SLAs and SLOs are Crucial – Interview w/ Emily Gorcenski

Data Mesh Radio Patreon – get access to interviews well before they are released

Episode list and link to all available episode transcripts (most interviews from #32 on) here

Provided as a free resource by DataStax AstraDB

Transcript for this episode (link) provided by Starburst. See their Data Mesh Summit recordings here and their great data mesh resource center here

In this episode, Scott interviewed Emily Gorcenski, Head of Data and AI at Thoughtworks Germany. Emily has put out some great content relative to data mesh.

As a data scientist by training, Emily has a data consumer bent in her views on data mesh. She is therefore often focused on how can data mesh help “me” (her) as a data consumer.

SLAs and SLOs come right out of the site reliability engineering playbook from Google. Overall, systems reliability engineering practices are crucial – Emily asked why don’t we bring the rigor of other engineering disciplines to software engineering?

So, what is an SLA and an SLO? Per Emily, an SLA is a contract between two parties – hence why agreement is in the name. This agreement should be written around an SLO with the SLO serving as a specific target. That can be uptime or latency in the microservices realm but with data, SLOs can get a little – or a lot – more tricky.

The theory around developing an SLO is for it to directly connect to business value. Emily believes that when we think about SLOs and data, we shouldn’t apply SLOs directly to the data but should shift those SLOs to the left and have SLOs in the software engineering practice that apply to data.

Emily mentioned another antipattern for SLAs in general, which is not connecting them to SLOs. But when it comes to data, most teams don’t even have any SLAs, connected to an SLO or not. As an industry, software engineering has figured out how to offer great SLAs to external parties but many organizations still struggle to offer good SLAs internally.

For Emily, software-focused SLAs can even result in worse outcomes for data. If an SLA is about uptime, it might result in pushing bad data into a system so a service can maintain its SLA.

When developing SLAs, Emily recommends starting with conversations and negotiations between both parties. If 5 9s of uptime is not valuable to your consumers, why build to ensure 5 9s? Dig into actual user needs and what will actually drive user value. And start to differentiate between infrastructure focused SLAs – like is the data product available – and data SLAs – like is the data updated and does it meet quality thresholds.

Emily then started to talk about some of the fun very specific SLAs around data and what does data availability mean. These SLAs can get complicated but they can start to really drive towards what is actually valued by the consumers, what is the actual value of the data so you can then start to negotiate to drive a high return on investment. Again, we can avoid pre-optimizing for facets that consumers don’t care about.

Per Emily, good SLOs will tell you what you should improve. We should make sure our SLOs are decomposable to again, get quite specific when useful and/or necessary. It is much more difficult to do in data than in general software engineering – we can’t think about data in a binary way such as accurate or not it is much more of a continuous spectrum. Emily recommends to look at the error budget concept and think about how we can apply that to data.

Emily believes SLOs can help you to avoid building unnecessary complexity – if your users don’t need real time results, don’t build a real-time system. It’s the conversations and negotiations that take you from the state of what’s possible to what’s valuable. We should use SLOs to align closely to the use case – there is definitely such a thing as good enough. And don’t create Franken-Data-Products – monstrosities that try to solve every single need. It’s fine to have two similar data products to serve two distinct needs.

For Emily, data consumers keep complaining to a centralized data engineering team. People on that centralized team are the unfortunate middle-people with little power to change what consumers are getting. We should use SLOs and move them to the responsibility on the software development teams – the domains – much like we do data ownership in data mesh.

Once an organization learns to do SLOs well, Emily recommends extending that to use SLOs around the data platform. But to not mistake the SLOs and SLAs around infrastructure and data products as mentioned earlier.

Emily believes the governance team also has a responsibility to drive standardization around SLOs. This includes sensible defaults.

What should we learn in the data space from DevOps? For Emily, the philosophy of resilience is crucial. Repeatability and safety through continuous integration / continuous delivery – or CI/CD – is a major driver of value in software engineering. How can we apply it to data?

In data, we all too often use a systems oriented approach so we don’t properly attribute value well per Emily. How can we measure the value of being able to do ad hoc analysis? Not the value of the analysis itself but almost the inverse of opportunity cost – what is the opportunity value? If we remove the abstractions, can we get to a specific value measurement?

Emily believes we need to get much more serious about creating good data about our data practices. It takes a fair bit of effort to get to a place where we can repeatedly get good, usable data on our data initiatives at scale. We also need to give people more slack in their work time to chase down additional information. Serendipity can only strike if people have the room to create it and then react to it.

Emily wrapped up her thoughts on a few points – first, the pace of change of business has accelerated significantly and it requires us to philosophically reorient how we think about data. There needs to be more space for people to doing the new necessary work to drive the high incremental value. But because everyone is so overloaded already, that isn’t happening in most organizations. And second, start from the consumers and their needs and work backwards. It’s okay to not create every piece of potentially useful data in a usable fashion upfront. Figure out what are the needs you know about and build towards those – additional use cases will emerge.

Emily’s LinkedIn: https://www.linkedin.com/in/emily-gorcenski-0a3830200/

Emily’s Twitter: @EmilyGorcenski / https://twitter.com/EmilyGorcenski

Emily’s Polywork profile: https://www.polywork.com/emilygorcenski

Emily’s website: https://www.emilygorcenski.com/

Alex Hidalgo’s Implementing Service Level Objectives book as mentioned: https://www.alex-hidalgo.com/the-slo-book

Data Mesh Radio is hosted by Scott Hirleman. If you want to connect with Scott, reach out to him at community at datameshlearning.com or on LinkedIn: https://www.linkedin.com/in/scotthirleman/

If you want to learn more and/or join the Data Mesh Learning Community, see here: https://datameshlearning.com/community/

If you want to be a guest or give feedback (suggestions for topics, comments, etc.), please see here

All music used this episode was found on PixaBay and was created by (including slight edits by Scott Hirleman): Lesfm, MondayHopes, SergeQuadrado, and/or nevesf

Data Mesh Radio is brought to you as a community resource by DataStax. Check out their high-scale, multi-region database offering (w/ lots of great APIs) and use code DAAP500 for a free $500 credit (apply under “add payment”): AstraDB

Leave a Reply

Your email address will not be published.