#263 Panel: Applying Site Reliability Engineering Practices to Data – Led by Emily Gorcenski w/ Amy Tobey and Alex Hidalgo

Please Rate and Review us on your podcast app of choice!

Get involved with Data Mesh Understanding’s free community roundtables and introductions: https://landing.datameshunderstanding.com/

If you want to be a guest or give feedback (suggestions for topics, comments, etc.), please see here

Episode list and links to all available episode transcripts here.

Provided as a free resource by Data Mesh Understanding. Get in touch with Scott on LinkedIn if you want to chat data mesh.

Transcript for this episode (link) provided by Starburst. See their Data Mesh Summit recordings here and their great data mesh resource center here. You can download their Data Mesh for Dummies e-book (info gated) here.

Emily’s LinkedIn: https://www.linkedin.com/in/emily-gorcenski-0a3830200/

Amy’s LinkedIn: https://www.linkedin.com/in/amytobey/

Alex’s LinkedIn: https://www.linkedin.com/in/alex-hidalgo-6823971b7/

Alex’s Book Implementing Service Level Objectives: https://www.alex-hidalgo.com/the-slo-book

In this episode, guest host Emily Gorcenski, Head of Data and AI for Thoughtworks Europe (guest of episode #72) facilitated a discussion with Amy Tobey, Senior Principal Engineer at Equinix and Alex Hidalgo, Principal Reliability Advocate at Nobl9. As per usual, all guests were only reflecting their own views.

The topic for this panel was applying reliability engineering practices to data. This is different than engineering for data reliability which is focused on data quality specifically.

The overall concept is taking what we’ve learned from reliability engineering across disciplines but mostly in software, especially SRE/site reliability engineering, and bringing those learnings to data to make data – especially data production and serving – more reliable and scalable. Scott note: this is probably one of the most frustrating topics in data for me because it feels like it’s basic foundational work yet most organizations aren’t tackling this well yet if at all really. The best starting point for an organization is simple awareness and starting to have reliability engineering conversations around data. And you will probably feel like you’re behind after listening to this. Everyone is behind on this πŸ˜…even most orgs aren’t doing SRE well so applying it to data, that’s no surprise.

Scott note: I wanted to share my takeaways rather than trying to reflect the nuance of the panelists’ views individually.

Scott’s Top Takeaways:

  1. Reliability engineering is one of the majorly overlooked aspects of what we need to bring to data – whether you are talking data mesh or not. People are starting to really move on product thinking and to a lesser degree bringing microservices and CI/CD approaches to data. But reliability engineering is still a somewhat distant afterthought if at all.
  2. At the end of the day, reliability engineering comes down to observability first. If you can’t observe what’s happening with your systems, you can’t really start to identify issues and work on fixes.
  3. Observability in data CANNOT only be about the data quality. That is observing what is happening with the data versus the data applications/systems/pipelines/etc. The data itself may seem fine – no quality issues – but if your systems storing, moving, transforming, etc. your data are degrading, you are missing the forest for the trees. A broken pipe that delivers 5% of the water but it’s clean is still broken even if the water passes quality checks.
  4. Relatedly and somewhat in contrast – the measures used in operational systems reliability engineering like availability and latency are potentially the wrong areas to focus on for reliability engineering in data. Don’t copy-paste those key metrics, think about what matters when it comes to the data applications/systems as well as the actual information you are sharing.
  5. As Amy mentioned, there is too much binary thinking about data – whether it is quality, reliability, availability, etc. The idea of criticality is often missing from data discussions – and trying to measure by who screams the loudest is a really poor approach πŸ˜… As an overall organization, we need to map what data is critical and why.
  6. The best way to find out your SLOs for data related work is to have conversations. Just because you can, doesn’t mean you should. Not all needs are created equal. Don’t take consumers at their first word, dig in and find out what will serve their needs but limit the work to something reasonable. Remember that return on investment is what matters, not simply return. Most data work has value but does the value justify the costs?
  7. Keep things simple. As Alex said, focus on “is this doing what my users need it to do?” Look to measure around what your systems need to do to serve customer needs. It’s very easy to get caught up measuring the wrong things.
  8. Currently, in many organizations, reliability engineering work is much more like archaeology than engineering as Emily noted. We need to focus on providing the context to those trying to ensure reliability and that takes forethought on the systems/architecture side as well as the organizational side. Observability – data or otherwise – isn’t a switch you can easily flip, it’s a practice you build and get better at over time.

Other Important Takeaways (many touch on similar points from different aspects):

  1. You can make progress on reliability engineering in data just by making sure you have an incident management process. If you do, then look to improve it. You can iterate towards good, everyone starts somewhere πŸ™‚
  2. As Emily pointed to, can all your SLOs really be covered by SLAs in data? How do you have an SLA around semantic meaning for instance? Do we need to extend the SLO concept to really make it applicable to data?
  3. Just starting the reliability concept discussions with people in data is often difficult if people don’t have operational systems/plane experience. In data, far too often there is a either ‘it’s right or it’s wrong’ approach instead of how acceptable something is. Sometimes imperfect data is acceptable and the same goes for the reliability of your systems.
  4. You can iteratively improve your observability and reliability practices. You can start to show value as you build more capability. Much like with anything, unless you have budget and full high-level support, start small.
  5. Take Alex’s simple definition of reliability: “Is this doing what it’s supposed to be doing?” Then work to find ways to observe if your systems are actually doing what they are supposed to be doing. And only focusing on where you think things could go wrong is monitoring, focus on observing – the difference between those two is a rabbit hole you should go down yourselves.
  6. Relatedly, you need to really consider expected behavior when building systems. This is of course around data quality but also setting your SLOs – what are you trying to achieve by building the data application/product?
  7. As Alex said, “SLAs really exist for lawyers and financial departments, and SLOs exist for engineers actually trying to keep things reliable.” Data contracts are often focusing on the SLAs instead of the SLOs but both are important when thinking about data. SLAs engender trust as a guarantee, SLOs are about making sure you’re actually thinking about consumer needs.
  8. ?Controversial?: Don’t start with formal SLAs – and maybe SLOs – unless your organization is mature enough to handle them. People under 18 – at least in many countries – aren’t able to enter many legal agreements for a reason. Your maturity level matters. Potentially look to informal agreements or ‘best faith effort’ as you are starting to think about SLAs and SLOs.
  9. People often think about data freshness and quality metrics but there are other important metrics like durability and availability. If you have perfect data but no one can access it or it gets lost, what value is it?
  10. Does managing cost fall under reliability engineering? Efficient usage of resources often falls on the SRE team to some degree – how do we think about wrapping in cost management. As Emily noted, many companies have their data management on a different cloud from their applications and that creates a LOT of cost inefficiencies.
  11. Cloud economics actually prevent good data usage and practices. Because cloud providers make so much of their money on data movement – especially egress – the economics of using best-in-class tools becomes untenable, especially if those are on a separate cloud.
  12. Relatedly, many – most? – cloud data tooling seems to be focusing on driving usage and revenue instead of driving value taken in the greater picture. Instead of focusing on how to play together to create value, each is creating lock-in. That creates additional reliability and observability challenges.
  13. In data, many organizations are still at the pre-crawl phase relative to observability. Something as basic as ‘if you create an alert, have a runbook for when the alert goes off’ is missing. We need to bring maturity in – but at a reasonable pace, not try to fix overnight – so organizations can get to actually reliable systems and data.
  14. For a lot of organizations, they still treat data work as a cost center. Even in data mesh to some degree. If we can’t prove out the value of the data work in general, it will be MUCH harder to get budget for something like reliability engineering work in data. But reliability is really crucial if you want to become data driven rather than simply say you’re data driven.
  15. At the end of the day, lack of reliability engineering practices in data are twofold: first, lack of awareness. Second, organizational. There simply needs to be a focus and investment if we believe reliability engineering in data will pay off. Or maybe it’s not worth it for most organizations. Time will tell but you can bet that the most advanced organizations that really are ‘data-driven’ aren’t skipping it.
  16. Once we tackle the above, hopefully tooling will follow but data tooling in general just isn’t in a good spot relative to reliability engineering – it’s too hard, too manual, and too expensive in general to do reliability and observability well around data systems. Scott note: this is why I’m an advisor to Masthead Data because they are doing some really cool reliability engineering stuff for data.
  17. One positive note is that industry challenges such as GDPR are driving organizations to realize they need to change. It may not be happening that quickly but the cost of keeping up with regulation in legacy systems is becoming greater by the day. We might see a bigger tipping point soon.
  18. A key point of SRE is to understand the inherent complexity and interconnectedness of our systems. There’s some understanding of that which exists in data but we generally don’t look to solve for it by learning about how things are interconnected beyond lineage.
  19. Good observability and reliability engineering practices have change as an inbuilt assumption. This is where monitoring often falls short – you have to constantly update your very explicit assumptions versus your understanding. Focus on understanding the needs of your users/customers and observe if the system is actually serving those needs appropriately.

Learn more about Data Mesh Understanding: https://datameshunderstanding.com/about

Data Mesh Radio is hosted by Scott Hirleman. If you want to connect with Scott, reach out to him on LinkedIn: https://www.linkedin.com/in/scotthirleman/

If you want to learn more and/or join the Data Mesh Learning Community, see here: https://datameshlearning.com/community/

If you want to be a guest or give feedback (suggestions for topics, comments, etc.), please see here

All music used this episode was found on PixaBay and was created by (including slight edits by Scott Hirleman): Lesfm, MondayHopes, SergeQuadrado, ItsWatR, Lexin_Music, and/or nevesf

Leave a Reply

Your email address will not be published. Required fields are marked *