#238 Bringing Software Testing Best Practices to Data – Interview w/ Sofia Tania

Sign up for Data Mesh Understanding’s free roundtable and introduction programs here: https://landing.datameshunderstanding.com/

Please Rate and Review us on your podcast app of choice!

If you want to be a guest or give feedback (suggestions for topics, comments, etc.), please see here

Episode list and links to all available episode transcripts here.

Provided as a free resource by Data Mesh Understanding / Scott Hirleman. Get in touch with Scott on LinkedIn if you want to chat data mesh.

Transcript for this episode (link) provided by Starburst. See their Data Mesh Summit recordings here and their great data mesh resource center here. You can download their Data Mesh for Dummies e-book (info gated) here.

Tania’s LinkedIn: https://www.linkedin.com/in/sofia-tania/

Presentation: “Data Mesh testing: An opinionated view of what good looks like”: https://www.youtube.com/watch?v=stNZQESndAA

In this episode, Scott interviewed Sofia Tania (she goes by Tania), Tech Principal at Thoughtworks. To be clear, she was only representing her own views on the episode. Scott asked her to be on especially because of a presentation she did on applying testing – especially important for data contracts – in data mesh.

Scott note: I was apparently getting extremely sick throughout this call so if I ramble a bit, I apologize. Tania’s dog also _really_ wanted to be part of the conversation so you might hear us both chuckling a bit about her antics. And Tania has some really great insights so I probably asked her probably the hardest questions of any guest to date. She did a great job answering them though! A lot of the takeaways are about are we actually ready to do a lot of the necessary testing to ensure quality around data, which I don’t think has a clear answer yet :)

Some key takeaways/thoughts from Tania’s point of view:

  1. We have to bring software best practices to data but we should do it smartly and not make the same mistakes we made in software, let’s start from a leveled up position. Zhamak has said the same. The question becomes how but looking at how practices evolved in software should bring us a lot of learnings.
  2. Just pushing ownership of data to the domains won’t suddenly solve data quality challenges. The new owners – the domains – have to really understand what ownership means and what quality means for use cases leveraging their data.
  3. A reasonably good way to measure if your data product is ‘good enough’ regarding data quality is to look at your SLOs (service level objectives) and SLIs (service level indicators). If you are constantly hitting the SLIs, you can probably focus more on new features. If not, you need to improve your quality.
  4. !Controversial!: Consider almost a zero-trust approach to testing for data. Test as data is flowing through the systems, as data lands. Test in development as to what changes might impact data. And then consumers should be writing tests against source data to prevent issues. Scott note: that’s a lot of tests but how important is certain data to your org?
  5. In a decentralized ownership model, many data consumers are less likely to trust data – at least at first – so you need to show them why they can trust at the data product level. Leveraging proper testing and data contract strategies is crucial to being able to prove out data quality.
  6. You should look to build out a robust testing and observability framework as part of your platform. Data product owners/domains shouldn’t have to build it out manually themselves.
  7. If you only have detection of data quality issues once data hits production, once something or someone is potentially already using it, that’s an issue. Look to create ways to test data at the data product development stage, as part of the CI/CD. We can’t rely only on lagging quality indicators if we want to up our data quality game.
  8. Data for analytics and AI is even more complicated than on the operational plane. Generally, data hasn’t been transformed multiple times on the operational plane so if there is an issue, it’s either in the source application or an issue with the API call that was made. In data, we have to develop smarter tests as data flows through your pipelines.
  9. Data producers need to define quality data in terms of what consumers actually want/need. Instead of arbitrarily setting quality levels, what do the consumers want?
  10. Consumer-driven testing in data sounds wonderful. But it’s hard to see teams being willing to do it :) We need better tooling and ways of working to make this easier.
  11. Data quality surveys of data consumers are important for a number of reasons but are lagging indicators. They should be used to help develop appropriate SLAs/SLIs for data products and monitor if data products are generally meeting customer needs.
  12. ?Controversial?: Can a data producer really develop a custom test for their data product for each consumer or do the consumer owes it to the producer to develop tests to ensure the data product continues to serve their use case well? Scott note: this could start a LinkedIn war but it’s an important question to ask!
  13. If you push for consumer driven testing, don’t be surprised at a lot of pushback. That happens still even in the API world where it’s been more accepted for years :)
  14. Are consumers ready and able to programmatically define what good data quality means for them for each use case? There are some tools that can help but practices and tooling are still mostly nascent.
  15. ?Controversial?: Many consumers still have the ‘give me all you have and I’ll sort through it’ mindset. Trying to get them to lock into what they are consuming will be hard.
  16. There can be a real chicken and egg scenario around data products, especially testing. Consumers don’t know precisely what they want and what will best suit them until they see the data/data product. But building out a data product and having to change it a lot to customer feedback is also tough – producers want to build it once instead of 10 iterations. Just be prepared for this to be an ongoing issue in data and for it to lengthen times to data product release.
  17. ?Controversial?: Having your transformations handled by low-code/no-code solutions can easily hurt you more than it will help you. Be very wary. Scott note: this is coming up in A LOT of conversations recently and was featured in the Thoughtworks radar released in early May
  18. In software development – including development for data – abstractions are crucial but can get you in trouble. Really think deep about your abstractions because it’s easy to lose sight of what underlies the abstraction. And abstractions of abstractions of abstractions just compounds the issue :)

Tania started with a bit of her background, especially related to data mesh. She worked on one of the clients that was an inspiration for Zhamak’s original data mesh blog post and spent 2+ years as the lead on the technical side of a data mesh implementation at another client. Her background as a developer and tech generalist have shaped her thoughts around bringing good software practices to data.

For Tania, the reason she originally put together the presentation on software testing practices in data was a client question (paraphrased): ‘if we already have data quality issues in the centralized setup with clear ownership and people who really know data, how the heck are we going to _improve_ data quality by pushing data ownership into the domains?’ It’s a very fair question – just pushing ownership without the capability and buy-in to own the data is possibly (likely?) going to lead to worse quality. So we need tests that work and can be shown to consumers to help ensure quality and trust in that quality. Showing people your kind of data quality certification goes a long way towards trust.

In Tania’s view, much of the existing data observability tooling and practices, while valuable, only really alert when there is already a problem that’s hit production. Is there a way where we can shift testing left, not just in ownership but testing earlier in the flow of data? Earlier in the development timeline of a data product? So that is 3 potential ways to shift left, to test earlier. Think about detect versus protect – can we prevent data quality issues instead of only better identify and resolve them?

Tania talked about how data product producers need to start to shift their thinking around data quality. What specifically do my consumers want – and why? Quality is inherently subjective so extract from them what their needs are and look to serve those. And we should look to stop using _only_ lagging quality indicators like surveys. They are valuable in reshaping what SLAs should be and is a data product meeting needs and expectations but they are certainly not designed to quickly detect issues. But do consumers actually know what would make data ‘high quality’ for them?

Consumer-driven data quality testing is a good idea for many reasons in Tania’s view. When we think about a single data product having 5 known, regular data consumers, does the data producer need to develop 5 different sets of tests to specifically protect against breaking changes or issues specific to each use case? Do they have to define quality metrics differently for each use case of the same data product? Do they have to be so familiar with each use case that they evolve their tests as use cases evolve? How much can we reasonably ask the data product consumers to do in the testing space to ensure quality?

But Tania admitted she hasn’t led a client in doing consumer-driven testing for data. It’s really hard to get data testing right in general, are people really ready for doing consumer-driven data testing? We don’t really have the tooling or the general best practices to do it well yet. And there is also just philosophical pushback – being forced to programmatically say what good quality means instead of saying ‘the data quality isn’t good enough’ is a tough pill to swallow for consumers. Do consumers really know precisely what they want? Things like the tools Great Expectations or Soda Core are a good start here but we need more. And many consumers are still in the ‘give me all the data you have’ kind of mindset so reducing the possible scope of data they get is not an easy mindset shift.

Tania also pointed to a persistent challenge in data that is a chicken and egg problem: data producers can’t build exactly what consumers want until they get feedback from the consumers. But the consumers don’t know exactly what they want until they’ve seen an early iteration of the data product. So you have lengthening time between conception to release because both sides need more from the other to move forward but can’t until they get at least some information. A good way to press on consumers might be to ask them about bad-case scenarios – what has to be there and why? That will _possibly_ prevent kitchen sink feature requests.

As the conversation transitioned into low-code/no-code tooling, Tania lamented about the difference between ease of use and simplicity. While low-code/no-code tools can be very easy to use at the start, as scale/complexity of use cases increases, they often become extremely difficult to manage. They are focused on ease-of-use, their architecture isn’t about maintaining simplicity of managing the solution as it scales. As you add more and more views, you might actually have 30-40 joins across many data products and performance comes to a halt. This was also mentioned in the Thoughtworks Radar that was released in early May 2023 (Tania contributed to that).

In wrapping up, Tania shared what she believes is a good way to measure if you are doing well enough with your data product, especially in regards to data quality. Look at your SLOs (service level objectives) and SLIs (service level indicators) – are you hitting those regularly? Then maybe you can focus more on new feature development. But if not, you might need more/better monitoring/observability.

Data Mesh Radio is hosted by Scott Hirleman. If you want to connect with Scott, reach out to him on LinkedIn: https://www.linkedin.com/in/scotthirleman/

If you want to learn more and/or join the Data Mesh Learning Community, see here: https://datameshlearning.com/community/

If you want to be a guest or give feedback (suggestions for topics, comments, etc.), please see here

All music used this episode was found on PixaBay and was created by (including slight edits by Scott Hirleman): Lesfm, MondayHopes, SergeQuadrado, ItsWatR, Lexin_Music, and/or nevesf

Leave a Reply

Your email address will not be published. Required fields are marked *