#65 What’s a Data Contract Between Friends – Setting Expectations with Data Contracts – Interview w/ Abe Gong

Provided as a free resource by DataStax AstraDB

Data Mesh Radio Patreon – get access to interviews well before they are released

Transcript for this episode (link) provided by Starburst. See their Data Mesh Summit recordings here and their great data mesh resource center here

In this episode, Scott interviewed Abe Gong, the co-creator Great Expectations (an open source data quality / monitoring / observability tool) and co-founder/CEO of Superconductive.

One caveat before jumping in is that Abe is passionate about the topic and has created tooling to help address it. So try to view Abe’s discussion of Great Expectations as an approach rather than a commercial for the project/product.

To start the conversation, Abe shared some of his background experience living the pain of unexpected upstream data changes causing data chaos / lots of work to recover from and adapt. Part of where we need to get to using something like data contracts is to remove the need to recover in addition to adapting and move towards controlled/expected adaptation. Abe believes that the best framing for data contracts is to think about them as a set of expectations.

To define expectations here, this would include not just schema but also the content of data, such as value ranges/types/distributions/relationships across tables/etc. So for instance, a column may be a one to five for rankings and then the application team changes it one to 10. The schema may not be broken – it is still passing whole numbers – but the new range is not within expectations so the contract is broken.

At current, Abe sees the best way to not break social expectations is via getting consumers and producers in a meeting to talk about the upcoming changes and prepare, such as with versioning. But, as tooling improves, Abe sees a world where we won’t even need a lot of those meetings going forward – either because data pipelines can be “self-healing” and automatically adapt to changes upstream or because metadata and tools for context-sharing will reduce the need for meetings.

Abe sees two distinct use cases in general for data contracts or more specifically how people are using Great Expectations to implement data contracts. The first is purely defensively – put some validation on the data you are ingesting to prevent data that doesn’t match from blowing up your own work; the second type is when the consuming team shares their expectations with the producers and there is a more formal agreement – or contract – with a shared set of expectations. The first often leads to the second, via an agreement conversation that happens after there was an upstream breaking change.

Abe also mentioned there is a third constituent on data contracts in the room: the data. Sometimes the consumers and producers may agree on what they expect, but if that’s different than what’s in the actual data, then it’s hard or dangerous to move forward. The data has a veto.

There was an interesting discussion on the push versus pull of data contracts – should the producer team create an all-encompassing contract or should we have consumer-driven contracts? Would producer-driven contracts be too restrictive, preventing the serendipity insights data mesh aims to produce? Would consumer-driven contracts mean multiple contracts for each data product that the producer agrees to? Is that sustainable?

So, to sum it up, the idea of a set of explicit expectations around a data product that are the result of collaboration between producers and consumers sounds like where we should all head if possible. If the expectation set is only coming from the producer side, it might be overly restrictive and miss a lot of the nuance necessary to actually create consumer trust. And exclusively consumer-driven contracts don’t sound sustainable or scalable.

Abe’s Twitter: @AbeGong / https://twitter.com/AbeGong

Abe’s LinkedIn: https://www.linkedin.com/in/abe-gong-8a77034/

Great Expectations Community Page: https://greatexpectations.io/community

Data Mesh Radio is hosted by Scott Hirleman. If you want to connect with Scott, reach out to him at community at datameshlearning.com or on LinkedIn: https://www.linkedin.com/in/scotthirleman/

If you want to learn more and/or join the Data Mesh Learning Community, see here: https://datameshlearning.com/community/

If you want to be a guest or give feedback (suggestions for topics, comments, etc.), please see here

All music used this episode created by Lesfm (intro includes slight edits by Scott Hirleman): https://pixabay.com/users/lesfm-22579021/

Data Mesh Radio is brought to you as a community resource by DataStax. Check out their high-scale, multi-region database offering (w/ lots of great APIs) and use code DAAP500 for a free $500 credit (apply under “add payment”): AstraDB

Leave a Reply

Your email address will not be published.