Data Mesh Radio Patreon – get access to interviews well before they are released
Episode list and links to all available episode transcripts (most interviews from #32 on) here
Provided as a free resource by DataStax AstraDB; George Trujillo’s contact info: email (email@example.com) and LinkedIn
Transcript for this episode (link) provided by Starburst. See their Data Mesh Summit recordings here and their great data mesh resource center here. You can download their Data Mesh for Dummies e-book (info gated) here.
Ananth’s LinkedIn: https://www.linkedin.com/in/ananthdurai/
Data Engineering Weekly newsletter: https://www.dataengineeringweekly.com/
In this episode, Scott interviewed Ananth Packkildurai, Author of Data Engineering Weekly and the creator of Schemata.
Scott note: we discuss Schemata quite a bit in this episode but it’s an open source offering that I think can fill in some of the major gaps in our tooling and even ways of working collaboratively around data.
Some key takeaways/thoughts from Ananth’s point of view:
- !Important!: Collaboration around data is crucial. The best way to get people bought in on collaboration around data is to integrate into their workflow, not to create yet another one-off tool in yet another pane of glass.
- ?Controversial?: There is so much friction between initial data producers – the domain developers – and data consumers because they are constantly speaking past each other. The data consumers have to learn too much about the domain and the data producers rarely really understand the context of most analytical asks.
- Data creation is a human-in-the-loop problem. Autonomous data creation is not likely to create significant value because the systems can’t understand the context well enough right now.
- As Zhamak has also pointed out, there is far too much tool fragmentation. It made sense with lots of readily available VC money and finding how to approach things with cloud but we need holistic approaches, not spot approaches to things like data quality, observability, lineage, etc.
- !Open Source Product!: Schemata was created to enforce certain rules in a cooperative platform around data sharing specific to data schemas to help alleviate much of the above friction.
- Data needs to really take a lot of learning from the platform engineering for microservices space. They make it easy for teams to test new services or changes, deploy, etc. In data, we are asking domains to own their data without giving them the tooling to easily do so. It’s too much of an ask. Scott note: PREACH!
- ?Controversial?: In general, we need better ways to share about what data is already available and what data we expect. This is where data contracts as a platform instead of tooling only becomes important.
- !Scott Controversial Note!: Many get data contracts woefully wrong. Data contracts aren’t about _only_ the contract. They signify a relationship that has contractual terms but think about a vendor – is your only interaction, communication, set of expectations, etc. the contract or is a part of the relationship with strong guarantees?
- Consumer-driven data contract testing is important. It is defensive in a way – if my upstream changes, I don’t necessarily want to consume from it. But consumer-driven testing can also be a great part of the conversation around how consumers are actually using the data – it’s a programmatic way to describe usage to producers. Scott note: if we can make consumer-driven testing easy, great. But we need to reduce the burden on producer and consumer to ensure data contract compliance.
- We need to be able to take consumer requests and translate those to producers as well as give guidance to producers about effective – cost or otherwise – ways of meeting those requests. E.g. including a user ID might seem easy to a consumer but it could be very expensive for producers in their standard way of calling user ID. How can we put a prescriptive way in front of data producers to make it easy to meet requests?
- ?Controversial?: Pull requests > requirements gathering. A consumer can show exactly what they want and a producer can either approve or deny but it generates a better conversation.
- Teams need to figure out coordination and communication in a decentralized data modeling world. That is the federated aspect of data mesh – if it’s all decentralized, nothing works with each other and you end up with garbage data at the organization level despite domains having well modeled data for themselves.
- !Controversial!: Schemata believes there needs to be a core data domain for data that links most other domains. Scott note: While not rare in data mesh, a core domain can become a bottleneck and may not give you the flexibility required. Adevinta (episode #40) discusses leveraging a core domain in depth.
- It’s very, very valuable to provide automated feedback via your platform to people considering creating a new data asset – whether that will become a data product or not – including how well it fits in the organization’s data landscape. Are you creating something that can actually be leveraged for other use cases? Does it integrate well with existing data assets/products?
- It’s crucial to have something that gets all the parameters of a data contract on paper. Think about negotiating an agreement with a vendor – is it all just verbal or are you starting from something concrete and working from there? Have your platform provide the basic parameters that people can adjust.
- Far too often, the first conversation a data producer has with their consumer is once something breaks for the consumer. These silent or stealth data consumers create expectations without ever telling the producer. That causes many, many issues.
- Schema should be immutable – the only way to change your schema is by creating a new version.
Ananth started by sharing a bit about his background. Despite writing the Data Engineering Weekly newsletter, he sees his experience as somewhat between a data engineer and a data analyst. That gave him the ability to see the full end-to-end journey of how data was handled at many different organizations. He consistently saw that analytical data outside of the application scope was an afterthought because developers were focused singularly on their application, not how it fit into the greater scheme, especially on the analytics side.
For Ananth, the data marketplace is a useful concept for many organizations when thinking about data contracts. It might be a bit more of a data bazaar than like Amazon in certain ways as there can be a bit of collaborative negotiation – ‘oh, you have XYZ to offer, what about ABC, could you do that?’ We need standardized ways to discuss/document data to make it far easier to share data, or at least start the conversation off from an informed standpoint when collaborating to get the most useful data created and shared. We need programmatic ways for producers to share what data they have available including their expectations like SLAs and consumers to request data they want with their expectations.
Scott note: It’s crucial to understand that data contracts are less about the actual contractual terms and more about the establishment of a relationship that is covered through the contract terms. There are expectations but the contract isn’t the entire relationship between the data producer and the data consumer. Essentially, the relationship includes the contract but just having SLAs will not resolve many of the issues people have around data contracts/sharing.
Similar to something Chris Riccomini mentioned in episode #51, Schemata is looking to provide feedback to producers about what broke downstream when they made a change. Or more valuably what will break before a commit is deployed. Data producers haven’t had much of this feedback historically – e.g. “if you make this change, it will break your data contract expectations on the schema front because of…”. But, Schemata is also designed for producers to see how well what they are offering fits with what other domains are offering when thinking about how well does my domain or potential new data product integrate into the overall organizational data sharing landscape.
On consumer-driven testing in data contracts/agreements, Ananth thinks there are two aspects: structural and behavioral. Structural is what you’d expect and what most people discuss in data contracts – mainly schema validation, is it backward compatible, is it strongly typed, is the required metadata complete, is there a registered owner, are the SLAs defined and complete, etc. The behavioral is similar to what Abe Gong talked about in episode 65 about what are the expectations, does the data behave the way people expect such that it can actually be leveraged for their use case. A key, widespread reason we need consumer-driven testing is producers rarely really understand how data consumers will use their data or are using their data already. Thus, that behavioral testing can inform the producer – along with actual human to human conversations – about how consumers will be/are leveraging data.
One general issue many teams have according to Ananth is the consumer doesn’t really understand the cost or complexity of doing something around data creation. E.g. the producer of one domain might not store the user ID so to get every user ID is an expensive database call. So a consumer creating a pull request instead of a demand/request for data means you can start from a deeper conversation about what the data will be used for and why it’s structured like it is proposed in the PR. This is also much more in a developer in the domain’s workflow of using git. It’s all just far less vague even if the initial proposal is infeasible – a producer has far more information about how the data might be used to start to iterate towards a workable solution.
According to Ananth, many people looking at Schemata have seen the need for years but there hasn’t been a great way to implement what Scott calls “making the implicit explicit” around data sharing/data contracts. And this isn’t a typical problem at a small company but once you get to a certain scale, the need for decentralized data modeling starts to become very evident. But with decentralized data modeling, it’s pretty easy to put yourself in a bad spot because there is no collaboration layer so you create data silos / things that just don’t interoperate well. Much like thinking federated governance versus decentralized governance in data mesh.
Schemata has a concept of a core domain that then every incremental entity or event you model, it will automatically assess how well the new event or entity is connected to the core domain. The theory is to quickly figure out how well what you are building will connect into the greater whole of the organization through the core domain. It gives you quick feedback on what is in process and you can easily add more fields to better match the core domain if a producer wants. It isn’t a blocker, it’s giving feedback to someone creating a pull request – data producer or consumer – about how well the resulting data model would fit in the organizational data landscape.
Ananth discussed how data creation is really a human-in-the-loop challenge – autonomous data creation is just not very valuable now and might never be. We need a collaborative platform to create data that is truly valuable and understandable but especially usable. The crucial aspect is to make a tool that integrates into people’s workflows instead of yet another screen that further fractures the data management experience. Schemata is trying to be like Snyk – automatically scanning and giving people actionable advice but with little effort on their part. Where are your likely pain points? How could you address them? You can more easily set a goal of remediation/improvement and figure out how well you are doing. What are the top 2-3 things you could focus on to make the data you share that much better/more valuable?
A big thing many overlook in creating data contracts is about defining the value and/or cost of something happening according to Ananth. It’s about getting people to the table to discuss something concrete and make sure people are on the same page. Instead of requirements, it’s a collaborative discussion. Alla Hale in her episode #122 talked about every conversation, you should have something to show the other party, whether a full prototype or a post-it note with a little drawing. So getting to clear contract/agreement is far easier if you have a system that defines an owner, defines the parameters you need, makes sure the implicit aspects are explicit so both parties can fully agree, etc.
One thing Ananth – and Scott – keep running across are stealth data consumers creating one-sided data contracts. Essentially, the consumer has created their consumer-side testing and is consuming but the data producer has no idea they are consuming their data. Or many don’t even really do the testing/contract model to protect themselves at all. The first the producer hears about their consumption is when something breaks for the consumer. With Schemata, at least there is a contract in place and stealth data consumers just have to inherit existing contractual bounds. Scott note: I hate stealth anything in data, let the producer know or they will potentially make breaking changes that could be prevented if they were just aware.
According to Ananth, we can really learn a LOT from the DevOps movement that has become more the platform engineering movement on the microservices side. If we try to push ownership to domains/data producers without the tooling to help them verify they comply with governance and that things are working okay, that’s a lot of extra work on the data producer end. It’s why we are seeing so damn much pushback from domains about not wanting to own their data – it’s just way too much of an ask. Data producers just don’t have enough information about what might be an issue when they try to make a change and it causes unnecessary friction. We need to make both the producer and consumer more productive, so that people can develop and deploy without tons of manual intervention.
Far too many teams are using tooling to solve single problems and while that one-off tool helps address a singular issue, it creates an even more disjointed data management workflow in Ananth’s view. It’s easy to focus too much on the spot challenge instead of the overall challenges in data management, the holistic process. Tooling fragmented with cloud and it made sense as we figured out new approaches and patterns – and VCs were quite free with their money – but we need to think about the whole process as one again now. Zhamak has mentioned this multiple times as well.
Data Mesh Radio is hosted by Scott Hirleman. If you want to connect with Scott, reach out to him at community at datameshlearning.com or on LinkedIn: https://www.linkedin.com/in/scotthirleman/
If you want to learn more and/or join the Data Mesh Learning Community, see here: https://datameshlearning.com/community/
If you want to be a guest or give feedback (suggestions for topics, comments, etc.), please see here
All music used this episode was found on PixaBay and was created by (including slight edits by Scott Hirleman): Lesfm, MondayHopes, SergeQuadrado, ItsWatR, Lexin_Music, and/or nevesf
Data Mesh Radio is brought to you as a community resource by DataStax. Check out their high-scale, multi-region database offering (w/ lots of great APIs) and use code DAAP500 for a free $500 credit (apply under “add payment”): AstraDB