Data Mesh Radio Patreon – get access to interviews well before they are released
Episode list and links to all available episode transcripts (most interviews from #32 on) here
Transcript for this episode (link) provided by Starburst. See their Data Mesh Summit recordings here and their great data mesh resource center here. You can download their Data Mesh for Dummies e-book (info gated) here.
Chris’ LinkedIn: https://www.linkedin.com/in/charles-dove-b4715723/
In this episode, Scott interviewed Chris (Charles) Dove, Data Architect at Endava. To be clear, he was only representing his own views.
Some key takeaways/thoughts from Chris’ point of view:
- Data used by only one use case in one way is not how you make money by leveraging data, it’s too expensive. Set yourself up to reuse data and make sure the organization is aware of what data is available.
- ?Controversial?: Tooling around data, especially metadata, has gotten better. But is it good yet? There are still some major fundamental gaps that seem like basic blocking and tackling around sharing data.
- Data isn’t the point, it’s merely a vehicle for exchanging information.
- Far too often there is an implicit understanding of a taxonomy/shared terms in different business units that is actually incorrect which leads to misunderstandings and mismatched data being treated as the same. But it’s not easy to make all aspects of all parts of data explicit and easily understandable, we have to invest and find good ways to do that.
- !Incrementally Important!: Business people in domains often don’t understand their own data because it’s embedded in an application. So they only experience their data in a context that is already framed for them by the application. So they don’t think about someone else not understanding the data inherently when those others _aren’t_ experiencing it through the same application.
- Getting to a ‘good enough’ level of documentation is crucial to prevent misuse of data based on misunderstandings. But every organization has to figure out what is good enough and how to get there, what level of documentation is required, there isn’t a blueprint.
- A constant challenge in data is implicit assumptions by producers around ‘they work here so they know this’ instead of documenting nuance. This leads to misunderstandings based on incomplete documentation.
- Beware the TLA – three letter acronym – in data documentation. It might have a lot of embedded context for those with domain knowledge but it’s not helpful for those without the understanding. Focus on explaining the concepts at a level an outsider can understand.
- ?Controversial?: The point of data literacy isn’t to teach everyone technical skills, it’s to get to an understanding of their own data and how to share the context of their data so others can get a decent understanding of their data. If a good use case emerges, we don’t need everyone to be able to create and maintain a data product but we need an organizational understanding of what data might be available and what it means.
- Tribal knowledge is a double edged sword. It’s great your organization has that knowledge but it’s a massive risk point. Get it out of people’s heads as much as possible. Scott note: find low friction ways like interviewing people to extract their context. Like this podcast…
- Most organizations don’t have a great data documentation strategy or practice. It’s better to get going with learning how to share information about your data than try to make your data documentation or documentation strategy perfect upfront. Get something in place and recognize technical debt but something is better than nothing if it’s not the end state.
- Trapped metadata – where tools try to enforce a closed system instead of easily offering up critical information for ingestion by other systems – is a persistent problem that doesn’t seem to be getting better. It’s even worse when you write custom code to do transformations because most people aren’t creating the necessary metadata at all.
- We need vendors to be bought in that publishing their metadata is the right move and head towards metadata standards to make creating a more complete picture via metadata easy/feasible. Some vendors are moving in that direction, notably Atlan and data.world.
- Truly getting people to change the way they think and feel – not just do work – is an incredibly difficult challenge that most companies don’t ever really address. Make the change in ways of working a value-add to actually change hearts and minds.
- To do data mesh well, we need to figure out scalable and highly effective ways of communicating changes – just offering data product versions won’t cut it. To do that, data producers need to know how their data is used as well.
- Companies need to really see and understand the business benefit from their data before they are likely to change their ways of working around data. That can be a chicken-and-egg issue though.
Chris started with his view that while tooling is getting better in general, most tools are still very lacking in how their metadata plays into the greater organizational view of data which means we can’t do some pretty basic things. Or at least there aren’t comprehensive tools that make easy sharing of context easy across teams because of many metadata incompatibilities/challenges. So we need to get to a way to show the semantic context-related metadata as well as the transformation metadata in one place that is also understandable by the business users. A hard order to be sure but fundamental to enabling the vision of companies being actually data-driven.
There is a very common problem in organizations that comes from an implicit taxonomy and homonym problem according to Chris. The classic example is the definition of a customer but it goes far deeper than that. Some bit of data often has a very specific meaning in a source system and/or domain but then a different business unit looks at it with their own interpretation of what it means and misses the nuance, the differences. So do you have an enterprise taxonomy or do you try to document the exact meaning differences or do you not let people have access to data in case they misunderstand? Not as easy of a choice as many would like to prevent these misinterpretations or misaligned data mixing.
An interesting and very crucial nuance Chris mentioned about data sharing: the business people in a domain are often consuming their own data through a different lens. The data is embedded into an application for them so the interface makes much of the nuance, the meaning explicit. But that meaning isn’t included in the data by default – the column title in the table doesn’t have that meaning. So those business people – the data producers – often struggle to understand why people are confused or don’t get the nuance. So it’s important to make sure the data producing domain understands the interface others use to consume their data. Scott note: this is benefit I hadn’t considered of having domains consume from their own data products. If nuances about the data aren’t explicit, if the documentation isn’t good enough, will they get confused about their own data? Will that force them to do better in building their data products?
Chris hit on a common problem many are having in data mesh – and data in general: what depth of documentation and explanation is necessary for data to be useful and not misused? People automatically assume some level of knowledge of the domain simply because others are in the organization. ‘You work here so you understand my domain’ type of attitude. So we need to make sure people can at least understand what they don’t know and give them a way to get up to speed on what they need to know about a domain. Self-service can be a recipe for disaster if people can’t understand when they are missing the necessary understanding/meaning.
While domain-specific acronyms can have a lot of embedded information in them for people with knowledge of a domain, they are often a major hindrance to those trying to learn about the domain in Chris’ experience. Instead of focusing on exactly what your team calls everything, focus on the concepts and why they matter. As Shakespeare said “what’s in a name”, don’t be enamored with sharing context via domain specific language. Referring back to Vlad Khononov’s DDD episode, the internal domain language is the ubiquitous language but the published language is what is used to share with the rest of the organization. Focus on that published language – how can things be understood easily by those outside the domain?
For Chris, the point and meaning of data literacy isn’t what most think – it’s about getting people to understand what data they have and the general meaning/context so it can be communicated with the rest of the organization. It’s understanding how data can be used and shared, not the exact technical aspects. It’s about getting to a capability to share context and understand other’s context around data without getting overly technical. When there is a use case that emerges, not every single person in the company needs to be able to create and maintain a data product. Basically, the concepts matter far more than everyone learning SQL.
In Chris’ view, tribal knowledge is a very dangerous place to be. You have amazing and extremely valuable knowledge but it’s trapped in people’s heads. What happens if they leave? We all know about tribal knowledge but it’s especially important in data because again the context and nuance, not just the column name, matters 🙂 So extract that valuable tribal information, get it into a consumable format for the entire organization. It frees up the time of your most knowledgeable people too as they aren’t answering questions all the time – extract once but leveraged by many 🙂
Good documentation, good knowledge sharing isn’t about anticipating every challenge and writing out the fix or entirely preventing it according to Chris – that’s not feasible. It’s okay to get things into a knowledge base rather than the perfect metadata tool at first. You want to improve but if you are waiting for the perfect solution, you won’t ever move forward with your data documentation. So get something out, put it in front of others, ask for feedback, and improve. And as stated, documentation doesn’t have to answer all questions – it’s something to make sure people generally understand a certain set of data and if they have a deeper question, they have a clear question escalation path for who to ask.
It’s easy to get lost in data by focusing on ‘data as the point’ in Chris’ experience. Data is merely a vehicle for exchanging information. But there are lots of interesting technical challenges in dealing with data so data people often lose the plot. Without the context around the data, it’s useless too, so we have to focus on delivering it as one packaged unit. Scott note: this is what Zhamak keeps referring to as a data product container or a unit of data – that it isn’t merely the 1s and 0s but the context, the user experience, the lineage, etc. wrapped in one package so it is usable as is.
For Chris, the biggest issue right now in data, especially for something like data mesh, is the trapped metadata problem. It’s something Scott has mentioned repeatedly: most tools that touch your data in some way at best generate metadata that is trapped in that tool or it’s extremely difficult to extract and integrate that metadata into other tools. And when people write custom code to do transformations, they often don’t generate the metadata at all! So trying to get the full necessary picture of what’s happening around our data is extremely time-consuming and difficult.
Chris called out the need for specification around metadata so we can at least bring it all into one place. Only a few vendors have moved to making it possible to even extract most of the metadata they create – he noted Atlan and data.world – but hopefully more vendors are pushed – or dragged – into doing the same. OpenMetadata or other early projects may provide a good way to start developing some standards for how things are described, shared, and/or stored. But again, trapped metadata is a lock-in pain that vendors are seemingly unwilling to let go of unless their hands are forced.
To really move forward with how we all approach data – as an industry and at the organizational level – we need to change the way people think and feel about data according to Chris. But change forced upon people only _might_ change their way of working and usually doesn’t. So we have to focus on changing hearts and minds or the behavioral changes won’t actually net the necessary care changes to the ways of working and understanding we need in data. Easier said – to change hearts and minds – than done but actually changing how we work requires empathy not mandates.
Chris finished on two points: the first is to really change the way an organization does data, they have to understand how data fits into their overall strategy and how treating it as a product impacts the work. Something may be valuable now but that value might fade. It’s okay – and even very healthy – to end of life any data work that is no longer valuable. And the second point is that reuse is really key to generating strong business benefit from data. The cost of getting data to a point you can leverage it is typically high, look to make it reusable and find valuable ways of reuse as much as possible.
Data Mesh Radio is hosted by Scott Hirleman. If you want to connect with Scott, reach out to him at community at datameshlearning.com or on LinkedIn: https://www.linkedin.com/in/scotthirleman/
If you want to learn more and/or join the Data Mesh Learning Community, see here: https://datameshlearning.com/community/
If you want to be a guest or give feedback (suggestions for topics, comments, etc.), please see here
Data Mesh Radio is brought to you as a community resource by DataStax. Check out their high-scale, multi-region database offering (w/ lots of great APIs) and use code DAAP500 for a free $500 credit (apply under “add payment”): AstraDB