Sign up for Data Mesh Understanding’s free roundtable and introduction programs here: https://landing.datameshunderstanding.com/
Please Rate and Review us on your podcast app of choice!
If you want to be a guest or give feedback (suggestions for topics, comments, etc.), please see here
Episode list and links to all available episode transcripts here.
Provided as a free resource by Data Mesh Understanding / Scott Hirleman. Get in touch with Scott on LinkedIn if you want to chat data mesh.
Transcript for this episode (link) provided by Starburst. See their Data Mesh Summit recordings here and their great data mesh resource center here. You can download their Data Mesh for Dummies e-book (info gated) here.
Ebru’s Twitter: @ebrucucen / https://twitter.com/ebrucucen
Ebru’s LinkedIn: https://www.linkedin.com/in/ebrucucen/
In this episode, Scott interviewed Ebru Cucen, Lead Consultant at Open Credo. To be clear, Ebru was only representing her own views on the episode.
Some key takeaways/thoughts from Ebru’s point of view:
- It’s far too hard for data producers to actually reliably produce clean, trustworthy, and well-documented data. We need to give them a better ability to do that, whether that is tooling or ways of working remains to be seen. Scott note: It’s no wonder it’s been hard for many teams to get their domains to own their own data 😉
- There is a hidden challenge in data-intensive service/application development. The version of the data – the schema, the API, and the data itself version – need to be understood and coordinated as the developers don’t control their own data sources unlike software development of the past. But we don’t have good ways of doing that right now on the process or tooling front – data product approaches help but fall short.
- We are lacking the tooling to easily manage data quality for producers. While there are so many data related tools, there is a real lack of things that make it easy to manage the quality. We are getting there on observing or monitoring quality, but not managing and maintaining quality.
- Fitness functions can help you measure if you are doing well on your data quality/reliability.
- As the speed to reliably ship changes on the application side increased – microservices and DevOps -, that just made the data warehouse, the data monolith that much harder to deal with. Instead of slow-changing inputs and gentle evolution, it simply became more and more of a data exhaust model that breaks the warehouse.
- Large data monoliths are just far too hard to maintain, especially as the speed of change of application and the world increases.
- However, monoliths aren’t ‘the enemy’, even microservices advocates say sometimes a monolith is the right call. Look to figure out what’s the right solution for the now and future. Don’t distribute, don’t decentralize without a specific reason.
- ?Controversial?: Data really needs much better version control systems and practices. Yes, there is the versioning of the data product but the actual versioning of the data – that immutability factor, when did this data change and what was it before that – is the most important versioning for data/analytics.
- Versioning means safety – safety for consumers but especially for producers to be able to roll back. We need those better safety features so we can test much more thoroughly in data but right now, we don’t have great ways to do that.
- It’s hard to fight Conway’s Law. If we don’t fix our ways of working together, it is extremely difficult for consumers and producers to align well enough to get the most value from our data. Communication issues will be reflected in the data as well.
- The tools we have for data are so specialized, you might need to use 5+ tools just to properly manage a simple ingestion process – it’s just not there to support the producers well enough.
- How can we observe and validate data before writing very specific testing – testing shouldn’t be the only line of defense. We need a way to define and create our quality gateways much more easily.
- With fast feedback cycles and close collaboration around data, especially with data science, it makes everyone so much more productive. E.g. people aren’t building on deprecated data sources and you can get to initially testing a hypothesis in a day or days instead of weeks.
- It’s important to think of your data like a garden instead of a single project – you must tend to it and improve it further. Your garden is never “done” and weeds can creep in quite easily. Get that green thumb.
- To build to good models in data science, you have to ask what questions can we ask of the data, can we get enough data, is it high enough quality, etc. You need to answer will we be able to achieve a likely positive outcome and then iterate towards good – and then make it better – instead of making things all or nothing and making a static model.
- Not all questions can be asked of the data you have and you need to measure how well what data you have can answer the questions you want to ask. Be realistic about what you are trying to do and what you actually can do based on what you have.
- How do we create psychologically safe environments where people can fail safely and learn from that – we need iterative communications, interactions, learning, and development.
- Inject more empathy into your teams and communications. We need a better way of understanding the challenges, what are we achieving together, instead of what is each person’s role. The sum of the parts is the purpose.
- ?Controversial?: As we increase the amount of data we have and the number of people attempting to leverage that data – and let’s not forget the increasing complexity of the world at large – we are likely to see it get harder to communicate relative to data. We have to try harder than ever to get it right, or at least to an acceptable end outcome.
- ?Controversial?: Similarly, our understanding of certain questions or sets of data will change more frequently than historically and communicating that – and why our understanding has evolved – is going to get more complex as well.
Ebru started by sharing her background where she was a software engineer and trainer including training people on SQL before moving into data/data science. As a software engineer, it was crucial to at least model and understand data well enough to ingest and store it for the application side. The big challenges for software engineers really came in integrating that data into the monolithic data warehouse and then keeping it well integrated as the application evolved. The monoliths were bottlenecks on the software side and the integration into the data monolith was just becoming too much of a major bottleneck for all. As the DevOps and microservices movements picked up steam, the speed to reliably changing the application significantly increased. That increased speed created more and more challenges in integrating into a monolith – the applications drifted far too quickly to easily work with the data monolith. But monoliths aren’t necessarily the wrong choice for all, just at scale and especially scale of complexity, they become a massive bottleneck.
When talking about versioning, Ebru talked about the many copies of data challenge – which one is the right one to use and can I trust it? There are people doing incredibly important work on data where they can’t reliably trace it to source and know they are working on the right version. And with no clear ownership of data, nothing ever gets cleaned up so finding a reliable, repeatable source of data is very hard. So people copy the data they do find to their work area lest the source goes away, creating more copies. So we’ve figured out how to do versioning relatively well on the software/microservices side with APIs but we haven’t figured it out for data – whether that is versioning the analytical API or the data itself. It’s far too hard to make our data assets maintainable right now, thus the big push to data mesh.
For Ebru, when asked specifically what is the most important aspect of versioning in data – code, schema, API, or the data itself – she chose the data itself. This is a somewhat controversial choice but her reasoning was traceability – what actually happened to the data and when did it change? She expects that the code versioning, we’ll have more version control systems and many people already manage their data related code work in git or other systems.
Another point Ebru made was that software development hasn’t really had a focus on aligning itself to what version of data it is using. When you do a production deployment, the database is the database, it’s tied to the application. But when we start to think about how we actually deploy software going forward, if it is referencing external data as part of that, the version of the data source it’s leveraging obviously matters far more and we need far more coordination to ensure the software is referencing what we need it to. There is not enough tooling out there to easily manage this coordination and it’s causing far too many issues.
Scott note: this is a really incremental thought here but VERY hard to explain. Historically, most services have been more or less wholly contained in what data they use or they access information from other services via a versioned API on the operational plane. So the coordination is less challenging. We have not really figured out well how to do that for data intensive applications – this is partly why everyone is building data products, whether data mesh or not, but it’s still challenging if you don’t think about providing a steady access mechanism and a way for a consumer to know what they are accessing hasn’t suddenly changed without their knowledge. See the episode on my rant on data contracts and how it’s not just schema and constraints.
We just can’t escape Conway’s Law according to Ebru. While many people have applied it to the operational plane, we really need to think about how Conway’s Law applies to data. The way we exchange information can’t only be the data itself, we need to get better at how we actually communicate and collaborate internally or gaps in how we communicate will be reflected in the data and our data integrations. Without fixing the way we work together and communicate, the producers and consumers will not collaborate well enough to leverage our data to the fullest extent.
Ebru believes that right now, it’s still far too hard for producers to reliably publish clean, trustable, and understandable data. We haven’t developed great ways of working and the tools are definitely not there yet. So if we try to push ownership on them too quickly, it will not go well. They have historically published what they want and we need to make it far easier to publish what consumers specifically want or they won’t likely want to participate.
Data mesh is a sociotechnical approach but for Ebru, there is a lot of talk about the social and the technical is still lacking. There are so many tools but they don’t work together that well natively and most only do a few very specific things – you could need 5+ tools to accomplish just the ingestion part of a use case. There is also a major challenge on the testing side – can you observe what changes would occur before writing the tests?
In general, we need to change our ways of working in data to enable much faster feedback cycles in Ebru’s view. She was working on a project where everyone was in close collaboration and you could try things out and get feedback in the same day, meaning there was far less time spent building toward a solution only to find out the data wasn’t available or there were other challenges. With better data ownership, we can go from idea to ingestion to testing in a short period of time, significantly improving how productive data science team members are.
Scott note: if you listen to early data mesh presentations from Zhamak, she talks more about data science/machine learning/AI than regular old analytics. This is that data bazaar/data marketplace kind of concept in action.
Ebru believes we need to take more learnings from microservices, especially the concept of Lego pieces. In data, we haven’t really built incrementally to really achieve good value – it’s often been all or nothing. But cloud means we have a chance to do things differently. That iteration means we can fail faster too – if we have an idea but we can’t get the right data or even get enough of the right data, instead of building for weeks, we can change course. It’s important to realize you can’t ask any question to any data as well – sometimes you have a question that just can’t be answered with the data you have or can get and that’s okay.
To do data well/better, Ebru believes we need to create psychological safety and an ability to fail safely. That means we will have to train data consumers far better on how we work with data – a 95% confidence interval doesn’t mean what most believe. And our understanding of data evolves too so consumers must learn to evolve their understanding. Human interaction is far more crucial than many want to believe in doing data well.
In data, as Zhamak has mentioned this trend towards super fractional roles, Ebru believes there is far too much focus in many organizations on what specifically is “my role” instead of what is the team’s role and how can we make sure we accomplish our objectives. This fractional thinking of course creates more friction and challenges and handoffs – handoffs are always a place of lost context. So work to have teams focused on accomplishing team goals instead of individual ones.
Data Mesh Radio is hosted by Scott Hirleman. If you want to connect with Scott, reach out to him on LinkedIn: https://www.linkedin.com/in/scotthirleman/
If you want to learn more and/or join the Data Mesh Learning Community, see here: https://datameshlearning.com/community/
If you want to be a guest or give feedback (suggestions for topics, comments, etc.), please see here