Data Mesh Radio Patreon – get access to interviews well before they are released
Episode list and links to all available episode transcripts (most interviews from #32 on) here
In this episode, Scott interviewed Mahmoud Yassin, Lead Data Architect at ABN AMRO, a large bank headquartered in the Netherlands.
Some high-level takeaways/thoughts from Mahmoud’s view:
- It’s very difficult to do fully decentralized MDM, which led to some duplication of effort – that can mean increased cost and people not using the best data. ABN tackled this through their Data Integration Access Layer – similar to a service bus.
- They are using that centralized layer – called DIAL – to help teams manage integrations that are both consistently running and on-the-fly. It helps monitor for duplication of work instead of reuse.
- If Mahmoud could do it again, he’d focus on enabling easy data integration earlier in their journey to encourage more data consumption. Cross domain and cross data product consumption is highly valuable.
- The industry needs to develop more and better standards to enable easy data integration.
- Data mesh and similar decentralized data approaches cannot fully decentralize everything. Look for places to centralize offerings in a platform or platform-like approach that can be leveraged by decentralized teams.
- Most current data technology licensing models aren’t well designed for or suited to doing decentralized data – it’s easy to pay a lot if you aren’t careful – or even if you are careful!
- A tough but necessary mentality shift is not thinking about being “done” once data is delivered. That’s data projects, not data as a product.
- Try to keep as much work as possible within the domain boundary when doing data work. Of course, cross-domain communication is key but try to limit the actual work dependencies on other domains if possible.
- A data marketplace enables organizations to more easily create a standardized experience across data products and make data discovery much easier. You don’t necessarily have to tie your cost allocation models to the marketplace concept.
- Sharing what analytical queries/data integration “recipes” people are using has been important for ABN. It drives insights across boundaries and also creates a lower bar to interesting tangential insight creation/development.
- You should consider not allowing integrations across multiple data products by default. Producers should be able to stop integrations – for compliance purposes or because the integration doesn’t actually provide good/valuable/correct insights.
- Traditional ETL development is about translating the business needs to code. But centralized IT usually can’t deeply understand the business context and needs so they deliver substandard solutions. If you consider that business needs evolve, it gets even worse.
Mahmoud started his career in data as an ETL developer so he saw the ever increasing issues with the traditional enterprise data warehouse approach in large organizations. Then he moved on to working with the common way people have approached data lakes – managed by a centralized team – and the issues seemed pretty similar to with data warehouses to him. So he was glad to start working with ABN AMRO on decentralizing their approach to data starting about 4 years ago.
ETL development is about translating the business needs to code. But Mahmoud saw the same problem many organizations are having – it is very hard for IT to really understand the real business context and needs relative to data. They try, but it often is only on the second or third attempt – if at all – that IT would really understand and get it right. They simply cannot get enough context to serve needs well.
So to kick off the discussion, Mahmoud made it clear: there is no perfect data architecture. Not one that fits all organizations and not even one that will fit your organization throughout time as it evolves or potentially across all needs in your organization, so look to what fits your organization at the moment. And it’s okay to take pieces from multiple approaches and try them to see if they fit as a cohesive strategy. But do make sure to not just pick the easiest or most fun parts from multiple strategies – cohesion is crucial.
For Mahmoud, a key mentality shift in doing decentralized data, especially data mesh, has been around what “done” looks like. When you think about a physical goods-type product, it’s not done once it goes to production. With IT-run data, it was typically that project mentality and “done” was when you delivered the data and moved on. It is crucial to learn how to do actual product management, not just software product management, to understand how to do data as a product right.
A key learning from Mahmoud and team, which echoes something Jesse Anderson mentioned, is trying to keep as much of the work done around data inside the domain. There is obviously need for cross domain boundary communication and collaboration but there is a big cost to crossing domain boundaries when doing data work.
While some people think we should decentralize everything we can in data – Scott calls those people simply “wrong” -, Mahmoud and team found there to be a significant cost to decentralizing the wrong things. They have a centralized governance layer to make things easier on data product producers and consumers. And trying to do fully decentralized MDM (master data management) can quickly lead to duplication of effort and data. Omar Khawaja mentioned similar issues early in Roche’s journey.
So, how did ABN tackle these data duplication challenges? Per Mahmoud, they created DIAL or their Data Integrational Access Layer. This is similar to a service bus on the operational plane with the DIAL layer handling data quality checking, business metadata, technical metadata, checking against an interoperable data format, etc. Another instance of a centrally managed service that is leveraged by decentralized teams.
Similar to a number of other organizations, Mahmoud discussed how ABN AMRO is creating an internal data marketplace as the mechanism for centralized data discovery and consumption. This way, there is a standard user experience when looking for and trying to understand what data is available. A standardized experience is crucial to really drive data consumption. The marketplace requirements also lead to a very transparent way to share data.
Per Mahmoud, ABN is also working on making the data integration experience standardized in a few different ways. The previously mentioned DIAL layer is a centralized way to do integrations, whether that is creating new data sets that are reused across multiple downstream data products or integrating in more of a virtualized, on-the-fly way. If you aren’t careful, it is pretty easy – especially if there are domains that might naturally touch on similar concepts – to duplicate work, which can cost A LOT. Especially because most data tool licensing isn’t designed for doing decentralized data.
As part of or similar to the marketplace concept, Mahmoud talked about how ABN is creating integration recipes. So while recipes may not be a data product, these repeated integrations may be similar to a downstream data product in how they present to data consumers. And other consumers can leverage the same recipe or clone it and adapt it to their needs. It has been very important to share what recipes others are using to drive insight sharing across domains.
To help manage compliance/governance and also to make sure data consumers understand what they are actually consuming, the DIAL layer prevents people from doing data integration without consent from data producers. Ust Oldfield mentioned something similar regarding how self-serve without understanding by data consumers can cause major issues.
Mahmoud and Scott discussed how different just creating data products and data as a product thinking are. If you are really thinking of your data as a product, versioning and the actual data product interface are crucial. And with versioning, it’s important to know who will be impacted by a change when assessing if and how a change should happen.
One thing Mahmoud would do differently is focusing more on encouraging/enabling data consumption earlier in the journey. While consumption is picking up, it still is below desired levels and is behind how mature they are with getting data on to the platform to share. Part of the reason for lower than desired consumption came from leaving the focus on data integration until later in their journey. They are trying to find – or if not find, then develop – better standards to make data integration easier. While there are some standards for metadata like OpenMetadata, it’s still early days.
Lastly, Mahmoud mentioned how their metadata was just getting to be in too many places so they are building out a metadata lake – a tool-agnostic lake for their metadata. It remains to be seen if this is a common pattern in data mesh but it may address one of Scott’s big concerns – the “trapped metadata” problem.
Mahmoud’s LinkedIn: https://www.linkedin.com/in/mahmoudyassin007/
Data Mesh Radio is hosted by Scott Hirleman. If you want to connect with Scott, reach out to him at community at datameshlearning.com or on LinkedIn: https://www.linkedin.com/in/scotthirleman/
If you want to learn more and/or join the Data Mesh Learning Community, see here: https://datameshlearning.com/community/
If you want to be a guest or give feedback (suggestions for topics, comments, etc.), please see here
Data Mesh Radio is brought to you as a community resource by DataStax. Check out their high-scale, multi-region database offering (w/ lots of great APIs) and use code DAAP500 for a free $500 credit (apply under “add payment”): AstraDB