Data Mesh Radio Patreon – get access to interviews well before they are released
Episode list and links to all available episode transcripts (most interviews from #32 on) here
Provided as a free resource by DataStax AstraDB
In this episode, Scott interviewed Jay Sen, Data Platforms & Domain Expert/Builder and OSS Committer. While Jay currently works at PayPal, he was only representing his own view points.
Some key takeaways/thoughts from Jay’s view:
- When you get to a certain scale, any central team should focus on, as Jay said, “Empower people, don’t try do their jobs.” That’s how you build towards scale and maintain flexibility – your centralized team likely won’t become a bottleneck if they aren’t making decisions on behalf of other teams.
- To actually empower other teams, dig into the actual business need and work backwards to a solution that can solve that. If there is a solution already in place that isn’t working any more, look to find ways to augment that rather than trying to replace or reinvent the wheel.
- Self-service is a slippery slope – it often solves the immediate problem of time to market but also creates next level challenges. A big issue is that when you remove the friction to data access, you are throwing challenge of finding right data on consumers plate.
- Data contracts are great when everybody aligns on a single contract and there are enough tools to support the contracts. But they also create a proliferation of data to enforce the contracts required by multiple consumers – thus, they often don’t survive the real world.
- The data catalog space is finally getting some needed attention. But there are still a myriad of issues that need solving. Will those be solved by technology or by leveraging a “data concierge” remains to be seen.
- It’s insanely easy to overspend in the cloud. Everyone is vaguely aware but cost should be part of every important architectural discussion. You can drive business value but it absolutely must also be focused on the cost as return on investment is far more important than simply return.
Jay took a few lessons from working on a central services team in a company of ~200 people. Having a centralized team was doable at first but as the org scaled, it quickly got complicated. As a centralized team, it’s very easy to become a bottleneck but Jay learned a lesson that has continued to help in his subsequent roles: “empower people, don’t do their jobs.” Focus on reducing the friction to others doing their work instead of doing it for them.
Easier said than done so how do you empower people? Per Jay, you must understand the business aspect and what the requestor actually needs. That isn’t really going to get communicated well in a ticket most times so you should have a high context information exchange to take what they need and convert it into a workable solution. And often, there is already a solution in place but it’s just not handling the job anymore. So you want to consider if you should solve the same issues in a better way. It’s much easier to do a greenfield deploy but brownfield is an inevitable facet of enterprise data work.
Per Jay, a few good things to remember: 1) frameworks and technology come and go but the concepts are the things that stick around. Focus on solving issues by leveraging technology and frameworks, not relying on them. 2) When working in data, you can’t favor data producers or consumers over the other. It is easy for many to align with consumers but all stakeholders need assistance. 3) Beware the cool and trendy tech or approaches. Yes, it’s funny to say that in a data mesh podcast but often, engineers just want to play with cool things and take on gnarly challenges. Stay focused on the business issues and value. 4) Trust in data takes a long time to build and seconds to break in Jay’s experience.
For Jay, self-service can be a very slippery slope. It often creates more issues than it solves. But it feels like it solves something and that is the allure. Part of the issue of self-serve is that it removes the necessary governance in order to reduce friction to accessing data. So you can get to data but don’t have the understanding necessary to actually use it properly. It’s also crucial to embed governance into data access if you are going for something like self-serve.
There is often a high cost to adopting any technology in Jay’s view. Not just initial adoption but tool stewardship. But it’s typically much higher for the latest technology. Make sure if you are looking at an immature solution that you’re focused on solving the right problems.
Jay shared that data contracts are evolving in a good direction. But, they are still not addressing all the challenges people want data contracts to handle. They can be really good at making sure people understand their responsibilities. As Emily Gorcenski noted in her episode as well, you must drive to meaningful conversations between producers and consumers to define quality very specifically as well as other SLOs. In Jay’s view, contracts work in an ideal world of one-to-one communication between domains but often, there are multiple parties from each side that view things slightly differently. So the contracts rarely fully cover all use cases and can be, at best, a good conversation point for negotiations.
Jay is excited that the data catalog space is getting some very necessary attention. Every company of any size is now dealing with petabyte level of data so organizing it is becoming a major necessity. There are quite a few challenges left to tackle: 1) automated dataset discovery and definitions; 2) naming conventions still aren’t standardized; 3) over-reliance on auto-documentation when human input is required; 4) how do we build trust? 5) how can we empower the data applications? 6) how do we deal with the trapped metadata problem? etc. Jay believes that the catalog must have the understanding not just of what the dataset is but also why it exists.
When asked about whether he thought systems or people should be the focus in enabling data discovery, Jay said to focus on systems to make the onboarding experience the best it can be. That will make it easiest for people as you scale. Scott disagrees and believes a “Data Concierge” role will serve organizations well – but that exceedingly few organizations will actually create and leverage such a role.
Jay then shared his thoughts on understanding, containing, and preventing unnecessary costs in data management in the cloud. It’s quite important as it is VERY easy to spend a lot when you move to the cloud. Scott agreed as he previously managed AWS costs for a public company and saw that first hand. Jay pointed out that it’s often a bad idea to do a one-to-one mapping of what you were doing on-prem to the cloud. The cost structure is often very different and it can cost you a fortune. But rearchitecting also has a cost. Evaluating cost should play a role in every part of data work if you really want to drive good business outcomes.
In wrapping up, Jay reiterated that technology needs to solve real business problems, not just be cool. Really consider the long term costs of adopting a new solution.
Jay’s LinkedIn: https://www.linkedin.com/in/jaysen2/
Posts by Jay:
Next-Gen Data Movement Platform at PayPal: https://medium.com/paypal-tech/next-gen-data-movement-platform-at-paypal-100f70a7a6b
How PayPal Moves Secure and Encrypted Data Across Security Zones: https://medium.com/paypal-tech/how-paypal-moves-secure-and-encrypted-data-across-security-zones-10010c1788ce
The Evolution of Data-Movement Systems: https://jaysen99.medium.com/evolution-of-data-movement-f12614d6e9de
Data Mesh Radio is hosted by Scott Hirleman. If you want to connect with Scott, reach out to him at community at datameshlearning.com or on LinkedIn: https://www.linkedin.com/in/scotthirleman/
If you want to learn more and/or join the Data Mesh Learning Community, see here: https://datameshlearning.com/community/
If you want to be a guest or give feedback (suggestions for topics, comments, etc.), please see here
Data Mesh Radio is brought to you as a community resource by DataStax. Check out their high-scale, multi-region database offering (w/ lots of great APIs) and use code DAAP500 for a free $500 credit (apply under “add payment”): AstraDB