Data Mesh Radio Patreon – get access to interviews well before they are released
Episode list and links to all available episode transcripts (most interviews from #32 on) here
Transcript for this episode (link) provided by Starburst. See their Data Mesh Summit recordings here and their great data mesh resource center here. You can download their Data Mesh for Dummies e-book (info gated) here.
Data Mesh at PayPal blog post: https://medium.com/paypal-tech/the-next-generation-of-data-platforms-is-the-data-mesh-b7df4b825522
JGP’s All Things Open talk (free virtual registration): https://2022.allthingsopen.org/sessions/building-a-data-mesh-with-open-source-technologies/
JGP’s LinkedIn: https://www.linkedin.com/in/jgperrin/
JGP’s Twitter: @jgperrin / https://twitter.com/jgperrin
JGP’s YouTube: https://www.youtube.com/c/JeanGeorgesPerrin
JGP’s Website: https://jgp.ai/
In this episode, Scott interviewed Jean-Georges Perrin AKA JGP, Intelligence Platform Lead at PayPal. JGP is probably the first guest to lean into using “data quantum” instead of “data product”. JGP did want to emphasize that as of now, he was only discussing the implementation for his team the GCSC IA (Global Credit Risk, Seller Risk, Collections Intelligence Automation) within PayPal.
Some key takeaways/thoughts from JGP’s point of view:
- Data mesh as it’s been laid out by Zhamak obviously leaves a lot of room for innovation. For some, that’s great. For others, they want the blueprint. And it’s okay to wait for the blueprint. But JGP and team are excited to innovate!
- PayPal’s 3 main initial target outcomes from data mesh: A) faster and easier data discovery, B) easier to use the data in a governed way, and C) increase data consumer trust in data.
- PayPal’s initial data consumers are data scientists so their platform and data quanta are built to serve that audience first.
- Really consider what you want to prove out in your MVP. Is that minimum viable A) data quantum, B) data platform, C) data mesh, or D) something else? Only doing a data quantum probably sets you up for trouble and a platform only won’t be tested until it has data quanta on it.
- Data contracts are crucial to making trustability actually measurable and agreed upon. Otherwise, it’s far too easy to have miscommunication between data producers and consumers, which leads to lack/loss of trust.
- Producers, don’t set your data contract terms too strictly when first launching a data quantum. There’s no need to over-engineer – despite how interesting that can sometimes be…
- For too long, we have tried to keep software engineering and data engineering overly separate. They are both just engineering with slightly different focus and data mesh really leans in to that.
- We’ve also tried to keep operational and analytical far too separate. We should look to build out tooling where data can live that serves both operational and analytical workload needs. But we aren’t there yet.
- Analytical APIs, at least as far as we’ve seen them to date, are just not going to do what we need relative to accessing data from data products/quanta according to JGP.
- Standardizing metadata access APIs across data quanta has made it very simple for data consumers to begin using new data quanta as they are introduced to the mesh. PayPal has observability, discovery, and control APIs.
- Domain is an overloaded word. It can mean a very large high-level domain like Sales, Finance, or Marketing with hundreds or thousands of people in it or it can mean a smaller, ‘two pizza team’ level scale.
- PayPal is doing only one data quantum per domain but that domain is really at the two pizza team scale – they aren’t trying to have a single data quantum for Marketing.
- It’s crucial to understand data quanta and the use cases they power both have life-cycles. So really applying product thinking is crucial.
- Most data engineering teams do work in a waterfall approach and that just doesn’t scale well. However, moving to data mesh can mean additional cognitive load as it really requires an Agile mindset to do right and that shift in the ways of working is not trivial.
- It’s good to have smaller delivery requirements so you get faster feedback on what you are creating – a core tenet of Agile. Don’t try to deliver everything all at once. Get it in users’ hands early to get feedback.
JGP started the conversation talking about how in his team, he’s really leaning into the idea that software engineering and data engineering are not that different. Zhamak has discussed this too. We should focus on sharing practices so we all create better software and infrastructure. For JGP, data engineering work in most organizations has followed a very waterfall approach. However, his team has been mostly working in an Agile manner. Therefore it wasn’t a huge switch to their ways of working – like it is at many organizations – once they started doing data mesh. And luckily, there was already an appetite for changing the way they were tackling data challenges.
In the spirit of being agile and capital A Agile as well, PayPal set out on their data mesh journey. They wanted to do an MVP but what was the P? Minimum Viable Data Product/Quantum? Minimum Viable Platform? Both? Minimum Viable Mesh? JGP recommends looking at what you want to deliver as a minimum unit of value. PayPal already had extensive data platform expertise so they were able to focus on delivering data products/quanta (plural of data quantum) but they worked in parallel to build out their initial data quantum and mesh capabilities. As many guests have noted, it’s dangerous to only do a minimum viable data product/quanta.
PayPal has been building data platforms for a long time. As mentioned by JGP, they were one of the pioneers of the self-service data platform concept. But data mesh offered a path to faster and easier data discovery, to making it easier to use data in a governed way, and to increased trust in data by the data consumers – their first consumers being data scientists. A big benefit of addressing those needs is those data scientists are able to better tell if the data they access is the right data for their use case.
One thing JGP emphasized that’s significantly helping PayPal move forward is standardizing APIs across data quanta. Those are not data access – or analytical – APIs as JGP thinks those will just never work all that well. Instead, as their audience is data scientists only to start, everything anyone needs other than the actual 1s and 0s of the data is accessible via Python APIs. The metadata, the observability/trust data, etc. Then, the data scientists use notebooks to work with the data. But standard APIs means data consumers only have to learn one interface. This is similar in concept to what many are doing with data marketplaces – one standardized way to interact with the information about the data quanta.
PayPal is using the terms data product and data quantum as two separate things. A data product is simply a product powered by data and analytics. Those have been around for quite some time. But PayPal is looking at data quanta like side cars, used specifically to power more and more of their data products going forward.
PayPal have invested heavily in making data contracts work well per JGP and earlier PayPal guest Jay Sen. They’ve been building APIs to make it far easier to consume data contracts as people learn more about a data quantum. And as mentioned before, they can consume observability metrics via API as well. When asked about how are they setting their actual contractual terms, the data producers initially put out some contractual terms and then may adjust those terms as data consumers request. It’s important for data producers to not set their data contract obligations too strictly unless there is a user-based need.
JGP made the good and often unspoken point: the term domain has lost a lot of its meaning. It can mean a very high-level domain like Marketing, Finance, Sales, or HR. Even in software companies, a domain could be Product. But at PayPal, they are being quite strict about what they mean for domain in data mesh: it is a small scale sub-domain – think two pizza team size – and they enforcing a strict 1:1 relationship of one data quantum per domain; and of course, not cross domain source data quanta too. That way, each small domain can focus on creating a great data quantum instead of worrying too much about how big each data quantum should be. The scope should never get that huge at a two pizza team size.
Back to APIs, PayPal is implementing an API-first approach. APIs for the data quantum control plane, observability APIs, and data discovery APIs. It’s the preferred way of working for their initial consumers – data scientists. However, as mentioned previously, JGP does not believe analytical APIs – that is APIs designed to do things like filtering and returning many hundreds to thousands or more results – are really feasible. Definitely not now and possibly ever. So APIs are great for getting at the metadata but not the data for analytical use in his view.
JGP wrapped up in sharing how our tooling must evolve so we don’t have to think about such a hard wall between analytical and operational. There will always be analytical and operational workloads but our systems can evolve to support both. We aren’t there yet though.
If you are just delivering data, the 1s and 0s, you are not delivering the necessary trust.
Data Mesh Radio is hosted by Scott Hirleman. If you want to connect with Scott, reach out to him at community at datameshlearning.com or on LinkedIn: https://www.linkedin.com/in/scotthirleman/
If you want to learn more and/or join the Data Mesh Learning Community, see here: https://datameshlearning.com/community/
If you want to be a guest or give feedback (suggestions for topics, comments, etc.), please see here
Data Mesh Radio is brought to you as a community resource by DataStax. Check out their high-scale, multi-region database offering (w/ lots of great APIs) and use code DAAP500 for a free $500 credit (apply under “add payment”): AstraDB