In this episode, Scott interviewed Justin Cunningham, who worked as a tech lead and data architect on data platforms at Netflix, Yelp, and Atlassian over the last 8.5 years. In that time, Justin was involved in initiatives to push data ownership to developers / domains.
To sum up one of Justin’s points he touched on repeatedly – he recommends to create a pool of low effort data which will inherently have low quality. Use that for initial research into what might be useful. Focus on maximizing accessibility – you can have governance and use things like join restrictions or give consumers an ability to self-certify that they are using the data responsibly. Once you get the use cases, then you go for the data mesh quality data products.
Justin saw a lot of success at Yelp focusing on data availability – getting data to a place it could be found and played with – was a bigger driver for success than focusing initially on data quality. Once people discovered what data was available and how they might use it, the organization was able to work towards getting that data to an acceptable quality level.
Another point Justin made was figure out which you want to optimize for in general: getting things right upfront or testing and changing. He believes in optimizing for change. Create an adaptive process and optimize for learning. Keep it simple and focus on value delivery – it will set up more tractable bets.
At Yelp, they were trying to ETL a huge amount of data in their data warehouse to build reports for the C-Suite. But they were never really going to get enough data ingested to really meet their goals. It was taking them 2 weeks to create each new set of ETLs and that was just creation, not maintenance – it was looking like they’d need 5x the number of people.
What Justin found the most useful at Yelp was to focus on getting as much “usable” data in an automated way. They achieved this initially through the data mesh anti-pattern of copying direct from the underlying operational data stores and building business logic on top of it. But, that data getting into the hands of the data team meant there could be an initial value assessment – once they proved out there could be value in the data, the conversation with developers was much easier to get them to care about providing clean and reliable data.
Justin mentioned the same thing Wannes Rosiers mentioned in his episode: there are operational and analytical workloads but there should absolutely not be that separation when it comes to data. Data from operational systems is useful for analytics and vice versa. One thing that really helped developers understand how to share data was thinking of data sets as being similar to public APIs.
At Netflix, there were just too many bespoke data sets – made it very hard to manage quality. What they found that worked was a data certification program for data sets, creating tooling to prove a data set was complete and accurate. That and upping the amount of focus on data set reuse significantly helped them to combat the data sprawl.
Back to data accessibility and availability versus quality, for Justin, he believes data analysts and data scientists initially care far more about getting data access as you can work to improve the data quality later – especially if there is a clear owner. I discussed this in a mesh musings about speculative data products but a key hack for them was being able to mark data as low quality.
On driving buy-in for data producing teams, Justin again talked to proving there was value in data before the producers were bought in. Asking them to serve their data upfront without a clear, specific use case was very tough. The return on investment or ROI was very squishy. So they got out low quality data initially and then came back to producing teams to up quality and reliability once they proved certain data was valuable. This is someone similar to the emerging data mesh pattern of creating your data products for a consumer-focused use case. It might be a source-aligned data product but it should still be initially serving a specific purpose with a targeted outcome. It can grow from there.
Justin also shared his thoughts on how the way we do data lineage is broken – we should look to do lineage declaratively instead of just as a reference. This should flow through both the schema registry and the data catalog. What is the data movement supposed to be? This would enable us to much more easily test data flows and alert downstream users of upcoming changes.
Justin’s LinkedIn: https://www.linkedin.com/in/justincinmd/