#136 Building Your Data Platform for Change and Reusability via Modularity – Interview w/ Alireza Sohofi

Data Mesh Radio Patreon – get access to interviews well before they are released

Episode list and links to all available episode transcripts (most interviews from #32 on) here

Provided as a free resource by DataStax AstraDB; George Trujillo’s contact info: email (george.trujillo@datastax.com) and LinkedIn

Transcript for this episode (link) provided by Starburst. See their Data Mesh Summit recordings here and their great data mesh resource center here. You can download their Data Mesh for Dummies e-book (info gated) here.

Squirrel (OSS data platform) GitHub: https://github.com/merantix-momentum/squirrel-core

Alireza’s LinkedIn: https://www.linkedin.com/in/alireza-sohofi/

In this episode, Scott interviewed Alireza Sohofi, a Data Scientist focused on building the data platform at Merantix Momentum.

Some key takeaways/thoughts from Alireza’s point of view – some written directly by Alireza himself:

  1. Where possible, look to build your platform in a loosely coupled way. It will make it easier to extend and evolve; and domains can replace pieces, mix and match components, or even extend the functionalities when it makes sense.
  2. It’s easy to fall into the trap of building a platform that is hard to evolve and support. Be very conscious about what you want to include – and not include – in your platform. Don’t try to solve every challenge with a point solution.
  3. To effectively share data – and the information it represents – software engineers / domains need to really understand their own data, including data modeling. That can’t be easily outsourced. A platform team’s job is to build the tooling so those domains only need to deal with the data, not the data engineering.
  4. If you want a scalable platform – in many senses of the word scalable -, your platform should be relatively generic. It must also be easy to extend and augment. Focus on providing flexibility and ease of customization. One size definitely won’t fit all.
  5. Packages and templates are both useful but templates are typically more user friendly and easier to customize – start with templates when possible.
  6. If there is a need for customizing or extending a package or template, it’s better to first build it within a domain (with the help of the platform team if necessary). The generalized version of the new feature is then contributed to the platform. This leads to a more integrated domain-platform, more robust first release of new features in the platform, knowledge sharing, and avoiding bottlenecks that may arise if only relying on the central platform team.
  7. Platform teams need to A) dog food the platform – you will learn far more by using it; B) provide good methods of communication for domains to give feedback and requests; and C) find better ways to exchange context with your domains regularly, e.g. pair work and scheduled informal chats.
  8. The platform consists of several tools that should not only work well together, but should also work well with a wider ecosystem of open source tools. Solutions that try to offer end-to-end coverage usually fall short when it comes to flexibility and changing requirement and business environment. Composable components that can work together is the way to go.
  9. Tools should be opinionated, i.e. encode the best practices, but at the same time hackable to the very core. Layered design where domain teams can choose the abstraction level which is appropriate for them is a good choice.

Alireza started by sharing a bit about how Merantix works with clients – often, their clients are not that deep into machine learning and want to outsource that. So Alireza and team are essentially building a data platform that is use-case agnostic across many different data maturity levels and modalities, both on the production and consumption sides, that is scalable and cost-efficient. Sound like a familiar challenge? While their platform is specifically for machine learning, it’s a good approach to dig into, partly because they recently open sourced their platform so others can dig deep into the implementation aspects.

The most difficult challenge of the platform and working with customers, per Alireza, is the data ingestion. Clients are using a vast array of source systems and formats so they had to focus on each customer’s specific challenge. Unfortunately, this means there is a custom-built driver for each use case/dataset for ingesting customer data but they have created a number of templates for ingestion and transformation to make the custom development – whether initial development or incorporating changes – relatively lightweight; the customization – and subsequent coupling – is typically only related to the business logic related to the use case. So, while it’s not ideal, it’s a scalable approach that is serving them well.

While customers are outsourcing their machine learning, that doesn’t mean they are not data literate according to Alireza. To be able to leverage the Squirrel platform / service and get good information out for their applications, those domain teams still need to really understand their own data. Unfortunately, Alireza does not have the silver bullet to training generalist software developers to really handle their data – they must be able to model the data properly themselves. The platform team’s job is to make tools so they can deal with the data but not the data engineering – but a centralized team handling the data modeling can quickly become a bottleneck.

When talking about scalability – both pure throughput scale but also scalable across many use cases, Alireza believes you must build a platform that is generic enough that it isn’t tied to use cases. It must be relatively easy to extend or augment the platform as well. You need to provide people the flexibility and ease of customization if they want to own the complexity themselves. Easier said than done but still important to repeat. Blanca and Pablo at Plain Concepts said similar things in their episode.

According to Alireza, it’s important to think about templates versus packages. Use templates where possible for more simple things because packages, by default, have some choices embedded in them. Templates are starting points that more often point directly to the choices people can make with defaults versus people discovering the choices made in packages. But both can be useful. And when a template or package needs to be extended, the work should be done by the domain team – otherwise you have centralized work that can become a bottleneck. But do have the domains contribute those extensions back to the platform as well.

Because Squirrel is open source, Alireza and team had to really think about how to make things loosely coupled, even within the platform. So the drivers, computation, storage, etc. are all able to be extended or even swappable. This means each domain team can replace things if they truly have reason to and they can still get good leverage from what is already built.

Alireza had some direct advice: when developing your platform, be very conscious about what you want to include. You don’t want to do one-offs. What can you reliably abstract because it’s a repetitive need? The first time you see a new pattern, don’t rush to build supporting that in the platform. Otherwise, it’s very easy to build a platform that is hard to evolve and support. Analyze a diverse range of challenges to find your patterns and then abstract.

It’s crucial for platform teams to develop good communication and ways of working with your domain teams – or whoever your users are – and really learn how they use the platform according to Alireza. You should provide good ways for them to provide feedback and make requests but also more informal ways of teaming up and exchanging context, e.g. pair programming and scheduled informal chats. That way, every communication isn’t an ask – think water cooler chat. Your platform team should be users of the platform too – the best information and feedback often comes from being a user yourself.

Alireza wrapped up on a major challenge that is still yet to be well addressed: how can we embed the semantics into the data. We need to figure out how to “solve ontologies that don’t align across domains”.

Data Mesh Radio is hosted by Scott Hirleman. If you want to connect with Scott, reach out to him at community at datameshlearning.com or on LinkedIn: https://www.linkedin.com/in/scotthirleman/

If you want to learn more and/or join the Data Mesh Learning Community, see here: https://datameshlearning.com/community/

If you want to be a guest or give feedback (suggestions for topics, comments, etc.), please see here

All music used this episode was found on PixaBay and was created by (including slight edits by Scott Hirleman): Lesfm, MondayHopes, SergeQuadrado, ItsWatR, Lexin_Music, and/or nevesf

Data Mesh Radio is brought to you as a community resource by DataStax. Check out their high-scale, multi-region database offering (w/ lots of great APIs) and use code DAAP500 for a free $500 credit (apply under “add payment”): AstraDB

Leave a Reply

Your email address will not be published. Required fields are marked *