#90 Sharing Data Reliably in Hyperscale Mode – Interview w/ Björn Smedman

Data Mesh Radio Patreon – get access to interviews well before they are released

Episode list and links to all available episode transcripts (most interviews from #32 on) here

Provided as a free resource by DataStax AstraDB

Transcript for this episode (link) provided by Starburst. See their Data Mesh Summit recordings here and their great data mesh resource center here

In this episode, Scott interviewed Björn Smedman, Engineering Manager at Communication Platform-as-a-Service (CPaaS) company Sinch.

Some interesting thoughts or takeaways:

  1. A good indicator for when decentralizing your data team might make sense is the cognitive load of a centralized data team. How many systems – including a measure of how complex – are they managing? How much of their time is spent in meetings, especially trying to understand context/requests? Is there starting to be combative prioritization from multiple domains?
  2. It can be very beneficial and scalable to apply data mesh principles to non analytical use cases, especially sharing data for application purposes.
  3. It is still often difficult to prioritize creating a data product for machine learning without knowing the business value of the ML model. But the ML team needs the data first before they can figure out the business value of the ML model. You have to make speculative bets.
  4. If you see the data platform team start to dig into the semantics of a use case, that’s a red flag that people are trying to leverage them as a data team. And while you want a centralized data platform team, you probably don’t want them to become a centralized data team.

Since December 2020, Sinch raised nearly $2 billion USD. With this funding, they have made a number of sizeable acquisitions, with the company growing from 500 employees to over 3,000 in about a year. This has led to some interesting challenges in sharing data in a hyper-scaling environment.

Per Björn, data is a very key part of Sinch’s plans for growth. Sinch’s operational systems are often very transactional, as some product lines can process tens of thousands of monetary transactions a second, so data that might be typically shared on the operational plane in other companies is shared on the data plane lest the operational data stores deal with billions of events, making the data challenges even more complex than for most organizations. Then add in the regulatory requirements of telecom.

Björn helped lead the move to decentralizing the data team. When Björn joined, the central data team organization was 4 teams and 25 people. The data function was previously centralized and that was becoming a bottleneck, even for the legacy business. Now that the company had acquired a number of other sizeable companies, that central data team setup clearly wasn’t going to scale. The company reorganized around business units and started to build data and analytics teams inside each BU.

For Björn, who started in December 2021 just as Sinch started acquiring new businesses, the central data team was clearly not going to be able to meet the needs of this new organization that was about 8x larger than a year earlier. There was too much cognitive load on the team, especially trying to understand the product lines of five distinct business units, many of which were entirely new to the company.

Björn gave a few good indicators of what to look for when considering if you should decentralize your data team. A big one is team cognitive load. Cognitive overload can take many forms: how many systems – especially complex ones – are your teams managing? Do they really deeply understand the systems? Do they need to deeply understand them to work with them? How many competencies does the team and each individual on the team need to have to do their day-to-day work? What percent of time is spent in meetings, especially follow-up meetings? One that wasn’t mentioned is request turnaround time lengthening.

Sinch had a strong signal that they should move towards a decentralized data team approach, per Björn. The business announced in early February 2022 it was organizing itself into five distinct business units. The business units started to build data and analytics capabilities internally but there would be a very distinct need for teams to share data with each other so a common self-serve data platform was necessary. If they didn’t have a common platform, each business unit would need to do custom integrations with the other BUs, and that could be four custom integration points per BU. And each one probably would not be bi-directional so they might need eight per BU, four for sharing data out and four for ingesting data. Obviously, that would be a pretty bad situation.

Regarding machine learning, Björn mentioned how difficult it can be to prioritize – there is a chicken and egg issue here: before data producing teams are willing to do the work to create a data product that will feed an ML model, they need to know how valuable or what will be the result of the ML model. But the ML teams need to get access to the data first to determine how valuable the data is before assigning a value. Thus there is a need for speculative bets, and those are hard to prioritize.

Björn worked with the central data platform team to build out a common data platform with a data lake, data warehouse, and streaming capabilities. They are using as many open standards as possible as it prevents lock-in and also often means more integrations are available. The goal of the platform is to make it easy to do the necessary data engineering work for every business unit. But Björn mentioned it is important to prevent the data platform team from becoming another data team – if you are seeing your data platform team start to dig into the semantics of a use case, that’s a red flag.

Björn’s LinkedIn: https://www.linkedin.com/in/bjornsmedman/

Data Mesh Radio is hosted by Scott Hirleman. If you want to connect with Scott, reach out to him at community at datameshlearning.com or on LinkedIn: https://www.linkedin.com/in/scotthirleman/

If you want to learn more and/or join the Data Mesh Learning Community, see here: https://datameshlearning.com/community/

If you want to be a guest or give feedback (suggestions for topics, comments, etc.), please see here

All music used this episode was found on PixaBay and was created by (including slight edits by Scott Hirleman): Lesfm, MondayHopes, SergeQuadrado, and/or nevesf

Data Mesh Radio is brought to you as a community resource by DataStax. Check out their high-scale, multi-region database offering (w/ lots of great APIs) and use code DAAP500 for a free $500 credit (apply under “add payment”): AstraDB

Leave a Reply

Your email address will not be published.