#145 From Failwhale to Massive Scale and Beyond: Learnings on Fixing Data Team Bottlenecks – Interview w/ Dmitriy Ryaboy

Data Mesh Radio Patreon – get access to interviews well before they are released

Episode list and links to all available episode transcripts (most interviews from #32 on) here

Provided as a free resource by DataStax AstraDB; George Trujillo’s contact info: email (george.trujillo@datastax.com) and LinkedIn

Transcript for this episode (link) provided by Starburst. See their Data Mesh Summit recordings here and their great data mesh resource center here. You can download their Data Mesh for Dummies e-book (info gated) here.

Dmitriy’s Twitter: @squarecog / https://twitter.com/squarecog

The Missing README book: https://themissingreadme.com/

Building Evolutionary Architectures book: https://www.oreilly.com/library/view/building-evolutionary-architectures/9781491986356/

In this episode, Scott interviewed Dmitriy Ryaboy, CTO at Zymergen and co-author of the book The Missing README.

Some key takeaways/thoughts from Dmitriy’s point of view:

  1. Organizational design and change management is “like a knife fight” – you are going to get cut but if you do it well, you can choose where you get cut. There is no perfect org, and there will be pain somewhere but you can influence what will hurt, and make it not life-threatening.
  2. There is too much separation between data engineering and software engineering. Data engineering is just a type of software engineering with a focus on dealing with data. We have to stop treating them like completely different practices.
  3. When communicating internally, always focus on telling people the why before you get to the how. If they don’t get why you are doing it, they are far less likely to be motivated to address the issue or opportunity. This applies to getting teams to take ownership of the data they produce, but also to everything else. There is often a rush to use tech over talk. Conversation is a powerful tool and will set you up so your tools can help you address the challenges once people are aligned. Paving over challenges with tech will not go well.
  4. Build your data platform such that the central data platform team is unnecessary in conversations between data producers and data consumers. That way, your team won’t become a bottleneck. Try to reduce cognitive load on the users – they shouldn’t have to deeply understand the platform and its inner workings just to leverage it.
  5. “Data debt is debt forever.” You can certainly ‘pay it down’ but data debt typically has a much longer life than even the initial source system that supplied the data. Take it on consciously.
  6. Looking to hire or grow full-stack engineers for an ever-growing definition of stack (backend, frontend, security, ops, QA, UX, data…) is probably not a great idea, we can’t keep piling new domains on people and expect them to be good at all of them. Instead, look to build full-stack teams, and tools that look and feel sufficiently similar that e.g. “data engineering” becomes very close to “backend engineering”.
  7. Look for needless delays in work as a sign your organization isn’t well aligned with what you are trying to accomplish. The cost of coordination should not be a huge bottleneck, especially at a smaller size organization.
  8. Any agreement like a data contract or API needs to be agreed to by both parties. Consumers can’t just expect things to not change, especially if they don’t let producers know what, why, and how they are consuming their data.

Dmitriy shared some of his experience leading the team building and scaling Twitter’s initial data platform. They had to transform from a relatively simple model that just wasn’t scaling to one that was far more scalable but obviously more complex. E.g. their MySQL setup, they couldn’t add any more columns so they treated certain columns almost as a CSV within the cell. You can imagine how difficult that was for analytics… They also had many hidden dependencies as well so when changes were made, it would break all kinds of other systems. So part of his role was helping the team untangle that mess.

One thing they got right according to Dmitriy was moving to structured logging so there were at least standard columns that made it possible to combine data across applications. Dmitriy didn’t have any specific advice on exactly how to standardize but mostly about the discussions around standardizing – always share what you are trying to accomplish, focus on the why much more than the how. The how is based on decisions once you’ve realized the why. This advice came up throughout the conversation

Twitter had a small central data team that had to either try to scale massively and still not be able to know all the business context necessary or – more sanely – build their data platform to keep themselves from being a bottleneck. They chose the second option. So when building their data platform, there was a focus on building tools to make the central data team unnecessary to most data conversations, getting the data producers and consumers to talk to each other instead of playing interpreter and doing the work themselves.

So, to sum up some advice from what Dmitriy learned from Twitter: 1) go for structured logging, 2) make dependencies clear and known, and also limit where possible, and 3) build your data platform tools so your data team doesn’t have to be part of the conversations between data producers and consumers to enable scaling.

According to Dmitriy, when focusing on change management, again you have to focus on explaining to people the why of what you are doing. When you give people the reason, the pain you are trying to solve, most people then just want to know what you need from them in his experience. And be prepared to reiterate the why a lot. It’s better to state it too often than not often enough.

When asked what we can take from software engineering to apply to data mesh and data engineering, Dmitriy is very passionate that data engineering and software engineering really shouldn’t be overly separate. Data engineering is just a type of software engineering and we need to create the tooling to make software engineers capable of doing data engineering tasks. We can take a lot of learnings from software engineering and apply them to data but it shouldn’t be seen as very different.

When thinking about data mesh especially, we are asking domains to pick up additional responsibilities. Dmitriy believes – and Scott has said it multiple times too – you can’t give a team more responsibility without giving them more resources. That can be people – especially with incremental capabilities – or something like a platform that reduces the cognitive load and workload of the new – and existing – responsibilities. Give people a platform that they don’t need to be experts in what it is doing under the hood to actually leverage. And full-stack engineers, especially if we add dealing with data, are just going to be overloaded. Look to full-stack teams where people within the team have some specialty areas.

Dmitriy shared some thoughts about how Zymergen’s data capabilities and team has evolved. It was a central team that had ownership over everything which hampered the teams from talking to each other as much as needed. That introduced needless delays because the coordination cost was so high. So they reorganized and had a number of the data engineers move into the domains. Those data engineers were better able to leverage the data platform because they had built it and could teach other domain team members how to leverage it better too.

A lot of issues can be handled with just a conversation, not a technical solution, according to Dmitriy. This might seem obvious but so many people try to not go down this road. But technology is obviously also important. Think about when to use tech and when to use talk. Your team should build out the platform to make it easy to have better conversations because the tech is handled.

“Data debt is debt forever,” according to Dmitriy. Meaning that data has a habit of sticking around for a very long time, even well past when the initial source system has been replaced. You can pay down that data debt, but you have to address it intentionally. A simple refactor usually won’t do it. Right now, it’s far easier to update an API, we need to get there with data where things like versioning aren’t such a manual, pain-in-the-butt task.

Dmitriy ended with a few things: 1) indirection is your friend but don’t use indirection until you need it; 2) a good place to learn about building your architecture to be able to evolve is the book Building Evolutionary Architectures (kind of well titled, huh?); and 3) organizational design and change management is like a knife fight – you are going to get cut but if you do it well, if you are a pro, you will choose where you get cut.

Data Mesh Radio is hosted by Scott Hirleman. If you want to connect with Scott, reach out to him at community at datameshlearning.com or on LinkedIn: https://www.linkedin.com/in/scotthirleman/

If you want to learn more and/or join the Data Mesh Learning Community, see here: https://datameshlearning.com/community/

If you want to be a guest or give feedback (suggestions for topics, comments, etc.), please see here

All music used this episode was found on PixaBay and was created by (including slight edits by Scott Hirleman): Lesfm, MondayHopes, SergeQuadrado, ItsWatR, Lexin_Music, and/or nevesf

Data Mesh Radio is brought to you as a community resource by DataStax. Check out their high-scale, multi-region database offering (w/ lots of great APIs) and use code DAAP500 for a free $500 credit (apply under “add payment”): AstraDB

Leave a Reply

Your email address will not be published. Required fields are marked *