Please Rate and Review us on your podcast app of choice!
Get involved with Data Mesh Understanding’s free community roundtables and introductions: https://landing.datameshunderstanding.com/
If you want to be a guest or give feedback (suggestions for topics, comments, etc.), please see here
Episode list and links to all available episode transcripts here.
Transcript for this episode (link) provided by Starburst. See their Data Mesh Summit recordings here and their great data mesh resource center here. You can download their Data Mesh for Dummies e-book (info gated) here.
Madhav’s LinkedIn: https://www.linkedin.com/in/madhavsrinath/
In this episode, Scott interviewed Madhav Srinath, CEO at Nexusleap.
Overall, we are super early in the Generative AI cycle and hype is huge. This discussion is one of early impressions, not fully formed answers. It’s far too early for that.
Also, FYI, there were some technical difficulties in this episode where the recording kept shutting down and had to be restarted. So thanks to Madhav for sticking through and hopefully it isn’t too noticeable. Generative AI will mostly be shortened to GenAI throughout these notes. LLM stands for large language models which power GenAI.
Some key takeaways/thoughts from Madhav’s point of view:
- ?Controversial?: An emerging best practice seems to be having layers of LLMs – one model where you might ask it complicated questions and the second model is trained specifically to vet the answers for correctness and governance concerns.
- The cost of running many models in production is typically actually quite low, at least infrastructure wise. Instead of an always-on architecture, most organizations are leveraging a serverless architecture – or leverage APIs from others providing the models – so they essentially only pay a few cents per query.
- ?Controversial?: Use GenAI as a “scalpel, not a broadsword”. Many are trying to use them in overly broad ways and getting not great results.
- The ability to take a mountain of data and get something out of it in a structured way isn’t a new concept. We’ve been trying to do that with data mining for years. It’s just that it is finally maturing into something more widely useful/usable with GenAI.
- People are generally still only trying to solve pretty shallow problems with GenAI, e.g. writing an article. Scott note: That’s probably good because most aren’t ready to do the work necessary to have GenAI be usable for much deeper use cases
- ?Controversial?: We may need human handlers for LLMs to do GenAI well. If we aren’t sure of the quality of the answers and we need high quality answers, there needs to be guardrails and probably a human in the loop. Scott note: this might prove to be better than having the human just do the analysis or not, remains to be seen.
- If you have the right guardrails in place, there isn’t really any harm in starting to work with GenAI. But, you have to understand it’s early days and there are definite sharp corners for you to cut yourself on – that human in the loop is important for a myriad of reasons and you have to be careful around things like privacy.
- ?Controversial?: It’s better to start at domain level questions and focus on domain-specific problems right now with GenAI. That way, you can more easily control the inputs you feed it and it can help with more specific, targeted questions.
- Look at machine learning use cases. Creating narrow focuses for each model has been proven to be a far better strategy instead of creating one overarching model to try to solve many problems. Why not try the same with LLMs, creating models specific to topics?
- Relatedly, you can add more focus areas to an LLM as you train it. Trying to get it to understand everything at the start will likely overwhelm the LLM to a point where the quality of your answers will fall.
- LLMs can be used to infer relationships between domains or data products. You still have to point them at high quality data and you need someone to check their work but they could be used to more easily find out where data products already are or should be interoperable.
- A potential good use case is to have GenAI models focused on finding those potential relationships and then use a second GenAI model that’s more targeted at finding information based on those relationships.
- ?Controversial?: You shouldn’t be training GenAI models from scratch. Start with one of the many open source models available and train it on your specifics. Leverage work that others have already done for you. You can train them by having your business specialists share information with the LLMs.
- Since everyone essentially has access to the same models, companies will differentiate on the information – and especially the quality of the information – they feed their LLMs.
- ?Controversial?: GenAI may be more useful for data producers than data consumers. They still need to focus on the fundamentals but GenAI can really make them more productive. Scott note: imagine being able to get 5 sample data models or be able to ask an LLM to figure out the best way to make your data fit with other data products to be interoperable.
For Madhav, a lot of what Generative AI has become is the concept of data mining with a personable interface. We’ve been trying to create a way to dig into data that is unstructured and get some insights or information – something that is structured – for a while. The concept isn’t new but we’ve finally found something that might actually be able to do it well and make the outputs easy to consume.
Right now, in Madhav’s view most of the emerging GenAI use cases have been pretty shallow, such as to help write an article. It probably can get far deeper but it’s quite early days. However, we probably need to put a human in the loop to actually make sure the answers LLMs are giving are correct. That might be the best option in his view, something like a guide plus guardrails driven by a human to make sure these LLMs don’t hallucinate. In a way, this isn’t all that different from other machine learning work – black boxes tend to have unexpected consequences.
Madhav’s view is that it’s totally okay to start working with GenAI at your organization as long as you understand GenAI/LLMs have quality issues right now and are really only at the MVP stage in many senses – especially if you are using them internally on your own data. There will probably be good ways to put a wrapper around them to prevent improper data usage/leakage and prevent hallucinations as well.
Starting with domain specific questions/problems is where Madhav thinks people should focus their GenAI work. If you try to feed an LLM a ton of information from many sources, you can’t really be sure of the logic it uses to generate answers versus keeping the inputs tighter and asking about more specific business areas. Keeping that tighter focus and having many LLMs across the organization gives you an ability to more tightly focus your models on specific topics. You can then attempt to add additional focus areas to those LLMs once you have the model performing well on one topic.
While LLMs aren’t magic, Madhav is seeing an emerging use case where people point the LLMs at maybe two data products and ask it to infer relationships between them. It might discover something people haven’t thought of before. You still need a human in the loop or you end up with something like the Pastafarian ‘belief’ that global temperature rise since the 1700s is caused by the decreasing number of pirates globally – correlation doesn’t equal causation. It’s not a magic wand but it could help people find more places where data is already interoperable or should be. Then, he’s seeing once those relationships are discovered, a differently tuned GenAI model is used to actually infer some information based on those newly discovered relationships. Again, specialized models.
Madhav doesn’t believe most organizations should be training their GenAI models from scratch. Instead, go and find the open source models and add your necessary information to the training – basically, why start at zero to train it to one when you can start at 0.7? Leverage the work others are doing so you don’t need super expensive LLM training focused engineers. By starting with an existing base model, you can tune it based on your own answers instead of trying to feed it very specific data and supervise the base-level initial training. Leverage your business subject matter experts and get them to share their tribal knowledge with the LLMs as well.
Circling back to the idea of layered LLMs, Madhav talked about how some organizations are having models specifically tuned to answer questions about data but then there is a secondary LLM that is focused on checking the answers the first one gives for sanity/correctness as well as around governance, e.g. security and privacy concerns. Again, that separation of the work. And it’s not nearly that expensive when it’s all done in a serverless way – if your LLMs are becoming cost prohibitive, you are probably not running them in a cost effective way.
Madhav has a few strong feelings around what organizations should be doing with LLMs. The first is that most should not be trying to train their own LLMs – the time and cost just don’t make that much sense when open source models are advancing at frankly remarkable speeds. The second is that GenAI is probably more helpful for data producers than data consumers. It really can make producers far more productive, e.g. letting them generate insights on their own data or helping them to find good interoperability points with their data and other data products.
Learn more about Data Mesh Understanding: https://datameshunderstanding.com/about
Data Mesh Radio is hosted by Scott Hirleman. If you want to connect with Scott, reach out to him on LinkedIn: https://www.linkedin.com/in/scotthirleman/
If you want to learn more and/or join the Data Mesh Learning Community, see here: https://datameshlearning.com/community/
If you want to be a guest or give feedback (suggestions for topics, comments, etc.), please see here