Conversation with Brian Raymond, CEO of Unstructured on Streamlining Enterprise Data for LLMs and GenAI

Advancing GenAI in the Enterprise: Traditional NLP vs LLMs, RAG vs Large Context Windows, Agentic LLMs, LLM Stack, LLM Tooling, Enterprise vs Government Requirements, E/Acc vs EA.

May 03, 2024

In this episode of the Uncharted Algorithm, I sit down with Brian Raymond, Founder & CEO of Unstructured. Unstructured is a Silicon Valley startup that provides tools to prepare enterprise data for use with Large Language Models (LLMs), aiming to streamline the process of transforming raw data into a format suitable for AI applications.

We discuss his unique journey from the intelligence community to the forefront of AI technology in silicon valley. Brian shares his experiences from his time at the CIA and the White House, where he tackled significant national security challenges, to his transition into the tech world with Primer AI and eventually founding Unstructured.

We dive into the evolution of natural language processing and the shift towards generative AI (GenAI), exploring the challenges and opportunities this technology presents. Brian elaborates on the pivotal moments and decisions that led to his focus on improving data accessibility for AI models, a critical but often overlooked aspect of AI development.

Throughout our conversation, Brian offers insights into the current state of AI, particularly in how companies can effectively pivot to generative AI to enhance their operations. He discusses the importance of data transformation and integration in leveraging large language models (LLMs) and the role of retrieval-augmented generation (RAG) in production environments.

Additionally, Brian touches on the broader implications of AI in business and society, including thand the potential for AI to significantly alter various industries. He provides advice for AI startups and entrepreneurs, emphasising the need to address real-world problems and the importance of data handling in building effective AI solutions

Introduction

Aditya Kaul: Hi Brian, welcome to the Uncharted Algorithm. Thanks for joining me today.

Brian Raymond: Thanks for having me. Excited to be here.

Aditya Kaul: Wonderful. So I wanted to start off with your background. So your background is very different from most startup CEOs in Silicon Valley, which includes an active membership in the U.S. intelligence community serving the Middle East and also the White House, I believe, during the Obama administration, before a stint at the CIA. So it's very unusual. So if you could walk us through how you ended up in Silicon Valley, that would be great.

Brian Raymond: Yeah, sure. So I've had a nonlinear career path. Let's call it that. I left a PhD program to join the CIA. This was more than 15 years ago at this point. And I joined as an intelligence officer. The opportunity to write classified assessments for the president every day was too good to pass up. And so I spent a number of years there, as an intel analyst both at Langley as well as in the Middle East. Then I moved from CIA down to the White House to run our foreign policy for Iraq and ISIS. This is during the time when ISIS was on the ascent in Syria and Iraq, and spent more than a year there working with President Obama at the time and Vice President Biden.

My wife and I, we wanted to move back to California, where both of us are from, and I landed in San Francisco and had the opportunity to join an early stage startup focusing on natural language processing called Primer AI. This is right around the time that transformer models began to emerge, and I was employee number 20 or something like that, very early in a newly minted series A company, and just had an incredible opportunity to immerse myself in the world of language models for the next four years at Primer, as we tried to build knowledge graphs and workflows for enterprise customers on top of that. And it was really through those lived experiences at Primer on the model side that I saw a lot of the challenges on the data side, which inspired me to found Unstructured.

Spy Shows

Aditya Kaul: Very interesting. So, just as a curveball question, what's your favourite spy TV show that you watch? I was listening to one of your podcasts or interviews and you said you love reading spy novels. I was curious what your favourite spy TV show is?

Brian Raymond: Oh man, on the TV show... I mean, the Jack Ryan series has been great. I think probably my favourite is The Americans. Nothing's more realistic than The Americans. They did an amazing job with that series.

Aditya Kaul: Yeah. Do you watch any anime by any chance?

Brian Raymond: You know what? A little bit, but I'm not deep on it, but would love to hear more.

Aditya Kaul: So my son, he watches this show called Spy x Family. It's a really good anime series from Japan. Yeah. So definitely highly recommend it.

Brian Raymond: I got three small kids. So we got to put that in the queue. That's a great recommendation.

Aditya Kaul: Yeah. Although I wouldn't recommend it for really small kids, maybe 11 plus.

Pivot from NLP to GenAI

Aditya Kaul: It's quite clear that there are lots of companies in the NLP space that have been around for a while now. Many NLP companies I've noticed haven't successfully pivoted to GenAI or language models. Initially, every little task in NLP had a separate model, right? And now you have these large language models that can generalise across multiple NLP tasks. So I'm curious, as you have successfully pivoted to GenAI and large language models. What was behind that decision? What was the thinking behind making that pivot? You already saw that at Primer?

Brian Raymond: Yeah, look, rewinding the clock three or four years ago, you had two big economic and practical challenges in NLP. First, as you mentioned, you needed lots of small models to extract triples and do different classification tasks and pull that all together into a knowledge graph.

And that's why you saw an abundance of data labelling startups, model hosting, and model serving startups. Hugging Face exploded as a library for these models. That was all to grapple with the challenges associated with that. The second challenge was on the data side. Even today, you typically need chunk JSON or chunk markdown to feed into the context window for these models. The context window used to be less than 30 tokens, now we're talking about millions of tokens.

You couldn't walk more than ten feet in San Francisco without seeing a data labelling or a model serving startup during that time. And that's evolved.

“However, if you go to a large investment bank or a CPG organisation like Walmart, and they'd say the data that we want to use is in PowerPoints, Google docs, Slack messages, audio recordings, and there's absolutely nothing to help data scientists or solutions architects get that into nice chunk JSON to feed to those models.”

From 2022 through today, things on the model side of the equation have evolved very rapidly, as you articulated. You see nothing on the data side to actually help in practical terms, and so that’s what we're building at Unstructured.

Now there's some other folks in our ecosystem that are beginning to emerge. Initially, our vision was that the Hugging Face community is just exploding. There's so much unmet demand and they're starting to meet that demand on the model side. But the bottleneck is on the data side as well. Let's go solve that and serve the Hugging Face community. That's who we were building for initially. We still love that community and we're doing it, but it's just gotten even bigger right over the last year and a half, two years. So that was the initial thesis—let's go after those users and help them get the data to those models.

Core Value Proposition

Aditya Kaul: You've just recently launched a platform which is more enterprise-friendly. Can you talk about your core value proposition? I know you have different products, but it would be good to just hear what those are.

Brian Raymond: Yeah. So, we started with open source and we still support our open source projects. It's straightforward— a raw file comes in, and JSON with metadata comes back, which you can then chunk. We have a commercial API that improves on this by doing better table extraction, better form extraction, and better element classification, among other things.

However, if you're an organisation that wants to continually hydrate a RAG architecture or needs to prepare a bunch of data for fine-tuning or pre-training tasks, simple file transformation isn't enough. We're talking about connectors, like what FiveTran and Airbyte and others have done, but for these datasets, you need file transformation, chunking, summarisation, vectorisation, vector syncing, and then writing that to a downstream location on a schedule with various options.

Developers and data scientists also have to worry about how they're indexing and retrieving the data, along with tuning their LLM parameters.

Our platform vision is to simplify this process— from raw to RAG-ready data or raw to fine-tuning-ready data, without the user having to worry about it. You set the parameters you want, and it just works.

You show up, and your data is parked in your Pinecone index, Weaviate index, or in an AWS S3 or MongoDB Atlas database, ready to go.

LLM Stack

Aditya Kaul: Talking about the LLM stack, which is evolving—it's still early days—but you've mentioned integration with entities like Hugging Face, Langchain, and vector databases. We also see major cloud AI providers like Microsoft, Google, AWS in this space. How do you see this evolving and where do you see it going as it becomes more advanced?

Brian Raymond: It's a very interesting question. There are some things that aren't changing, which I find as intriguing as the changes themselves. For instance, despite some debates, large context windows haven't eliminated the need for RAG (retrieval-augmented generation).

“From everything we've seen, in terms of total cost of ownership and performance, RAG will continue to be the dominant paradigm, possibly integrated with knowledge graphs.”

What is rapidly evolving is the work on observability and the orchestration of agents to optimise LLM integration. This includes traditional retrieval techniques to deliver exactly what is needed into the context window at the right time—because accuracy without noise is crucial.

On our side, we're delving into more complex pre-processing and indexing strategies, determining which types of summaries to append, how to embed, and how to chunk data. This isn't about introducing new variables but deepening our understanding of existing ones.

“We're seeing a significant expansion in tooling which is making a substantial difference in whether these architectures succeed in moving to production. There's a virtuous cycle developing between improved model performance and enhanced observability and orchestration, which is fascinating. It's crucial to continue this trend as it increases the success rate of developers deploying GenAI solutions within their organisations.”

RAG in Production

Aditya Kaul: Great, can you walk us through a typical customer use case? I'm curious about the practical implementation of RAG in production. Do you have any examples you can share?

Brian Raymond: Certainly, I have three interesting examples.

First, for search and discovery applications, where we see the highest success. Companies like Glean are doing an excellent job in this area, packaging it effectively, with models mature enough to deliver substantial business value.

The second example revolves around Q&A systems, like customer chatbots, which are now a dominant user interface. With the right data and proper setup, these systems can be effectively deployed in production.

Lastly, we're working with a client who uses our solutions to make decades of product research accessible to models. Instead of merely automating processes, we're enabling the model to generate new product ideas and strategies. These creative applications are particularly well-suited to RAG, bridging the gap between QA chatbots and more complex multi-step process automation tasks, which can be more challenging.

Agentic LLMs

Aditya Kaul: As we go into multi-agent, multi-step LLMs, how do you see this changing your value proposition? It seems early days still, but what are your thoughts?

Brian Raymond: From a broad perspective, agents are going to make it easier to retrieve the right data and bring it back to the context window, potentially reducing the pre-processing workload. Agents excel at working with more structured data, like crawling HTML or code bases. However, when dealing with tens or hundreds of billions of records, it's still challenging to quickly find exactly what you need.

If agents have to retrieve data from sources like SharePoint or native PDFs and PowerPoints, the latency could be quite high. There's a distinction between what's possible to show in a demo and what's practical for scaling across an organisation of non-technical users to actually deliver productivity gains. We're still quite a ways away from agents being able to efficiently handle raw files or pulling them through a transformation pipeline.

“Our benchmarking shows that using GPT for turbo for PDFs often results in missing or out-of-order words more than 40% of the time, and they're slower than more traditional approaches. It will take quite a bit of maturing in this space if we want to start using generative models for non-generative tasks.”

Multimodal LLMs

Aditya Kaul: That's very interesting. And about multimodal capabilities, as these improve, what are your thoughts on their impact with regards to RAG and ETL (extract transform and load), especially as they get better at extracting charts and graphs?

Brian Raymond: There are different approaches here. Text still has the most information density, so extracting and rendering it in the correct reading order without missing words is essential but challenging, and generative models currently perform poorly at this. However, for engineering schematics, flow charts, and pie charts, they're fantastic. The Anthropic team, for instance, has done an incredible job with Claude 3 on this.

You want to orchestrate the best tool for the right problem. Through our platform, for example, we can identify a schematic, automatically send it to an LLM, generate a rich text description in the correct reading order, and also embed the image itself, minimising information loss as you transition from a raw file to data pre-staged in a vector database.

Currently, the best in the world at information extraction from documents is Azure Document AI, which misses fewer than 5% of the words in our benchmarking. However, the best LLMs miss more than 40% of the words and are more computationally expensive. Therefore, the trick is orchestrating the right model for the right task. Using LLMs for classification tasks on a large scale is economically unfeasible.

Large Context Windows

Aditya Kaul: I've been playing around with Gemini 1.5 Pro with its 1 million context window. I find it's doing a pretty good job. It's basically for large documents extraction. You were mentioning GPT-4 only was pulling 40 percent of the data. So again, you mentioned in one of your comments that you don't think the context windows are really going to make a big difference.

I was just curious why you thought that was the case and any thoughts on the Gemini model, how that's doing?

Brian Raymond: I guess where I kind of get lost on the large context windows is when we're talking to customers about use cases.

“We rarely see a workflow where a user says, I know exactly the file that I want and what I want done. And then they go and upload it to a model and ask questions in a chat and then pull it back down and integrate it into what they're doing. That's just not how people work.”

Maybe in certain circumstances, there might be isolated cases where you do that, but I haven't seen workflows transition to that with anything that OpenAI or Anthropic introduced over the last two years. Instead, what you're seeing are copilots.

That's how people want it integrated into their workflows, much like how Gmail predicts text as you type. So, if that's going to be the dominant user experience, which is an accelerant to existing workflows, instead of ‘Find the file, upload the file, ask a question about the file, download it again’, then we have to ask, what do you really gain from needing more context?

With an 8K context window for Llama 3, that’s sufficient for most scenarios, if you’re using it as a co-pilot and you’ve pre-staged billions of records in a vector database. Then, you just pull exactly what you need into that context window to help accelerate workflows. We'll see if co-pilots actually change the game or if users will still need to manually manage data uploads, which can be slow and expensive. While the 'lost in the middle' issues are improving, I still don’t see a huge amount of value in them.

Aditya Kaul: Yeah. Maybe there are different use cases. I mean, obviously when I use Gemini 1.5 Pro to help me read books, really complex philosophy books, for example. So you upload a whole series, for example. But in the enterprise case as you said typically people don't know what file they're working with. So, that makes sense.

Synthetic Data

Aditya Kaul: Any thoughts on synthetic data? I mean lots of people talking about synthetic data and would that change anything or is that going to shift anything with pre-processing?

Brian Raymond: I think it has a disproportionate impact on model pre-training, maybe some on fine-tuning but you still have issues with overfitting that are really difficult and vexing to overcome if you're going to use synthetic data. And at the end of the day, the use cases here are really boring.

It's like, Barbara in logistics who wants to know what's going on with this purchase order and this delivery. Synthetic data is relevant for helping fine-tune that model using our early or our reinforcement learning with AI feedback right in order to shape the outputs in a way that's helpful for Barbara.

But at the end of the day, synthetic data is not going to help Barbara figure out what's going on with a delivery that's going in customs or something like that. And so you're still going to need to take human-generated data, a customs form, right. And that workflow and then connect those.

And so, we're seeing an explosion of synthetic data as learning, but there's really interesting things going on. Llama 3 isn’t a mixture of experts model. It's a single big model. And there are some really interesting conversations going on right now about convergence, las you're starting to encode additional information into these models are they converging around the same sort of behaviours and then the limiting factor is how are we going to get more human-generated data to these models.

Challenges with Document Processing

Aditya Kaul: In terms of the types of documents you've talked about—tables, charts, using multimodality—is there something that is still kind of difficult to process? I mean, based on your own experience, what you see within enterprises or even with the open source developer community that is using it, where is the maximum effort needed to make it work for an LLM?

Brian Raymond: The listeners will probably roll their eyes at this, but things are still unbelievably difficult.

“Take tables without lines demarcating columns and rows; LLMs still perform terribly at parsing those. Most of the state-of-the-art table extraction models, like Microsoft's new table extraction model, which is great, and ours that we're working on, still really struggle. So if you want to get all of the tables out of a 10K, an SEC 10K, nobody in the industry can do it reliably.”

It's like, a lot of the object detection tasks in documents are way more complicated than even the self-driving car tasks from an object detection standpoint, just because of the information density and heterogeneity of document layouts. It's just unbelievably difficult to infer hierarchy, infer reading order, and to do extraction of key-value pairs, even identify key-value pairs. Those tasks aren't great at it, and more traditional approaches are still far superior to LLMs, but still very, very difficult even to figure out how to label the data for those tasks.

And so there are old problems where performance is still actually pretty poor, where we've got to get better at some of those old problems if we're going to move the ball on some of the newer ones. And so where we're spending a huge amount of time right now is on labelling data, playing with different ontologies, playing with different model architectures to try and crack that. Because if the table is completely garbled, an LLM can't piece it back together again, it's not going to work.

Aditya Kaul: So, yeah, based on my own experience as well—I've been working on some projects looking at environmental certification, permits, for example, in the U.S., every state has its own permit. And if you look at that, it's just not able to do it if it goes beyond a certain type of complexity, especially the table extraction is very, very hard.

New Startup Opportunities in the LLM Stack

Aditya Kaul: Great, I wanted to kind of move on to the future of AI, and as a founder, you've recently raised a Series B round. You're laser-focused on the GenAI stack and, you talked a little bit about it in terms of the earlier discussion we were having with regards to Hugging Face, Pinecone etc, but in the next 12 months, you are going to see massive shifts in the startup space. Again, I'm kind of asking it more from a startup founder perspective. If somebody is a startup founder, where should they be working and what area of the stack should they be focused on?

Brian Raymond: Yeah, I think there's so much opportunity right now. And so many interesting things going on in terms of the stack here; there are a lot of verticalised solutions that are beginning to emerge. For example, Harvey on legal docs, they're just crushing it. And Glean's crushing it on enterprise search, Typeface is crushing it on marketing. A lot of these more verticalized solutions are easier to scale at this point, given where the tech's at, and there is a lot of enormous opportunity. OpenAI had a really interesting piece that they published this week about how, in the biotech pharma space, there are GenAI tailored GPTs for different workflows.

I think there's enormous opportunity there in terms of horizontal capabilities. We've seen an explosion of small companies popping up, some people just focus on embeddings and vectors, others just focused on chunking, some just focus on certain types of indexing on the search side. And there's very intense competition from a vertical standpoint, and you're starting to see a bit of consolidation, like Langflow was recently acquired by DataStax.

There is some interesting work that Datology, CleanLab and NOMIC are doing around dataset curation, model fine-tuning, and model pre-training. I think that there's a lot of room to run there still a huge amount of run there. It's around there.

I think the work that like Harrison and Langchain are doing around observability and orchestration and around agents like Adept who lead on this, and they were really early in the space. But I think that there's a huge opportunity around agents, and then I think just new UX is, I think, just, I don't know if chat is going to continue being the dominant paradigm here from a UX standpoint. And I think there are opportunities to really push things forward in terms of how this is actually imported into products and realised, and like Perplexity has done some really interesting things there.

Government vs Private Enterprise

Aditya Kaul: I was curious again, you have a background in government and the intelligence community. How is that community different from typical enterprise requirements when it comes to offering them an ETL platform for LLMs?

Brian Raymond: Well, the interesting thing is the convergence and how quickly they've caught up over the last several years. Where we are in the stack, which is one of the reasons I love our position, is that we're buried deep in the stack. A PowerPoint is a PowerPoint, PDF is PDF, HTML is HTML, so the requirements are indistinguishable. They use multi-cloud environments like GCP, AWS, Azure, Oracle. This has radically changed from five years ago when everything was on bare metal and there was no cloud infrastructure, making it very difficult to operate.

On the workflow side and at the application layer, it couldn't be more different. We're talking about command and control applications and intelligence operation applications, which serve a different customer with different workflows and mission requirements. Companies like Primer, where I used to work, and others like Vannevar Labs, Scale AI’s Donovan, Yurts are doing really interesting work at the application layer, which is pretty specific to those domains. But deep in the stack, where we operate, it's pure dual-use technology. We use the exact same code base for enterprise customers inside and outside of government.

E/Acc and Techno-Optimism vs Effective Altruism

Aditya Kaul: Great, in terms of a bigger picture and philosophy, there's a lot of debate happening with regards to effective altruists (EA) versus techno-optimists. With your intelligence and government background, but also as a Silicon Valley startup CEO, where does your allegiance lie? Are you e/acc, EA, or somewhere in the middle?

Brian Raymond: Look, I think that technology is neither good nor bad; it's how it's used, but it only goes in one direction. There are really interesting ethical considerations at play right now, like how hardware is utilised in situations like Ukraine, with the need for, or difficulty using non-autonomous kamikaze drones, for example. We're seeing warfare move in that direction, and there will be some very difficult decisions on what will be necessary to compete with our adversaries.

The Pentagon has done a very good job of formulating a responsible framework, stress testing it, and then applying it. However, the world is evolving very quickly, more quickly than we could have imagined. Continuing to have a robust discussion and having all ideas on the table will be essential to navigate this effectively, balancing both our security, which is of paramount importance, and our ethics and values.

I don't think there is one answer to this. It's scary on a lot of these issues, but it's also scary if bad actors have this technology. You can't put a lid on it. This isn't something like a nuclear program where control was possible; even that didn't work perfectly as the technology spread rapidly. The pace of development and the need to continue to grapple with these issues urgently and seriously is not up for dispute. We need to drive consensus and have a shared view across the public, which is a challenge, but we should always strive towards that.

So, I don't fall into one camp or the other. I think I'm more of a pragmatic realist; one thing I learned at the CIA was to be able to flip arguments inside out and look at them from a 360-degree view, and when you're used to doing that, you're never really satisfied.

Bridging Cultures Between DC and Silicon Valley

Aditya Kaul: That makes sense. And I guess the last question is related to what I asked before, but generally speaking, Silicon Valley has had a difficult relationship with DC, especially regarding regulation and government. There's traditionally been some mistrust, but now, I see that changing. The large tech giants are deeply involved with regulation. On the startup side, CEOs and VCs seem to realise the need to be active and engaged in lobbying, much like what A16Z is currently doing in DC. Given your background in DC and now in Silicon Valley, do you see this as a benefit where you can straddle both places and perhaps bring a balanced perspective to help out in some way?

Brian Raymond: Absolutely. I think the real low point was after Google pulled out of Project Maven about five or six years ago, which marked the nadir of relationships. But now, it's quite fascinating. A16Z has their American Dynamism Fund, and a lot of defense tech-focused funds, one of which is Shield AI led by Mike Brown, former head of the Defense Innovation Unit, have invested in us. More broadly, you're seeing government leaders frequently visit Silicon Valley to meet with startups. There are more startups investing in the public sector and engaging in discussions.

For instance, last fall, the Senate held multi-day hearings where startup founders were invited to discuss responsible AI frameworks. The Executive Branch and the EOB have been very inclusive of startups in their executive orders. Notables like Clem from Hugging Face and Alexander from Scale AI participated. There are more mechanisms for dialogue today than in 2018-2020. We are seeing the benefits of this engagement, with capital flowing into these companies and a lot less crosstalk than before. Overall, we're on a far healthier trajectory than what we saw during the last phase.

Aditya Kaul: Great, I think that's it. Bryan, I appreciate you taking the time, and thank you so much for joining us.

Brian Raymond: Thanks for having me.

The Uncharted Algorithm

Discussion about this post