LlamaIndex on LinkedIn: QA over massive tabular data without hallucinations (Caltrain schedule… (2024)

LlamaIndex

150,196 followers

Report this post

QA over massive tabular data without hallucinations (Caltrain schedule edition 🚞🕰️)Even as LLMs get better, they still hallucinate over very complex tables and charts in documents - due to poor parsing.Case in point is the Caltrain schedule (or any big train schedule) - there’s a lot of time information in here! With LlamaParse, we were able to spatially layout the text in a semantically coherent manner, so that our GPT-4o-powered QA pipeline could correctly answer questions💡In contrast with naive parsing (pypdf), a lot of tabular information gets lost, leading to LLM hallucinations.Check out our brand-new notebook here: https://lnkd.in/geCGEMbs

194

7 Comments

Like Comment

Hunter Zhao

Founder at Petal.org & GPT-trainer.com

Report this comment

Are you representing the table as markdown, csv, html, json, or some other format after extraction? There has to be some kind of limit to the tokens you can fit in a single chunk right? Since tables have such variety different formats, how do you ensure the data is stored according to the intended table structure?Our team at GPT-trainer studied this problem quite a bit. Here’s our approach: https://guide.gpt-trainer.com/working-with-tablesMicrosoft research also talked about LLM and tables: https://www.microsoft.com/en-us/research/blog/improving-llm-understanding-of-structured-data-and-exploring-advanced-prompting-methods/Then there’s the problem with multi-page tables too, and tables with merged cells, etc. that I assume are out of scope.

Like Reply

1Reaction

Titas Das

Applied ML Scientist | Software Engineer

Report this comment

How do we still keep the parsing quality high but rely less on a powerful LLM (because the cost can exponentially rise as the number of docs increase) ? Are there potential solutions from computer vision - such as segmentation, deep learning based object detection that can be integrated here to solve the problem of parsing not just tables but also plots and graphs while preserving the format?

Like Reply

1Reaction

Raj Kannan

AI Solutions Architect

Report this comment

THIS IS HOT! https://shorturl.at/oZzZY

Like Reply

1Reaction

Abdelfettah Latrache

software engineer at SYNDIKAT7

Report this comment

More Relevant Posts

LlamaIndex

150,196 followers

2h
Report this post
Text-to-SQL - fully local edition 🔐The latest local LLMs are not only capable of RAG synthesis, but also querying structured databases. Diptiman Raichaudhuri has a great tutorial on how to build a fully local text-to-SQL setup, letting you query local databases without an internet connection. Stack:🐤 DuckDB as the database🦙Ollama + Mixtral-8x7B as the model🦙 LlamaIndex as the text-to-sql orchestrationCheck out the full tutorial here: https://lnkd.in/gi_J7eNB
62

7 Comments

Like Comment

To view or add a comment, sign in
LlamaIndex

150,196 followers

5h Edited
Report this post
Multi-Document Agentic RAG using LlamaIndex and MistralThis is a great article by Plaban Nayak on how to build a multi-document agent that can reason about multi-part questions over multiple documents.It does this by modeling each document as a set of tools (summarization and vector search). Since there can be many documents (so tools will overflow context), we can do tool retrieval first in order to pre-fetch the relevant tools that the agent will operate over. The end result is an advanced agent powered by Mistral AI function calling that goes many steps beyond what naive RAG systems can do. https://lnkd.in/gcuMfJJP
399

7 Comments

Like Comment

To view or add a comment, sign in
LlamaIndex

150,196 followers

1d
Report this post
Get 32x faster performance on your vector search at only a 4% cost in accuracy!Building production apps is all about tradeoffs. In vector search, your data is encoded as 32-bit vectors, which can use a lot of storage and compute to search. In this blog post from Jina AI, they show you how to get dramatically faster vector search by reducing them to binary digits, at a small cost in accuracy of retrieval.They also demonstrate how to get it working in LlamaIndex: it's as simple as adding a single `encoding_queries='binary'` parameter to your embedding call!https://lnkd.in/e4P82Dwk
136

4 Comments

Like Comment

To view or add a comment, sign in
LlamaIndex

150,196 followers

1d
Report this post
Structured Image Extraction with GPT-4o 🖼️GPT-4o is state-of-the-art in integrating image/text understanding, and we’ve created a full cookbook showing you how to use GPT-4o to extract out structured JSONs from images. It does this much better than GPT-4V. We feed it detailed papercards (created by Val Andrei Fajardo) of various research papers (see diagram below), and measure quantitative metrics like failure rates and quality of extracted outputs. GPT-4o was able to extract structured output from every image (0% failure rate), and synthesizes much higher quality answers/insights than 4V. Check out our full cookbook here - also by Val Andrei Fajardo!https://lnkd.in/gptzdcUq
406

20 Comments

Like Comment

To view or add a comment, sign in
LlamaIndex

150,196 followers

2d
Report this post
Announcing our first-ever meetup at our brand-new office in San Francisco! Come hang out with us and hear from us and our friends at Activeloop and Tryolabs about the latest developments in generative AI.https://lnkd.in/d3HXb3wF
96

2 Comments

Like Comment

To view or add a comment, sign in
LlamaIndex

150,196 followers

2d
Report this post
🔥 Introducing GPT-4o + LlamaParse 🔥GPT-4o is the state-of-the-art model for multimodal understanding, meaning it also has state-of-the-art document parsing capabilities.LlamaParse is the platform for enabling LLM-powered parsing - it uses LLMs to extract documents from any file type in a performant, reliable fashion, offering state-of-the-art response quality for advanced document RAG.We’re excited to offer GPT-4o as an explicit option in LlamaParse, which will use GPT-4o for extraction per page into markdown, instead of using our default parsers/models. Why:- GPT-4o is very good at parsing very complex documents into well-formatted markdown. Oftentimes it outperforms our default approaches.- This means that it can turn documents with very complex tables / charts into clean, indexable data for your RAG pipeline - higher response quality, lower hallucinations 📈Tradeoffs / Caveats ⚠️:- It’s expensive 💵: Due to the cost of inference, using GPT-4o is currently $0.60 USD per page (while by default LlamaParse is $0.003 per page). This cost can spike quickly - beware!- You can specify your OpenAI key, in which case the marginal cost per page goes down to 0.3c per page.- This is a beta feature. Given the cost and latency, use this with caution! If you want to give this a shot, signup for an account and check out our UI: https://lnkd.in/gbkxQAQdNotebook: https://lnkd.in/grwUVr-G
652

23 Comments

Like Comment

To view or add a comment, sign in
LlamaIndex

150,196 followers

2d
Report this post
LlamaParse 🤝 Quivr 🧠Quivr (YC W24) (Stan Girard) is a popular open-source platform where you can create personalized AI assistants over your data.We’re excited to partner with Quivr to introduce LlamaParse - parse any complex document (.pdf, .pptx, .md) through our advanced parsing capabilities, ensuring that you get clean data before storing in the agent’s personalized memory. This ensures accurate retrieval and lower hallucinations during conversation.Check out the docs! https://lnkd.in/d7Ftjd5w

220

8 Comments

Like Comment

To view or add a comment, sign in
LlamaIndex

150,196 followers

3d
Report this post
GPT-4o support now available in create-llama! Get 90% of the way through building a chatbot over your data just by answering a few questions.

105

3 Comments

Like Comment

To view or add a comment, sign in

LlamaIndex on LinkedIn: QA over massive tabular data without hallucinations (Caltrain schedule… (50)

150,196 followers

View Profile

Explore topics

Sales
Marketing
Business Administration
HR Management
Content Management
Engineering
Soft Skills
See All

LlamaIndex on LinkedIn: QA over massive tabular data without hallucinations (Caltrain schedule… (2024)

More Relevant Posts

Explore topics

References