LlamaIndex on LinkedIn: QA over massive tabular data without hallucinations (Caltrain scheduleā€¦ (2024)

LlamaIndex

150,196 followers

  • Report this post

QA over massive tabular data without hallucinations (Caltrain schedule edition šŸšžšŸ•°ļø)Even as LLMs get better, they still hallucinate over very complex tables and charts in documents - due to poor parsing.Case in point is the Caltrain schedule (or any big train schedule) - thereā€™s a lot of time information in here! With LlamaParse, we were able to spatially layout the text in a semantically coherent manner, so that our GPT-4o-powered QA pipeline could correctly answer questionsšŸ’”In contrast with naive parsing (pypdf), a lot of tabular information gets lost, leading to LLM hallucinations.Check out our brand-new notebook here: https://lnkd.in/geCGEMbs

  • LlamaIndex on LinkedIn: QA over massive tabular data without hallucinations (Caltrain scheduleā€¦ (2)

194

7 Comments

Like Comment

Hunter Zhao

Founder at Petal.org & GPT-trainer.com

1d

  • Report this comment

Are you representing the table as markdown, csv, html, json, or some other format after extraction? There has to be some kind of limit to the tokens you can fit in a single chunk right? Since tables have such variety different formats, how do you ensure the data is stored according to the intended table structure?Our team at GPT-trainer studied this problem quite a bit. Hereā€™s our approach: https://guide.gpt-trainer.com/working-with-tablesMicrosoft research also talked about LLM and tables: https://www.microsoft.com/en-us/research/blog/improving-llm-understanding-of-structured-data-and-exploring-advanced-prompting-methods/Then thereā€™s the problem with multi-page tables too, and tables with merged cells, etc. that I assume are out of scope.

Like Reply

1Reaction

Titas Das

Applied ML Scientist | Software Engineer

1d

  • Report this comment

How do we still keep the parsing quality high but rely less on a powerful LLM (because the cost can exponentially rise as the number of docs increase) ? Are there potential solutions from computer vision - such as segmentation, deep learning based object detection that can be integrated here to solve the problem of parsing not just tables but also plots and graphs while preserving the format?

Like Reply

1Reaction

Raj Kannan

AI Solutions Architect

1d

  • Report this comment
Like Reply

1Reaction

Abdelfettah Latrache

software engineer at SYNDIKAT7

1d

  • Report this comment

This is great!

Like Reply

1Reaction

Muhammad Habib

Global Investment Leader in Private Equity & Real Estate | FinTech Expert | $1B+ in Transactions

9h

  • Report this comment

Muhammad Haseeb

Like Reply

1Reaction

Kabeer Singh Thockchom

AI & Data @ EY | AI Engineering + Product Management | Turning Ideas into Impact with AI: Creating Lasting Value from Ideation to Production | Azure; SaFe POPM

1d

  • Report this comment

LlamaParse for the win!

Like Reply

1Reaction

See more comments

To view or add a comment, sign in

More Relevant Posts

  • LlamaIndex

    150,196 followers

    • Report this post

    Text-to-SQL - fully local edition šŸ”The latest local LLMs are not only capable of RAG synthesis, but also querying structured databases. Diptiman Raichaudhuri has a great tutorial on how to build a fully local text-to-SQL setup, letting you query local databases without an internet connection. Stack:šŸ¤ DuckDB as the databasešŸ¦™Ollama + Mixtral-8x7B as the modelšŸ¦™ LlamaIndex as the text-to-sql orchestrationCheck out the full tutorial here: https://lnkd.in/gi_J7eNB

    • LlamaIndex on LinkedIn: QA over massive tabular data without hallucinations (Caltrain scheduleā€¦ (13)

    62

    7 Comments

    Like Comment

    To view or add a comment, sign in

  • LlamaIndex

    150,196 followers

    • Report this post

    Multi-Document Agentic RAG using LlamaIndex and MistralThis is a great article by Plaban Nayak on how to build a multi-document agent that can reason about multi-part questions over multiple documents.It does this by modeling each document as a set of tools (summarization and vector search). Since there can be many documents (so tools will overflow context), we can do tool retrieval first in order to pre-fetch the relevant tools that the agent will operate over. The end result is an advanced agent powered by Mistral AI function calling that goes many steps beyond what naive RAG systems can do. https://lnkd.in/gcuMfJJP

    • LlamaIndex on LinkedIn: QA over massive tabular data without hallucinations (Caltrain scheduleā€¦ (18)

    399

    7 Comments

    Like Comment

    To view or add a comment, sign in

  • LlamaIndex

    150,196 followers

    • Report this post

    Get 32x faster performance on your vector search at only a 4% cost in accuracy!Building production apps is all about tradeoffs. In vector search, your data is encoded as 32-bit vectors, which can use a lot of storage and compute to search. In this blog post from Jina AI, they show you how to get dramatically faster vector search by reducing them to binary digits, at a small cost in accuracy of retrieval.They also demonstrate how to get it working in LlamaIndex: it's as simple as adding a single `encoding_queries='binary'` parameter to your embedding call!https://lnkd.in/e4P82Dwk

    • LlamaIndex on LinkedIn: QA over massive tabular data without hallucinations (Caltrain scheduleā€¦ (23)

    136

    4 Comments

    Like Comment

    To view or add a comment, sign in

  • LlamaIndex

    150,196 followers

    • Report this post

    Structured Image Extraction with GPT-4o šŸ–¼ļøGPT-4o is state-of-the-art in integrating image/text understanding, and weā€™ve created a full cookbook showing you how to use GPT-4o to extract out structured JSONs from images. It does this much better than GPT-4V. We feed it detailed papercards (created by Val Andrei Fajardo) of various research papers (see diagram below), and measure quantitative metrics like failure rates and quality of extracted outputs. GPT-4o was able to extract structured output from every image (0% failure rate), and synthesizes much higher quality answers/insights than 4V. Check out our full cookbook here - also by Val Andrei Fajardo!https://lnkd.in/gptzdcUq

    • LlamaIndex on LinkedIn: QA over massive tabular data without hallucinations (Caltrain scheduleā€¦ (28)

    406

    20 Comments

    Like Comment

    To view or add a comment, sign in

  • LlamaIndex

    150,196 followers

    • Report this post

    Announcing our first-ever meetup at our brand-new office in San Francisco! Come hang out with us and hear from us and our friends at Activeloop and Tryolabs about the latest developments in generative AI.https://lnkd.in/d3HXb3wF

    • LlamaIndex on LinkedIn: QA over massive tabular data without hallucinations (Caltrain scheduleā€¦ (33)

    96

    2 Comments

    Like Comment

    To view or add a comment, sign in

  • LlamaIndex

    150,196 followers

    • Report this post

    šŸ”„ Introducing GPT-4o + LlamaParse šŸ”„GPT-4o is the state-of-the-art model for multimodal understanding, meaning it also has state-of-the-art document parsing capabilities.LlamaParse is the platform for enabling LLM-powered parsing - it uses LLMs to extract documents from any file type in a performant, reliable fashion, offering state-of-the-art response quality for advanced document RAG.Weā€™re excited to offer GPT-4o as an explicit option in LlamaParse, which will use GPT-4o for extraction per page into markdown, instead of using our default parsers/models. Why:- GPT-4o is very good at parsing very complex documents into well-formatted markdown. Oftentimes it outperforms our default approaches.- This means that it can turn documents with very complex tables / charts into clean, indexable data for your RAG pipeline - higher response quality, lower hallucinations šŸ“ˆTradeoffs / Caveats āš ļø:- Itā€™s expensive šŸ’µ: Due to the cost of inference, using GPT-4o is currently $0.60 USD per page (while by default LlamaParse is $0.003 per page). This cost can spike quickly - beware!- You can specify your OpenAI key, in which case the marginal cost per page goes down to 0.3c per page.- This is a beta feature. Given the cost and latency, use this with caution! If you want to give this a shot, signup for an account and check out our UI: https://lnkd.in/gbkxQAQdNotebook: https://lnkd.in/grwUVr-G

    • LlamaIndex on LinkedIn: QA over massive tabular data without hallucinations (Caltrain scheduleā€¦ (38)

    652

    23 Comments

    Like Comment

    To view or add a comment, sign in

  • LlamaIndex

    150,196 followers

    • Report this post

    LlamaParse šŸ¤ Quivr šŸ§ Quivr (YC W24) (Stan Girard) is a popular open-source platform where you can create personalized AI assistants over your data.Weā€™re excited to partner with Quivr to introduce LlamaParse - parse any complex document (.pdf, .pptx, .md) through our advanced parsing capabilities, ensuring that you get clean data before storing in the agentā€™s personalized memory. This ensures accurate retrieval and lower hallucinations during conversation.Check out the docs! https://lnkd.in/d7Ftjd5w

    220

    8 Comments

    Like Comment

    To view or add a comment, sign in

  • LlamaIndex

    150,196 followers

    • Report this post

    GPT-4o support now available in create-llama! Get 90% of the way through building a chatbot over your data just by answering a few questions.

    105

    3 Comments

    Like Comment

    To view or add a comment, sign in

LlamaIndex on LinkedIn: QA over massive tabular data without hallucinations (Caltrain scheduleā€¦ (50)

LlamaIndex on LinkedIn: QA over massive tabular data without hallucinations (Caltrain scheduleā€¦ (51)

150,196 followers

View Profile

Follow

Explore topics

  • Sales
  • Marketing
  • Business Administration
  • HR Management
  • Content Management
  • Engineering
  • Soft Skills
  • See All
LlamaIndex on LinkedIn: QA over massive tabular data without hallucinations (Caltrain scheduleā€¦ (2024)

References

Top Articles
Latest Posts
Article information

Author: Ray Christiansen

Last Updated:

Views: 6597

Rating: 4.9 / 5 (69 voted)

Reviews: 92% of readers found this page helpful

Author information

Name: Ray Christiansen

Birthday: 1998-05-04

Address: Apt. 814 34339 Sauer Islands, Hirtheville, GA 02446-8771

Phone: +337636892828

Job: Lead Hospitality Designer

Hobby: Urban exploration, Tai chi, Lockpicking, Fashion, Gunsmithing, Pottery, Geocaching

Introduction: My name is Ray Christiansen, I am a fair, good, cute, gentle, vast, glamorous, excited person who loves writing and wants to share my knowledge and understanding with you.