From Static Files to Dynamic Insights: The vLLM Effect on PDFs

Ujjwal
4 min readNov 23, 2024

--

Using LLMs to deal with the one thing that every software engineer struggles with : PDFs(shudders).

Credits : link

Understanding PDFs

PDF stands for Portable Document Format, the go-to document format people rely on to transfer files from one place to another. Compared to other document formats like Word(docx), PDFs add a layer of perceived immutability. Perceived because PDFs are actually editable.

Most people use PDFs to store and share valuable information, often assuming they are unchangeable.

But why do PDFs suck for software developers?

PDFs function like a “flat file format”. They don’t contain metadata or extra information about the text, meaning there’s no straightforward way to programmatically extract deterministic data. The information you can extract from a PDF is essentially limited to what you can visually observe. What you see is what you get. This makes PDFs an awesome format for us humans to read, but makes them a nightmare for developers trying to parse them programmatically.

In an ideal world, I could just write file.data.getFirstName() to extract the “Name” field from a PDF. Sadly, that’s not the world we live in.

How have humans made PDFs easier?

We’ve improved the usability of PDFs by introducing forms and standardized document formats. This adds a layer of structure to the data and makes it easier to rely on techniques like OCR to parse documents.

For example, you must have seen these boxes in the form.

You can parse this form by training a Neural network with the MNIST-like dataset from the 1990s. Yes, it’s that easy!

Bottom line, if you’re dealing with PDFs, hope to god that they are standardized forms and documents.

How visual LLMs have changed PDF parsing

Now, instead of relying on OCR and painstakingly parsing its output, you can simply query the document directly for the information you need. What’s even more impressive is how accessible and affordable this technology has become.

For example, I used OpenAI’s GPT-4o and instructor library to parse a W-2 tax form effortlessly. Even better? It only took me fewer than 20 lines of code to achieve this!

If you would just like to see the code, here’s the colab notebook.

Step 1 — Install Dependencies and initialize variables

For this demo, we’ll use the openai SDK to interact with the LLM, instructor lib to get structured outputs and pydantic to define the structured output and embed prompts in fields.

!pip install openai
!pip install instructor
!pip install pydantic

We will use all PDFs as images.

We define the input doc(image_path) and our openai api key.

image_path = "https://www.patriotsoftware.com/wp-content/uploads/2024/03/2024-Form-W-2-1.png"
openai_key = "sk-xxx"

Step 2 — Define the output and prompts

As I mentioned above, we are parsing a W-2 form. Here’s the W-2 we are parsing :

Let’s say we are building a software that takes in a W-2 as input and gives us the amount of Federal income tax withheld. You can add more fields if you want but for now, we’ll move ahead with just one field.

Let’s define our pydantic class and our system and user prompt.

from pydantic import BaseModel, Field

class W2Fields(BaseModel):
federal_tax_withheld: int = Field(description="Amount of federal tax withheld for this W-2")

system_prompt = "You are an expert tax document reader and analyzer"
user_prompt = "Look at the input W-2 form and extract the relevant fields from it. Double check your answers."

The awesome part about using instructor is that you can use the description parameter to insert a prompt for that particular field. So incase you add more fields to the class, you don’t need to change the user_prompt.

Step 3 — Make the LLM go brrr

Let’s use everything we have defined so far and call the LLM!

from openai import OpenAI
from instructor import from_openai

client = from_openai(OpenAI(api_key=openai_key))

response: W2Fields = client.chat.completions.create(
model="gpt-4o",
messages=[
{"role": "system", "content": system_prompt},
{"role": "user", "content": [
{"type": "text", "text": user_prompt},
{
"type": "image_url",
"image_url": {"url": image_path},
},
]},
],
response_model=W2Fields,
)

print(response.model_dump_json(indent=2))

Output :

{
"federal_tax_withheld": 4092
}

Sweet!

In less than 10 minutes, we transformed a PDF form into JSON data using fewer than 20 lines of code. That’s pretty incredible!

(wait, should we build a TurboTax competitor?!)

Hopefully, this demonstration showcased the power of visual LLMs for parsing data. In the coming days, I’ll dive into more comprehensive examples of document parsing, explore where OCR fits into the equation, and share tips on leveraging LLMs effectively for this task. Stay tuned!

Sign up to discover human stories that deepen your understanding of the world.

Free

Distraction-free reading. No ads.

Organize your knowledge with lists and highlights.

Tell your story. Find your audience.

Membership

Read member-only stories

Support writers you read most

Earn money for your writing

Listen to audio narrations

Read offline with the Medium app

--

--

Ujjwal
Ujjwal

Written by Ujjwal

Passionate about building cool things using code. Interested in backend engineering and Gen AI.

No responses yet

Write a response