Using LLMs to deal with the one thing that every software engineer struggles with : PDFs(shudders).
Understanding PDFs
PDF stands for Portable Document Format, the go-to document format people rely on to transfer files from one place to another. Compared to other document formats like Word(docx), PDFs add a layer of perceived immutability. Perceived because PDFs are actually editable.
Most people use PDFs to store and share valuable information, often assuming they are unchangeable.
But why do PDFs suck for software developers?
PDFs function like a “flat file format”. They don’t contain metadata or extra information about the text, meaning there’s no straightforward way to programmatically extract deterministic data. The information you can extract from a PDF is essentially limited to what you can visually observe. What you see is what you get. This makes PDFs an awesome format for us humans to read, but makes them a nightmare for developers trying to parse them programmatically.
In an ideal world, I could just write file.data.getFirstName()
to extract the “Name” field from a PDF. Sadly, that’s not the world we live in.
How have humans made PDFs easier?
We’ve improved the usability of PDFs by introducing forms and standardized document formats. This adds a layer of structure to the data and makes it easier to rely on techniques like OCR to parse documents.
For example, you must have seen these boxes in the form.

You can parse this form by training a Neural network with the MNIST-like dataset from the 1990s. Yes, it’s that easy!
Bottom line, if you’re dealing with PDFs, hope to god that they are standardized forms and documents.
How visual LLMs have changed PDF parsing
Now, instead of relying on OCR and painstakingly parsing its output, you can simply query the document directly for the information you need. What’s even more impressive is how accessible and affordable this technology has become.
For example, I used OpenAI’s GPT-4o and instructor library to parse a W-2 tax form effortlessly. Even better? It only took me fewer than 20 lines of code to achieve this!
If you would just like to see the code, here’s the colab notebook.
Step 1 — Install Dependencies and initialize variables
For this demo, we’ll use the openai
SDK to interact with the LLM, instructor
lib to get structured outputs and pydantic
to define the structured output and embed prompts in fields.
!pip install openai
!pip install instructor
!pip install pydantic
We will use all PDFs as images.
We define the input doc(image_path
) and our openai api key.
image_path = "https://www.patriotsoftware.com/wp-content/uploads/2024/03/2024-Form-W-2-1.png"
openai_key = "sk-xxx"
Step 2 — Define the output and prompts
As I mentioned above, we are parsing a W-2 form. Here’s the W-2 we are parsing :

Let’s say we are building a software that takes in a W-2 as input and gives us the amount of Federal income tax withheld. You can add more fields if you want but for now, we’ll move ahead with just one field.
Let’s define our pydantic
class and our system and user prompt.
from pydantic import BaseModel, Field
class W2Fields(BaseModel):
federal_tax_withheld: int = Field(description="Amount of federal tax withheld for this W-2")
system_prompt = "You are an expert tax document reader and analyzer"
user_prompt = "Look at the input W-2 form and extract the relevant fields from it. Double check your answers."
The awesome part about using instructor
is that you can use the description
parameter to insert a prompt for that particular field. So incase you add more fields to the class, you don’t need to change the user_prompt
.
Step 3 — Make the LLM go brrr
Let’s use everything we have defined so far and call the LLM!
from openai import OpenAI
from instructor import from_openai
client = from_openai(OpenAI(api_key=openai_key))
response: W2Fields = client.chat.completions.create(
model="gpt-4o",
messages=[
{"role": "system", "content": system_prompt},
{"role": "user", "content": [
{"type": "text", "text": user_prompt},
{
"type": "image_url",
"image_url": {"url": image_path},
},
]},
],
response_model=W2Fields,
)
print(response.model_dump_json(indent=2))
Output :
{
"federal_tax_withheld": 4092
}
Sweet!
In less than 10 minutes, we transformed a PDF form into JSON data using fewer than 20 lines of code. That’s pretty incredible!
(wait, should we build a TurboTax competitor?!)
Hopefully, this demonstration showcased the power of visual LLMs for parsing data. In the coming days, I’ll dive into more comprehensive examples of document parsing, explore where OCR fits into the equation, and share tips on leveraging LLMs effectively for this task. Stay tuned!