Tutorial | Extracting structured data from images using a large model

Mastering the big model technology, extracting structured data from images is no longer difficult.
Core content:
1. Environment configuration: Python calls the big model steps and necessary installation
2. Ollama application: local operation, creation and sharing of large language models
3. Experimental code: How to use Ollama to extract information from images
In the rapidly developing field of artificial intelligence, integrating visual functions into large language models can be used to interpret image semantics and extract structured data from images .
1. Environment Configuration
To call a large model in Python, you must first configure the corresponding environment.
1.1 Install Python package
pip3 install ollama
pip3 install pydantic
pip3 install instructor
1.2 Install Ollama
Ollama is an open source application that allows you to run, create, and share large language models locally using a command line interface on MacOS, Linux, and Windows.
Ollama can access various LLMs directly from its library, which can be downloaded with just one command. After downloading, you can start using it by executing a single command. This is very helpful for users whose workload revolves around terminal windows. For detailed tutorials on how to install, configure, and use Ollama, please read Tutorial | How to use Ollama to download & use local large language models
1.3 Installing the Large Model
As of February 22, 2025, there are 7 large visual models published on the Ollama website. Here are two of them.
llama3.2-vision is better at recognizing English information in images The minicpm-v model is based on qwen and is better at recognizing Chinese information in images.
Open the command line cmd (corresponding to terminal in Mac) and execute the installation command
ollama pull llama3.2-vision:11b
ollama pull minicpm -v:8b
1.4 Start the Ollama service
Open the command line cmd (corresponding to terminal in Mac) and execute the start service command
ollama serve
2. Experimental Code
2.1 Unstructured Output
The screenshot file name is test_screen.png
import ollama
# Screenshot file of the paper test_screen.png
#Note that the code file and the screenshot file are in the same folder
response = ollama.chat(
model = 'minicpm-v' ,
messages=[{
'role' : 'user' ,
'content' : 'What field is this paper about? ' ,
'images' : [ 'test_screen.png' ]
}]
)
print(response)
Run
ChatResponse(model= 'minicpm-v' , created_at= '2025-02-22T13:11:25.766017Z' , done =True, done_reason= 'stop' , total_duration=12956488125, load_duration=819433041, prompt_eval_count=461, prompt_eval_duration=9630000000, eval_count=147, eval_duration=2499000000, message=Message(role= 'assistant' , content= 'This image is of the title page of an article titled "On or Off Track: How (Broken) Cues Influence Consumer Decisions". Written by Jackie Silverman and Alexandra Balaschi, the article explores the consequences of new technological tracking of consumer behavior. Across seven studies, the research found that sustained behavioral tracks trigger reinforcement following high consumption, and that breaking these tracks has the opposite effect, thus influencing consumer decision making. The research methods used included tracking, behavioral analysis, and tools and techniques such as tracking and monitoring to understand the impact of cues across different domains (e.g., sports, learning). Keywords list the focus areas of the article: Circuit Breakers, Behavior Tracking and Recording, Consumer Motivation, Engagement. ' , images=None, tool_calls=None))
2.2 Structured Output
Design a more detailed prompt, design the data structure by using typing and pydantic , and output it as dictionary-like data.
import instructor
import os
from typing import List
from pydantic import BaseModel
PROMPT = """Please analyze the provided image and extract the following information from it:
- Title
- Subject
- Field
Please return the results in the following format:
{
"title": "Title of the paper",
"subject": "Subject of the paper",
"field": "Research field of the paper",
}"""
#The large model minicpm-v has been installed locally
model_name = 'minicpm-v'
base_url = 'http://127.0.0.1:11434/v1'
api_key = 'NA'
# Screenshot file of the paper test_screen.png
#Note that the code file and the screenshot file are in the same folder
image = instructor.Image.from_path( "test_screen.png" )
client = instructor.from_openai(
OpenAI
base_url=base_url,
api_key=api_key, # required, but unused
),
mode=instructor.Mode.JSON,
)
class Paper (BaseModel) :
title: str
subject: List[str]
field: List[str]
# Create structured output
result = client.chat.completions.create(
model=model_name,
messages=[
{ "role" : "asistant" , "content" : PROMPT},
{ "role" : "user" , "content" : image},
],
response_model = Paper,
temperature = 0.0
)
result.model_dump()
Run
{ 'title' : 'On or Off Track: How (Broken) Streaks Affect Consumer Decisions' ,
'subject' : [ 'streaks, behavioral tracking and logging, technology, goals and motivation' ],
'field' : [ 'consumer behavior' , 'marketing research' , 'engagement strategies' ]}
Discussion
The Da Deng test found that structured output is prone to errors, while unstructured output is more stable in comparison.