Tutorial | Extracting structured data from images using a large model

Written by

Audrey Miles

Updated on:July-12th-2025

In the rapidly developing field of artificial intelligence, integrating visual functions into large language models can be used to interpret image semantics and extract structured data from images .

1. Environment Configuration

To call a large model in Python, you must first configure the corresponding environment.

1.1 Install Python package


pip3 install ollama
pip3 install pydantic
pip3 install instructor

1.2 Install Ollama

Ollama is an open source application that allows you to run, create, and share large language models locally using a command line interface on MacOS, Linux, and Windows.

Ollama can access various LLMs directly from its library, which can be downloaded with just one command. After downloading, you can start using it by executing a single command. This is very helpful for users whose workload revolves around terminal windows. For detailed tutorials on how to install, configure, and use Ollama, please read Tutorial | How to use Ollama to download & use local large language models

1.3 Installing the Large Model

As of February 22, 2025, there are 7 large visual models published on the Ollama website. Here are two of them.

llama3.2-vision is better at recognizing English information in images
The minicpm-v model is based on qwen and is better at recognizing Chinese information in images.

Open the command line cmd (corresponding to terminal in Mac) and execute the installation command


ollama pull llama3.2-vision:11b
ollama pull minicpm -v:8b

1.4 Start the Ollama service

Open the command line cmd (corresponding to terminal in Mac) and execute the start service command


ollama serve

2. Experimental Code

2.1 Unstructured Output

The screenshot file name is test_screen.png


import  ollama


# Screenshot file of the paper test_screen.png
#Note that the code file and the screenshot file are in the same folder

response = ollama.chat(
    model = 'minicpm-v' ,  
    messages=[{
        'role' :  'user' ,
        'content' :  'What field is this paper about? ' ,
        'images' : [ 'test_screen.png' ]
    }]
)

print(response)

Run


ChatResponse(model= 'minicpm-v' , created_at= '2025-02-22T13:11:25.766017Z' ,  done =True, done_reason= 'stop' , total_duration=12956488125, load_duration=819433041, prompt_eval_count=461, prompt_eval_duration=9630000000, eval_count=147, eval_duration=2499000000, message=Message(role= 'assistant' , content= 'This image is of the title page of an article titled "On or Off Track: How (Broken) Cues Influence Consumer Decisions". Written by Jackie Silverman and Alexandra Balaschi, the article explores the consequences of new technological tracking of consumer behavior. Across seven studies, the research found that sustained behavioral tracks trigger reinforcement following high consumption, and that breaking these tracks has the opposite effect, thus influencing consumer decision making. The research methods used included tracking, behavioral analysis, and tools and techniques such as tracking and monitoring to understand the impact of cues across different domains (e.g., sports, learning). Keywords list the focus areas of the article: Circuit Breakers, Behavior Tracking and Recording, Consumer Motivation, Engagement. ' , images=None, tool_calls=None))

2.2 Structured Output

Design a more detailed prompt, design the data structure by using typing and pydantic , and output it as dictionary-like data.


import  instructor
import  os
from  typing  import  List
from  pydantic  import  BaseModel


PROMPT =  """Please analyze the provided image and extract the following information from it:
- Title
- Subject
- Field


Please return the results in the following format:
{
    "title": "Title of the paper",
    "subject": "Subject of the paper",
    "field": "Research field of the paper",
}"""


#The large model minicpm-v has been installed locally
model_name =  'minicpm-v'
base_url =  'http://127.0.0.1:11434/v1'
api_key =  'NA'


# Screenshot file of the paper test_screen.png
#Note that the code file and the screenshot file are in the same folder
image = instructor.Image.from_path( "test_screen.png" )




client = instructor.from_openai(
        OpenAI
            base_url=base_url,
            api_key=api_key,   # required, but unused
        ),
        mode=instructor.Mode.JSON,
)


class Paper (BaseModel) :
    title: str
    subject: List[str]
    field: List[str]




# Create structured output
result = client.chat.completions.create(
    model=model_name,
    messages=[
        { "role" :  "asistant" ,  "content" : PROMPT},
        { "role" :  "user" ,  "content" : image},
    ],
    response_model = Paper,
    temperature = 0.0
)


result.model_dump()

Run


{ 'title' :  'On or Off Track: How (Broken) Streaks Affect Consumer Decisions' ,
'subject' : [ 'streaks, behavioral tracking and logging, technology, goals and motivation' ],
'field' : [ 'consumer behavior' ,  'marketing research' ,  'engagement strategies' ]}

Discussion

The Da Deng test found that structured output is prone to errors, while unstructured output is more stable in comparison.