Amazing! Can you do image recognition just by talking? This magic tool makes AI programming no longer a dream! (Code included)

Written by

Silas Grey

Updated on:July-13th-2025

Recently, I discovered a super cool tool library - VisionAgent ^[1] , which allows you to use natural language to command AI to complete various image recognition tasks. It is so convenient! This tool is open source. I know many people will ask if there is an online demo to experience it. Yes, it is here

In the past, if we wanted to do image recognition, such as counting how many cans of Coke there are in a picture, we had to write a lot of code and adjust the parameters until our heads went bald. Now with VisionAgent, you only need to tell it: "Hey, help me count how many cans of Coke there are in this picture!", and it will do the rest!

What exactly is VisionAgent?

Simply put, VisionAgent is a tool library that allows you to use "human language" to command AI to process images. The big guys behind it are the most popular large language models (LLM), such as Anthropic's Claude-3.5 and OpenAI's o1.

These LLMs are like the "brain" of VisionAgent, responsible for understanding your instructions and then generating the corresponding code to complete the task. You only need to speak to let AI help you with image recognition, isn't it amazing?

What can VisionAgent do?

VisionAgent is really powerful! It can help you:

• Counting : Count the heads, coke cans, cars, etc. in the pictures.
• Object recognition : Recognize various objects in the picture, such as cats, dogs, tables, chairs, etc.
• Tracking objects : Track an object in a video, such as tracking a puppy in a video.
• Generate code : Directly generate executable image processing code that you can use for study or modify yourself.

How to use VisionAgent?

To use VisionAgent, you first have to install it:

pip install vision-agent

Next, you need to prepare your Anthropic and OpenAI API keys:

export  ANTHROPIC_API_KEY= "your-api-key"
export  OPENAI_API_KEY= "your-api-key"

Once you have these, you can start playing with VisionAgent!

Example 1: Counting heads

Want to know how many people are in a picture? Easy!

from  vision_agent.agent  import  VisionAgentCoderV2
from  vision_agent.models  import  AgentMessage

# Create a VisionAgent instance
agent = VisionAgentCoderV2(verbose= True )

# Let VisionAgent generate code to count heads
code_context = agent.generate_code(
    [
        AgentMessage(
            role= "user" ,
            content= "Count the number of people in this image" ,
            media=[ "people.png" ]   # Assume you have a picture named people.png
        )
    ]
)

# Save the generated code to a file
with open ( "generated_code.py" ,  "w" )  as  f:
    f.write(code_context.code +  "\n"  + code_context.test)

Running this code, VisionAgent will generate a generated_code.py The file contains the code for counting people! You can run this file directly or modify it yourself.

Example 2: Directly use the VisionAgent tool

VisionAgent not only generates code, it also provides a series of useful tools that you can use directly.

For example, you want to find all the people in a picture and frame them:

import  vision_agent.tools  as  T
import  matplotlib.pyplot  as  plt

# Loading images
image = T.load_image( "people.png" )

# Detect people in the image
dets = T.countgd_object_detection( "person" , image)

# Draw the detection results (frames) on the picture
viz = T.overlay_bounding_boxes(image, dets)

# Save the results
T.save_image(viz,  "people_detected.png" )

# Display results
plt.imshow(viz)
plt.show()

This code will generate a table called people_detected.png In the picture, the people inside are framed!

Example 3: Video Processing

VisionAgent can also process videos! For example, if you want to track an object in a video:

import  vision_agent.tools  as  T

# Extract each frame and the corresponding timestamp from the video
frames_and_ts = T.extract_frames_and_timestamps( "people.mp4" )   # Assume you have a video called people.mp4
frames = [f[ "frame" ]  for  f  in  frames_and_ts]

# Track "person" in each frame
tracks = T.countgd_sam2_video_tracking( "person" , frames)

# Draw the tracking results (segmentation mask) on each frame
viz = T.overlay_segmentation_masks(frames, tracks)

# Save the processed video
T.save_video(viz,  "people_detected.mp4" )

This code will generate a file called people_detected.mp4 In the video, everyone in it has been tracked and tagged!

Want to use another LLM?

VisionAgent uses Anthropic Claude-3.5 and OpenAI o1 by default, but you can also switch to other LLMs.

Just need to modify vision_agent/configs Under the directory config.py For example, if you want to use only Anthropic, you can put anthropic_config.py Copy to config.py:

cp vision_agent/configs/anthropic_config.py vision_agent/configs/config.py

Similar tools?

I won’t compare it with other similar tools here, because VisionAgent itself is an integrated tool that uses LLM as its brain and calls various visual toolkits to handle visual tasks. Therefore, the power of VisionAgent lies in its “brain”, that is, the choice of LLM. In fact, I haven’t seen similar tools