Problems encountered when building code execution tools based on Function Calling

Explore how AI models execute complex codes and achieve intelligent tasks.
Core content:
1. Advanced capabilities of AI models: understanding intent and executing complex instructions
2. Implementing the Function Calling mechanism to allow the model to call tools and run code
3. How the backend system parses and executes the Python code requested by the model
In the process of developing MolaGPT, I always believe that simple text conversation is just a manifestation of the AI model's capabilities. What is really exciting is that the model can actively understand the intention, make plans like humans, and execute complex instructions to solve complex problems, such as calling tools, generating charts, and analyzing data.
To achieve this, it is necessary to build a complete "toolbox" for the model, and the language model itself must be powerful enough to call the tools, and these capabilities require matching back-end interfaces and stable result outputs.
During this period, I finally had the opportunity to develop an Agent based on Function Calling (now it seems to be Tools Calling). My goal is very clear: first, to give the model the ability to search online; second, and more importantly, to enable the model to run Python code.
Hand-painted OpenAI's Function Calling
The core idea of my backend is inspired by OpenAI's Function Calling mechanism. The principle can be summarized as follows: when the model recognizes that the user's intention needs to be completed with the help of external tools, it no longer directly generates the final answer, but generates a structured JSON object. This JSON object accurately describes the name of the function to be called (such as execute_python_code) and the parameters required to execute the function (for example, a Python code string).
This structured request is then sent to my backend system. After receiving this request, the backend parses the JSON content, matches the predefined list of available functions ($available_functions) according to the name field, and finds the corresponding processing logic (for example, execute_python_code). After the backend completes the task, it returns the execution result (such as the standard output or error message of the code) to the model for a second request, and then the model combines the context to generate the final reply to the user.
For example, if the model determines that the Python code print(2+2) needs to be executed, it will send a JSON similar to the following to the backend:
{ "name": "execute_python_code", "arguments": { "code": "print(2+2)" }}
The backend parses the name as execute_python_code, extracts the code field "print(2+2)" in arguments, and then passes it to a Python interpreter in an isolated Docker sandbox environment for execution. After execution, its stdout is captured and the result "4" is returned to the model. This usually involves two interactions with the model: one is the model outputting the Function Call request, and the other is the backend returning the execution result for the model's reference.
For example, in my backend implementation, I defined two core functions for my model:
search_web: Used to search the Internet for the latest information.
execute_python_code: used to execute Python code, which is also the focus of this article.
This is the function description information defined by the backend, which is used to inform the model of the existence and usage of these two tools:
$available_functions = [ [ "type" => "function", "function" => [ "name" => "search_web", "description" => "Search the web for the latest information and authoritative content", "parameters" => [ "type" => "object", "properties" => [ "query" => [ "type" => "string", "description" => "The search query" ] ], "required" => ["query"] ] ] ], [ "type" => "function", "function" => [ "name" => "execute_python_code", "description" => "Execute a snippet of Python code and return the standard output", "parameters" => [ "type" => "object", "properties" => [ "code" => [ "type" => "string", "description" => "The Python code to execute, eg, print(2+2))" ] ], "required" => ["code"] ] ] ];
The model will intelligently determine whether these tools need to be called based on the conversation context, and automatically generate corresponding parameter requests for execution by the backend.
Several problems encountered in the process and their solutions
1. Models don’t always use print()
When I initially tested the execute_python_code function, I found that the code generated by the model often failed to execute. After checking the log, I found that the problem was that the code generated by the model often did not explicitly use the print() function to output the final result. The model may have treated the backend Python environment as an interactive environment such as Jupyter Notebook. For example, when I asked the model to calculate 2+2, it might generate the following code and request execution:
2 + 2
## or
result = 2 + 2
Result
In interactive environments such as Jupyter Notebook, this is indeed no problem, but my backend is a Docker environment that simply pulls the Python image and builds it manually. After the Python executor runs this code, it will fail because there is no print()
statement, the standard output is empty. As a result, the model cannot receive the calculation result "4", causing it to think that the code execution failed.
1.1 Failed attempt: brute force concatenation of print()
My initial idea was simple: Is it possible to automatically add print(...)
But I soon realized that this method was too crude. Not all code executions need to print the result of the last expression. For example, the code may just define functions, import libraries, etc., which does not really need to be printed. print()
existence. Forced addition print()
Might result in syntax errors or unexpected behavior.
1.2 Solution 1: Use AST to automatically complete print()
After some research and consulting with the big model, I decided to use AST (Abstract Syntax Tree) to handle this problem intelligently. AST can parse the code into a tree structure, where each node represents a syntax element in the program, such as an expression, statement, function definition, etc. The root of this tree is the entire module, and its child nodes may be assignment statements, function calls, conditional judgments, and other structures.
The main purpose of using AST is to accurately express the syntax and semantics of the code, ignoring non-essential content such as comments and blank lines. print()
.
import ast
class AutoPrintTransformer (ast.NodeTransformer):
"""
AST transformer that automatically adds print() to the last expression
"""
def __init__ ( self ):
super ().__init__()
self.modified = False
def visit_Module ( self, node ):
"""
Process the module node and add print() to the last expression
"""
self.generic_visit(node)
last_expr_index = - 1
for i in range ( len (node.body) - 1 , - 1 , - 1 ):
stmt = node.body[i]
# Determine whether it is a print statement
is_print_stmt = (
isinstance (stmt, ast.Expr) and
isinstance (stmt.value, ast.Call) and
isinstance (stmt.value.func, ast.Name) and stmt.value.func. id == 'print'
)
is_docstring = (
isinstance (stmt, ast.Expr) and
isinstance (stmt.value, ast.Constant) and
isinstance (stmt.value.value, str ) and i == 0
)
if not is_print_stmt and not is_docstring:
last_expr_index = i
break
# Found the expression that needs to be printed
if last_expr_index != - 1 :
expr_node = node.body[last_expr_index]
print_call = ast.Call(func=ast.Name( id = 'print' , ctx=ast.Load()), args=[expr_node.value], keywords=[])
print_stmt = ast.Expr(value=print_call)
ast.copy_location(print_stmt, expr_node)
node.body[last_expr_index] = print_stmt
self.modified = True
return node
The main execution flow of this code is:
- Analysis
: Using Python's built-in ast
Library, throughast.parse()
Parse the code string generated by the model into an AST object. - Traversing the tree
: Traverse the top-level statement node list of AST (node.body)
. - Locate the last expression statement
: Find the last type ast.Expr
The statement node. - Check if it is missing
print()
: Determine whether this expression statement is already print()
Call, or whether it is a string constant. - Package
: If it is a normal expression that needs to print the result, wrap its AST node in a new print()
Function call node, and replace the original expression statement node. - Generate new code and execute
:use ast.unparse()
Convert the modified AST back to a Python code string, or compile and execute it directly.
In this way, even if the model is not written print()
, our backend can also find all the missing positions and fill them in dynamically print()
, ensuring that the calculation results can be captured and returned correctly.
1.3 New problems encountered later
After going online for a while, I found that PHP had serious problems in handling quotes and line breaks in Python code. When Python code is directly inserted into Python template strings, the code often contains special characters (such as single quotes, double quotes, backslashes, etc.) or complex multi-line structures, which causes the string to be interrupted prematurely or syntax errors to occur. This was unexpected because I did not write Python directly in PHP code when I was doing PHP development before, so I did not expect this problem to occur.
To completely solve this problem, I decided to further optimize it: first Base64 encode the user's Python code in PHP, so as to ensure that the string passed to the Python environment is native and not interfered by special characters.
1.4 Current solution: Use Base64 to maintain code integrity, and then use AST to automatically complete print()
After receiving the Python script submitted by the model, the backend uses PHP's built-in function base64_encode
Encode the Python code.
$b64 = base64_encode($user_python_code);
The Python side template no longer directly uses triple quotes to enclose the user script, but instead receives a Base64 string and then uses Python's built-in library base64.b64decode()
Restore it to the original code and use the previously developed AST to autocomplete it print()
method to process it.
import base64, ast, traceback
class AutoPrintTransformer (ast.NodeTransformer):
def __init__ ( self ):
self.has_modified = False
def visit_Module ( self, node ):
self.generic_visit(node)
last_expr = None
for i in range ( len (node.body) - 1 , - 1 , - 1 ):
stmt = node.body[i]
if isinstance (stmt, ast.Expr) and not (
isinstance (stmt.value, ast.Constant) and isinstance (stmt.value.value, str )
):
last_expr = (i, stmt)
break
if last_expr:
idx, expr = last_expr
is_print = (
isinstance (expr.value, ast.Call) and
isinstance (expr.value.func, ast.Name) and
expr.value.func.id == ' print'
)
if not is_print:
print_node = ast.Expr(
value = ast.Call(
func=ast.Name( id = 'print' , ctx=ast.Load()),
args=[expr.value],
keywords=[]
)
)
ast.copy_location(print_node, expr)
node.body[idx] = print_node
self.has_modified = True
return node
def transform_code ( src ):
try :
tree = ast.parse(src)
transformer = AutoPrintTransformer()
new_tree = transformer.visit(tree)
ast.fix_missing_locations(new_tree)
if transformer.has_modified:
exec ( compile (new_tree, '<string>' , 'exec' ), globals ())
else :
exec (src, globals ())
except Exception as e:
print ( f"Error while processing code: {e} " )
traceback.print_exc()
exec (src, globals ())
# Decode from Base64 to recover the original user code
user_code = base64.b64decode( '<Base64_encoded_string>' ).decode( 'utf-8' )
transform_code(user_code)
After the above solution is put into use, the problem of accidental string truncation during the interaction between PHP and Python codes is avoided, and Python with complex line break structures can be executed correctly.
2. Matplotlib drawing results cannot be displayed
In order to enhance the code execution capability, I deliberately pre-made the matplotlib library when building the Docker image, expecting the model to generate data visualization charts. In the test, the model can indeed generate standard drawing codes (such as drawing sine waves), and plt.show() is also called, but after the code is executed, no images can be seen in the front-end chat interface. But think about it, this is right, because plt.show() will try to open a window in the graphical user interface (GUI) environment to display the image. But in the Docker sandbox, there is no display or window system, and plt.show() will silently fail or report an error in a pure headless environment.
Some people may think (this person is me): Can we let the model save the image in the prompt? But this brings up a new problem: even if the model is changed to call plt.savefig('plot.png') to save the image to a file, this file only exists in the container and cannot be directly accessed by the outside world. Moreover, once the code is finished running, the data will be lost after the container is destroyed.
2.1 Solution: After dynamically injecting the code, traverse the image and extract
Pre-code injection: Before executing the model code, inject a patch code into the header area to track all created Figure objects and set matplotlib to use a non-interactive backend:
import os
import uuid
import matplotlib
matplotlib.use( 'Agg' ) # Use non-interactive backend
import matplotlib.pyplot as plt
fig_list = [] # Used to keep track of all created figures
orig_figure = plt.figure # Save the original plt.figure method
def track_figure ( *args, **kwargs ):
"""
Look for the resulting graph.
"""
fig = orig_figure(*args, **kwargs)
fig_list.append(fig)
return fig
plt.figure = track_figure
Post-code injection: After the code execution is finished, it traverses all image objects, saves them as PNG files, transparently transfers them to the local directory, and converts the path to an image link accessible to the public network:
...
saved_paths = []
if 'fig_list' in globals () and fig_list:
output_dir = "/output" # output directory
os.makedirs(output_dir, exist_ok= True )
for i, fig in enumerate (fig_list):
unique_id = str (uuid.uuid4())[: 8 ] # Generate unique ID
filename = os.path.join(output_dir, f"chart_ {i} _ {unique_id} .png" ) # Construct file name
try :
fig.tight_layout()
except Exception:
pass
try :
fig.savefig(filename, dpi= 100 , bbox_inches= 'tight' ) # Save image
saved_paths.append(filename) # Add path to the list
except Exception as e:
print ( "Error!!" )
finally :
plt.close(fig) # Close the figure
if saved_paths:
for img_path in saved_paths:
print ( f"###IMAGE_FILE_PATH### {img_path} " )
else :
print ( "No chart." )
plt.close( 'all'
Model processing logic: Use emphasis symbols ###IMAGE_FILE_PATH### to prompt the model that there are images in the current location, and use natural language that the model can understand to output charts in Markdown format:

After a series of processing, the image drawn by the model can be displayed to the user on the front end.
As shown in the figure above, the user asked the model to draw a regular tetrahedron. The model understood the user's intention and drew a beautiful image, which was successfully transmitted to the public network directory through the backend preprocessing logic.
3. Summary
Regardless of the temperature parameter setting, the model may still not use print()
; or the matplotlib image does not display. At this time, the developer needs to design the corresponding pre-processing, execution and post-processing mechanisms in the backend, which can effectively make up for the limitations of the Agent development process and "clean up" the model.