Running Gemma 3 + Ollama on Colab: A Quick Start Guide for Developers

Mastering cutting-edge AI models only requires a single GPU. This guide will help you quickly get started with Gemma 3 and Ollama.
Core content:
1. The top performance and multimodal features of Gemma 3 models
2. The advantages of Ollama in simplifying the deployment of large language models
3. Steps to quickly configure and run Gemma 3 in the Colab environment
Ever wanted to run a State of the Art (SoTA) model on a single GPU?
Google's latest release, Gemma 3, is breaking the performance boundaries of language models, and Ollama is helping developers quickly get started with language models.
This guide will combine the power of Gemma 3 with the ease of use of Ollama in a Google Cloud Colab Enterprise environment through a hands-on demonstration.
You will learn how to efficiently deploy this Google model in a single GPU environment in a Colab notebook - in addition to basic text generation, this article will also explore practical techniques such as enabling multimodal capabilities (input images) and leveraging responsive streaming for interactive applications.
Whether you are preparing to develop the next generation of AI tool prototypes or are simply interested in popularizing cutting-edge AI technology, you can follow this guide to get started quickly.
Why choose Gemma 3?
The brilliance of Gemma 3 is that it is an open source model yet provides state-of-the-art (SoTA) performance.
It has also entered the multimodal realm and is now able to understand visual-linguistic input (i.e. process images and text simultaneously) and generate text output. Its key features include:
Ultra-long context window : supports 128k tokens and can handle longer context information
Multilingual understanding : covering more than 140 languages
Performance enhancement : Better performance in mathematical calculations, logical reasoning, and conversational skills
Structured output and function calls : easy to develop complex applications
Official quantized version : Reduces model size and computing requirements while maintaining high accuracy, allowing more developers to run powerful models without supercomputers
Let's take a look at the size differences between Gemma 3 versions:
As shown in the figure below, in the Chatbot Arena Elo score, we can see that Gemma 3, as an open source model, not only has top-notch (SoTA) performance, but can also run on a single GPU, greatly reducing the threshold for developers to use it.
Why choose Ollama?
The deployment and operation of traditional large language models (LLMs) are often a headache - complex dependencies and resource-intensive models.
Ollama completely eliminates these obstacles.
This means developers can experiment with open source models like Gemma more quickly in a variety of environments, whether it’s a development machine, a server, or a cloud instance like Google Cloud Colab Enterprise.
This simplified approach allows developers to break free from conventional constraints and iterate and explore various open source models more efficiently.
Gemma 3 is available in four versions on the Ollama platform:
• 1 billion parameter model: ollama run gemma3:1b
• 4 billion parameter model: ollama run gemma3:4b
• 12 billion parameter model: ollama run gemma3:12b
• 27 billion parameter model: ollama run gemma3:27b
Developers can pull the model and start interacting with it with just one line of command, which greatly reduces the entry threshold for practical exploration of LLM. Let's see how to get started quickly in a few minutes.
Quick Start: Configuring a Colab Environment on the Cloud
First, you need to set up the Colab development environment. You have two options:
1. Vertex AI Google Colab Enterprise (Recommended)
Using Vertex AI Colab Enterprise custom runtime is an ideal starting point. This solution has the following advantages:
Flexible accelerator selection : A100/H100 and other high-performance GPUs can be configured for large models or computationally intensive tasks
Enterprise-level security : built-in professional-level security protection features
Steps to configure a custom runtime:
Define a runtime template : Specify hardware requirements (e.g. select A100 GPU)
Create a runtime instance : Generate an instance based on a template
Connect your notebook : connect your development environment to your own runtime
2. Use Google Colab (free version)
The free version of Google Colab is a great choice for quick prototyping and getting started. While its GPU options are more limited than Colab Enterprise (unless you connect to a Google Cloud runtime), it still provides enough compute power for most common tasks, including running a medium-sized Gemma 3 model.
Google Colab configuration steps :
Launch Colab : Access Google Colab and create a new notebook
Configure GPU runtime :
Click "Runtime" in the top menu bar
Select "Change runtime type"
Select "T4 GPU" from the "Hardware accelerator" drop-down menu
Click "Save" and Colab may restart the runtime to apply the changes - this is completely normal!
How to run Gemma 3 + Ollama
Now that the environment is configured, let's start Ollama to run Gemma 3. Ollama makes this process extremely simple, and the following steps are detailed:
1. Install necessary dependencies
Installing Ollama directly in Colab may trigger a warning. This can be solved by installing the following toolkit:
pciutils
: PCI device detection toollsw
: Hardware Inventory Viewer
These hardware detection tools can effectively solve compatibility issues.
! sudo apt update && sudo apt install pciutils lshw
2. Install Ollama
Execute the following command to obtain and run the Ollama installation script, which will download and configure the Ollama service in the Colab instance:
!curl -fsSL https://ollama.com/install.sh | sh
This script will automatically complete the localized installation of Ollama.
3. Start the local Ollama service
Now we need to start the Ollama service process in the background. To do this, use the following command:
use
nohup
Keep the process runningRedirect output to
ollama.log
Log files&
Symbol implementation background operationEnsures that service continues to operate even if the initial connection is lost
!nohup ollama serve > ollama.log 2>&1 &
Note : Please wait for a few seconds after executing this module to allow the service to fully initialize before continuing with subsequent operations.
4. Run the Gemma 3 model
Once the Ollama service is running, you can start the main operation. Let's run the Gemma 3 12B model with a simple prompt:
! ollama run gemma3:12b "What is the capital of the Netherlands?"
Important : When running a specific model (such as gemma3:12b) for the first time, Ollama needs to download the model weight file. The download time depends on the model size and network speed, and subsequent runs will be much faster. Command analysis :
ollama run gemma3:12b
: Instructs Ollama to run the specified model"What is the capital of the Netherlands?"
: Prompt message sent to the model.
Operation status description :
You may see rotation symbols (such as ⠙ ⠹ ⠸…) when loading a model. This is normal data processing.
Example output after a successful run:
⠙ ⠹ ⠸ ⠼ ⠴ ⠦ ⠧ ⠇ ⠏ ⠏ ⠙ ⠹ ⠸ ⠼ ⠴ ⠴ ⠦ ⠧ ⠇ ⠏ ⠋ ⠙ ⠹ ⠸ ⠼ ⠴ ⠦ ⠧ ⠇ ⠏ ⠋ ⠙ ⠸ ⠸ ⠼
The capital of the Netherlands is **Amsterdam**.
However, it 's a bit complicated!
While Amsterdam is the capital and the largest city, **The Hague (Den Haag)**
is the seat of the government
and home to the Supreme Court and other important institutions.
So, it depends on what you mean by "capital."
If your hardware configuration allows (especially in Colab Enterprise environment), you can try the 27B version with a larger parameter size. Just replacegemma3:12b
Replace withgemma3:27b
That's it.
⚠️ Please note:
The 27B model significantly increases the computing resource requirements
It is recommended to use only in environments equipped with high-performance GPUs (such as A100/H100)
Model downloading takes longer on first run
Exploring the multimodal capabilities of Gemma 3
The most outstanding feature of Gemma 3 is the introduction of multimodal support, which can process visual-linguistic input and generate textual output.
The images are normalized to 896 x 896 resolution and encoded with 256 tokens per image. Let’s see how to implement multimodal features with Gemma 3 and Ollama.
The steps to use Ollama in Colab are as follows:
Upload image: Upload an image (e.g. picture.png) through the Colab file panel.
Run and specify image path: Include the path to the uploaded image directly in the prompt.
!ollama run gemma3: 12 b "Describe what's happening in this image: /content/picture.png"
Tip: When Ollama successfully loads an image, it will output a message similar to Added image '/content/picture.png'. If the image analysis fails, please double-check that the file path is correct!
Using Gemma 3 in Python
While command line interaction is suitable for testing, in actual development, you may prefer to call Gemma 3 programmatically. Ollama's Python library makes this very simple.
Install
Install the library using pip:
! pip install ollama
Basic text generation
Here is some sample code to generate text in a Python script:
import ollama
try :
response = ollama.generate(
model = "gemma3:12b" ,
prompt= "What is Friesland?"
)
print(response[ "response" ])
except Exception as e:
print( f"An error occurred: {e} " )
# Optional: Check the ollama.log file for server-side issues
Streaming Responses
For longer generation tasks or interactive applications, returning responses token by token in a streaming manner can significantly improve the user experience:
import ollama
try :
client = ollama.Client()
stream = client.generate(
model = "gemma3:12b" ,
prompt= "Explain the theory of relativity in one concise paragraph." ,
stream= True
)
print( "Gemma 3 Streaming: " )
for chunk in stream:
print(chunk[ 'response' ], end= '' , flush= True )
print() # Newline after streaming finishes
except Exception as e:
print( f"An error occurred during streaming: {e} " )
The output might look like this:
Building a simple chatbot
passclient.chat()
The message history function of the method can realize multi-round dialogue interaction. The following example builds a basic command line chat interface:
import ollama
try :
client = ollama.Client()
messages = [] # Stores the conversation history
print( "Starting chat with Gemma 3 (type 'exit' to quit)" )
while True :
user_input = input( "You: " )
if user_input.lower() == 'exit' :
print( "Exiting chat." )
break
#Append user message to history
messages.append({
'role' : 'user' ,
'content' : user_input
})
# Get streaming response from the model
response_stream = client.chat(
model = 'gemma3:12b' ,
messages=messages,
stream= True
)
print( "Gemma: " , end= "" )
full_assistant_response = ""
# Process and print the streamed response
for chunk in response_stream:
token = chunk[ 'message' ][ 'content' ]
print(token, end= '' , flush= True )
full_assistant_response += token
print() # Newline after assistant finishes
#Append assistant's full response to history
messages.append({
'role' : 'assistant' ,
'content' : full_assistant_response
})
except Exception as e:
print( f"\nAn error occurred during chat: {e} " )
Summary
So far, you have mastered:
Running Gemma 3 via Ollama in Google Colab
Interact with the model via the command line and Python
Processing text and image input
Building streaming responses and basic chat applications
Subsequent exploration directions
This guide focuses on running Gemma 3 on Colab via Ollama. If you want to:
In-depth fine-tuning of Gemma models
☁️ Scalable deployment on Google Cloud infrastructure