Woter AI detection.Hurry - ends Jul 23rd

New Year Sales :up to 80% OFF

AI Humanize AI Translator Bypass AI AI Rewriter AI Detector

PRICING

TRY FOR FREE

Running Gemma 3 + Ollama on Colab: A Quick Start Guide for Developers

Written by

Audrey Miles

Updated on:July-08th-2025

Ever wanted to run a State of the Art (SoTA) model on a single GPU?

Google's latest release, Gemma 3, is breaking the performance boundaries of language models, and Ollama is helping developers quickly get started with language models.

This guide will combine the power of Gemma 3 with the ease of use of Ollama in a Google Cloud Colab Enterprise environment through a hands-on demonstration.

You will learn how to efficiently deploy this Google model in a single GPU environment in a Colab notebook - in addition to basic text generation, this article will also explore practical techniques such as enabling multimodal capabilities (input images) and leveraging responsive streaming for interactive applications.

Whether you are preparing to develop the next generation of AI tool prototypes or are simply interested in popularizing cutting-edge AI technology, you can follow this guide to get started quickly.

Why choose Gemma 3?

The brilliance of Gemma 3 is that it is an open source model yet provides state-of-the-art (SoTA) performance.

It has also entered the multimodal realm and is now able to understand visual-linguistic input (i.e. process images and text simultaneously) and generate text output. Its key features include:

Ultra-long context window : supports 128k tokens and can handle longer context information
Multilingual understanding : covering more than 140 languages
Performance enhancement : Better performance in mathematical calculations, logical reasoning, and conversational skills
Structured output and function calls : easy to develop complex applications
Official quantized version : Reduces model size and computing requirements while maintaining high accuracy, allowing more developers to run powerful models without supercomputers

Let's take a look at the size differences between Gemma 3 versions:

As shown in the figure below, in the Chatbot Arena Elo score, we can see that Gemma 3, as an open source model, not only has top-notch (SoTA) performance, but can also run on a single GPU, greatly reducing the threshold for developers to use it.

Why choose Ollama?

The deployment and operation of traditional large language models (LLMs) are often a headache - complex dependencies and resource-intensive models.

Ollama completely eliminates these obstacles.

This means developers can experiment with open source models like Gemma more quickly in a variety of environments, whether it’s a development machine, a server, or a cloud instance like Google Cloud Colab Enterprise.

This simplified approach allows developers to break free from conventional constraints and iterate and explore various open source models more efficiently.

Gemma 3 is available in four versions on the Ollama platform:

• 1 billion parameter model: ollama run gemma3:1b

• 4 billion parameter model: ollama run gemma3:4b

• 12 billion parameter model: ollama run gemma3:12b

• 27 billion parameter model: ollama run gemma3:27b

Developers can pull the model and start interacting with it with just one line of command, which greatly reduces the entry threshold for practical exploration of LLM. Let's see how to get started quickly in a few minutes.

Quick Start: Configuring a Colab Environment on the Cloud

First, you need to set up the Colab development environment. You have two options:

1. Vertex AI Google Colab Enterprise (Recommended)

Using Vertex AI Colab Enterprise custom runtime is an ideal starting point. This solution has the following advantages:

Flexible accelerator selection : A100/H100 and other high-performance GPUs can be configured for large models or computationally intensive tasks
Enterprise-level security : built-in professional-level security protection features

Steps to configure a custom runtime:

Define a runtime template : Specify hardware requirements (e.g. select A100 GPU)
Create a runtime instance : Generate an instance based on a template
Connect your notebook : connect your development environment to your own runtime

2. Use Google Colab (free version)

The free version of Google Colab is a great choice for quick prototyping and getting started. While its GPU options are more limited than Colab Enterprise (unless you connect to a Google Cloud runtime), it still provides enough compute power for most common tasks, including running a medium-sized Gemma 3 model.

Google Colab configuration steps :

Launch Colab : Access Google Colab and create a new notebook
Configure GPU runtime :

Click "Runtime" in the top menu bar
Select "Change runtime type"
Select "T4 GPU" from the "Hardware accelerator" drop-down menu
Click "Save" and Colab may restart the runtime to apply the changes - this is completely normal!

How to run Gemma 3 + Ollama

Now that the environment is configured, let's start Ollama to run Gemma 3. Ollama makes this process extremely simple, and the following steps are detailed:

1. Install necessary dependencies

Installing Ollama directly in Colab may trigger a warning. This can be solved by installing the following toolkit:

pciutils: PCI device detection tool
lsw: Hardware Inventory Viewer

These hardware detection tools can effectively solve compatibility issues.

! sudo apt update && sudo apt install pciutils lshw

2. Install Ollama

Execute the following command to obtain and run the Ollama installation script, which will download and configure the Ollama service in the Colab instance:

!curl -fsSL https://ollama.com/install.sh | sh

This script will automatically complete the localized installation of Ollama.

3. Start the local Ollama service

Now we need to start the Ollama service process in the background. To do this, use the following command:

usenohupKeep the process running
Redirect output toollama.logLog files
&Symbol implementation background operation
Ensures that service continues to operate even if the initial connection is lost

!nohup ollama serve > ollama.log 2>&1 &

Note : Please wait for a few seconds after executing this module to allow the service to fully initialize before continuing with subsequent operations.

4. Run the Gemma 3 model

Once the Ollama service is running, you can start the main operation. Let's run the Gemma 3 12B model with a simple prompt:

! ollama run gemma3:12b  "What is the capital of the Netherlands?"

Important : When running a specific model (such as gemma3:12b) for the first time, Ollama needs to download the model weight file. The download time depends on the model size and network speed, and subsequent runs will be much faster. Command analysis :

ollama run gemma3:12b: Instructs Ollama to run the specified model
"What is the capital of the Netherlands?": Prompt message sent to the model.

Operation status description :

You may see rotation symbols (such as ⠙ ⠹ ⠸…) when loading a model. This is normal data processing.
Example output after a successful run:

⠙ ⠹ ⠸ ⠼ ⠴ ⠦ ⠧ ⠇ ⠏ ⠏ ⠙ ⠹ ⠸ ⠼ ⠴ ⠴ ⠦ ⠧ ⠇ ⠏ ⠋ ⠙ ⠹ ⠸ ⠼ ⠴ ⠦ ⠧ ⠇ ⠏ ⠋ ⠙ ⠸ ⠸ ⠼ 
The capital of the Netherlands  is  **Amsterdam**.
However, it 's a bit complicated! 
While Amsterdam is the capital and the largest city, **The Hague (Den Haag)** 
is the seat of the government 
and home to the Supreme Court and other important institutions.
So, it depends on what you mean by "capital."

If your hardware configuration allows (especially in Colab Enterprise environment), you can try the 27B version with a larger parameter size. Just replacegemma3:12bReplace withgemma3:27bThat's it.

⚠️ Please note:

The 27B model significantly increases the computing resource requirements
It is recommended to use only in environments equipped with high-performance GPUs (such as A100/H100)
Model downloading takes longer on first run

Exploring the multimodal capabilities of Gemma 3

The most outstanding feature of Gemma 3 is the introduction of multimodal support, which can process visual-linguistic input and generate textual output.

The images are normalized to 896 x 896 resolution and encoded with 256 tokens per image. Let’s see how to implement multimodal features with Gemma 3 and Ollama.

The steps to use Ollama in Colab are as follows:

Upload image: Upload an image (e.g. picture.png) through the Colab file panel.

Run and specify image path: Include the path to the uploaded image directly in the prompt.

!ollama run gemma3: 12 b  "Describe what's happening in this image: /content/picture.png"

Tip: When Ollama successfully loads an image, it will output a message similar to Added image '/content/picture.png'. If the image analysis fails, please double-check that the file path is correct!

Using Gemma 3 in Python

While command line interaction is suitable for testing, in actual development, you may prefer to call Gemma 3 programmatically. Ollama's Python library makes this very simple.

Install

Install the library using pip:

! pip install ollama

Basic text generation

Here is some sample code to generate text in a Python script:

import  ollama


try :
    response = ollama.generate(
        model = "gemma3:12b" ,
        prompt= "What is Friesland?"
    )
    print(response[ "response" ])


except  Exception  as  e:
    print( f"An error occurred:  {e} " )
    # Optional: Check the ollama.log file for server-side issues

Streaming Responses

For longer generation tasks or interactive applications, returning responses token by token in a streaming manner can significantly improve the user experience:

import  ollama


try :
    client = ollama.Client()
    stream = client.generate(
        model = "gemma3:12b" ,
        prompt= "Explain the theory of relativity in one concise paragraph." ,
        stream= True
    )


    print( "Gemma 3 Streaming: " )
    for  chunk  in  stream:
        print(chunk[ 'response' ], end= '' , flush= True )
    print()  # Newline after streaming finishes


except  Exception  as  e:
    print( f"An error occurred during streaming:  {e} " )

The output might look like this:

Building a simple chatbot

passclient.chat()The message history function of the method can realize multi-round dialogue interaction. The following example builds a basic command line chat interface:

import  ollama


try :
    client = ollama.Client()
    messages = []  # Stores the conversation history


    print( "Starting chat with Gemma 3 (type 'exit' to quit)" )


    while True :
        user_input = input( "You: " )
        if  user_input.lower() ==  'exit' :
            print( "Exiting chat." )
            break


        #Append user message to history
        messages.append({
            'role' :  'user' ,
            'content' : user_input
        })


        # Get streaming response from the model
        response_stream = client.chat(
            model = 'gemma3:12b' ,
            messages=messages,
            stream= True
        )


        print( "Gemma: " , end= "" )
        full_assistant_response =  ""
        # Process and print the streamed response
        for  chunk  in  response_stream:
            token = chunk[ 'message' ][ 'content' ]
            print(token, end= '' , flush= True )
            full_assistant_response += token
        print()  # Newline after assistant finishes


        #Append assistant's full response to history
        messages.append({
            'role' :  'assistant' ,
            'content' : full_assistant_response
        })


except  Exception  as  e:
    print( f"\nAn error occurred during chat:  {e} " )

Summary

So far, you have mastered:

Running Gemma 3 via Ollama in Google Colab
Interact with the model via the command line and Python
Processing text and image input
Building streaming responses and basic chat applications

Subsequent exploration directions

This guide focuses on running Gemma 3 on Colab via Ollama. If you want to:

In-depth fine-tuning of Gemma models

☁️ Scalable deployment on Google Cloud infrastructure