Ollama

The Ollama provider connects Iris to locally running or cloud-hosted Ollama instances. Run models like Llama 3.1, Mistral, and Code Llama on your own hardware for privacy, cost savings, and offline capabilities.

Quick Start

package main

import (
    "context"
    "fmt"

    "github.com/petal-labs/iris/core"
    "github.com/petal-labs/iris/providers/ollama"
)

func main() {
    provider := ollama.NewLocal()
    client := core.NewClient(provider)

    resp, err := client.Chat("llama3.1").
        System("You are a helpful assistant.").
        User("What is the capital of France?").
        GetResponse(context.Background())

    if err != nil {
        panic(err)
    }
    fmt.Println(resp.Output)
}

Installation

Before using the Ollama provider, install Ollama on your system:

# Using Homebrew
brew install ollama

# Or download from ollama.com
curl -fsSL https://ollama.com/install.sh | sh

curl -fsSL https://ollama.com/install.sh | sh

Then start the Ollama service:

# Start Ollama server
ollama serve

# Pull a model
ollama pull llama3.1

# No configuration needed for local!
# Default: http://localhost:11434

# Optional: Override the host
export OLLAMA_HOST=http://192.168.1.100:11434

# Store in the encrypted keystore
iris keys set ollama

# Or use an environment variable
export OLLAMA_API_KEY=...

Import

import "github.com/petal-labs/iris/providers/ollama"

Create the Provider

// Local instance (default: http://localhost:11434)
provider := ollama.NewLocal()

// Local instance with custom host via OLLAMA_HOST env var
os.Setenv("OLLAMA_HOST", "http://192.168.1.100:11434")
provider := ollama.NewLocal()

// Cloud instance from OLLAMA_API_KEY
provider, err := ollama.NewCloudFromEnv()

// Manual configuration
provider := ollama.New(
    ollama.WithBaseURL("http://my-server:11434"),
    ollama.WithCloud(),
    ollama.WithAPIKey("..."),
)

Configuration Options

Option	Description	Default
`WithBaseURL(url)`	Override the API base URL	`http://localhost:11434`
`WithCloud()`	Enable cloud mode (adds auth headers)	Disabled
`WithAPIKey(key)`	Set the API key for cloud mode	None
`WithHTTPClient(client)`	Use a custom `*http.Client`	Default client
`WithHeader(key, value)`	Add a custom HTTP header	None
`WithTimeout(duration)`	Set the request timeout	120 seconds

provider := ollama.New(
    ollama.WithBaseURL("http://gpu-server:11434"),
    ollama.WithTimeout(5 * time.Minute),
)

Supported Features

Feature	Supported	Notes
Chat	✓	All Ollama models
Streaming	✓	Real-time token streaming
Tool calling	✓	Model-dependent
Vision	✓	LLaVA, BakLLaVA, etc.
Reasoning	✓	DeepSeek-R1, etc.
Image generation		Not supported
Embeddings	✓	nomic-embed-text, all-minilm

Available Models

Chat Models

Model	Parameters	Context	Best For
`llama3.1`	8B	128K	General purpose
`llama3.1:70b`	70B	128K	Complex tasks
`llama3.2`	1B/3B	128K	Fast, lightweight
`mistral`	7B	32K	Balanced performance
`mixtral`	8x7B	32K	High quality MoE
`codellama`	7B/13B/34B	16K	Code generation
`deepseek-coder-v2`	16B/236B	128K	Advanced coding
`phi3`	3.8B	128K	Microsoft’s compact model
`gemma2`	2B/9B/27B	8K	Google’s open model
`qwen2.5`	0.5B-72B	128K	Alibaba’s multilingual

Vision Models

Model	Parameters	Best For
`llava`	7B/13B	Image analysis
`llava-llama3`	8B	Vision + Llama 3
`bakllava`	7B	Image understanding
`llava-phi3`	3.8B	Lightweight vision

Reasoning Models

Model	Parameters	Best For
`deepseek-r1`	1.5B-671B	Step-by-step reasoning
`qwq`	32B	Mathematical reasoning

Embedding Models

Model	Dimensions	Best For
`nomic-embed-text`	768	General embeddings
`all-minilm`	384	Lightweight embeddings
`mxbai-embed-large`	1024	High quality

Basic Chat

resp, err := client.Chat("llama3.1").
    System("You are a helpful coding assistant.").
    User("Write a function to reverse a string in Go.").
    Temperature(0.7).
    MaxTokens(500).
    GetResponse(ctx)

if err != nil {
    log.Fatal(err)
}
fmt.Println(resp.Output)

Streaming

Stream responses for real-time output:

stream, err := client.Chat("llama3.1").
    System("You are a helpful assistant.").
    User("Explain Go's concurrency model.").
    GetStream(ctx)

if err != nil {
    log.Fatal(err)
}

for chunk := range stream.Ch {
    fmt.Print(chunk.Content)
}
fmt.Println()

// Check for streaming errors
if err := <-stream.Err; err != nil {
    log.Fatal(err)
}

Vision (Multimodal)

Use vision models like LLaVA for image analysis:

// First, pull a vision model
// ollama pull llava

imageData, err := os.ReadFile("photo.png")
if err != nil {
    log.Fatal(err)
}
base64Data := base64.StdEncoding.EncodeToString(imageData)

resp, err := client.Chat("llava").
    UserMultimodal().
        Text("What's in this image?").
        ImageBase64(base64Data, "image/png").
        Done().
    GetResponse(ctx)

fmt.Println(resp.Output)

Multiple Images

resp, err := client.Chat("llava-llama3").
    UserMultimodal().
        Text("Compare these two images.").
        ImageBase64(image1Data, "image/png").
        ImageBase64(image2Data, "image/png").
        Done().
    GetResponse(ctx)

Reasoning Models

Use reasoning models for step-by-step problem solving:

// Pull a reasoning model
// ollama pull deepseek-r1:8b

resp, err := client.Chat("deepseek-r1:8b").
    User("Solve this step by step: If x + 5 = 12, what is x?").
    GetResponse(ctx)

// DeepSeek R1 shows its reasoning in the output
fmt.Println(resp.Output)

Tool Calling

Ollama supports tool calling with compatible models:

weatherTool := core.Tool{
    Name:        "get_weather",
    Description: "Get current weather for a location",
    Parameters: map[string]interface{}{
        "type": "object",
        "properties": map[string]interface{}{
            "location": map[string]interface{}{
                "type":        "string",
                "description": "City name",
            },
        },
        "required": []string{"location"},
    },
}

// Use a tool-capable model
resp, err := client.Chat("llama3.1").
    User("What's the weather in Tokyo?").
    Tools(weatherTool).
    GetResponse(ctx)

if len(resp.ToolCalls) > 0 {
    // Handle tool call
    call := resp.ToolCalls[0]
    result := getWeather(call.Arguments)

    // Continue conversation
    finalResp, err := client.Chat("llama3.1").
        User("What's the weather in Tokyo?").
        Tools(weatherTool).
        Assistant(resp.Output).
        ToolCall(call.ID, call.Name, call.Arguments).
        ToolResult(call.ID, result).
        GetResponse(ctx)

    fmt.Println(finalResp.Output)
}

Embeddings

Generate embeddings for RAG and semantic search:

// Pull an embedding model
// ollama pull nomic-embed-text

resp, err := provider.Embeddings(ctx, &core.EmbeddingRequest{
    Model: "nomic-embed-text",
    Input: []core.EmbeddingInput{
        {Text: "Go is a statically typed language."},
        {Text: "Python is dynamically typed."},
    },
})

if err != nil {
    log.Fatal(err)
}

for i, emb := range resp.Embeddings {
    fmt.Printf("Embedding %d: %d dimensions\n", i, len(emb.Values))
}

Model Management

List Available Models

models, err := provider.ListModels(ctx)
if err != nil {
    log.Fatal(err)
}

for _, model := range models {
    fmt.Printf("%s (%s)\n", model.Name, model.Size)
}

Pull a Model

err := provider.PullModel(ctx, "llama3.1:70b", func(progress ollama.PullProgress) {
    fmt.Printf("Downloading: %.1f%%\n", progress.Percent)
})
if err != nil {
    log.Fatal(err)
}

Delete a Model

err := provider.DeleteModel(ctx, "old-model")
if err != nil {
    log.Fatal(err)
}

Custom Model Parameters

Fine-tune model behavior:

resp, err := client.Chat("llama3.1").
    User("Write a creative story.").
    Temperature(0.9).        // Higher for creativity
    TopP(0.95).              // Nucleus sampling
    TopK(40).                // Top-k sampling
    RepetitionPenalty(1.1).  // Reduce repetition
    NumPredict(500).         // Max tokens
    GetResponse(ctx)

Context Window

// Use more context for long conversations
resp, err := client.Chat("llama3.1").
    User(longPrompt).
    NumCtx(8192).  // Increase context window
    GetResponse(ctx)

Multi-Turn Conversations

// First turn
resp1, _ := client.Chat("llama3.1").
    System("You are a helpful Go tutor.").
    User("What is a goroutine?").
    GetResponse(ctx)

// Second turn with history
resp2, _ := client.Chat("llama3.1").
    System("You are a helpful Go tutor.").
    User("What is a goroutine?").
    Assistant(resp1.Output).
    User("How is it different from a thread?").
    GetResponse(ctx)

Running on Remote Servers

Network Access

// Connect to Ollama on another machine
provider := ollama.New(
    ollama.WithBaseURL("http://gpu-server.local:11434"),
    ollama.WithTimeout(5 * time.Minute),
)

Docker Deployment

services:
  ollama:
    image: ollama/ollama
    ports:
      - "11434:11434"
    volumes:
      - ollama:/root/.ollama
    deploy:
      resources:
        reservations:
          devices:
            - capabilities: [gpu]

volumes:
  ollama:

// Connect to Docker-hosted Ollama
provider := ollama.New(
    ollama.WithBaseURL("http://localhost:11434"),
)

GPU Acceleration

Ollama automatically uses GPU when available:

# Check GPU status
ollama run llama3.1 --verbose
# Look for: "using CUDA" or "using Metal"

Memory Management

// For large models, increase keep-alive
resp, err := client.Chat("llama3.1:70b").
    User(prompt).
    KeepAlive("30m").  // Keep model in memory
    GetResponse(ctx)

Error Handling

resp, err := client.Chat("llama3.1").User(prompt).GetResponse(ctx)
if err != nil {
    // Check if Ollama is running
    if strings.Contains(err.Error(), "connection refused") {
        log.Fatal("Ollama is not running. Start with: ollama serve")
    }

    // Check if model is available
    if strings.Contains(err.Error(), "model not found") {
        log.Fatal("Model not installed. Run: ollama pull llama3.1")
    }

    var apiErr *core.APIError
    if errors.As(err, &apiErr) {
        log.Printf("API error %d: %s", apiErr.StatusCode, apiErr.Message)
    }

    if errors.Is(err, context.DeadlineExceeded) {
        log.Println("Request timed out - try a smaller model or increase timeout")
    }
}

Performance Tips

1. Choose the Right Model Size

Hardware	Recommended
8GB RAM	3B-7B models
16GB RAM	7B-13B models
32GB+ RAM	13B-70B models
GPU with 8GB VRAM	7B-13B models
GPU with 24GB+ VRAM	70B+ models

2. Model Quantization

# Use quantized models for better performance
ollama pull llama3.1:8b-instruct-q4_0  # 4-bit quantization
ollama pull llama3.1:8b-instruct-q8_0  # 8-bit quantization

3. Keep Models Loaded

// Keep model in GPU memory between requests
resp, err := client.Chat("llama3.1").
    User(prompt).
    KeepAlive("1h").
    GetResponse(ctx)

4. Batch Similar Requests

// Process multiple prompts efficiently
var wg sync.WaitGroup
for _, prompt := range prompts {
    wg.Add(1)
    go func(p string) {
        defer wg.Done()
        resp, _ := client.Chat("llama3.1").User(p).GetResponse(ctx)
        // Process response
    }(prompt)
}
wg.Wait()

Best Practices

1. Local Development

// Use fast models for development
provider := ollama.NewLocal()
client := core.NewClient(provider)

// Quick iteration with small model
resp, err := client.Chat("llama3.2:3b").  // Fast for testing
    User(prompt).
    GetResponse(ctx)

2. Production Setup

// More robust configuration for production
provider := ollama.New(
    ollama.WithBaseURL(os.Getenv("OLLAMA_HOST")),
    ollama.WithTimeout(5 * time.Minute),
)

client := core.NewClient(provider,
    core.WithRetryPolicy(&core.RetryPolicy{
        MaxRetries:      3,
        InitialInterval: 1 * time.Second,
        MaxInterval:     30 * time.Second,
    }),
)

3. Fallback to Cloud

// Use Ollama locally, fall back to cloud provider
localProvider := ollama.NewLocal()
cloudProvider, _ := openai.NewFromEnv()

// Try local first
resp, err := core.NewClient(localProvider).
    Chat("llama3.1").
    User(prompt).
    GetResponse(ctx)

if err != nil {
    // Fall back to OpenAI
    resp, err = core.NewClient(cloudProvider).
        Chat("gpt-4o-mini").
        User(prompt).
        GetResponse(ctx)
}

Notes

NewLocal() checks the OLLAMA_HOST env var before defaulting to localhost:11434
No API key is required for local instances
Cloud mode adds an Authorization: Bearer header to requests
Models must be pulled before use with ollama pull <model>
The provider is safe for concurrent use after construction
GPU acceleration is automatic when available

Next Steps

Tools Guide

Learn tool calling with local models. Tools →

Streaming Guide

Master streaming responses. Streaming →

Images Guide

Work with vision models. Images →

Providers Overview

Compare all available providers. Providers →