Hugging Face

The Hugging Face provider connects Iris to thousands of models hosted on the Hugging Face Inference API. Access open-source models like Llama, Mistral, Falcon, and many more through a unified interface.

Quick Start

package main

import (
    "context"
    "fmt"
    "os"

    "github.com/petal-labs/iris/core"
    "github.com/petal-labs/iris/providers/huggingface"
)

func main() {
    provider := huggingface.New(os.Getenv("HF_TOKEN"))
    client := core.NewClient(provider)

    resp, err := client.Chat("meta-llama/Llama-3.1-70B-Instruct").
        User("Explain the attention mechanism in transformers.").
        GetResponse(context.Background())

    if err != nil {
        panic(err)
    }
    fmt.Println(resp.Output)
}

Set Your API Key

CLI Keystore
Environment Variable

# Store in the encrypted keystore (recommended)
iris keys set huggingface
# Prompts for: Enter API key for huggingface: hf_...

# Primary variable
export HF_TOKEN=hf_...

# Fallback (also accepted)
export HUGGINGFACE_TOKEN=hf_...

Import

import "github.com/petal-labs/iris/providers/huggingface"

Create the Provider

// From a token string
provider := huggingface.New("hf_...")

// From HF_TOKEN or HUGGINGFACE_TOKEN environment variable
provider, err := huggingface.NewFromEnv()
if err != nil {
    log.Fatal("HF_TOKEN not set:", err)
}

// From the Iris keystore
provider, err := huggingface.NewFromKeystore()

Configuration Options

Option	Description	Default
`WithBaseURL(url)`	Override the Inference API base URL	Auto-resolved per model
`WithHubAPIBaseURL(url)`	Override the Hub API base URL	`https://huggingface.co`
`WithHTTPClient(client)`	Use a custom `*http.Client`	Default client
`WithHeader(key, value)`	Add a custom HTTP header	None
`WithTimeout(duration)`	Set the request timeout	120 seconds

provider := huggingface.New("hf_...",
    huggingface.WithTimeout(180 * time.Second),
)

Supported Features

Feature	Supported	Notes
Chat	✓	Instruction-tuned models
Streaming	✓	Real-time token streaming
Tool calling	✓	Model-dependent
Vision	✓	Multimodal models
Image generation		Not supported
Embeddings		Not supported

Available Models

Hugging Face hosts thousands of models. Here are some popular options:

Meta Llama Models

Model ID	Parameters	Best For
`meta-llama/Llama-3.1-70B-Instruct`	70B	Complex reasoning
`meta-llama/Llama-3.1-8B-Instruct`	8B	General purpose
`meta-llama/Llama-3.2-3B-Instruct`	3B	Fast, lightweight
`meta-llama/Llama-3.2-1B-Instruct`	1B	Ultra-fast

Mistral Models

Model ID	Parameters	Best For
`mistralai/Mistral-7B-Instruct-v0.3`	7B	Balanced performance
`mistralai/Mixtral-8x7B-Instruct-v0.1`	8x7B	High quality MoE
`mistralai/Mistral-Nemo-Instruct-2407`	12B	Latest Mistral

Google Models

Model ID	Parameters	Best For
`google/gemma-2-27b-it`	27B	Complex tasks
`google/gemma-2-9b-it`	9B	General purpose
`google/gemma-2-2b-it`	2B	Fast inference

Microsoft Models

Model ID	Parameters	Best For
`microsoft/Phi-3.5-mini-instruct`	3.8B	Compact, capable
`microsoft/Phi-3-medium-4k-instruct`	14B	Medium tasks

Specialized Models

Model ID	Parameters	Best For
`Qwen/Qwen2.5-72B-Instruct`	72B	Multilingual
`deepseek-ai/DeepSeek-Coder-V2-Instruct`	Various	Code generation
`nvidia/Llama-3.1-Nemotron-70B-Instruct-HF`	70B	NVIDIA optimized

Basic Chat

resp, err := client.Chat("meta-llama/Llama-3.1-8B-Instruct").
    System("You are a helpful coding assistant.").
    User("Write a function to calculate Fibonacci numbers in Go.").
    Temperature(0.7).
    MaxTokens(500).
    GetResponse(ctx)

if err != nil {
    log.Fatal(err)
}
fmt.Println(resp.Output)

Streaming

stream, err := client.Chat("meta-llama/Llama-3.1-70B-Instruct").
    System("You are a helpful assistant.").
    User("Explain how transformers work in machine learning.").
    GetStream(ctx)

if err != nil {
    log.Fatal(err)
}

for chunk := range stream.Ch {
    fmt.Print(chunk.Content)
}
fmt.Println()

if err := <-stream.Err; err != nil {
    log.Fatal(err)
}

Vision (Multimodal)

Use vision-capable models for image analysis:

imageData, err := os.ReadFile("photo.png")
if err != nil {
    log.Fatal(err)
}
base64Data := base64.StdEncoding.EncodeToString(imageData)

resp, err := client.Chat("meta-llama/Llama-3.2-11B-Vision-Instruct").
    UserMultimodal().
        Text("What's in this image?").
        ImageBase64(base64Data, "image/png").
        Done().
    GetResponse(ctx)

fmt.Println(resp.Output)

Tool Calling

Tool calling support depends on the model:

weatherTool := core.Tool{
    Name:        "get_weather",
    Description: "Get current weather for a location",
    Parameters: map[string]interface{}{
        "type": "object",
        "properties": map[string]interface{}{
            "location": map[string]interface{}{
                "type":        "string",
                "description": "City name",
            },
        },
        "required": []string{"location"},
    },
}

// Use a model that supports tool calling
resp, err := client.Chat("meta-llama/Llama-3.1-70B-Instruct").
    User("What's the weather in Tokyo?").
    Tools(weatherTool).
    GetResponse(ctx)

if len(resp.ToolCalls) > 0 {
    call := resp.ToolCalls[0]
    result := getWeather(call.Arguments)

    finalResp, err := client.Chat("meta-llama/Llama-3.1-70B-Instruct").
        User("What's the weather in Tokyo?").
        Tools(weatherTool).
        Assistant(resp.Output).
        ToolCall(call.ID, call.Name, call.Arguments).
        ToolResult(call.ID, result).
        GetResponse(ctx)

    fmt.Println(finalResp.Output)
}

Model Selection

Dynamic Model Selection

// Select model based on task complexity
func selectModel(task string, complexity string) string {
    switch complexity {
    case "simple":
        return "meta-llama/Llama-3.2-3B-Instruct"
    case "medium":
        return "meta-llama/Llama-3.1-8B-Instruct"
    case "complex":
        return "meta-llama/Llama-3.1-70B-Instruct"
    default:
        return "meta-llama/Llama-3.1-8B-Instruct"
    }
}

model := selectModel("coding", "complex")
resp, err := client.Chat(model).User(prompt).GetResponse(ctx)

Check Model Availability

// List available models (requires Hub API access)
models, err := provider.ListModels(ctx, huggingface.ListModelsOptions{
    Pipeline: "text-generation",
    Inference: "warm",  // Only models ready for inference
})

for _, m := range models {
    fmt.Printf("%s - %s\n", m.ID, m.Downloads)
}

Inference Endpoints

For production workloads, use dedicated Inference Endpoints:

// Connect to a dedicated Inference Endpoint
provider := huggingface.New("hf_...",
    huggingface.WithBaseURL("https://your-endpoint.endpoints.huggingface.cloud"),
)

// Use as normal
resp, err := client.Chat("").  // Model is determined by endpoint
    User(prompt).
    GetResponse(ctx)

Multi-Turn Conversations

// First turn
resp1, _ := client.Chat("meta-llama/Llama-3.1-8B-Instruct").
    System("You are a helpful programming tutor.").
    User("What is recursion?").
    GetResponse(ctx)

// Second turn with history
resp2, _ := client.Chat("meta-llama/Llama-3.1-8B-Instruct").
    System("You are a helpful programming tutor.").
    User("What is recursion?").
    Assistant(resp1.Output).
    User("Give me a Go example.").
    GetResponse(ctx)

Error Handling

resp, err := client.Chat(model).User(prompt).GetResponse(ctx)
if err != nil {
    // Check for model loading
    if strings.Contains(err.Error(), "loading") {
        log.Println("Model is loading, please retry in a moment")
        // Implement retry logic
    }

    // Check for rate limits
    var apiErr *core.APIError
    if errors.As(err, &apiErr) {
        switch apiErr.StatusCode {
        case 401:
            log.Fatal("Invalid HF token")
        case 403:
            log.Fatal("Token doesn't have Inference API permission")
        case 429:
            log.Printf("Rate limited. Retry after: %s", apiErr.RetryAfter)
        case 503:
            log.Println("Model is loading, retry later")
        }
    }

    if errors.Is(err, context.DeadlineExceeded) {
        log.Println("Request timed out - model may be loading")
    }
}

Model Warming

Some models need to be “warmed up” before use:

// Check if model is ready
ready, err := provider.IsModelReady(ctx, "meta-llama/Llama-3.1-70B-Instruct")
if err != nil {
    log.Fatal(err)
}

if !ready {
    // Trigger model loading
    _, err := provider.WarmModel(ctx, "meta-llama/Llama-3.1-70B-Instruct")
    if err != nil {
        log.Printf("Model loading: %v", err)
    }

    // Wait and retry
    time.Sleep(30 * time.Second)
}

// Now use the model
resp, err := client.Chat("meta-llama/Llama-3.1-70B-Instruct").
    User(prompt).
    GetResponse(ctx)

Best Practices

Choose the Right Model

Task	Recommended Model
General chat	meta-llama/Llama-3.1-8B-Instruct
Complex reasoning	meta-llama/Llama-3.1-70B-Instruct
Code generation	deepseek-ai/DeepSeek-Coder-V2-Instruct
Fast responses	meta-llama/Llama-3.2-3B-Instruct
Multilingual	Qwen/Qwen2.5-72B-Instruct

Handle Model Loading

client := core.NewClient(provider,
    core.WithRetryPolicy(&core.RetryPolicy{
        MaxRetries:        5,
        InitialInterval:   10 * time.Second,  // Longer for model loading
        MaxInterval:       60 * time.Second,
        BackoffMultiplier: 2.0,
        RetryOn:           []int{503, 429},
    }),
)

Use Appropriate Timeouts

// Larger models need longer timeouts
provider := huggingface.New("hf_...",
    huggingface.WithTimeout(180 * time.Second),  // 3 minutes
)

Notes

NewFromEnv() checks HF_TOKEN first, then falls back to HUGGINGFACE_TOKEN
Hugging Face hosts thousands of models - you specify the full model ID
Uses Authorization: Bearer for authentication
Some models require warming up before first use (503 responses)
The provider is safe for concurrent use after construction
Token must have “Make calls to Inference Providers” permission

Next Steps

Tools Guide

Learn tool calling patterns. Tools →

Streaming Guide

Master streaming responses. Streaming →

Providers Overview

Compare all available providers. Providers →