v0.5.2 release - Contributors, Sponsors and Enquiries are most welcome 😌

Local & Open Source Providers

Run AI agents completely locally or with open source models. Perfect for privacy-sensitive applications, offline deployments, cost optimization, and development.

Why Use Local Providers?

  • 🔒 Privacy & Security - Data never leaves your infrastructure
  • 💰 Cost Savings - No per-token API costs
  • Low Latency - No network round trips for inference
  • 🔌 Offline Capable - Works without internet connection
  • 🎛️ Full Control - Customize models, parameters, and deployment

Supported Local Providers

🦙 Ollama

The easiest way to run local LLMs. Supports Llama 3, Mistral, Gemma, and 50+ other models with automatic model management and GPU acceleration.

Installation

# macOS/Linux
curl -fsSL https://ollama.com/install.sh | sh

# Windows
# Download from https://ollama.com/download

# Pull a model
ollama pull llama3.2
ollama pull mistral
ollama pull gemma2

Basic Usage

import { Agent, OllamaProvider, ToolRegistry } from '@lov3kaizen/agentsea-core';

// Initialize Ollama provider
const provider = new OllamaProvider({
  baseUrl: 'http://localhost:11434',
  model: 'llama3.2' // or 'mistral', 'gemma2', etc.
});

// Create agent with local model
const agent = new Agent(
  {
    name: 'local-assistant',
    model: 'llama3.2',
    provider: 'ollama',
    systemPrompt: 'You are a helpful AI assistant running locally.',
    tools: [],
    temperature: 0.7
  },
  provider,
  new ToolRegistry()
);

// Execute locally
const response = await agent.execute(
  'What are the benefits of running AI models locally?'
);

console.log(response.content);

Recommended Models

  • llama3.2:3b - Fast, efficient, great for chat (3GB RAM)
  • llama3.1:8b - Balanced performance and quality (8GB RAM)
  • mistral:7b - Excellent instruction following (7GB RAM)
  • gemma2:9b - Google's efficient model (9GB RAM)
  • qwen2.5:7b - Strong multilingual support (7GB RAM)

🚀 llama.cpp

High-performance C++ inference engine with Metal (Mac), CUDA (NVIDIA), and CPU support. Provides the fastest local inference with quantized models.

Installation

# Build from source
git clone https://github.com/ggerganov/llama.cpp
cd llama.cpp

# macOS (with Metal acceleration)
make LLAMA_METAL=1

# Linux with CUDA
make LLAMA_CUDA=1

# Start server
./llama-server \
  -m models/llama-3.2-3b-q4_k_m.gguf \
  --port 8080 \
  --n-gpu-layers 35

Basic Usage

import { Agent, LlamaCppProvider, ToolRegistry } from '@lov3kaizen/agentsea-core';

// Initialize llama.cpp provider
const provider = new LlamaCppProvider({
  baseUrl: 'http://localhost:8080',
  model: 'llama-3.2-3b-q4_k_m'
});

const agent = new Agent(
  {
    name: 'fast-agent',
    model: 'llama-3.2-3b-q4_k_m',
    provider: 'llama-cpp',
    systemPrompt: 'You are a fast, efficient AI assistant.',
    tools: [],
    temperature: 0.7,
    maxTokens: 2048
  },
  provider,
  new ToolRegistry()
);

// Stream responses for real-time output
const stream = await agent.stream('Explain quantum computing');

for await (const chunk of stream) {
  if (chunk.type === 'content') {
    process.stdout.write(chunk.content);
  }
}

Download Quantized Models

Get pre-quantized GGUF models from HuggingFace:

🌐 GPT4All

Privacy-focused local LLM platform with easy-to-use desktop app and Python/TypeScript bindings. Great for beginners and non-technical users.

Installation

# Install GPT4All package
npm install gpt4all

# Or download desktop app
# https://gpt4all.io/

Basic Usage

import { Agent, GPT4AllProvider, ToolRegistry } from '@lov3kaizen/agentsea-core';

// Initialize GPT4All provider
const provider = new GPT4AllProvider({
  model: 'orca-mini-3b-gguf2-q4_0',
  modelPath: './models/' // Optional: custom model directory
});

const agent = new Agent(
  {
    name: 'gpt4all-agent',
    model: 'orca-mini-3b-gguf2-q4_0',
    provider: 'gpt4all',
    systemPrompt: 'You are a helpful assistant.',
    tools: []
  },
  provider,
  new ToolRegistry()
);

const response = await agent.execute('What is machine learning?');
console.log(response.content);

🤗 HuggingFace Transformers

Access thousands of open source models via HuggingFace Inference API or self-hosted endpoints. Supports both cloud and local deployment.

Using Inference API (Free)

import { Agent, HuggingFaceProvider, ToolRegistry } from '@lov3kaizen/agentsea-core';

// Free tier with rate limits
const provider = new HuggingFaceProvider({
  apiKey: process.env.HUGGINGFACE_API_KEY, // Get from hf.co
  model: 'meta-llama/Meta-Llama-3-8B-Instruct'
});

const agent = new Agent(
  {
    name: 'hf-agent',
    model: 'meta-llama/Meta-Llama-3-8B-Instruct',
    provider: 'huggingface',
    systemPrompt: 'You are a helpful AI assistant.',
    tools: []
  },
  provider,
  new ToolRegistry()
);

const response = await agent.execute('Explain neural networks');
console.log(response.content);

Self-Hosted with Text Generation Inference

# Run TGI server locally
docker run -p 8080:80 \
  -v ./models:/data \
  ghcr.io/huggingface/text-generation-inference:latest \
  --model-id meta-llama/Meta-Llama-3-8B-Instruct

# Connect to local endpoint
const provider = new HuggingFaceProvider({
  baseUrl: 'http://localhost:8080',
  model: 'meta-llama/Meta-Llama-3-8B-Instruct'
});

Popular Open Source Models

  • meta-llama/Meta-Llama-3.1-8B-Instruct - Meta's latest
  • mistralai/Mistral-7B-Instruct-v0.3 - Efficient instruct model
  • google/gemma-2-9b-it - Google's open model
  • Qwen/Qwen2.5-7B-Instruct - Multilingual support
  • microsoft/Phi-3-mini-4k-instruct - Small but capable (3.8B)

💻 LM Studio

User-friendly desktop app for running local LLMs with a beautiful UI. Includes model discovery, automatic quantization selection, and OpenAI-compatible API server.

Setup

  1. Download LM Studio from lmstudio.ai
  2. Browse and download models from the UI
  3. Start the local server (OpenAI-compatible endpoint)
  4. Connect AgentSea to the server

Basic Usage

import { Agent, OpenAIProvider, ToolRegistry } from '@lov3kaizen/agentsea-core';

// LM Studio uses OpenAI-compatible API
const provider = new OpenAIProvider({
  baseUrl: 'http://localhost:1234/v1',
  apiKey: 'lm-studio', // Any value works for local
  model: 'local-model'
});

const agent = new Agent(
  {
    name: 'lm-studio-agent',
    model: 'local-model',
    provider: 'openai', // Uses OpenAI interface
    systemPrompt: 'You are a helpful assistant.',
    tools: []
  },
  provider,
  new ToolRegistry()
);

const response = await agent.execute('Hello!');
console.log(response.content);

🌪️ Mistral AI (Self-Hosted)

Mistral offers open-weight models that can be self-hosted. Use their official Docker images or deploy via vLLM for production workloads.

Using vLLM (Recommended for Production)

# Deploy Mistral 7B with vLLM
docker run -p 8000:8000 \
  --gpus all \
  vllm/vllm-openai:latest \
  --model mistralai/Mistral-7B-Instruct-v0.3 \
  --dtype float16

# Connect AgentSea agent
const provider = new OpenAIProvider({
  baseUrl: 'http://localhost:8000/v1',
  apiKey: 'none',
  model: 'mistralai/Mistral-7B-Instruct-v0.3'
});

Provider Comparison

ProviderEase of UsePerformanceGPU RequiredBest For
Ollama⭐⭐⭐⭐⭐⭐⭐⭐⭐OptionalQuick start, development
llama.cpp⭐⭐⭐⭐⭐⭐⭐⭐OptionalMaximum performance
GPT4All⭐⭐⭐⭐⭐⭐⭐⭐NoBeginners, desktop apps
LM Studio⭐⭐⭐⭐⭐⭐⭐⭐⭐OptionalGUI users, testing models
HuggingFace TGI⭐⭐⭐⭐⭐⭐⭐RecommendedModel variety
vLLM⭐⭐⭐⭐⭐⭐⭐YesProduction, high throughput

Tool Calling with Local Models

Many local models don't natively support function calling like Claude or GPT-4. AgentSea provides automatic tool calling fallback using prompt engineering:

import { Agent, OllamaProvider, ToolRegistry, Calculator, HttpRequest } from '@lov3kaizen/agentsea-core';

const toolRegistry = new ToolRegistry();
toolRegistry.register(new Calculator());
toolRegistry.register(new HttpRequest());

const provider = new OllamaProvider({
  baseUrl: 'http://localhost:11434',
  model: 'llama3.2',
  // Enable tool calling adapter for models without native support
  useToolAdapter: true
});

const agent = new Agent(
  {
    name: 'tool-agent',
    model: 'llama3.2',
    provider: 'ollama',
    systemPrompt: 'You are a helpful assistant with access to tools.',
    tools: [
      { name: 'calculator', description: 'Perform mathematical calculations' },
      { name: 'http_request', description: 'Make HTTP requests to APIs' }
    ]
  },
  provider,
  toolRegistry
);

// Agent will automatically format tool calls in prompts
const response = await agent.execute(
  'What is 47 * 89 + 123?'
);

console.log(response.content);
// Uses calculator tool automatically

Hardware Requirements

Minimum Requirements

  • CPU Only: 8GB RAM, modern CPU (3B models)
  • GPU Recommended: 16GB RAM, NVIDIA GPU with 6GB+ VRAM (7B models)
  • Optimal: 32GB RAM, NVIDIA GPU with 12GB+ VRAM (13B+ models)
Model SizeRAM (CPU)VRAM (GPU)Performance
3B (Q4)4GB3GBFast, basic tasks
7B (Q4)8GB6GBGood balance
13B (Q4)16GB10GBHigh quality
70B (Q4)48GB40GBNear GPT-4 level

Troubleshooting

Connection Refused / Cannot Connect

  • Verify the server is running: curl http://localhost:11434
  • Check firewall settings
  • Ensure correct port in baseUrl

Out of Memory (OOM)

  • Use a smaller model (3B instead of 7B)
  • Try higher quantization (Q4 instead of Q8)
  • Reduce context length (maxTokens)
  • Enable GPU offloading if available

Slow Response Times

  • Enable GPU acceleration (Metal/CUDA)
  • Use quantized models (Q4_K_M recommended)
  • Reduce batch size or context length
  • Consider switching to llama.cpp for faster inference

Tool Calling Not Working

  • Enable useToolAdapter: true in provider config
  • Use models fine-tuned for instruction following
  • Check system prompt includes tool descriptions
  • Consider using cloud providers for complex tool use

Performance Optimization Tips

🚀 Speed Optimization

  • ✓ Use Q4_K_M quantization (best speed/quality)
  • ✓ Enable GPU layers (--n-gpu-layers 35)
  • ✓ Increase batch size for throughput
  • ✓ Use smaller models (3B-7B) for simple tasks
  • ✓ Enable flash attention if available

🎯 Quality Optimization

  • ✓ Use Q5_K_M or Q8 quantization
  • ✓ Choose larger models (13B-70B)
  • ✓ Adjust temperature (0.7 for creative, 0.1 for factual)
  • ✓ Use instruct-tuned model variants
  • ✓ Provide clear system prompts

Production Deployment

version: '3.8'

services:
  # High-performance inference with vLLM
  vllm:
    image: vllm/vllm-openai:latest
    ports:
      - "8000:8000"
    volumes:
      - ./models:/root/.cache/huggingface
    environment:
      - CUDA_VISIBLE_DEVICES=0
    command: >
      --model mistralai/Mistral-7B-Instruct-v0.3
      --dtype float16
      --max-model-len 4096
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              count: 1
              capabilities: [gpu]

  # Your AgentSea application
  agentsea-app:
    build: .
    ports:
      - "3000:3000"
    environment:
      - LOCAL_LLM_URL=http://vllm:8000
    depends_on:
      - vllm

Ready to Get Started?

Start with Ollama for the easiest setup, then explore other providers based on your needs. All local providers work seamlessly with AgentSea's agent framework, tools, and workflows.

Next Steps