AgentSea - Unite and Orchestrate AI Agents

Why Use Local Providers?

🔒 Privacy & Security - Data never leaves your infrastructure
💰 Cost Savings - No per-token API costs
⚡ Low Latency - No network round trips for inference
🔌 Offline Capable - Works without internet connection
🎛️ Full Control - Customize models, parameters, and deployment

Supported Local Providers

🦙 Ollama

The easiest way to run local LLMs. Supports Llama 3, Mistral, Gemma, and 50+ other models with automatic model management and GPU acceleration.

Installation

# macOS/Linux
curl -fsSL https://ollama.com/install.sh | sh

# Windows
# Download from https://ollama.com/download

# Pull a model
ollama pull llama3.2
ollama pull mistral
ollama pull gemma2

Basic Usage

import { Agent, OllamaProvider, ToolRegistry } from '@lov3kaizen/agentsea-core';

// Initialize Ollama provider
const provider = new OllamaProvider({
  baseUrl: 'http://localhost:11434',
  model: 'llama3.2' // or 'mistral', 'gemma2', etc.
});

// Create agent with local model
const agent = new Agent(
  {
    name: 'local-assistant',
    model: 'llama3.2',
    provider: 'ollama',
    systemPrompt: 'You are a helpful AI assistant running locally.',
    tools: [],
    temperature: 0.7
  },
  provider,
  new ToolRegistry()
);

// Execute locally
const response = await agent.execute(
  'What are the benefits of running AI models locally?'
);

console.log(response.content);

Recommended Models

llama3.2:3b - Fast, efficient, great for chat (3GB RAM)
llama3.1:8b - Balanced performance and quality (8GB RAM)
mistral:7b - Excellent instruction following (7GB RAM)
gemma2:9b - Google's efficient model (9GB RAM)
qwen2.5:7b - Strong multilingual support (7GB RAM)

🚀 llama.cpp

High-performance C++ inference engine with Metal (Mac), CUDA (NVIDIA), and CPU support. Provides the fastest local inference with quantized models.

Installation

# Build from source
git clone https://github.com/ggerganov/llama.cpp
cd llama.cpp

# macOS (with Metal acceleration)
make LLAMA_METAL=1

# Linux with CUDA
make LLAMA_CUDA=1

# Start server
./llama-server \
  -m models/llama-3.2-3b-q4_k_m.gguf \
  --port 8080 \
  --n-gpu-layers 35

Basic Usage

import { Agent, LlamaCppProvider, ToolRegistry } from '@lov3kaizen/agentsea-core';

// Initialize llama.cpp provider
const provider = new LlamaCppProvider({
  baseUrl: 'http://localhost:8080',
  model: 'llama-3.2-3b-q4_k_m'
});

const agent = new Agent(
  {
    name: 'fast-agent',
    model: 'llama-3.2-3b-q4_k_m',
    provider: 'llama-cpp',
    systemPrompt: 'You are a fast, efficient AI assistant.',
    tools: [],
    temperature: 0.7,
    maxTokens: 2048
  },
  provider,
  new ToolRegistry()
);

// Stream responses for real-time output
const stream = await agent.stream('Explain quantum computing');

for await (const chunk of stream) {
  if (chunk.type === 'content') {
    process.stdout.write(chunk.content);
  }
}

Download Quantized Models

Get pre-quantized GGUF models from HuggingFace:

bartowski's collection - Wide variety of quantized models
TheBloke's collection - Popular GGUF conversions
QuantFactory - High-quality quantizations

🌐 GPT4All

Privacy-focused local LLM platform with easy-to-use desktop app and Python/TypeScript bindings. Great for beginners and non-technical users.

Installation

# Install GPT4All package
npm install gpt4all

# Or download desktop app
# https://gpt4all.io/

Basic Usage

import { Agent, GPT4AllProvider, ToolRegistry } from '@lov3kaizen/agentsea-core';

// Initialize GPT4All provider
const provider = new GPT4AllProvider({
  model: 'orca-mini-3b-gguf2-q4_0',
  modelPath: './models/' // Optional: custom model directory
});

const agent = new Agent(
  {
    name: 'gpt4all-agent',
    model: 'orca-mini-3b-gguf2-q4_0',
    provider: 'gpt4all',
    systemPrompt: 'You are a helpful assistant.',
    tools: []
  },
  provider,
  new ToolRegistry()
);

const response = await agent.execute('What is machine learning?');
console.log(response.content);

🤗 HuggingFace Transformers

Access thousands of open source models via HuggingFace Inference API or self-hosted endpoints. Supports both cloud and local deployment.

Using Inference API (Free)

import { Agent, HuggingFaceProvider, ToolRegistry } from '@lov3kaizen/agentsea-core';

// Free tier with rate limits
const provider = new HuggingFaceProvider({
  apiKey: process.env.HUGGINGFACE_API_KEY, // Get from hf.co
  model: 'meta-llama/Meta-Llama-3-8B-Instruct'
});

const agent = new Agent(
  {
    name: 'hf-agent',
    model: 'meta-llama/Meta-Llama-3-8B-Instruct',
    provider: 'huggingface',
    systemPrompt: 'You are a helpful AI assistant.',
    tools: []
  },
  provider,
  new ToolRegistry()
);

const response = await agent.execute('Explain neural networks');
console.log(response.content);

Self-Hosted with Text Generation Inference

# Run TGI server locally
docker run -p 8080:80 \
  -v ./models:/data \
  ghcr.io/huggingface/text-generation-inference:latest \
  --model-id meta-llama/Meta-Llama-3-8B-Instruct

# Connect to local endpoint
const provider = new HuggingFaceProvider({
  baseUrl: 'http://localhost:8080',
  model: 'meta-llama/Meta-Llama-3-8B-Instruct'
});

Popular Open Source Models

meta-llama/Meta-Llama-3.1-8B-Instruct - Meta's latest
mistralai/Mistral-7B-Instruct-v0.3 - Efficient instruct model
google/gemma-2-9b-it - Google's open model
Qwen/Qwen2.5-7B-Instruct - Multilingual support
microsoft/Phi-3-mini-4k-instruct - Small but capable (3.8B)

💻 LM Studio

User-friendly desktop app for running local LLMs with a beautiful UI. Includes model discovery, automatic quantization selection, and OpenAI-compatible API server.

Setup

Download LM Studio from lmstudio.ai
Browse and download models from the UI
Start the local server (OpenAI-compatible endpoint)
Connect AgentSea to the server

Basic Usage

import { Agent, OpenAIProvider, ToolRegistry } from '@lov3kaizen/agentsea-core';

// LM Studio uses OpenAI-compatible API
const provider = new OpenAIProvider({
  baseUrl: 'http://localhost:1234/v1',
  apiKey: 'lm-studio', // Any value works for local
  model: 'local-model'
});

const agent = new Agent(
  {
    name: 'lm-studio-agent',
    model: 'local-model',
    provider: 'openai', // Uses OpenAI interface
    systemPrompt: 'You are a helpful assistant.',
    tools: []
  },
  provider,
  new ToolRegistry()
);

const response = await agent.execute('Hello!');
console.log(response.content);

🌪️ Mistral AI (Self-Hosted)

Mistral offers open-weight models that can be self-hosted. Use their official Docker images or deploy via vLLM for production workloads.

Using vLLM (Recommended for Production)

# Deploy Mistral 7B with vLLM
docker run -p 8000:8000 \
  --gpus all \
  vllm/vllm-openai:latest \
  --model mistralai/Mistral-7B-Instruct-v0.3 \
  --dtype float16

# Connect AgentSea agent
const provider = new OpenAIProvider({
  baseUrl: 'http://localhost:8000/v1',
  apiKey: 'none',
  model: 'mistralai/Mistral-7B-Instruct-v0.3'
});

Provider Comparison

Provider	Ease of Use	Performance	GPU Required	Best For
Ollama	⭐⭐⭐⭐⭐	⭐⭐⭐⭐	Optional	Quick start, development
llama.cpp	⭐⭐⭐	⭐⭐⭐⭐⭐	Optional	Maximum performance
GPT4All	⭐⭐⭐⭐⭐	⭐⭐⭐	No	Beginners, desktop apps
LM Studio	⭐⭐⭐⭐⭐	⭐⭐⭐⭐	Optional	GUI users, testing models
HuggingFace TGI	⭐⭐⭐	⭐⭐⭐⭐	Recommended	Model variety
vLLM	⭐⭐	⭐⭐⭐⭐⭐	Yes	Production, high throughput

Tool Calling with Local Models

Many local models don't natively support function calling like Claude or GPT-4. AgentSea provides automatic tool calling fallback using prompt engineering:

import { Agent, OllamaProvider, ToolRegistry, Calculator, HttpRequest } from '@lov3kaizen/agentsea-core';

const toolRegistry = new ToolRegistry();
toolRegistry.register(new Calculator());
toolRegistry.register(new HttpRequest());

const provider = new OllamaProvider({
  baseUrl: 'http://localhost:11434',
  model: 'llama3.2',
  // Enable tool calling adapter for models without native support
  useToolAdapter: true
});

const agent = new Agent(
  {
    name: 'tool-agent',
    model: 'llama3.2',
    provider: 'ollama',
    systemPrompt: 'You are a helpful assistant with access to tools.',
    tools: [
      { name: 'calculator', description: 'Perform mathematical calculations' },
      { name: 'http_request', description: 'Make HTTP requests to APIs' }
    ]
  },
  provider,
  toolRegistry
);

// Agent will automatically format tool calls in prompts
const response = await agent.execute(
  'What is 47 * 89 + 123?'
);

console.log(response.content);
// Uses calculator tool automatically

Hardware Requirements

Minimum Requirements

CPU Only: 8GB RAM, modern CPU (3B models)
GPU Recommended: 16GB RAM, NVIDIA GPU with 6GB+ VRAM (7B models)
Optimal: 32GB RAM, NVIDIA GPU with 12GB+ VRAM (13B+ models)

Model Size	RAM (CPU)	VRAM (GPU)	Performance
3B (Q4)	4GB	3GB	Fast, basic tasks
7B (Q4)	8GB	6GB	Good balance
13B (Q4)	16GB	10GB	High quality
70B (Q4)	48GB	40GB	Near GPT-4 level

Troubleshooting

Connection Refused / Cannot Connect

Verify the server is running: curl http://localhost:11434
Check firewall settings
Ensure correct port in baseUrl

Out of Memory (OOM)

Use a smaller model (3B instead of 7B)
Try higher quantization (Q4 instead of Q8)
Reduce context length (maxTokens)
Enable GPU offloading if available

Slow Response Times

Enable GPU acceleration (Metal/CUDA)
Use quantized models (Q4_K_M recommended)
Reduce batch size or context length
Consider switching to llama.cpp for faster inference

Tool Calling Not Working

Enable useToolAdapter: true in provider config
Use models fine-tuned for instruction following
Check system prompt includes tool descriptions
Consider using cloud providers for complex tool use

Performance Optimization Tips

🚀 Speed Optimization

✓ Use Q4_K_M quantization (best speed/quality)
✓ Enable GPU layers (--n-gpu-layers 35)
✓ Increase batch size for throughput
✓ Use smaller models (3B-7B) for simple tasks
✓ Enable flash attention if available

🎯 Quality Optimization

✓ Use Q5_K_M or Q8 quantization
✓ Choose larger models (13B-70B)
✓ Adjust temperature (0.7 for creative, 0.1 for factual)
✓ Use instruct-tuned model variants
✓ Provide clear system prompts

Production Deployment

version: '3.8'

services:
  # High-performance inference with vLLM
  vllm:
    image: vllm/vllm-openai:latest
    ports:
      - "8000:8000"
    volumes:
      - ./models:/root/.cache/huggingface
    environment:
      - CUDA_VISIBLE_DEVICES=0
    command: >
      --model mistralai/Mistral-7B-Instruct-v0.3
      --dtype float16
      --max-model-len 4096
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              count: 1
              capabilities: [gpu]

  # Your AgentSea application
  agentsea-app:
    build: .
    ports:
      - "3000:3000"
    environment:
      - LOCAL_LLM_URL=http://vllm:8000
    depends_on:
      - vllm

Ready to Get Started?

Start with Ollama for the easiest setup, then explore other providers based on your needs. All local providers work seamlessly with AgentSea's agent framework, tools, and workflows.

Quick Start Guide →Agent Documentation →See Examples →

Local & Open Source Providers

Why Use Local Providers?

Supported Local Providers

🦙 Ollama

Installation

Basic Usage

Recommended Models

🚀 llama.cpp

Installation

Basic Usage

Download Quantized Models

🌐 GPT4All

Installation

Basic Usage

🤗 HuggingFace Transformers

Using Inference API (Free)

Self-Hosted with Text Generation Inference

Popular Open Source Models

💻 LM Studio

Setup

Basic Usage

🌪️ Mistral AI (Self-Hosted)

Using vLLM (Recommended for Production)

Provider Comparison

Tool Calling with Local Models

Hardware Requirements

Minimum Requirements

Troubleshooting

Connection Refused / Cannot Connect

Out of Memory (OOM)

Slow Response Times

Tool Calling Not Working

Performance Optimization Tips

🚀 Speed Optimization

🎯 Quality Optimization

Production Deployment

Ready to Get Started?

Next Steps