Local & Open Source Providers
Run AI agents completely locally or with open source models. Perfect for privacy-sensitive applications, offline deployments, cost optimization, and development.
Why Use Local Providers?
- 🔒 Privacy & Security - Data never leaves your infrastructure
- 💰 Cost Savings - No per-token API costs
- ⚡ Low Latency - No network round trips for inference
- 🔌 Offline Capable - Works without internet connection
- 🎛️ Full Control - Customize models, parameters, and deployment
Supported Local Providers
🦙 Ollama
The easiest way to run local LLMs. Supports Llama 3, Mistral, Gemma, and 50+ other models with automatic model management and GPU acceleration.
Installation
# macOS/Linux
curl -fsSL https://ollama.com/install.sh | sh
# Windows
# Download from https://ollama.com/download
# Pull a model
ollama pull llama3.2
ollama pull mistral
ollama pull gemma2Basic Usage
import { Agent, OllamaProvider, ToolRegistry } from '@lov3kaizen/agentsea-core';
// Initialize Ollama provider
const provider = new OllamaProvider({
baseUrl: 'http://localhost:11434',
model: 'llama3.2' // or 'mistral', 'gemma2', etc.
});
// Create agent with local model
const agent = new Agent(
{
name: 'local-assistant',
model: 'llama3.2',
provider: 'ollama',
systemPrompt: 'You are a helpful AI assistant running locally.',
tools: [],
temperature: 0.7
},
provider,
new ToolRegistry()
);
// Execute locally
const response = await agent.execute(
'What are the benefits of running AI models locally?'
);
console.log(response.content);Recommended Models
- llama3.2:3b - Fast, efficient, great for chat (3GB RAM)
- llama3.1:8b - Balanced performance and quality (8GB RAM)
- mistral:7b - Excellent instruction following (7GB RAM)
- gemma2:9b - Google's efficient model (9GB RAM)
- qwen2.5:7b - Strong multilingual support (7GB RAM)
🚀 llama.cpp
High-performance C++ inference engine with Metal (Mac), CUDA (NVIDIA), and CPU support. Provides the fastest local inference with quantized models.
Installation
# Build from source
git clone https://github.com/ggerganov/llama.cpp
cd llama.cpp
# macOS (with Metal acceleration)
make LLAMA_METAL=1
# Linux with CUDA
make LLAMA_CUDA=1
# Start server
./llama-server \
-m models/llama-3.2-3b-q4_k_m.gguf \
--port 8080 \
--n-gpu-layers 35Basic Usage
import { Agent, LlamaCppProvider, ToolRegistry } from '@lov3kaizen/agentsea-core';
// Initialize llama.cpp provider
const provider = new LlamaCppProvider({
baseUrl: 'http://localhost:8080',
model: 'llama-3.2-3b-q4_k_m'
});
const agent = new Agent(
{
name: 'fast-agent',
model: 'llama-3.2-3b-q4_k_m',
provider: 'llama-cpp',
systemPrompt: 'You are a fast, efficient AI assistant.',
tools: [],
temperature: 0.7,
maxTokens: 2048
},
provider,
new ToolRegistry()
);
// Stream responses for real-time output
const stream = await agent.stream('Explain quantum computing');
for await (const chunk of stream) {
if (chunk.type === 'content') {
process.stdout.write(chunk.content);
}
}Download Quantized Models
Get pre-quantized GGUF models from HuggingFace:
- bartowski's collection - Wide variety of quantized models
- TheBloke's collection - Popular GGUF conversions
- QuantFactory - High-quality quantizations
🌐 GPT4All
Privacy-focused local LLM platform with easy-to-use desktop app and Python/TypeScript bindings. Great for beginners and non-technical users.
Installation
# Install GPT4All package
npm install gpt4all
# Or download desktop app
# https://gpt4all.io/Basic Usage
import { Agent, GPT4AllProvider, ToolRegistry } from '@lov3kaizen/agentsea-core';
// Initialize GPT4All provider
const provider = new GPT4AllProvider({
model: 'orca-mini-3b-gguf2-q4_0',
modelPath: './models/' // Optional: custom model directory
});
const agent = new Agent(
{
name: 'gpt4all-agent',
model: 'orca-mini-3b-gguf2-q4_0',
provider: 'gpt4all',
systemPrompt: 'You are a helpful assistant.',
tools: []
},
provider,
new ToolRegistry()
);
const response = await agent.execute('What is machine learning?');
console.log(response.content);🤗 HuggingFace Transformers
Access thousands of open source models via HuggingFace Inference API or self-hosted endpoints. Supports both cloud and local deployment.
Using Inference API (Free)
import { Agent, HuggingFaceProvider, ToolRegistry } from '@lov3kaizen/agentsea-core';
// Free tier with rate limits
const provider = new HuggingFaceProvider({
apiKey: process.env.HUGGINGFACE_API_KEY, // Get from hf.co
model: 'meta-llama/Meta-Llama-3-8B-Instruct'
});
const agent = new Agent(
{
name: 'hf-agent',
model: 'meta-llama/Meta-Llama-3-8B-Instruct',
provider: 'huggingface',
systemPrompt: 'You are a helpful AI assistant.',
tools: []
},
provider,
new ToolRegistry()
);
const response = await agent.execute('Explain neural networks');
console.log(response.content);Self-Hosted with Text Generation Inference
# Run TGI server locally
docker run -p 8080:80 \
-v ./models:/data \
ghcr.io/huggingface/text-generation-inference:latest \
--model-id meta-llama/Meta-Llama-3-8B-Instruct
# Connect to local endpoint
const provider = new HuggingFaceProvider({
baseUrl: 'http://localhost:8080',
model: 'meta-llama/Meta-Llama-3-8B-Instruct'
});Popular Open Source Models
- meta-llama/Meta-Llama-3.1-8B-Instruct - Meta's latest
- mistralai/Mistral-7B-Instruct-v0.3 - Efficient instruct model
- google/gemma-2-9b-it - Google's open model
- Qwen/Qwen2.5-7B-Instruct - Multilingual support
- microsoft/Phi-3-mini-4k-instruct - Small but capable (3.8B)
💻 LM Studio
User-friendly desktop app for running local LLMs with a beautiful UI. Includes model discovery, automatic quantization selection, and OpenAI-compatible API server.
Setup
- Download LM Studio from lmstudio.ai
- Browse and download models from the UI
- Start the local server (OpenAI-compatible endpoint)
- Connect AgentSea to the server
Basic Usage
import { Agent, OpenAIProvider, ToolRegistry } from '@lov3kaizen/agentsea-core';
// LM Studio uses OpenAI-compatible API
const provider = new OpenAIProvider({
baseUrl: 'http://localhost:1234/v1',
apiKey: 'lm-studio', // Any value works for local
model: 'local-model'
});
const agent = new Agent(
{
name: 'lm-studio-agent',
model: 'local-model',
provider: 'openai', // Uses OpenAI interface
systemPrompt: 'You are a helpful assistant.',
tools: []
},
provider,
new ToolRegistry()
);
const response = await agent.execute('Hello!');
console.log(response.content);🌪️ Mistral AI (Self-Hosted)
Mistral offers open-weight models that can be self-hosted. Use their official Docker images or deploy via vLLM for production workloads.
Using vLLM (Recommended for Production)
# Deploy Mistral 7B with vLLM
docker run -p 8000:8000 \
--gpus all \
vllm/vllm-openai:latest \
--model mistralai/Mistral-7B-Instruct-v0.3 \
--dtype float16
# Connect AgentSea agent
const provider = new OpenAIProvider({
baseUrl: 'http://localhost:8000/v1',
apiKey: 'none',
model: 'mistralai/Mistral-7B-Instruct-v0.3'
});Provider Comparison
| Provider | Ease of Use | Performance | GPU Required | Best For |
|---|---|---|---|---|
| Ollama | ⭐⭐⭐⭐⭐ | ⭐⭐⭐⭐ | Optional | Quick start, development |
| llama.cpp | ⭐⭐⭐ | ⭐⭐⭐⭐⭐ | Optional | Maximum performance |
| GPT4All | ⭐⭐⭐⭐⭐ | ⭐⭐⭐ | No | Beginners, desktop apps |
| LM Studio | ⭐⭐⭐⭐⭐ | ⭐⭐⭐⭐ | Optional | GUI users, testing models |
| HuggingFace TGI | ⭐⭐⭐ | ⭐⭐⭐⭐ | Recommended | Model variety |
| vLLM | ⭐⭐ | ⭐⭐⭐⭐⭐ | Yes | Production, high throughput |
Tool Calling with Local Models
Many local models don't natively support function calling like Claude or GPT-4. AgentSea provides automatic tool calling fallback using prompt engineering:
import { Agent, OllamaProvider, ToolRegistry, Calculator, HttpRequest } from '@lov3kaizen/agentsea-core';
const toolRegistry = new ToolRegistry();
toolRegistry.register(new Calculator());
toolRegistry.register(new HttpRequest());
const provider = new OllamaProvider({
baseUrl: 'http://localhost:11434',
model: 'llama3.2',
// Enable tool calling adapter for models without native support
useToolAdapter: true
});
const agent = new Agent(
{
name: 'tool-agent',
model: 'llama3.2',
provider: 'ollama',
systemPrompt: 'You are a helpful assistant with access to tools.',
tools: [
{ name: 'calculator', description: 'Perform mathematical calculations' },
{ name: 'http_request', description: 'Make HTTP requests to APIs' }
]
},
provider,
toolRegistry
);
// Agent will automatically format tool calls in prompts
const response = await agent.execute(
'What is 47 * 89 + 123?'
);
console.log(response.content);
// Uses calculator tool automaticallyHardware Requirements
Minimum Requirements
- CPU Only: 8GB RAM, modern CPU (3B models)
- GPU Recommended: 16GB RAM, NVIDIA GPU with 6GB+ VRAM (7B models)
- Optimal: 32GB RAM, NVIDIA GPU with 12GB+ VRAM (13B+ models)
| Model Size | RAM (CPU) | VRAM (GPU) | Performance |
|---|---|---|---|
| 3B (Q4) | 4GB | 3GB | Fast, basic tasks |
| 7B (Q4) | 8GB | 6GB | Good balance |
| 13B (Q4) | 16GB | 10GB | High quality |
| 70B (Q4) | 48GB | 40GB | Near GPT-4 level |
Troubleshooting
Connection Refused / Cannot Connect
- Verify the server is running:
curl http://localhost:11434 - Check firewall settings
- Ensure correct port in baseUrl
Out of Memory (OOM)
- Use a smaller model (3B instead of 7B)
- Try higher quantization (Q4 instead of Q8)
- Reduce context length (maxTokens)
- Enable GPU offloading if available
Slow Response Times
- Enable GPU acceleration (Metal/CUDA)
- Use quantized models (Q4_K_M recommended)
- Reduce batch size or context length
- Consider switching to llama.cpp for faster inference
Tool Calling Not Working
- Enable
useToolAdapter: truein provider config - Use models fine-tuned for instruction following
- Check system prompt includes tool descriptions
- Consider using cloud providers for complex tool use
Performance Optimization Tips
🚀 Speed Optimization
- ✓ Use Q4_K_M quantization (best speed/quality)
- ✓ Enable GPU layers (
--n-gpu-layers 35) - ✓ Increase batch size for throughput
- ✓ Use smaller models (3B-7B) for simple tasks
- ✓ Enable flash attention if available
🎯 Quality Optimization
- ✓ Use Q5_K_M or Q8 quantization
- ✓ Choose larger models (13B-70B)
- ✓ Adjust temperature (0.7 for creative, 0.1 for factual)
- ✓ Use instruct-tuned model variants
- ✓ Provide clear system prompts
Production Deployment
version: '3.8'
services:
# High-performance inference with vLLM
vllm:
image: vllm/vllm-openai:latest
ports:
- "8000:8000"
volumes:
- ./models:/root/.cache/huggingface
environment:
- CUDA_VISIBLE_DEVICES=0
command: >
--model mistralai/Mistral-7B-Instruct-v0.3
--dtype float16
--max-model-len 4096
deploy:
resources:
reservations:
devices:
- driver: nvidia
count: 1
capabilities: [gpu]
# Your AgentSea application
agentsea-app:
build: .
ports:
- "3000:3000"
environment:
- LOCAL_LLM_URL=http://vllm:8000
depends_on:
- vllmReady to Get Started?
Start with Ollama for the easiest setup, then explore other providers based on your needs. All local providers work seamlessly with AgentSea's agent framework, tools, and workflows.