v0.5.2 release - Contributors, Sponsors and Enquiries are most welcome 😌

LLM Evaluation

Comprehensive feedback collection and evaluation platform. Build production-ready pipelines with automated metrics, LLM-as-Judge, and human feedback.

The Evaluate package provides metrics, LLM-as-Judge, human feedback collection, and preference dataset generation for RLHF/DPO training.

Installation

bash
pnpm add @lov3kaizen/agentsea-evaluate

Key Features

📊

Built-in Metrics

Accuracy, relevance, coherence, toxicity, and more

⚖️

LLM-as-Judge

Rubric-based and comparative scoring

👥

Human Feedback

Ratings, rankings, and corrections

📦

Dataset Management

Create and import datasets with HuggingFace integration

🔄

Continuous Evaluation

Monitor production quality with alerts

🎯

Preference Learning

Generate datasets for RLHF/DPO training

Quick Start

typescript
import {
  EvaluationPipeline,
  AccuracyMetric,
  RelevanceMetric,
  EvalDataset,
} from '@lov3kaizen/agentsea-evaluate';

// Create metrics
const accuracy = new AccuracyMetric({ type: 'fuzzy' });
const relevance = new RelevanceMetric();

// Create evaluation pipeline
const pipeline = new EvaluationPipeline({
  metrics: [accuracy, relevance],
  parallelism: 5,
});

// Create dataset
const dataset = new EvalDataset({
  items: [
    {
      id: '1',
      input: 'What is the capital of France?',
      expectedOutput: 'Paris',
    },
    {
      id: '2',
      input: 'What is 2 + 2?',
      expectedOutput: '4',
    },
  ],
});

// Run evaluation
const results = await pipeline.evaluate({
  dataset,
  generateFn: async (input) => {
    // Your LLM generation function
    return await myAgent.run(input);
  },
});

console.log(results.summary);
// { passRate: 0.95, avgScore: 0.87, ... }

Built-in Metrics

MetricDescription
AccuracyMetricExact, fuzzy, or semantic match against expected output
RelevanceMetricHow relevant the response is to the input
CoherenceMetricLogical flow and consistency of the response
ToxicityMetricDetection of harmful or inappropriate content
FaithfulnessMetricFactual accuracy relative to provided context (RAG)
ContextRelevanceMetricRelevance of retrieved context (RAG)
FluencyMetricGrammar, spelling, and readability
ConcisenessMetricBrevity without losing important information
HelpfulnessMetricHow helpful the response is to the user
SafetyMetricDetection of unsafe or harmful outputs

Custom Metrics

typescript
import { BaseMetric, MetricResult, EvaluationInput } from '@lov3kaizen/agentsea-evaluate';

class CustomMetric extends BaseMetric {
  readonly type = 'custom';
  readonly name = 'my-metric';

  async evaluate(input: EvaluationInput): Promise<MetricResult> {
    // Your evaluation logic
    const score = calculateScore(input.output, input.expectedOutput);

    return {
      metric: this.name,
      score,
      explanation: `Score: ${score}`,
    };
  }
}

LLM-as-Judge

Rubric-Based Evaluation

Use LLMs to evaluate responses with custom rubrics:

typescript
import { RubricJudge } from '@lov3kaizen/agentsea-evaluate';

const judge = new RubricJudge({
  provider: anthropicProvider,
  rubric: {
    criteria: 'Response Quality',
    levels: [
      { score: 1, description: 'Poor - Incorrect or irrelevant' },
      { score: 2, description: 'Fair - Partially correct' },
      { score: 3, description: 'Good - Correct but incomplete' },
      { score: 4, description: 'Very Good - Correct and complete' },
      { score: 5, description: 'Excellent - Correct, complete, and well-explained' },
    ],
  },
});

const result = await judge.evaluate({
  input: 'Explain quantum entanglement',
  output: response,
});

Comparative Evaluation

Compare two responses head-to-head:

typescript
import { ComparativeJudge } from '@lov3kaizen/agentsea-evaluate';

const judge = new ComparativeJudge({
  provider: openaiProvider,
  criteria: ['accuracy', 'helpfulness', 'clarity'],
});

const result = await judge.compare({
  input: 'Summarize this article',
  responseA: modelAOutput,
  responseB: modelBOutput,
});
// { winner: 'A', reasoning: '...', criteriaScores: {...} }

Human Feedback

Rating Collector

Collect ratings from human annotators:

typescript
import { RatingCollector } from '@lov3kaizen/agentsea-evaluate/feedback';

const collector = new RatingCollector({
  scale: 5,
  criteria: ['accuracy', 'helpfulness', 'clarity'],
});

// Collect feedback
await collector.collect({
  itemId: 'response-123',
  input: 'What is ML?',
  output: 'Machine Learning is...',
  annotatorId: 'user-1',
  ratings: {
    accuracy: 4,
    helpfulness: 5,
    clarity: 4,
  },
  comment: 'Good explanation',
});

// Get aggregated scores
const stats = collector.getStatistics('response-123');

Preference Collection

Collect A/B preferences for RLHF/DPO training:

typescript
import { PreferenceCollector } from '@lov3kaizen/agentsea-evaluate/feedback';

const collector = new PreferenceCollector();

// Collect A/B preferences
await collector.collect({
  input: 'Explain recursion',
  responseA: '...',
  responseB: '...',
  preference: 'A',
  annotatorId: 'user-1',
  reason: 'More concise explanation',
});

// Export for RLHF/DPO training
const dataset = collector.exportForDPO();

Datasets

Create Dataset

typescript
import { EvalDataset } from '@lov3kaizen/agentsea-evaluate/datasets';

const dataset = new EvalDataset({
  name: 'qa-benchmark',
  items: [
    {
      id: '1',
      input: 'Question 1',
      expectedOutput: 'Answer 1',
      context: ['Relevant context...'],
      tags: ['factual', 'science'],
    },
  ],
});

// Filter and sample
const subset = dataset
  .filter(item => item.tags?.includes('science'))
  .sample(100);

// Split for train/test
const [train, test] = dataset.split(0.8);

HuggingFace Integration

typescript
import { loadHuggingFaceDataset } from '@lov3kaizen/agentsea-evaluate/datasets';

const dataset = await loadHuggingFaceDataset('squad', {
  split: 'validation',
  inputField: 'question',
  outputField: 'answers.text[0]',
  contextField: 'context',
  limit: 1000,
});

Continuous Evaluation

Monitor production quality with automated evaluation pipelines:

typescript
import { ContinuousEvaluator } from '@lov3kaizen/agentsea-evaluate/continuous';

const evaluator = new ContinuousEvaluator({
  metrics: [accuracy, relevance, toxicity],
  sampleRate: 0.1, // Evaluate 10% of requests
  alertThresholds: {
    accuracy: 0.8,
    toxicity: 0.1,
  },
});

// Set up alerts
evaluator.on('alert', (alert) => {
  console.error(`Quality alert: ${alert.metric} below threshold`);
  notifyOncall(alert);
});

// Log production interactions
await evaluator.log({
  input: userQuery,
  output: agentResponse,
  expectedOutput: groundTruth, // Optional
});

API Reference

EvaluationPipeline

typescript
interface EvaluationPipelineConfig {
  metrics: MetricInterface[];
  llmJudge?: JudgeInterface;
  parallelism?: number;
  timeout?: number;
  retries?: number;
}

// Methods
pipeline.evaluate(options: PipelineEvaluationOptions): Promise<PipelineEvaluationResult>

EvalDataset

typescript
interface EvalDatasetItem {
  id: string;
  input: string;
  expectedOutput?: string;
  context?: string[];
  reference?: string;
  metadata?: Record<string, unknown>;
  tags?: string[];
}

// Methods
dataset.getItems(): EvalDatasetItem[]
dataset.filter(predicate): EvalDataset
dataset.sample(count): EvalDataset
dataset.split(ratio): [EvalDataset, EvalDataset]

PipelineEvaluationResult

typescript
interface PipelineEvaluationResult {
  results: SingleEvaluationResult[];
  metrics: MetricsSummary;
  failures: FailureAnalysis[];
  summary: EvaluationSummary;
  exportJSON(): string;
  exportCSV(): string;
}

Next Steps