Back to blog
AI#llm#ai-testing#quality-engineering#generative-ai#test-automation

LLM Testing: How to Test AI Applications and Language Model Outputs

A practical guide to testing applications that use Large Language Models — covering evaluation strategies, prompt regression testing, hallucination detection, latency and cost testing, and the tools QA engineers need to build reliable AI product quality.

InnovateBits9 min read

Testing an application that uses a Large Language Model (LLM) is fundamentally different from testing a traditional deterministic system. When a user submits a form and you expect a 200 response with specific JSON — that's easy to assert. When a user asks a question and your application calls GPT-4 or Claude for an answer — the output is probabilistic, context-sensitive, and can change with every model update.

This guide covers the strategies, tools, and techniques QA engineers need to build quality practices for AI-powered applications.


Why LLM Testing Is Different

Traditional software testing rests on a principle: given the same inputs, the system always produces the same outputs. Tests are deterministic. Pass/fail is binary.

LLM-based applications break this in several ways:

Non-determinism — even with temperature: 0, model outputs can vary between calls, API versions, and model updates. A test that asserted output === "The capital of France is Paris" would be fragile; a test that asserts output.contains("Paris") is better but still brittle.

Soft failures — an LLM response can be technically correct (no exception, valid JSON, non-empty string) but wrong in meaning (hallucinated facts, misunderstood intent, inappropriate tone). Traditional assertions don't catch this.

Moving targets — model providers update their models. GPT-4-turbo in January 2026 is not GPT-4-turbo in June 2025. Behaviour can change without any code change on your side.

Context sensitivity — the same prompt can produce different quality outputs depending on the conversation history, system prompt, retrieved documents, and user context.


The LLM Testing Stack

LLM quality requires multiple testing layers working together:

Layer 1: Unit tests (prompt evaluation)
         ↓
Layer 2: Integration tests (full pipeline)
         ↓  
Layer 3: Evaluation suite (LLM-as-judge)
         ↓
Layer 4: Production monitoring (real user interactions)

Layer 1: Prompt Unit Tests

Test individual prompts and prompt templates in isolation. The goal is to catch prompt regressions — changes in prompt wording that degrade output quality.

Tooling: Promptfoo is the most widely used open-source prompt testing framework.

# promptfoo config: promptfooconfig.yaml
prompts:
  - "Summarise the following customer review in one sentence: {{review}}"
  - "You are a helpful assistant. Summarise this review briefly: {{review}}"
 
providers:
  - openai:gpt-4o
  - anthropic:claude-sonnet-4-20250514
 
tests:
  - vars:
      review: "Absolutely loved the product! Fast shipping, great quality, will buy again."
    assert:
      - type: contains
        value: "positive"
      - type: javascript
        value: output.length < 200
      - type: llm-rubric
        value: "The summary should be accurate and capture the positive sentiment"
 
  - vars:
      review: "Terrible experience. Item arrived broken, support was unhelpful."
    assert:
      - type: contains-any
        value: ["negative", "broken", "disappointed"]
      - type: llm-rubric
        value: "The summary should reflect the negative experience accurately"

Run with:

npx promptfoo eval
npx promptfoo view  # Opens a visual comparison dashboard

Layer 2: Integration Tests

Test the full pipeline from user input to final output — including retrieval (RAG), tool calls, and post-processing.

import { test, expect } from '@playwright/test';
import Anthropic from '@anthropic-ai/sdk';
 
const client = new Anthropic();
 
// Test: RAG pipeline returns relevant answers
test('support chatbot answers product questions accurately', async () => {
  const response = await client.messages.create({
    model: 'claude-sonnet-4-20250514',
    max_tokens: 500,
    system: `You are a product support assistant for InnovateBits. 
             Use only the provided documentation to answer questions.
             If you don't know, say "I don't have that information."`,
    messages: [{
      role: 'user',
      content: 'What is your refund policy?'
    }]
  });
 
  const output = response.content[0].type === 'text' ? response.content[0].text : '';
  
  // Soft assertions: check for expected themes, not exact words
  expect(output).not.toContain("I don't have that information");
  expect(output.toLowerCase()).toMatch(/refund|return|policy/);
  expect(output.length).toBeGreaterThan(50);
});
 
// Test: Tool calling works correctly
test('assistant correctly calls the search tool', async () => {
  let toolWasCalled = false;
  
  const response = await client.messages.create({
    model: 'claude-sonnet-4-20250514',
    max_tokens: 500,
    tools: [{
      name: 'search_products',
      description: 'Search the product catalogue',
      input_schema: {
        type: 'object',
        properties: {
          query: { type: 'string' }
        },
        required: ['query']
      }
    }],
    messages: [{
      role: 'user',
      content: 'Find me wireless headphones under $100'
    }]
  });
 
  // Verify the model chose to use the tool
  const toolUse = response.content.find(b => b.type === 'tool_use');
  expect(toolUse).toBeDefined();
  expect(toolUse?.name).toBe('search_products');
  
  const input = toolUse?.input as { query: string };
  expect(input.query.toLowerCase()).toMatch(/headphone|wireless/);
});

Layer 3: LLM-as-Judge Evaluation

For output quality that can't be asserted programmatically, use another LLM as an evaluator. This is the most scalable approach for evaluating response quality at scale.

async function evaluateResponse(
  question: string,
  response: string,
  criteria: string
): Promise<{ score: number; reasoning: string }> {
  const evaluation = await client.messages.create({
    model: 'claude-sonnet-4-20250514',
    max_tokens: 300,
    messages: [{
      role: 'user',
      content: `Evaluate this AI response on a scale of 1-5.
 
Question: ${question}
Response: ${response}
Criteria: ${criteria}
 
Return JSON only: {"score": <1-5>, "reasoning": "<brief explanation>"}`
    }]
  });
 
  const text = evaluation.content[0].type === 'text' ? evaluation.content[0].text : '{}';
  return JSON.parse(text);
}
 
// Use in your test suite
test('customer service responses are empathetic and helpful', async () => {
  const testCases = [
    { question: "My order is late, I'm frustrated", minScore: 4 },
    { question: "How do I return a product?", minScore: 4 },
    { question: "This is the worst company ever", minScore: 3 },
  ];
 
  for (const { question, minScore } of testCases) {
    const response = await getCustomerServiceResponse(question);
    const evaluation = await evaluateResponse(
      question,
      response,
      "Response should be empathetic, professional, and provide actionable help"
    );
    
    expect(evaluation.score).toBeGreaterThanOrEqual(minScore);
  }
});

Layer 4: Production Monitoring

Log all LLM interactions in production and sample them for quality review.

What to monitor:

  • Latency — p50, p95, p99 response times by model and prompt type
  • Error rate — API failures, timeout rates, content policy rejections
  • Output quality — automated scoring on a sample of interactions
  • Cost — token usage by feature, total daily spend

Tooling options:

  • LangSmith (by LangChain) — full observability for LLM chains
  • Braintrust — evaluation, logging, and regression testing in one platform
  • Arize AI — enterprise ML observability including LLM monitoring
  • Weights & Biases — experiment tracking that extends to production monitoring

Testing for Hallucinations

Hallucination detection is one of the harder problems in LLM testing. A hallucination is when an LLM states something confidently that is factually wrong. For applications where accuracy matters (customer support, medical information, legal content), this is a critical quality dimension.

Fact-grounded applications (RAG)

If your application uses Retrieval-Augmented Generation (RAG) — generating answers from a retrieved document set — you can test faithfulness:

async function testRAGFaithfulness(
  question: string,
  retrievedContext: string,
  generatedAnswer: string
): Promise<boolean> {
  const evaluation = await client.messages.create({
    model: 'claude-sonnet-4-20250514',
    max_tokens: 100,
    messages: [{
      role: 'user',
      content: `Does this answer contain only information from the provided context?
      
Context: ${retrievedContext}
Answer: ${generatedAnswer}
 
Reply with only "YES" or "NO".`
    }]
  });
 
  const result = evaluation.content[0].type === 'text' ? evaluation.content[0].text.trim() : '';
  return result === 'YES';
}

Automated hallucination benchmarks

For open-domain QA applications, use established benchmarks like TruthfulQA to evaluate your system's overall hallucination rate on known problematic question types.


Regression Testing for Model Updates

Model providers update their models, and your application's behaviour can change without any code change. This requires a regression strategy specifically for model updates.

Build a golden dataset:

A golden dataset is a set of (input, expected output characteristic) pairs that define the expected behaviour of your application:

const goldenDataset = [
  {
    input: "Summarise our Q4 performance in one sentence",
    expectations: {
      maxLength: 150,
      mustContain: ["Q4"],
      tone: "professional",
      mustNotContain: ["I don't know", "I cannot"]
    }
  },
  // ... 50-200 more examples covering your key use cases
];

Run this dataset against the model before and after any model version change. If more than 5% of cases degrade, investigate before deploying the update.

Canary deployment for model changes:

Treat model version updates like code deployments — test in staging, deploy to 10% of traffic, monitor quality metrics, then roll out fully.


Latency and Cost Testing

Performance testing for LLM applications requires tracking metrics that don't exist in traditional performance testing.

Key metrics:

  • Time to first token (TTFT) — how long before streaming output begins. Critical for conversational UX.
  • Tokens per second — streaming throughput
  • Total response latency — end-to-end including prompt processing
  • Cost per request — input tokens × price + output tokens × price
async function measureLLMPerformance(prompt: string) {
  const startTime = Date.now();
  let firstTokenTime: number | null = null;
  let totalTokens = 0;
 
  const stream = await client.messages.stream({
    model: 'claude-sonnet-4-20250514',
    max_tokens: 500,
    messages: [{ role: 'user', content: prompt }]
  });
 
  for await (const event of stream) {
    if (event.type === 'content_block_delta' && firstTokenTime === null) {
      firstTokenTime = Date.now();
    }
  }
 
  const finalMessage = await stream.finalMessage();
  const endTime = Date.now();
 
  return {
    ttft: firstTokenTime ? firstTokenTime - startTime : null,
    totalLatency: endTime - startTime,
    inputTokens: finalMessage.usage.input_tokens,
    outputTokens: finalMessage.usage.output_tokens,
    estimatedCost: (finalMessage.usage.input_tokens * 0.000003) + 
                   (finalMessage.usage.output_tokens * 0.000015)
  };
}

Building an LLM Test Suite: Where to Start

If you're starting from zero, this is the order to tackle it:

Week 1: Prompt regression baseline Set up Promptfoo with 20–30 test cases covering your most critical prompts. Run it in CI on every code change.

Week 2: Integration test coverage Write integration tests for your most business-critical LLM pipeline paths. Focus on tool calling, RAG faithfulness, and error handling.

Week 3: Production logging Instrument your LLM calls to log inputs, outputs, latency, and token usage. You can't improve what you can't see.

Week 4: LLM-as-judge evaluation Build an evaluation harness for your top 5 failure modes (hallucination, unhelpful responses, wrong tone, missed tool calls, etc.). Run weekly against a sample of production interactions.


The QA Engineer's Role in AI Product Quality

LLM testing is a new skill set, but it builds on skills QA engineers already have: designing test cases, thinking about edge cases, building evaluation infrastructure, and making risk judgments about quality.

What's new: the probabilistic nature of outputs, the need for LLM-as-judge evaluation, and the importance of production monitoring as a first-class testing activity.

QA engineers who develop LLM testing expertise are positioning themselves for roles that barely existed two years ago and are now in high demand at every company building AI products.

For more on the AI quality engineering landscape, see our AI Testing Trends and Smart Test Data Generation guides.