Back to blog
AI#test-data#ai#generative-ai#quality-engineering#test-automation

Smart Test Data Generation Using AI: A Practical Guide

How to use AI and LLMs to generate comprehensive, realistic test data — covering synthetic data generation, edge case discovery, PII-safe test datasets, and practical code examples with Claude and open-source tools.

InnovateBits7 min read

Test data is one of the most underappreciated challenges in software testing. Your test suite is only as good as the data you run it against. Inadequate test data leads to coverage gaps, false confidence, and defects that only surface in production when real user data triggers edge cases you never considered.

AI makes comprehensive test data generation significantly easier. This guide covers how to use LLMs and AI tools to generate realistic, diverse, and edge-case-rich test data at scale.


The Test Data Problem

There are three common approaches to test data, each with significant problems:

Using production data is the most realistic option but comes with privacy and compliance risks. GDPR, HIPAA, and similar regulations restrict using real customer data for testing. Data breaches that expose test environments put real user data at risk.

Manually created test data is safe but limited. Engineers create a few representative examples and miss the long tail of edge cases that real users generate — unusual name formats, international characters, extreme field lengths, unusual date patterns.

Faker libraries (Faker.js, Python Faker) generate random realistic data but don't reason about domain-specific constraints or generate purposeful edge cases. They're great for volume data but poor for deliberate coverage.

AI-assisted test data generation combines the realism of production data with the safety of synthetic data, while adding the intelligence to generate edge cases you haven't thought of.


Using LLMs for Test Data Generation

Generating structured test datasets

LLMs excel at generating diverse, realistic structured data when given clear instructions:

import Anthropic from '@anthropic-ai/sdk';
 
const anthropic = new Anthropic();
 
async function generateUserTestData(count: number): Promise<User[]> {
  const response = await anthropic.messages.create({
    model: 'claude-sonnet-4-20250514',
    max_tokens: 2000,
    messages: [{
      role: 'user',
      content: `Generate ${count} realistic but fictional user records for testing an e-commerce platform.
      
      Include diversity in:
      - Names (international, unusual characters, hyphenated, single names, very long names)
      - Email formats (subaddressing like user+tag@domain.com, different TLDs, edge cases)
      - Phone numbers (different countries, formats, with/without country codes)
      - Addresses (international, PO boxes, apartment formats, missing fields)
      
      Also include these specific edge cases:
      - A user with a name containing SQL injection attempt
      - A user with emoji in their display name
      - A user with maximum-length fields
      - A user with minimal required fields only
      
      Return ONLY a valid JSON array, no markdown, no explanation.
      Each object: { id, firstName, lastName, email, phone, address, createdAt }`
    }]
  });
 
  const text = response.content[0].type === 'text' ? response.content[0].text : '';
  return JSON.parse(text);
}

This generates test users that cover:

  • Normal cases (the happy path)
  • International variations (which manual data creation typically neglects)
  • Security edge cases (injection attempts)
  • Boundary cases (maximum lengths, minimum fields)

Generating edge case inputs for specific fields

For testing individual form fields or API parameters, prompt for targeted edge case generation:

async function generateEmailEdgeCases(): Promise<string[]> {
  const response = await anthropic.messages.create({
    model: 'claude-sonnet-4-20250514',
    max_tokens: 1000,
    messages: [{
      role: 'user',
      content: `Generate a comprehensive list of email address edge cases for testing an email validation system.
      
      Include:
      - Valid emails that naive validators reject (valid per RFC 5321)
      - Invalid emails that naive validators accept  
      - Internationalized domain names
      - Subaddressing (user+tag@domain.com)
      - IP address domains
      - Very long local parts
      - Unicode in local part
      - Common typos (gmail.com → gmial.com)
      - Disposable email domains
      - Business email patterns
      
      Return ONLY a JSON array of strings. No explanation.`
    }]
  });
 
  return JSON.parse(response.content[0].type === 'text' ? response.content[0].text : '[]');
}

Domain-specific test data

For domain-specific testing (healthcare, finance, e-commerce), LLMs understand domain context:

async function generateProductCatalogData() {
  const response = await anthropic.messages.create({
    model: 'claude-sonnet-4-20250514',
    max_tokens: 3000,
    messages: [{
      role: 'user',
      content: `Generate test product data for an e-commerce platform. Include edge cases that test:
      
      1. Pricing edge cases: $0.00 products, very high prices ($99,999.99), prices with many decimal places
      2. Inventory edge cases: 0 stock, negative stock (pre-order), very large inventory (10000+)  
      3. Name edge cases: very short (1 char), very long (200+ chars), special characters, unicode
      4. Category edge cases: uncategorised products, multi-category products
      5. Image edge cases: no image, multiple images, broken image URL
      6. Description edge cases: empty, very long (5000+ chars), HTML in description
      
      Return ONLY valid JSON array. Schema: { id, name, price, stock, category, description, imageUrl }`
    }]
  });
 
  return JSON.parse(response.content[0].type === 'text' ? response.content[0].text : '[]');
}

Open Source Tools for AI-Assisted Test Data

Faker.js with AI augmentation

Faker.js generates basic realistic data. Combine it with an LLM for edge case layers:

import { faker } from '@faker-js/faker';
 
// Faker for volume data
function generateBulkUsers(count: number) {
  return Array.from({ length: count }, () => ({
    id: faker.string.uuid(),
    name: faker.person.fullName(),
    email: faker.internet.email(),
    phone: faker.phone.number(),
    address: faker.location.streetAddress(),
  }));
}
 
// LLM for targeted edge cases — combine both
async function generateCompleteTestDataset(bulkCount: number) {
  const bulkData = generateBulkUsers(bulkCount);
  const edgeCases = await generateUserTestData(20); // LLM-generated edge cases
  
  return [...bulkData, ...edgeCases];
}

Mockaroo

Mockaroo is a web-based tool for generating realistic CSV/JSON test data with custom schemas. While not AI-powered in the LLM sense, it's excellent for generating large volumes of relational test data with consistent referential integrity.

Gretel.ai

Gretel generates synthetic data that statistically mimics real production data without containing actual personal information. This is particularly valuable when you need realistic data distributions (which production data has) without the privacy risk.


PII-Safe Synthetic Data from Production

A common need: you want test data that reflects the real distributions and patterns in your production database, but without containing actual PII.

The workflow:

  1. Extract a sample of production data
  2. Use an anonymisation tool or LLM to synthesise statistically similar but entirely fictional data
  3. Use the synthetic dataset for testing
# Example: anonymise a production user sample
import anthropic
 
def anonymise_user_sample(production_users: list) -> list:
    """
    Takes real user records and generates synthetic equivalents
    with similar patterns but no real PII.
    """
    client = anthropic.Anthropic()
    
    # Extract just the structure (no actual data) to show the LLM
    sample = production_users[:3]  # Small sample for pattern recognition
    
    response = client.messages.create(
        model="claude-sonnet-4-20250514",
        max_tokens=3000,
        messages=[{
            "role": "user",
            "content": f"""Given this sample of user records (patterns only, ignore the actual values):
            {sample}
            
            Generate {len(production_users)} synthetic user records that:
            - Follow the same patterns and formats as the sample
            - Contain ONLY fictional names, emails, addresses (no real people)
            - Maintain similar data distributions (name length, email domain distribution, etc.)
            - Include the same fields as the original
            
            Return ONLY a JSON array. No explanation."""
        }]
    )
    
    return json.loads(response.content[0].text)

Integrating AI-Generated Data into Your Test Suite

The most practical integration pattern is a data factory that uses AI for edge case generation and Faker for volume:

// test/factories/dataFactory.ts
export class DataFactory {
  private aiClient: Anthropic;
  
  constructor() {
    this.aiClient = new Anthropic();
  }
  
  // Fast: Faker for standard data
  user(overrides: Partial<User> = {}): User {
    return {
      id: faker.string.uuid(),
      email: faker.internet.email(),
      name: faker.person.fullName(),
      ...overrides,
    };
  }
  
  // AI-powered: for edge case coverage
  async userEdgeCases(): Promise<User[]> {
    const cacheKey = 'user-edge-cases';
    // Cache AI-generated data between test runs to avoid repeated API calls
    if (this.cache.has(cacheKey)) return this.cache.get(cacheKey);
    
    const data = await this.generateWithAI('users', 20);
    this.cache.set(cacheKey, data);
    return data;
  }
}

Cache AI-generated test data between runs — you don't need to regenerate edge cases on every test run, and caching reduces API costs and latency.


Key Principles

Generate for coverage, not just volume. 1,000 variations of normal inputs is less valuable than 20 carefully chosen edge cases. Use AI to find the edges.

Review AI-generated data. LLMs can generate plausible-looking but logically invalid data. Review a sample before using in production test suites.

Version control your generated datasets. Save AI-generated test datasets in your repository so tests are reproducible without making API calls on every run.

Keep PII out of test environments. Even if the data looks realistic, ensure it's provably synthetic. Don't use real names, real addresses, or real phone numbers — even if obfuscated.

For more on the broader AI in testing landscape, see our Implementing AI in Software Testing guide. For the test data management strategies that work within a Playwright test suite, see our API Testing Guide.