Skip to content

Structured Data Extractor

Extract structured JSON from unstructured text using LLM tool calling.

Features

  • Schema Definition - Define extraction schemas using JSON Schema format
  • Tool-Based Extraction - LLM extracts data via tool calls (not string parsing)
  • Validation with Retry - Automatic retry on validation failure with error feedback
  • Multiple Data Types - Works with contacts, invoices, events, or any schema
  • Type-Safe Results - TypeScript generics for type-safe extraction

Quick Start

bash
export OPENAI_API_KEY=your_key
npm run recipe:structured-extractor

How It Works

This recipe extracts structured JSON data from unstructured text using LLM tool calling, with automatic validation and retry on failure.

Supported extractions:

  • Contacts (name, email, phone, company)
  • Invoices (items, totals, dates)
  • Events (title, date, location, attendees)
  • Any custom JSON Schema

Flow:

  1. Define a JSON Schema for the data you want to extract
  2. Register a tool with that schema
  3. Send unstructured text to the LLM
  4. LLM calls the tool with extracted data
  5. Validate the result against the schema
  6. If invalid, retry with error feedback
  7. Return typed, validated data

Example Output

╭─────────────────────────────────────────────╮
│  Structured Data Extractor                  │
│  JSON from Unstructured Text                │
╰─────────────────────────────────────────────╯

Using: openai/gpt-4o-mini

━━━ Contact Extraction ━━━

Input text:
Hey there! I wanted to introduce you to our new team lead, Sarah Chen.
She's the Director of Engineering at TechCorp Inc. You can reach her
at sarah.chen@techcorp.io or call her office at (555) 123-4567.

ℹ Attempt 1/3: Extracting contact...
✓ Extraction successful on attempt 1

Extracted Contact:
{
  "name": "Sarah Chen",
  "email": "sarah.chen@techcorp.io",
  "phone": "(555) 123-4567",
  "company": "TechCorp Inc.",
  "role": "Director of Engineering"
}
Attempts: 1

━━━ Invoice Extraction ━━━

Input text:
INVOICE #INV-2024-0042
Date: January 3, 2026
From: Acme Supplies Ltd.

Items:
- Widget Pro (x5) @ $29.99 each = $149.95
- Super Gadget (x2) @ $89.50 each = $179.00
- Premium Cable (x10) @ $12.00 each = $120.00

Subtotal: $448.95
Tax (8%): $35.92
TOTAL DUE: $484.87

ℹ Attempt 1/3: Extracting invoice...
✓ Extraction successful on attempt 1

Extracted Invoice:
{
  "invoiceNumber": "INV-2024-0042",
  "date": "2026-01-03",
  "vendor": "Acme Supplies Ltd.",
  "items": [
    { "description": "Widget Pro", "quantity": 5, "unitPrice": 29.99, "total": 149.95 },
    { "description": "Super Gadget", "quantity": 2, "unitPrice": 89.5, "total": 179 },
    { "description": "Premium Cable", "quantity": 10, "unitPrice": 12, "total": 120 }
  ],
  "subtotal": 448.95,
  "tax": 35.92,
  "total": 484.87
}
Attempts: 1

Code Walkthrough

Define Your Schema

Use JSON Schema format to define what data you want to extract:

typescript
/**
 * Structured Data Extractor Library
 *
 * Exported functions for the Structured Data Extractor recipe.
 * Snippet markers allow VitePress to extract code for documentation.
 */

import { ChatClient, ToolRegistry } from '../../../src';

// [start:colors]
export const colors = {
  reset: '\x1b[0m',
  dim: '\x1b[2m',
  bold: '\x1b[1m',
  red: '\x1b[31m',
  yellow: '\x1b[33m',
  green: '\x1b[32m',
  cyan: '\x1b[36m',
  magenta: '\x1b[35m'
};
// [end:colors]

// [start:types]
export interface JSONSchema {
  type: string;
  properties?: Record<string, JSONSchema & { description?: string }>;
  items?: JSONSchema;
  required?: string[];
  description?: string;
  enum?: string[];
  format?: string;
  minimum?: number;
  maximum?: number;
}

export interface ExtractionResult<T> {
  data: T;
  attempts: number;
  raw?: unknown;
}

export interface ValidationError {
  field: string;
  message: string;
}

export interface ExtractorConfig {
  provider: 'openai' | 'anthropic' | 'ollama';
  model: string;
  apiKey?: string;
  baseURL?: string;
}
// [end:types]

// [start:logging]
export function printBanner(): void {
  console.log(`
${colors.cyan}╭─────────────────────────────────────────────╮
│  ${colors.reset}${colors.bold}Structured Data Extractor${colors.reset}${colors.cyan}                  │
│  ${colors.dim}JSON from Unstructured Text${colors.reset}${colors.cyan}                 │
╰─────────────────────────────────────────────╯${colors.reset}
`);
}

export function log(level: 'info' | 'warn' | 'error' | 'success', message: string): void {
  const prefix = {
    info: `${colors.cyan}ℹ${colors.reset}`,
    warn: `${colors.yellow}⚠${colors.reset}`,
    error: `${colors.red}✗${colors.reset}`,
    success: `${colors.green}✓${colors.reset}`
  };
  console.log(`${prefix[level]} ${message}`);
}
// [end:logging]

// [start:validation]
/**
 * Simple JSON Schema validator
 * Returns array of validation errors (empty if valid)
 */
export function validateAgainstSchema(
  data: unknown,
  schema: JSONSchema,
  path = ''
): ValidationError[] {
  const errors: ValidationError[] = [];

  if (data === null || data === undefined) {
    if (schema.required?.length) {
      errors.push({ field: path || 'root', message: 'Value is required' });
    }
    return errors;
  }

  // Type checking
  if (schema.type === 'object' && typeof data === 'object' && !Array.isArray(data)) {
    const obj = data as Record<string, unknown>;

    // Check required fields
    for (const field of schema.required || []) {
      if (!(field in obj) || obj[field] === null || obj[field] === undefined) {
        errors.push({
          field: path ? `${path}.${field}` : field,
          message: 'Required field missing'
        });
      }
    }

    // Validate properties
    if (schema.properties) {
      for (const [key, propSchema] of Object.entries(schema.properties)) {
        if (key in obj) {
          errors.push(
            ...validateAgainstSchema(obj[key], propSchema, path ? `${path}.${key}` : key)
          );
        }
      }
    }
  } else if (schema.type === 'array' && Array.isArray(data)) {
    if (schema.items) {
      data.forEach((item, index) => {
        errors.push(...validateAgainstSchema(item, schema.items!, `${path}[${index}]`));
      });
    }
  } else if (schema.type === 'string') {
    if (typeof data !== 'string') {
      errors.push({ field: path, message: `Expected string, got ${typeof data}` });
    } else if (schema.format === 'email' && !data.includes('@')) {
      errors.push({ field: path, message: 'Invalid email format' });
    } else if (schema.format === 'date' && isNaN(Date.parse(data))) {
      errors.push({ field: path, message: 'Invalid date format' });
    }
  } else if (schema.type === 'number') {
    if (typeof data !== 'number') {
      errors.push({ field: path, message: `Expected number, got ${typeof data}` });
    } else {
      if (schema.minimum !== undefined && data < schema.minimum) {
        errors.push({ field: path, message: `Value must be >= ${schema.minimum}` });
      }
      if (schema.maximum !== undefined && data > schema.maximum) {
        errors.push({ field: path, message: `Value must be <= ${schema.maximum}` });
      }
    }
  } else if (schema.type === 'boolean' && typeof data !== 'boolean') {
    errors.push({ field: path, message: `Expected boolean, got ${typeof data}` });
  }

  // Enum validation
  if (schema.enum && !schema.enum.includes(data as string)) {
    errors.push({ field: path, message: `Value must be one of: ${schema.enum.join(', ')}` });
  }

  return errors;
}
// [end:validation]

// [start:extract-data]
/**
 * Extract structured data from text using LLM tool calling
 */
export async function extractData<T>(
  config: ExtractorConfig,
  text: string,
  schema: JSONSchema,
  schemaName: string,
  maxRetries = 3
): Promise<ExtractionResult<T>> {
  let attempts = 0;
  let lastErrors: ValidationError[] = [];
  let extractedData: unknown = null;

  while (attempts < maxRetries) {
    attempts++;

    // Create tool registry with extraction tool
    const registry = new ToolRegistry();

    registry.registerTool(
      'extract_data',
      async (args: Record<string, unknown>) => {
        extractedData = args;
        return { success: true, data: args };
      },
      {
        description: `Extract ${schemaName} from the provided text. Call this tool with the extracted data.`,
        parameters: schema
      }
    );

    // Build prompt
    let prompt = `Extract the following information from the text below and call the extract_data tool with the result.

Text:
"""
${text}
"""

Extract all ${schemaName} information you can find. For optional fields, only include them if the information is clearly present.`;

    // Add error feedback for retries
    if (lastErrors.length > 0) {
      prompt += `\n\nPrevious extraction had these validation errors, please fix them:
${lastErrors.map((e) => `- ${e.field}: ${e.message}`).join('\n')}`;
    }

    log('info', `Attempt ${attempts}/${maxRetries}: Extracting ${schemaName}...`);

    // Create extraction client with tool
    const extractionClient = new ChatClient({
      provider: config.provider,
      model: config.model,
      apiKey: config.apiKey,
      tools: registry
    });

    try {
      await extractionClient.chat(prompt);

      // Validate extracted data
      if (extractedData) {
        const errors = validateAgainstSchema(extractedData, schema);

        if (errors.length === 0) {
          log('success', `Extraction successful on attempt ${attempts}`);
          return {
            data: extractedData as T,
            attempts,
            raw: extractedData
          };
        }

        lastErrors = errors;
        log('warn', `Validation failed: ${errors.map((e) => e.message).join(', ')}`);
      } else {
        lastErrors = [{ field: 'root', message: 'No data extracted' }];
        log('warn', 'No data was extracted');
      }
    } catch (error) {
      log('error', `Extraction error: ${error instanceof Error ? error.message : String(error)}`);
      lastErrors = [{ field: 'root', message: String(error) }];
    }
  }

  throw new Error(
    `Failed to extract valid ${schemaName} after ${maxRetries} attempts. Last errors: ${lastErrors.map((e) => `${e.field}: ${e.message}`).join(', ')}`
  );
}
// [end:extract-data]

// [start:contact-schema]
export const contactSchema: JSONSchema = {
  type: 'object',
  properties: {
    name: { type: 'string', description: 'Full name of the person' },
    email: { type: 'string', format: 'email', description: 'Email address' },
    phone: { type: 'string', description: 'Phone number' },
    company: { type: 'string', description: 'Company or organization' },
    role: { type: 'string', description: 'Job title or role' }
  },
  required: ['name', 'email']
};

export interface Contact {
  name: string;
  email: string;
  phone?: string;
  company?: string;
  role?: string;
}
// [end:contact-schema]

// [start:invoice-schema]
export const invoiceSchema: JSONSchema = {
  type: 'object',
  properties: {
    invoiceNumber: { type: 'string', description: 'Invoice or reference number' },
    date: { type: 'string', format: 'date', description: 'Invoice date (YYYY-MM-DD)' },
    vendor: { type: 'string', description: 'Vendor or seller name' },
    items: {
      type: 'array',
      description: 'Line items on the invoice',
      items: {
        type: 'object',
        properties: {
          description: { type: 'string', description: 'Item description' },
          quantity: { type: 'number', minimum: 0, description: 'Quantity' },
          unitPrice: { type: 'number', minimum: 0, description: 'Price per unit' },
          total: { type: 'number', minimum: 0, description: 'Line total' }
        },
        required: ['description', 'quantity', 'unitPrice']
      }
    },
    subtotal: { type: 'number', minimum: 0, description: 'Subtotal before tax' },
    tax: { type: 'number', minimum: 0, description: 'Tax amount' },
    total: { type: 'number', minimum: 0, description: 'Total amount due' }
  },
  required: ['invoiceNumber', 'date', 'items', 'total']
};

export interface Invoice {
  invoiceNumber: string;
  date: string;
  vendor?: string;
  items: Array<{
    description: string;
    quantity: number;
    unitPrice: number;
    total?: number;
  }>;
  subtotal?: number;
  tax?: number;
  total: number;
}
// [end:invoice-schema]

// [start:event-schema]
export const eventSchema: JSONSchema = {
  type: 'object',
  properties: {
    title: { type: 'string', description: 'Event title or name' },
    date: { type: 'string', format: 'date', description: 'Event date (YYYY-MM-DD)' },
    time: { type: 'string', description: 'Event time (e.g., 2:00 PM)' },
    location: { type: 'string', description: 'Event location or venue' },
    description: { type: 'string', description: 'Event description' },
    organizer: { type: 'string', description: 'Event organizer' },
    attendees: {
      type: 'array',
      description: 'List of attendees',
      items: { type: 'string' }
    }
  },
  required: ['title', 'date']
};

export interface CalendarEvent {
  title: string;
  date: string;
  time?: string;
  location?: string;
  description?: string;
  organizer?: string;
  attendees?: string[];
}
// [end:event-schema]

// [start:sample-texts]
export const sampleTexts = {
  contact: `
    Hey there! I wanted to introduce you to our new team lead, Sarah Chen.
    She's the Director of Engineering at TechCorp Inc. You can reach her
    at sarah.chen@techcorp.io or call her office at (555) 123-4567.
    She's really excited to collaborate on the upcoming project!
  `,
  invoice: `
    INVOICE #INV-2024-0042
    Date: January 3, 2026
    From: Acme Supplies Ltd.

    Items:
    - Widget Pro (x5) @ $29.99 each = $149.95
    - Super Gadget (x2) @ $89.50 each = $179.00
    - Premium Cable (x10) @ $12.00 each = $120.00

    Subtotal: $448.95
    Tax (8%): $35.92
    TOTAL DUE: $484.87
  `,
  event: `
    You're invited to the Annual Tech Summit 2026!

    Join us on March 15, 2026 at 9:00 AM at the Grand Convention Center
    in downtown San Francisco. This year's theme is "AI in Practice".

    Hosted by the Bay Area Tech Association, we'll have speakers from
    Google, Meta, and OpenAI. Expected attendees include John Smith,
    Jane Doe, and Bob Wilson from our team.

    Don't miss this opportunity to network and learn!
  `
};
// [end:sample-texts]

// [start:create-config]
export function createExtractorConfig(): ExtractorConfig {
  return {
    provider: process.env.OPENAI_API_KEY ? 'openai' : 'anthropic',
    model: process.env.OPENAI_API_KEY ? 'gpt-4o-mini' : 'claude-3-haiku-20240307',
    apiKey: process.env.OPENAI_API_KEY || process.env.ANTHROPIC_API_KEY
  };
}
// [end:create-config]

// [start:usage-example]
/**
 * Example usage of the structured data extractor:
 *
 * ```typescript
 * const config = createExtractorConfig();
 *
 * const result = await extractData<Contact>(
 *   config,
 *   'Sarah Chen is Director at TechCorp, sarah@techcorp.io',
 *   contactSchema,
 *   'contact'
 * );
 *
 * console.log(result.data);
 * // { name: "Sarah Chen", email: "sarah@techcorp.io", ... }
 * ```
 */
// [end:usage-example]

Extract Data

The extractData function handles the entire extraction flow:

typescript
/**
 * Structured Data Extractor Library
 *
 * Exported functions for the Structured Data Extractor recipe.
 * Snippet markers allow VitePress to extract code for documentation.
 */

import { ChatClient, ToolRegistry } from '../../../src';

// [start:colors]
export const colors = {
  reset: '\x1b[0m',
  dim: '\x1b[2m',
  bold: '\x1b[1m',
  red: '\x1b[31m',
  yellow: '\x1b[33m',
  green: '\x1b[32m',
  cyan: '\x1b[36m',
  magenta: '\x1b[35m'
};
// [end:colors]

// [start:types]
export interface JSONSchema {
  type: string;
  properties?: Record<string, JSONSchema & { description?: string }>;
  items?: JSONSchema;
  required?: string[];
  description?: string;
  enum?: string[];
  format?: string;
  minimum?: number;
  maximum?: number;
}

export interface ExtractionResult<T> {
  data: T;
  attempts: number;
  raw?: unknown;
}

export interface ValidationError {
  field: string;
  message: string;
}

export interface ExtractorConfig {
  provider: 'openai' | 'anthropic' | 'ollama';
  model: string;
  apiKey?: string;
  baseURL?: string;
}
// [end:types]

// [start:logging]
export function printBanner(): void {
  console.log(`
${colors.cyan}╭─────────────────────────────────────────────╮
│  ${colors.reset}${colors.bold}Structured Data Extractor${colors.reset}${colors.cyan}                  │
│  ${colors.dim}JSON from Unstructured Text${colors.reset}${colors.cyan}                 │
╰─────────────────────────────────────────────╯${colors.reset}
`);
}

export function log(level: 'info' | 'warn' | 'error' | 'success', message: string): void {
  const prefix = {
    info: `${colors.cyan}ℹ${colors.reset}`,
    warn: `${colors.yellow}⚠${colors.reset}`,
    error: `${colors.red}✗${colors.reset}`,
    success: `${colors.green}✓${colors.reset}`
  };
  console.log(`${prefix[level]} ${message}`);
}
// [end:logging]

// [start:validation]
/**
 * Simple JSON Schema validator
 * Returns array of validation errors (empty if valid)
 */
export function validateAgainstSchema(
  data: unknown,
  schema: JSONSchema,
  path = ''
): ValidationError[] {
  const errors: ValidationError[] = [];

  if (data === null || data === undefined) {
    if (schema.required?.length) {
      errors.push({ field: path || 'root', message: 'Value is required' });
    }
    return errors;
  }

  // Type checking
  if (schema.type === 'object' && typeof data === 'object' && !Array.isArray(data)) {
    const obj = data as Record<string, unknown>;

    // Check required fields
    for (const field of schema.required || []) {
      if (!(field in obj) || obj[field] === null || obj[field] === undefined) {
        errors.push({
          field: path ? `${path}.${field}` : field,
          message: 'Required field missing'
        });
      }
    }

    // Validate properties
    if (schema.properties) {
      for (const [key, propSchema] of Object.entries(schema.properties)) {
        if (key in obj) {
          errors.push(
            ...validateAgainstSchema(obj[key], propSchema, path ? `${path}.${key}` : key)
          );
        }
      }
    }
  } else if (schema.type === 'array' && Array.isArray(data)) {
    if (schema.items) {
      data.forEach((item, index) => {
        errors.push(...validateAgainstSchema(item, schema.items!, `${path}[${index}]`));
      });
    }
  } else if (schema.type === 'string') {
    if (typeof data !== 'string') {
      errors.push({ field: path, message: `Expected string, got ${typeof data}` });
    } else if (schema.format === 'email' && !data.includes('@')) {
      errors.push({ field: path, message: 'Invalid email format' });
    } else if (schema.format === 'date' && isNaN(Date.parse(data))) {
      errors.push({ field: path, message: 'Invalid date format' });
    }
  } else if (schema.type === 'number') {
    if (typeof data !== 'number') {
      errors.push({ field: path, message: `Expected number, got ${typeof data}` });
    } else {
      if (schema.minimum !== undefined && data < schema.minimum) {
        errors.push({ field: path, message: `Value must be >= ${schema.minimum}` });
      }
      if (schema.maximum !== undefined && data > schema.maximum) {
        errors.push({ field: path, message: `Value must be <= ${schema.maximum}` });
      }
    }
  } else if (schema.type === 'boolean' && typeof data !== 'boolean') {
    errors.push({ field: path, message: `Expected boolean, got ${typeof data}` });
  }

  // Enum validation
  if (schema.enum && !schema.enum.includes(data as string)) {
    errors.push({ field: path, message: `Value must be one of: ${schema.enum.join(', ')}` });
  }

  return errors;
}
// [end:validation]

// [start:extract-data]
/**
 * Extract structured data from text using LLM tool calling
 */
export async function extractData<T>(
  config: ExtractorConfig,
  text: string,
  schema: JSONSchema,
  schemaName: string,
  maxRetries = 3
): Promise<ExtractionResult<T>> {
  let attempts = 0;
  let lastErrors: ValidationError[] = [];
  let extractedData: unknown = null;

  while (attempts < maxRetries) {
    attempts++;

    // Create tool registry with extraction tool
    const registry = new ToolRegistry();

    registry.registerTool(
      'extract_data',
      async (args: Record<string, unknown>) => {
        extractedData = args;
        return { success: true, data: args };
      },
      {
        description: `Extract ${schemaName} from the provided text. Call this tool with the extracted data.`,
        parameters: schema
      }
    );

    // Build prompt
    let prompt = `Extract the following information from the text below and call the extract_data tool with the result.

Text:
"""
${text}
"""

Extract all ${schemaName} information you can find. For optional fields, only include them if the information is clearly present.`;

    // Add error feedback for retries
    if (lastErrors.length > 0) {
      prompt += `\n\nPrevious extraction had these validation errors, please fix them:
${lastErrors.map((e) => `- ${e.field}: ${e.message}`).join('\n')}`;
    }

    log('info', `Attempt ${attempts}/${maxRetries}: Extracting ${schemaName}...`);

    // Create extraction client with tool
    const extractionClient = new ChatClient({
      provider: config.provider,
      model: config.model,
      apiKey: config.apiKey,
      tools: registry
    });

    try {
      await extractionClient.chat(prompt);

      // Validate extracted data
      if (extractedData) {
        const errors = validateAgainstSchema(extractedData, schema);

        if (errors.length === 0) {
          log('success', `Extraction successful on attempt ${attempts}`);
          return {
            data: extractedData as T,
            attempts,
            raw: extractedData
          };
        }

        lastErrors = errors;
        log('warn', `Validation failed: ${errors.map((e) => e.message).join(', ')}`);
      } else {
        lastErrors = [{ field: 'root', message: 'No data extracted' }];
        log('warn', 'No data was extracted');
      }
    } catch (error) {
      log('error', `Extraction error: ${error instanceof Error ? error.message : String(error)}`);
      lastErrors = [{ field: 'root', message: String(error) }];
    }
  }

  throw new Error(
    `Failed to extract valid ${schemaName} after ${maxRetries} attempts. Last errors: ${lastErrors.map((e) => `${e.field}: ${e.message}`).join(', ')}`
  );
}
// [end:extract-data]

// [start:contact-schema]
export const contactSchema: JSONSchema = {
  type: 'object',
  properties: {
    name: { type: 'string', description: 'Full name of the person' },
    email: { type: 'string', format: 'email', description: 'Email address' },
    phone: { type: 'string', description: 'Phone number' },
    company: { type: 'string', description: 'Company or organization' },
    role: { type: 'string', description: 'Job title or role' }
  },
  required: ['name', 'email']
};

export interface Contact {
  name: string;
  email: string;
  phone?: string;
  company?: string;
  role?: string;
}
// [end:contact-schema]

// [start:invoice-schema]
export const invoiceSchema: JSONSchema = {
  type: 'object',
  properties: {
    invoiceNumber: { type: 'string', description: 'Invoice or reference number' },
    date: { type: 'string', format: 'date', description: 'Invoice date (YYYY-MM-DD)' },
    vendor: { type: 'string', description: 'Vendor or seller name' },
    items: {
      type: 'array',
      description: 'Line items on the invoice',
      items: {
        type: 'object',
        properties: {
          description: { type: 'string', description: 'Item description' },
          quantity: { type: 'number', minimum: 0, description: 'Quantity' },
          unitPrice: { type: 'number', minimum: 0, description: 'Price per unit' },
          total: { type: 'number', minimum: 0, description: 'Line total' }
        },
        required: ['description', 'quantity', 'unitPrice']
      }
    },
    subtotal: { type: 'number', minimum: 0, description: 'Subtotal before tax' },
    tax: { type: 'number', minimum: 0, description: 'Tax amount' },
    total: { type: 'number', minimum: 0, description: 'Total amount due' }
  },
  required: ['invoiceNumber', 'date', 'items', 'total']
};

export interface Invoice {
  invoiceNumber: string;
  date: string;
  vendor?: string;
  items: Array<{
    description: string;
    quantity: number;
    unitPrice: number;
    total?: number;
  }>;
  subtotal?: number;
  tax?: number;
  total: number;
}
// [end:invoice-schema]

// [start:event-schema]
export const eventSchema: JSONSchema = {
  type: 'object',
  properties: {
    title: { type: 'string', description: 'Event title or name' },
    date: { type: 'string', format: 'date', description: 'Event date (YYYY-MM-DD)' },
    time: { type: 'string', description: 'Event time (e.g., 2:00 PM)' },
    location: { type: 'string', description: 'Event location or venue' },
    description: { type: 'string', description: 'Event description' },
    organizer: { type: 'string', description: 'Event organizer' },
    attendees: {
      type: 'array',
      description: 'List of attendees',
      items: { type: 'string' }
    }
  },
  required: ['title', 'date']
};

export interface CalendarEvent {
  title: string;
  date: string;
  time?: string;
  location?: string;
  description?: string;
  organizer?: string;
  attendees?: string[];
}
// [end:event-schema]

// [start:sample-texts]
export const sampleTexts = {
  contact: `
    Hey there! I wanted to introduce you to our new team lead, Sarah Chen.
    She's the Director of Engineering at TechCorp Inc. You can reach her
    at sarah.chen@techcorp.io or call her office at (555) 123-4567.
    She's really excited to collaborate on the upcoming project!
  `,
  invoice: `
    INVOICE #INV-2024-0042
    Date: January 3, 2026
    From: Acme Supplies Ltd.

    Items:
    - Widget Pro (x5) @ $29.99 each = $149.95
    - Super Gadget (x2) @ $89.50 each = $179.00
    - Premium Cable (x10) @ $12.00 each = $120.00

    Subtotal: $448.95
    Tax (8%): $35.92
    TOTAL DUE: $484.87
  `,
  event: `
    You're invited to the Annual Tech Summit 2026!

    Join us on March 15, 2026 at 9:00 AM at the Grand Convention Center
    in downtown San Francisco. This year's theme is "AI in Practice".

    Hosted by the Bay Area Tech Association, we'll have speakers from
    Google, Meta, and OpenAI. Expected attendees include John Smith,
    Jane Doe, and Bob Wilson from our team.

    Don't miss this opportunity to network and learn!
  `
};
// [end:sample-texts]

// [start:create-config]
export function createExtractorConfig(): ExtractorConfig {
  return {
    provider: process.env.OPENAI_API_KEY ? 'openai' : 'anthropic',
    model: process.env.OPENAI_API_KEY ? 'gpt-4o-mini' : 'claude-3-haiku-20240307',
    apiKey: process.env.OPENAI_API_KEY || process.env.ANTHROPIC_API_KEY
  };
}
// [end:create-config]

// [start:usage-example]
/**
 * Example usage of the structured data extractor:
 *
 * ```typescript
 * const config = createExtractorConfig();
 *
 * const result = await extractData<Contact>(
 *   config,
 *   'Sarah Chen is Director at TechCorp, sarah@techcorp.io',
 *   contactSchema,
 *   'contact'
 * );
 *
 * console.log(result.data);
 * // { name: "Sarah Chen", email: "sarah@techcorp.io", ... }
 * ```
 */
// [end:usage-example]

Usage

typescript
/**
 * Structured Data Extractor Library
 *
 * Exported functions for the Structured Data Extractor recipe.
 * Snippet markers allow VitePress to extract code for documentation.
 */

import { ChatClient, ToolRegistry } from '../../../src';

// [start:colors]
export const colors = {
  reset: '\x1b[0m',
  dim: '\x1b[2m',
  bold: '\x1b[1m',
  red: '\x1b[31m',
  yellow: '\x1b[33m',
  green: '\x1b[32m',
  cyan: '\x1b[36m',
  magenta: '\x1b[35m'
};
// [end:colors]

// [start:types]
export interface JSONSchema {
  type: string;
  properties?: Record<string, JSONSchema & { description?: string }>;
  items?: JSONSchema;
  required?: string[];
  description?: string;
  enum?: string[];
  format?: string;
  minimum?: number;
  maximum?: number;
}

export interface ExtractionResult<T> {
  data: T;
  attempts: number;
  raw?: unknown;
}

export interface ValidationError {
  field: string;
  message: string;
}

export interface ExtractorConfig {
  provider: 'openai' | 'anthropic' | 'ollama';
  model: string;
  apiKey?: string;
  baseURL?: string;
}
// [end:types]

// [start:logging]
export function printBanner(): void {
  console.log(`
${colors.cyan}╭─────────────────────────────────────────────╮
│  ${colors.reset}${colors.bold}Structured Data Extractor${colors.reset}${colors.cyan}                  │
│  ${colors.dim}JSON from Unstructured Text${colors.reset}${colors.cyan}                 │
╰─────────────────────────────────────────────╯${colors.reset}
`);
}

export function log(level: 'info' | 'warn' | 'error' | 'success', message: string): void {
  const prefix = {
    info: `${colors.cyan}ℹ${colors.reset}`,
    warn: `${colors.yellow}⚠${colors.reset}`,
    error: `${colors.red}✗${colors.reset}`,
    success: `${colors.green}✓${colors.reset}`
  };
  console.log(`${prefix[level]} ${message}`);
}
// [end:logging]

// [start:validation]
/**
 * Simple JSON Schema validator
 * Returns array of validation errors (empty if valid)
 */
export function validateAgainstSchema(
  data: unknown,
  schema: JSONSchema,
  path = ''
): ValidationError[] {
  const errors: ValidationError[] = [];

  if (data === null || data === undefined) {
    if (schema.required?.length) {
      errors.push({ field: path || 'root', message: 'Value is required' });
    }
    return errors;
  }

  // Type checking
  if (schema.type === 'object' && typeof data === 'object' && !Array.isArray(data)) {
    const obj = data as Record<string, unknown>;

    // Check required fields
    for (const field of schema.required || []) {
      if (!(field in obj) || obj[field] === null || obj[field] === undefined) {
        errors.push({
          field: path ? `${path}.${field}` : field,
          message: 'Required field missing'
        });
      }
    }

    // Validate properties
    if (schema.properties) {
      for (const [key, propSchema] of Object.entries(schema.properties)) {
        if (key in obj) {
          errors.push(
            ...validateAgainstSchema(obj[key], propSchema, path ? `${path}.${key}` : key)
          );
        }
      }
    }
  } else if (schema.type === 'array' && Array.isArray(data)) {
    if (schema.items) {
      data.forEach((item, index) => {
        errors.push(...validateAgainstSchema(item, schema.items!, `${path}[${index}]`));
      });
    }
  } else if (schema.type === 'string') {
    if (typeof data !== 'string') {
      errors.push({ field: path, message: `Expected string, got ${typeof data}` });
    } else if (schema.format === 'email' && !data.includes('@')) {
      errors.push({ field: path, message: 'Invalid email format' });
    } else if (schema.format === 'date' && isNaN(Date.parse(data))) {
      errors.push({ field: path, message: 'Invalid date format' });
    }
  } else if (schema.type === 'number') {
    if (typeof data !== 'number') {
      errors.push({ field: path, message: `Expected number, got ${typeof data}` });
    } else {
      if (schema.minimum !== undefined && data < schema.minimum) {
        errors.push({ field: path, message: `Value must be >= ${schema.minimum}` });
      }
      if (schema.maximum !== undefined && data > schema.maximum) {
        errors.push({ field: path, message: `Value must be <= ${schema.maximum}` });
      }
    }
  } else if (schema.type === 'boolean' && typeof data !== 'boolean') {
    errors.push({ field: path, message: `Expected boolean, got ${typeof data}` });
  }

  // Enum validation
  if (schema.enum && !schema.enum.includes(data as string)) {
    errors.push({ field: path, message: `Value must be one of: ${schema.enum.join(', ')}` });
  }

  return errors;
}
// [end:validation]

// [start:extract-data]
/**
 * Extract structured data from text using LLM tool calling
 */
export async function extractData<T>(
  config: ExtractorConfig,
  text: string,
  schema: JSONSchema,
  schemaName: string,
  maxRetries = 3
): Promise<ExtractionResult<T>> {
  let attempts = 0;
  let lastErrors: ValidationError[] = [];
  let extractedData: unknown = null;

  while (attempts < maxRetries) {
    attempts++;

    // Create tool registry with extraction tool
    const registry = new ToolRegistry();

    registry.registerTool(
      'extract_data',
      async (args: Record<string, unknown>) => {
        extractedData = args;
        return { success: true, data: args };
      },
      {
        description: `Extract ${schemaName} from the provided text. Call this tool with the extracted data.`,
        parameters: schema
      }
    );

    // Build prompt
    let prompt = `Extract the following information from the text below and call the extract_data tool with the result.

Text:
"""
${text}
"""

Extract all ${schemaName} information you can find. For optional fields, only include them if the information is clearly present.`;

    // Add error feedback for retries
    if (lastErrors.length > 0) {
      prompt += `\n\nPrevious extraction had these validation errors, please fix them:
${lastErrors.map((e) => `- ${e.field}: ${e.message}`).join('\n')}`;
    }

    log('info', `Attempt ${attempts}/${maxRetries}: Extracting ${schemaName}...`);

    // Create extraction client with tool
    const extractionClient = new ChatClient({
      provider: config.provider,
      model: config.model,
      apiKey: config.apiKey,
      tools: registry
    });

    try {
      await extractionClient.chat(prompt);

      // Validate extracted data
      if (extractedData) {
        const errors = validateAgainstSchema(extractedData, schema);

        if (errors.length === 0) {
          log('success', `Extraction successful on attempt ${attempts}`);
          return {
            data: extractedData as T,
            attempts,
            raw: extractedData
          };
        }

        lastErrors = errors;
        log('warn', `Validation failed: ${errors.map((e) => e.message).join(', ')}`);
      } else {
        lastErrors = [{ field: 'root', message: 'No data extracted' }];
        log('warn', 'No data was extracted');
      }
    } catch (error) {
      log('error', `Extraction error: ${error instanceof Error ? error.message : String(error)}`);
      lastErrors = [{ field: 'root', message: String(error) }];
    }
  }

  throw new Error(
    `Failed to extract valid ${schemaName} after ${maxRetries} attempts. Last errors: ${lastErrors.map((e) => `${e.field}: ${e.message}`).join(', ')}`
  );
}
// [end:extract-data]

// [start:contact-schema]
export const contactSchema: JSONSchema = {
  type: 'object',
  properties: {
    name: { type: 'string', description: 'Full name of the person' },
    email: { type: 'string', format: 'email', description: 'Email address' },
    phone: { type: 'string', description: 'Phone number' },
    company: { type: 'string', description: 'Company or organization' },
    role: { type: 'string', description: 'Job title or role' }
  },
  required: ['name', 'email']
};

export interface Contact {
  name: string;
  email: string;
  phone?: string;
  company?: string;
  role?: string;
}
// [end:contact-schema]

// [start:invoice-schema]
export const invoiceSchema: JSONSchema = {
  type: 'object',
  properties: {
    invoiceNumber: { type: 'string', description: 'Invoice or reference number' },
    date: { type: 'string', format: 'date', description: 'Invoice date (YYYY-MM-DD)' },
    vendor: { type: 'string', description: 'Vendor or seller name' },
    items: {
      type: 'array',
      description: 'Line items on the invoice',
      items: {
        type: 'object',
        properties: {
          description: { type: 'string', description: 'Item description' },
          quantity: { type: 'number', minimum: 0, description: 'Quantity' },
          unitPrice: { type: 'number', minimum: 0, description: 'Price per unit' },
          total: { type: 'number', minimum: 0, description: 'Line total' }
        },
        required: ['description', 'quantity', 'unitPrice']
      }
    },
    subtotal: { type: 'number', minimum: 0, description: 'Subtotal before tax' },
    tax: { type: 'number', minimum: 0, description: 'Tax amount' },
    total: { type: 'number', minimum: 0, description: 'Total amount due' }
  },
  required: ['invoiceNumber', 'date', 'items', 'total']
};

export interface Invoice {
  invoiceNumber: string;
  date: string;
  vendor?: string;
  items: Array<{
    description: string;
    quantity: number;
    unitPrice: number;
    total?: number;
  }>;
  subtotal?: number;
  tax?: number;
  total: number;
}
// [end:invoice-schema]

// [start:event-schema]
export const eventSchema: JSONSchema = {
  type: 'object',
  properties: {
    title: { type: 'string', description: 'Event title or name' },
    date: { type: 'string', format: 'date', description: 'Event date (YYYY-MM-DD)' },
    time: { type: 'string', description: 'Event time (e.g., 2:00 PM)' },
    location: { type: 'string', description: 'Event location or venue' },
    description: { type: 'string', description: 'Event description' },
    organizer: { type: 'string', description: 'Event organizer' },
    attendees: {
      type: 'array',
      description: 'List of attendees',
      items: { type: 'string' }
    }
  },
  required: ['title', 'date']
};

export interface CalendarEvent {
  title: string;
  date: string;
  time?: string;
  location?: string;
  description?: string;
  organizer?: string;
  attendees?: string[];
}
// [end:event-schema]

// [start:sample-texts]
export const sampleTexts = {
  contact: `
    Hey there! I wanted to introduce you to our new team lead, Sarah Chen.
    She's the Director of Engineering at TechCorp Inc. You can reach her
    at sarah.chen@techcorp.io or call her office at (555) 123-4567.
    She's really excited to collaborate on the upcoming project!
  `,
  invoice: `
    INVOICE #INV-2024-0042
    Date: January 3, 2026
    From: Acme Supplies Ltd.

    Items:
    - Widget Pro (x5) @ $29.99 each = $149.95
    - Super Gadget (x2) @ $89.50 each = $179.00
    - Premium Cable (x10) @ $12.00 each = $120.00

    Subtotal: $448.95
    Tax (8%): $35.92
    TOTAL DUE: $484.87
  `,
  event: `
    You're invited to the Annual Tech Summit 2026!

    Join us on March 15, 2026 at 9:00 AM at the Grand Convention Center
    in downtown San Francisco. This year's theme is "AI in Practice".

    Hosted by the Bay Area Tech Association, we'll have speakers from
    Google, Meta, and OpenAI. Expected attendees include John Smith,
    Jane Doe, and Bob Wilson from our team.

    Don't miss this opportunity to network and learn!
  `
};
// [end:sample-texts]

// [start:create-config]
export function createExtractorConfig(): ExtractorConfig {
  return {
    provider: process.env.OPENAI_API_KEY ? 'openai' : 'anthropic',
    model: process.env.OPENAI_API_KEY ? 'gpt-4o-mini' : 'claude-3-haiku-20240307',
    apiKey: process.env.OPENAI_API_KEY || process.env.ANTHROPIC_API_KEY
  };
}
// [end:create-config]

// [start:usage-example]
/**
 * Example usage of the structured data extractor:
 *
 * ```typescript
 * const config = createExtractorConfig();
 *
 * const result = await extractData<Contact>(
 *   config,
 *   'Sarah Chen is Director at TechCorp, sarah@techcorp.io',
 *   contactSchema,
 *   'contact'
 * );
 *
 * console.log(result.data);
 * // { name: "Sarah Chen", email: "sarah@techcorp.io", ... }
 * ```
 */
// [end:usage-example]

More Schema Examples

Invoice

typescript
/**
 * Structured Data Extractor Library
 *
 * Exported functions for the Structured Data Extractor recipe.
 * Snippet markers allow VitePress to extract code for documentation.
 */

import { ChatClient, ToolRegistry } from '../../../src';

// [start:colors]
export const colors = {
  reset: '\x1b[0m',
  dim: '\x1b[2m',
  bold: '\x1b[1m',
  red: '\x1b[31m',
  yellow: '\x1b[33m',
  green: '\x1b[32m',
  cyan: '\x1b[36m',
  magenta: '\x1b[35m'
};
// [end:colors]

// [start:types]
export interface JSONSchema {
  type: string;
  properties?: Record<string, JSONSchema & { description?: string }>;
  items?: JSONSchema;
  required?: string[];
  description?: string;
  enum?: string[];
  format?: string;
  minimum?: number;
  maximum?: number;
}

export interface ExtractionResult<T> {
  data: T;
  attempts: number;
  raw?: unknown;
}

export interface ValidationError {
  field: string;
  message: string;
}

export interface ExtractorConfig {
  provider: 'openai' | 'anthropic' | 'ollama';
  model: string;
  apiKey?: string;
  baseURL?: string;
}
// [end:types]

// [start:logging]
export function printBanner(): void {
  console.log(`
${colors.cyan}╭─────────────────────────────────────────────╮
│  ${colors.reset}${colors.bold}Structured Data Extractor${colors.reset}${colors.cyan}                  │
│  ${colors.dim}JSON from Unstructured Text${colors.reset}${colors.cyan}                 │
╰─────────────────────────────────────────────╯${colors.reset}
`);
}

export function log(level: 'info' | 'warn' | 'error' | 'success', message: string): void {
  const prefix = {
    info: `${colors.cyan}ℹ${colors.reset}`,
    warn: `${colors.yellow}⚠${colors.reset}`,
    error: `${colors.red}✗${colors.reset}`,
    success: `${colors.green}✓${colors.reset}`
  };
  console.log(`${prefix[level]} ${message}`);
}
// [end:logging]

// [start:validation]
/**
 * Simple JSON Schema validator
 * Returns array of validation errors (empty if valid)
 */
export function validateAgainstSchema(
  data: unknown,
  schema: JSONSchema,
  path = ''
): ValidationError[] {
  const errors: ValidationError[] = [];

  if (data === null || data === undefined) {
    if (schema.required?.length) {
      errors.push({ field: path || 'root', message: 'Value is required' });
    }
    return errors;
  }

  // Type checking
  if (schema.type === 'object' && typeof data === 'object' && !Array.isArray(data)) {
    const obj = data as Record<string, unknown>;

    // Check required fields
    for (const field of schema.required || []) {
      if (!(field in obj) || obj[field] === null || obj[field] === undefined) {
        errors.push({
          field: path ? `${path}.${field}` : field,
          message: 'Required field missing'
        });
      }
    }

    // Validate properties
    if (schema.properties) {
      for (const [key, propSchema] of Object.entries(schema.properties)) {
        if (key in obj) {
          errors.push(
            ...validateAgainstSchema(obj[key], propSchema, path ? `${path}.${key}` : key)
          );
        }
      }
    }
  } else if (schema.type === 'array' && Array.isArray(data)) {
    if (schema.items) {
      data.forEach((item, index) => {
        errors.push(...validateAgainstSchema(item, schema.items!, `${path}[${index}]`));
      });
    }
  } else if (schema.type === 'string') {
    if (typeof data !== 'string') {
      errors.push({ field: path, message: `Expected string, got ${typeof data}` });
    } else if (schema.format === 'email' && !data.includes('@')) {
      errors.push({ field: path, message: 'Invalid email format' });
    } else if (schema.format === 'date' && isNaN(Date.parse(data))) {
      errors.push({ field: path, message: 'Invalid date format' });
    }
  } else if (schema.type === 'number') {
    if (typeof data !== 'number') {
      errors.push({ field: path, message: `Expected number, got ${typeof data}` });
    } else {
      if (schema.minimum !== undefined && data < schema.minimum) {
        errors.push({ field: path, message: `Value must be >= ${schema.minimum}` });
      }
      if (schema.maximum !== undefined && data > schema.maximum) {
        errors.push({ field: path, message: `Value must be <= ${schema.maximum}` });
      }
    }
  } else if (schema.type === 'boolean' && typeof data !== 'boolean') {
    errors.push({ field: path, message: `Expected boolean, got ${typeof data}` });
  }

  // Enum validation
  if (schema.enum && !schema.enum.includes(data as string)) {
    errors.push({ field: path, message: `Value must be one of: ${schema.enum.join(', ')}` });
  }

  return errors;
}
// [end:validation]

// [start:extract-data]
/**
 * Extract structured data from text using LLM tool calling
 */
export async function extractData<T>(
  config: ExtractorConfig,
  text: string,
  schema: JSONSchema,
  schemaName: string,
  maxRetries = 3
): Promise<ExtractionResult<T>> {
  let attempts = 0;
  let lastErrors: ValidationError[] = [];
  let extractedData: unknown = null;

  while (attempts < maxRetries) {
    attempts++;

    // Create tool registry with extraction tool
    const registry = new ToolRegistry();

    registry.registerTool(
      'extract_data',
      async (args: Record<string, unknown>) => {
        extractedData = args;
        return { success: true, data: args };
      },
      {
        description: `Extract ${schemaName} from the provided text. Call this tool with the extracted data.`,
        parameters: schema
      }
    );

    // Build prompt
    let prompt = `Extract the following information from the text below and call the extract_data tool with the result.

Text:
"""
${text}
"""

Extract all ${schemaName} information you can find. For optional fields, only include them if the information is clearly present.`;

    // Add error feedback for retries
    if (lastErrors.length > 0) {
      prompt += `\n\nPrevious extraction had these validation errors, please fix them:
${lastErrors.map((e) => `- ${e.field}: ${e.message}`).join('\n')}`;
    }

    log('info', `Attempt ${attempts}/${maxRetries}: Extracting ${schemaName}...`);

    // Create extraction client with tool
    const extractionClient = new ChatClient({
      provider: config.provider,
      model: config.model,
      apiKey: config.apiKey,
      tools: registry
    });

    try {
      await extractionClient.chat(prompt);

      // Validate extracted data
      if (extractedData) {
        const errors = validateAgainstSchema(extractedData, schema);

        if (errors.length === 0) {
          log('success', `Extraction successful on attempt ${attempts}`);
          return {
            data: extractedData as T,
            attempts,
            raw: extractedData
          };
        }

        lastErrors = errors;
        log('warn', `Validation failed: ${errors.map((e) => e.message).join(', ')}`);
      } else {
        lastErrors = [{ field: 'root', message: 'No data extracted' }];
        log('warn', 'No data was extracted');
      }
    } catch (error) {
      log('error', `Extraction error: ${error instanceof Error ? error.message : String(error)}`);
      lastErrors = [{ field: 'root', message: String(error) }];
    }
  }

  throw new Error(
    `Failed to extract valid ${schemaName} after ${maxRetries} attempts. Last errors: ${lastErrors.map((e) => `${e.field}: ${e.message}`).join(', ')}`
  );
}
// [end:extract-data]

// [start:contact-schema]
export const contactSchema: JSONSchema = {
  type: 'object',
  properties: {
    name: { type: 'string', description: 'Full name of the person' },
    email: { type: 'string', format: 'email', description: 'Email address' },
    phone: { type: 'string', description: 'Phone number' },
    company: { type: 'string', description: 'Company or organization' },
    role: { type: 'string', description: 'Job title or role' }
  },
  required: ['name', 'email']
};

export interface Contact {
  name: string;
  email: string;
  phone?: string;
  company?: string;
  role?: string;
}
// [end:contact-schema]

// [start:invoice-schema]
export const invoiceSchema: JSONSchema = {
  type: 'object',
  properties: {
    invoiceNumber: { type: 'string', description: 'Invoice or reference number' },
    date: { type: 'string', format: 'date', description: 'Invoice date (YYYY-MM-DD)' },
    vendor: { type: 'string', description: 'Vendor or seller name' },
    items: {
      type: 'array',
      description: 'Line items on the invoice',
      items: {
        type: 'object',
        properties: {
          description: { type: 'string', description: 'Item description' },
          quantity: { type: 'number', minimum: 0, description: 'Quantity' },
          unitPrice: { type: 'number', minimum: 0, description: 'Price per unit' },
          total: { type: 'number', minimum: 0, description: 'Line total' }
        },
        required: ['description', 'quantity', 'unitPrice']
      }
    },
    subtotal: { type: 'number', minimum: 0, description: 'Subtotal before tax' },
    tax: { type: 'number', minimum: 0, description: 'Tax amount' },
    total: { type: 'number', minimum: 0, description: 'Total amount due' }
  },
  required: ['invoiceNumber', 'date', 'items', 'total']
};

export interface Invoice {
  invoiceNumber: string;
  date: string;
  vendor?: string;
  items: Array<{
    description: string;
    quantity: number;
    unitPrice: number;
    total?: number;
  }>;
  subtotal?: number;
  tax?: number;
  total: number;
}
// [end:invoice-schema]

// [start:event-schema]
export const eventSchema: JSONSchema = {
  type: 'object',
  properties: {
    title: { type: 'string', description: 'Event title or name' },
    date: { type: 'string', format: 'date', description: 'Event date (YYYY-MM-DD)' },
    time: { type: 'string', description: 'Event time (e.g., 2:00 PM)' },
    location: { type: 'string', description: 'Event location or venue' },
    description: { type: 'string', description: 'Event description' },
    organizer: { type: 'string', description: 'Event organizer' },
    attendees: {
      type: 'array',
      description: 'List of attendees',
      items: { type: 'string' }
    }
  },
  required: ['title', 'date']
};

export interface CalendarEvent {
  title: string;
  date: string;
  time?: string;
  location?: string;
  description?: string;
  organizer?: string;
  attendees?: string[];
}
// [end:event-schema]

// [start:sample-texts]
export const sampleTexts = {
  contact: `
    Hey there! I wanted to introduce you to our new team lead, Sarah Chen.
    She's the Director of Engineering at TechCorp Inc. You can reach her
    at sarah.chen@techcorp.io or call her office at (555) 123-4567.
    She's really excited to collaborate on the upcoming project!
  `,
  invoice: `
    INVOICE #INV-2024-0042
    Date: January 3, 2026
    From: Acme Supplies Ltd.

    Items:
    - Widget Pro (x5) @ $29.99 each = $149.95
    - Super Gadget (x2) @ $89.50 each = $179.00
    - Premium Cable (x10) @ $12.00 each = $120.00

    Subtotal: $448.95
    Tax (8%): $35.92
    TOTAL DUE: $484.87
  `,
  event: `
    You're invited to the Annual Tech Summit 2026!

    Join us on March 15, 2026 at 9:00 AM at the Grand Convention Center
    in downtown San Francisco. This year's theme is "AI in Practice".

    Hosted by the Bay Area Tech Association, we'll have speakers from
    Google, Meta, and OpenAI. Expected attendees include John Smith,
    Jane Doe, and Bob Wilson from our team.

    Don't miss this opportunity to network and learn!
  `
};
// [end:sample-texts]

// [start:create-config]
export function createExtractorConfig(): ExtractorConfig {
  return {
    provider: process.env.OPENAI_API_KEY ? 'openai' : 'anthropic',
    model: process.env.OPENAI_API_KEY ? 'gpt-4o-mini' : 'claude-3-haiku-20240307',
    apiKey: process.env.OPENAI_API_KEY || process.env.ANTHROPIC_API_KEY
  };
}
// [end:create-config]

// [start:usage-example]
/**
 * Example usage of the structured data extractor:
 *
 * ```typescript
 * const config = createExtractorConfig();
 *
 * const result = await extractData<Contact>(
 *   config,
 *   'Sarah Chen is Director at TechCorp, sarah@techcorp.io',
 *   contactSchema,
 *   'contact'
 * );
 *
 * console.log(result.data);
 * // { name: "Sarah Chen", email: "sarah@techcorp.io", ... }
 * ```
 */
// [end:usage-example]

Event

typescript
/**
 * Structured Data Extractor Library
 *
 * Exported functions for the Structured Data Extractor recipe.
 * Snippet markers allow VitePress to extract code for documentation.
 */

import { ChatClient, ToolRegistry } from '../../../src';

// [start:colors]
export const colors = {
  reset: '\x1b[0m',
  dim: '\x1b[2m',
  bold: '\x1b[1m',
  red: '\x1b[31m',
  yellow: '\x1b[33m',
  green: '\x1b[32m',
  cyan: '\x1b[36m',
  magenta: '\x1b[35m'
};
// [end:colors]

// [start:types]
export interface JSONSchema {
  type: string;
  properties?: Record<string, JSONSchema & { description?: string }>;
  items?: JSONSchema;
  required?: string[];
  description?: string;
  enum?: string[];
  format?: string;
  minimum?: number;
  maximum?: number;
}

export interface ExtractionResult<T> {
  data: T;
  attempts: number;
  raw?: unknown;
}

export interface ValidationError {
  field: string;
  message: string;
}

export interface ExtractorConfig {
  provider: 'openai' | 'anthropic' | 'ollama';
  model: string;
  apiKey?: string;
  baseURL?: string;
}
// [end:types]

// [start:logging]
export function printBanner(): void {
  console.log(`
${colors.cyan}╭─────────────────────────────────────────────╮
│  ${colors.reset}${colors.bold}Structured Data Extractor${colors.reset}${colors.cyan}                  │
│  ${colors.dim}JSON from Unstructured Text${colors.reset}${colors.cyan}                 │
╰─────────────────────────────────────────────╯${colors.reset}
`);
}

export function log(level: 'info' | 'warn' | 'error' | 'success', message: string): void {
  const prefix = {
    info: `${colors.cyan}ℹ${colors.reset}`,
    warn: `${colors.yellow}⚠${colors.reset}`,
    error: `${colors.red}✗${colors.reset}`,
    success: `${colors.green}✓${colors.reset}`
  };
  console.log(`${prefix[level]} ${message}`);
}
// [end:logging]

// [start:validation]
/**
 * Simple JSON Schema validator
 * Returns array of validation errors (empty if valid)
 */
export function validateAgainstSchema(
  data: unknown,
  schema: JSONSchema,
  path = ''
): ValidationError[] {
  const errors: ValidationError[] = [];

  if (data === null || data === undefined) {
    if (schema.required?.length) {
      errors.push({ field: path || 'root', message: 'Value is required' });
    }
    return errors;
  }

  // Type checking
  if (schema.type === 'object' && typeof data === 'object' && !Array.isArray(data)) {
    const obj = data as Record<string, unknown>;

    // Check required fields
    for (const field of schema.required || []) {
      if (!(field in obj) || obj[field] === null || obj[field] === undefined) {
        errors.push({
          field: path ? `${path}.${field}` : field,
          message: 'Required field missing'
        });
      }
    }

    // Validate properties
    if (schema.properties) {
      for (const [key, propSchema] of Object.entries(schema.properties)) {
        if (key in obj) {
          errors.push(
            ...validateAgainstSchema(obj[key], propSchema, path ? `${path}.${key}` : key)
          );
        }
      }
    }
  } else if (schema.type === 'array' && Array.isArray(data)) {
    if (schema.items) {
      data.forEach((item, index) => {
        errors.push(...validateAgainstSchema(item, schema.items!, `${path}[${index}]`));
      });
    }
  } else if (schema.type === 'string') {
    if (typeof data !== 'string') {
      errors.push({ field: path, message: `Expected string, got ${typeof data}` });
    } else if (schema.format === 'email' && !data.includes('@')) {
      errors.push({ field: path, message: 'Invalid email format' });
    } else if (schema.format === 'date' && isNaN(Date.parse(data))) {
      errors.push({ field: path, message: 'Invalid date format' });
    }
  } else if (schema.type === 'number') {
    if (typeof data !== 'number') {
      errors.push({ field: path, message: `Expected number, got ${typeof data}` });
    } else {
      if (schema.minimum !== undefined && data < schema.minimum) {
        errors.push({ field: path, message: `Value must be >= ${schema.minimum}` });
      }
      if (schema.maximum !== undefined && data > schema.maximum) {
        errors.push({ field: path, message: `Value must be <= ${schema.maximum}` });
      }
    }
  } else if (schema.type === 'boolean' && typeof data !== 'boolean') {
    errors.push({ field: path, message: `Expected boolean, got ${typeof data}` });
  }

  // Enum validation
  if (schema.enum && !schema.enum.includes(data as string)) {
    errors.push({ field: path, message: `Value must be one of: ${schema.enum.join(', ')}` });
  }

  return errors;
}
// [end:validation]

// [start:extract-data]
/**
 * Extract structured data from text using LLM tool calling
 */
export async function extractData<T>(
  config: ExtractorConfig,
  text: string,
  schema: JSONSchema,
  schemaName: string,
  maxRetries = 3
): Promise<ExtractionResult<T>> {
  let attempts = 0;
  let lastErrors: ValidationError[] = [];
  let extractedData: unknown = null;

  while (attempts < maxRetries) {
    attempts++;

    // Create tool registry with extraction tool
    const registry = new ToolRegistry();

    registry.registerTool(
      'extract_data',
      async (args: Record<string, unknown>) => {
        extractedData = args;
        return { success: true, data: args };
      },
      {
        description: `Extract ${schemaName} from the provided text. Call this tool with the extracted data.`,
        parameters: schema
      }
    );

    // Build prompt
    let prompt = `Extract the following information from the text below and call the extract_data tool with the result.

Text:
"""
${text}
"""

Extract all ${schemaName} information you can find. For optional fields, only include them if the information is clearly present.`;

    // Add error feedback for retries
    if (lastErrors.length > 0) {
      prompt += `\n\nPrevious extraction had these validation errors, please fix them:
${lastErrors.map((e) => `- ${e.field}: ${e.message}`).join('\n')}`;
    }

    log('info', `Attempt ${attempts}/${maxRetries}: Extracting ${schemaName}...`);

    // Create extraction client with tool
    const extractionClient = new ChatClient({
      provider: config.provider,
      model: config.model,
      apiKey: config.apiKey,
      tools: registry
    });

    try {
      await extractionClient.chat(prompt);

      // Validate extracted data
      if (extractedData) {
        const errors = validateAgainstSchema(extractedData, schema);

        if (errors.length === 0) {
          log('success', `Extraction successful on attempt ${attempts}`);
          return {
            data: extractedData as T,
            attempts,
            raw: extractedData
          };
        }

        lastErrors = errors;
        log('warn', `Validation failed: ${errors.map((e) => e.message).join(', ')}`);
      } else {
        lastErrors = [{ field: 'root', message: 'No data extracted' }];
        log('warn', 'No data was extracted');
      }
    } catch (error) {
      log('error', `Extraction error: ${error instanceof Error ? error.message : String(error)}`);
      lastErrors = [{ field: 'root', message: String(error) }];
    }
  }

  throw new Error(
    `Failed to extract valid ${schemaName} after ${maxRetries} attempts. Last errors: ${lastErrors.map((e) => `${e.field}: ${e.message}`).join(', ')}`
  );
}
// [end:extract-data]

// [start:contact-schema]
export const contactSchema: JSONSchema = {
  type: 'object',
  properties: {
    name: { type: 'string', description: 'Full name of the person' },
    email: { type: 'string', format: 'email', description: 'Email address' },
    phone: { type: 'string', description: 'Phone number' },
    company: { type: 'string', description: 'Company or organization' },
    role: { type: 'string', description: 'Job title or role' }
  },
  required: ['name', 'email']
};

export interface Contact {
  name: string;
  email: string;
  phone?: string;
  company?: string;
  role?: string;
}
// [end:contact-schema]

// [start:invoice-schema]
export const invoiceSchema: JSONSchema = {
  type: 'object',
  properties: {
    invoiceNumber: { type: 'string', description: 'Invoice or reference number' },
    date: { type: 'string', format: 'date', description: 'Invoice date (YYYY-MM-DD)' },
    vendor: { type: 'string', description: 'Vendor or seller name' },
    items: {
      type: 'array',
      description: 'Line items on the invoice',
      items: {
        type: 'object',
        properties: {
          description: { type: 'string', description: 'Item description' },
          quantity: { type: 'number', minimum: 0, description: 'Quantity' },
          unitPrice: { type: 'number', minimum: 0, description: 'Price per unit' },
          total: { type: 'number', minimum: 0, description: 'Line total' }
        },
        required: ['description', 'quantity', 'unitPrice']
      }
    },
    subtotal: { type: 'number', minimum: 0, description: 'Subtotal before tax' },
    tax: { type: 'number', minimum: 0, description: 'Tax amount' },
    total: { type: 'number', minimum: 0, description: 'Total amount due' }
  },
  required: ['invoiceNumber', 'date', 'items', 'total']
};

export interface Invoice {
  invoiceNumber: string;
  date: string;
  vendor?: string;
  items: Array<{
    description: string;
    quantity: number;
    unitPrice: number;
    total?: number;
  }>;
  subtotal?: number;
  tax?: number;
  total: number;
}
// [end:invoice-schema]

// [start:event-schema]
export const eventSchema: JSONSchema = {
  type: 'object',
  properties: {
    title: { type: 'string', description: 'Event title or name' },
    date: { type: 'string', format: 'date', description: 'Event date (YYYY-MM-DD)' },
    time: { type: 'string', description: 'Event time (e.g., 2:00 PM)' },
    location: { type: 'string', description: 'Event location or venue' },
    description: { type: 'string', description: 'Event description' },
    organizer: { type: 'string', description: 'Event organizer' },
    attendees: {
      type: 'array',
      description: 'List of attendees',
      items: { type: 'string' }
    }
  },
  required: ['title', 'date']
};

export interface CalendarEvent {
  title: string;
  date: string;
  time?: string;
  location?: string;
  description?: string;
  organizer?: string;
  attendees?: string[];
}
// [end:event-schema]

// [start:sample-texts]
export const sampleTexts = {
  contact: `
    Hey there! I wanted to introduce you to our new team lead, Sarah Chen.
    She's the Director of Engineering at TechCorp Inc. You can reach her
    at sarah.chen@techcorp.io or call her office at (555) 123-4567.
    She's really excited to collaborate on the upcoming project!
  `,
  invoice: `
    INVOICE #INV-2024-0042
    Date: January 3, 2026
    From: Acme Supplies Ltd.

    Items:
    - Widget Pro (x5) @ $29.99 each = $149.95
    - Super Gadget (x2) @ $89.50 each = $179.00
    - Premium Cable (x10) @ $12.00 each = $120.00

    Subtotal: $448.95
    Tax (8%): $35.92
    TOTAL DUE: $484.87
  `,
  event: `
    You're invited to the Annual Tech Summit 2026!

    Join us on March 15, 2026 at 9:00 AM at the Grand Convention Center
    in downtown San Francisco. This year's theme is "AI in Practice".

    Hosted by the Bay Area Tech Association, we'll have speakers from
    Google, Meta, and OpenAI. Expected attendees include John Smith,
    Jane Doe, and Bob Wilson from our team.

    Don't miss this opportunity to network and learn!
  `
};
// [end:sample-texts]

// [start:create-config]
export function createExtractorConfig(): ExtractorConfig {
  return {
    provider: process.env.OPENAI_API_KEY ? 'openai' : 'anthropic',
    model: process.env.OPENAI_API_KEY ? 'gpt-4o-mini' : 'claude-3-haiku-20240307',
    apiKey: process.env.OPENAI_API_KEY || process.env.ANTHROPIC_API_KEY
  };
}
// [end:create-config]

// [start:usage-example]
/**
 * Example usage of the structured data extractor:
 *
 * ```typescript
 * const config = createExtractorConfig();
 *
 * const result = await extractData<Contact>(
 *   config,
 *   'Sarah Chen is Director at TechCorp, sarah@techcorp.io',
 *   contactSchema,
 *   'contact'
 * );
 *
 * console.log(result.data);
 * // { name: "Sarah Chen", email: "sarah@techcorp.io", ... }
 * ```
 */
// [end:usage-example]

Validation

The built-in validator checks:

CheckDescription
Required fieldsAll required fields must be present
Type checkingValues must match declared type
Format validationemail, date formats are validated
Range validationminimum, maximum for numbers
Enum validationValues must be in enum list

Retry with Feedback

When validation fails, errors are fed back to the LLM:

Previous extraction had these validation errors, please fix them:
- email: Invalid email format
- date: Value must be in format YYYY-MM-DD

This helps the LLM correct mistakes on subsequent attempts.

Environment Variables

VariableRequiredDescription
OPENAI_API_KEYOne of these requiredOpenAI API key
ANTHROPIC_API_KEYAnthropic API key

At least one API key is required. OpenAI is recommended for tool calling.

Use Cases

Data Entry Automation

typescript
// Extract contact info from business cards, emails, signatures
const contact = await extractData<Contact>(config, emailSignature, contactSchema, 'contact');

Document Processing

typescript
// Extract invoice data from scanned documents
const invoice = await extractData<Invoice>(config, ocrText, invoiceSchema, 'invoice');

Event Parsing

typescript
// Extract event details from meeting invitations
const event = await extractData<Event>(config, inviteText, eventSchema, 'event');

Custom Schemas

typescript
// Define any schema for your domain
const productSchema = {
  type: 'object',
  properties: {
    name: { type: 'string' },
    price: { type: 'number', minimum: 0 },
    category: { type: 'string', enum: ['electronics', 'clothing', 'food'] }
  },
  required: ['name', 'price']
};

const product = await extractData<Product>(config, description, productSchema, 'product');

Full Source

View on GitHub

Released under the MIT License.