Structured Data Extractor
Extract structured JSON from unstructured text using LLM tool calling.
Features
- Schema Definition - Define extraction schemas using JSON Schema format
- Tool-Based Extraction - LLM extracts data via tool calls (not string parsing)
- Validation with Retry - Automatic retry on validation failure with error feedback
- Multiple Data Types - Works with contacts, invoices, events, or any schema
- Type-Safe Results - TypeScript generics for type-safe extraction
Quick Start
bash
export OPENAI_API_KEY=your_key
npm run recipe:structured-extractorHow It Works
This recipe extracts structured JSON data from unstructured text using LLM tool calling, with automatic validation and retry on failure.
Supported extractions:
- Contacts (name, email, phone, company)
- Invoices (items, totals, dates)
- Events (title, date, location, attendees)
- Any custom JSON Schema
Flow:
- Define a JSON Schema for the data you want to extract
- Register a tool with that schema
- Send unstructured text to the LLM
- LLM calls the tool with extracted data
- Validate the result against the schema
- If invalid, retry with error feedback
- Return typed, validated data
Example Output
╭─────────────────────────────────────────────╮
│ Structured Data Extractor │
│ JSON from Unstructured Text │
╰─────────────────────────────────────────────╯
Using: openai/gpt-4o-mini
━━━ Contact Extraction ━━━
Input text:
Hey there! I wanted to introduce you to our new team lead, Sarah Chen.
She's the Director of Engineering at TechCorp Inc. You can reach her
at sarah.chen@techcorp.io or call her office at (555) 123-4567.
ℹ Attempt 1/3: Extracting contact...
✓ Extraction successful on attempt 1
Extracted Contact:
{
"name": "Sarah Chen",
"email": "sarah.chen@techcorp.io",
"phone": "(555) 123-4567",
"company": "TechCorp Inc.",
"role": "Director of Engineering"
}
Attempts: 1
━━━ Invoice Extraction ━━━
Input text:
INVOICE #INV-2024-0042
Date: January 3, 2026
From: Acme Supplies Ltd.
Items:
- Widget Pro (x5) @ $29.99 each = $149.95
- Super Gadget (x2) @ $89.50 each = $179.00
- Premium Cable (x10) @ $12.00 each = $120.00
Subtotal: $448.95
Tax (8%): $35.92
TOTAL DUE: $484.87
ℹ Attempt 1/3: Extracting invoice...
✓ Extraction successful on attempt 1
Extracted Invoice:
{
"invoiceNumber": "INV-2024-0042",
"date": "2026-01-03",
"vendor": "Acme Supplies Ltd.",
"items": [
{ "description": "Widget Pro", "quantity": 5, "unitPrice": 29.99, "total": 149.95 },
{ "description": "Super Gadget", "quantity": 2, "unitPrice": 89.5, "total": 179 },
{ "description": "Premium Cable", "quantity": 10, "unitPrice": 12, "total": 120 }
],
"subtotal": 448.95,
"tax": 35.92,
"total": 484.87
}
Attempts: 1Code Walkthrough
Define Your Schema
Use JSON Schema format to define what data you want to extract:
typescript
/**
* Structured Data Extractor Library
*
* Exported functions for the Structured Data Extractor recipe.
* Snippet markers allow VitePress to extract code for documentation.
*/
import { ChatClient, ToolRegistry } from '../../../src';
// [start:colors]
export const colors = {
reset: '\x1b[0m',
dim: '\x1b[2m',
bold: '\x1b[1m',
red: '\x1b[31m',
yellow: '\x1b[33m',
green: '\x1b[32m',
cyan: '\x1b[36m',
magenta: '\x1b[35m'
};
// [end:colors]
// [start:types]
export interface JSONSchema {
type: string;
properties?: Record<string, JSONSchema & { description?: string }>;
items?: JSONSchema;
required?: string[];
description?: string;
enum?: string[];
format?: string;
minimum?: number;
maximum?: number;
}
export interface ExtractionResult<T> {
data: T;
attempts: number;
raw?: unknown;
}
export interface ValidationError {
field: string;
message: string;
}
export interface ExtractorConfig {
provider: 'openai' | 'anthropic' | 'ollama';
model: string;
apiKey?: string;
baseURL?: string;
}
// [end:types]
// [start:logging]
export function printBanner(): void {
console.log(`
${colors.cyan}╭─────────────────────────────────────────────╮
│ ${colors.reset}${colors.bold}Structured Data Extractor${colors.reset}${colors.cyan} │
│ ${colors.dim}JSON from Unstructured Text${colors.reset}${colors.cyan} │
╰─────────────────────────────────────────────╯${colors.reset}
`);
}
export function log(level: 'info' | 'warn' | 'error' | 'success', message: string): void {
const prefix = {
info: `${colors.cyan}ℹ${colors.reset}`,
warn: `${colors.yellow}⚠${colors.reset}`,
error: `${colors.red}✗${colors.reset}`,
success: `${colors.green}✓${colors.reset}`
};
console.log(`${prefix[level]} ${message}`);
}
// [end:logging]
// [start:validation]
/**
* Simple JSON Schema validator
* Returns array of validation errors (empty if valid)
*/
export function validateAgainstSchema(
data: unknown,
schema: JSONSchema,
path = ''
): ValidationError[] {
const errors: ValidationError[] = [];
if (data === null || data === undefined) {
if (schema.required?.length) {
errors.push({ field: path || 'root', message: 'Value is required' });
}
return errors;
}
// Type checking
if (schema.type === 'object' && typeof data === 'object' && !Array.isArray(data)) {
const obj = data as Record<string, unknown>;
// Check required fields
for (const field of schema.required || []) {
if (!(field in obj) || obj[field] === null || obj[field] === undefined) {
errors.push({
field: path ? `${path}.${field}` : field,
message: 'Required field missing'
});
}
}
// Validate properties
if (schema.properties) {
for (const [key, propSchema] of Object.entries(schema.properties)) {
if (key in obj) {
errors.push(
...validateAgainstSchema(obj[key], propSchema, path ? `${path}.${key}` : key)
);
}
}
}
} else if (schema.type === 'array' && Array.isArray(data)) {
if (schema.items) {
data.forEach((item, index) => {
errors.push(...validateAgainstSchema(item, schema.items!, `${path}[${index}]`));
});
}
} else if (schema.type === 'string') {
if (typeof data !== 'string') {
errors.push({ field: path, message: `Expected string, got ${typeof data}` });
} else if (schema.format === 'email' && !data.includes('@')) {
errors.push({ field: path, message: 'Invalid email format' });
} else if (schema.format === 'date' && isNaN(Date.parse(data))) {
errors.push({ field: path, message: 'Invalid date format' });
}
} else if (schema.type === 'number') {
if (typeof data !== 'number') {
errors.push({ field: path, message: `Expected number, got ${typeof data}` });
} else {
if (schema.minimum !== undefined && data < schema.minimum) {
errors.push({ field: path, message: `Value must be >= ${schema.minimum}` });
}
if (schema.maximum !== undefined && data > schema.maximum) {
errors.push({ field: path, message: `Value must be <= ${schema.maximum}` });
}
}
} else if (schema.type === 'boolean' && typeof data !== 'boolean') {
errors.push({ field: path, message: `Expected boolean, got ${typeof data}` });
}
// Enum validation
if (schema.enum && !schema.enum.includes(data as string)) {
errors.push({ field: path, message: `Value must be one of: ${schema.enum.join(', ')}` });
}
return errors;
}
// [end:validation]
// [start:extract-data]
/**
* Extract structured data from text using LLM tool calling
*/
export async function extractData<T>(
config: ExtractorConfig,
text: string,
schema: JSONSchema,
schemaName: string,
maxRetries = 3
): Promise<ExtractionResult<T>> {
let attempts = 0;
let lastErrors: ValidationError[] = [];
let extractedData: unknown = null;
while (attempts < maxRetries) {
attempts++;
// Create tool registry with extraction tool
const registry = new ToolRegistry();
registry.registerTool(
'extract_data',
async (args: Record<string, unknown>) => {
extractedData = args;
return { success: true, data: args };
},
{
description: `Extract ${schemaName} from the provided text. Call this tool with the extracted data.`,
parameters: schema
}
);
// Build prompt
let prompt = `Extract the following information from the text below and call the extract_data tool with the result.
Text:
"""
${text}
"""
Extract all ${schemaName} information you can find. For optional fields, only include them if the information is clearly present.`;
// Add error feedback for retries
if (lastErrors.length > 0) {
prompt += `\n\nPrevious extraction had these validation errors, please fix them:
${lastErrors.map((e) => `- ${e.field}: ${e.message}`).join('\n')}`;
}
log('info', `Attempt ${attempts}/${maxRetries}: Extracting ${schemaName}...`);
// Create extraction client with tool
const extractionClient = new ChatClient({
provider: config.provider,
model: config.model,
apiKey: config.apiKey,
tools: registry
});
try {
await extractionClient.chat(prompt);
// Validate extracted data
if (extractedData) {
const errors = validateAgainstSchema(extractedData, schema);
if (errors.length === 0) {
log('success', `Extraction successful on attempt ${attempts}`);
return {
data: extractedData as T,
attempts,
raw: extractedData
};
}
lastErrors = errors;
log('warn', `Validation failed: ${errors.map((e) => e.message).join(', ')}`);
} else {
lastErrors = [{ field: 'root', message: 'No data extracted' }];
log('warn', 'No data was extracted');
}
} catch (error) {
log('error', `Extraction error: ${error instanceof Error ? error.message : String(error)}`);
lastErrors = [{ field: 'root', message: String(error) }];
}
}
throw new Error(
`Failed to extract valid ${schemaName} after ${maxRetries} attempts. Last errors: ${lastErrors.map((e) => `${e.field}: ${e.message}`).join(', ')}`
);
}
// [end:extract-data]
// [start:contact-schema]
export const contactSchema: JSONSchema = {
type: 'object',
properties: {
name: { type: 'string', description: 'Full name of the person' },
email: { type: 'string', format: 'email', description: 'Email address' },
phone: { type: 'string', description: 'Phone number' },
company: { type: 'string', description: 'Company or organization' },
role: { type: 'string', description: 'Job title or role' }
},
required: ['name', 'email']
};
export interface Contact {
name: string;
email: string;
phone?: string;
company?: string;
role?: string;
}
// [end:contact-schema]
// [start:invoice-schema]
export const invoiceSchema: JSONSchema = {
type: 'object',
properties: {
invoiceNumber: { type: 'string', description: 'Invoice or reference number' },
date: { type: 'string', format: 'date', description: 'Invoice date (YYYY-MM-DD)' },
vendor: { type: 'string', description: 'Vendor or seller name' },
items: {
type: 'array',
description: 'Line items on the invoice',
items: {
type: 'object',
properties: {
description: { type: 'string', description: 'Item description' },
quantity: { type: 'number', minimum: 0, description: 'Quantity' },
unitPrice: { type: 'number', minimum: 0, description: 'Price per unit' },
total: { type: 'number', minimum: 0, description: 'Line total' }
},
required: ['description', 'quantity', 'unitPrice']
}
},
subtotal: { type: 'number', minimum: 0, description: 'Subtotal before tax' },
tax: { type: 'number', minimum: 0, description: 'Tax amount' },
total: { type: 'number', minimum: 0, description: 'Total amount due' }
},
required: ['invoiceNumber', 'date', 'items', 'total']
};
export interface Invoice {
invoiceNumber: string;
date: string;
vendor?: string;
items: Array<{
description: string;
quantity: number;
unitPrice: number;
total?: number;
}>;
subtotal?: number;
tax?: number;
total: number;
}
// [end:invoice-schema]
// [start:event-schema]
export const eventSchema: JSONSchema = {
type: 'object',
properties: {
title: { type: 'string', description: 'Event title or name' },
date: { type: 'string', format: 'date', description: 'Event date (YYYY-MM-DD)' },
time: { type: 'string', description: 'Event time (e.g., 2:00 PM)' },
location: { type: 'string', description: 'Event location or venue' },
description: { type: 'string', description: 'Event description' },
organizer: { type: 'string', description: 'Event organizer' },
attendees: {
type: 'array',
description: 'List of attendees',
items: { type: 'string' }
}
},
required: ['title', 'date']
};
export interface CalendarEvent {
title: string;
date: string;
time?: string;
location?: string;
description?: string;
organizer?: string;
attendees?: string[];
}
// [end:event-schema]
// [start:sample-texts]
export const sampleTexts = {
contact: `
Hey there! I wanted to introduce you to our new team lead, Sarah Chen.
She's the Director of Engineering at TechCorp Inc. You can reach her
at sarah.chen@techcorp.io or call her office at (555) 123-4567.
She's really excited to collaborate on the upcoming project!
`,
invoice: `
INVOICE #INV-2024-0042
Date: January 3, 2026
From: Acme Supplies Ltd.
Items:
- Widget Pro (x5) @ $29.99 each = $149.95
- Super Gadget (x2) @ $89.50 each = $179.00
- Premium Cable (x10) @ $12.00 each = $120.00
Subtotal: $448.95
Tax (8%): $35.92
TOTAL DUE: $484.87
`,
event: `
You're invited to the Annual Tech Summit 2026!
Join us on March 15, 2026 at 9:00 AM at the Grand Convention Center
in downtown San Francisco. This year's theme is "AI in Practice".
Hosted by the Bay Area Tech Association, we'll have speakers from
Google, Meta, and OpenAI. Expected attendees include John Smith,
Jane Doe, and Bob Wilson from our team.
Don't miss this opportunity to network and learn!
`
};
// [end:sample-texts]
// [start:create-config]
export function createExtractorConfig(): ExtractorConfig {
return {
provider: process.env.OPENAI_API_KEY ? 'openai' : 'anthropic',
model: process.env.OPENAI_API_KEY ? 'gpt-4o-mini' : 'claude-3-haiku-20240307',
apiKey: process.env.OPENAI_API_KEY || process.env.ANTHROPIC_API_KEY
};
}
// [end:create-config]
// [start:usage-example]
/**
* Example usage of the structured data extractor:
*
* ```typescript
* const config = createExtractorConfig();
*
* const result = await extractData<Contact>(
* config,
* 'Sarah Chen is Director at TechCorp, sarah@techcorp.io',
* contactSchema,
* 'contact'
* );
*
* console.log(result.data);
* // { name: "Sarah Chen", email: "sarah@techcorp.io", ... }
* ```
*/
// [end:usage-example]Extract Data
The extractData function handles the entire extraction flow:
typescript
/**
* Structured Data Extractor Library
*
* Exported functions for the Structured Data Extractor recipe.
* Snippet markers allow VitePress to extract code for documentation.
*/
import { ChatClient, ToolRegistry } from '../../../src';
// [start:colors]
export const colors = {
reset: '\x1b[0m',
dim: '\x1b[2m',
bold: '\x1b[1m',
red: '\x1b[31m',
yellow: '\x1b[33m',
green: '\x1b[32m',
cyan: '\x1b[36m',
magenta: '\x1b[35m'
};
// [end:colors]
// [start:types]
export interface JSONSchema {
type: string;
properties?: Record<string, JSONSchema & { description?: string }>;
items?: JSONSchema;
required?: string[];
description?: string;
enum?: string[];
format?: string;
minimum?: number;
maximum?: number;
}
export interface ExtractionResult<T> {
data: T;
attempts: number;
raw?: unknown;
}
export interface ValidationError {
field: string;
message: string;
}
export interface ExtractorConfig {
provider: 'openai' | 'anthropic' | 'ollama';
model: string;
apiKey?: string;
baseURL?: string;
}
// [end:types]
// [start:logging]
export function printBanner(): void {
console.log(`
${colors.cyan}╭─────────────────────────────────────────────╮
│ ${colors.reset}${colors.bold}Structured Data Extractor${colors.reset}${colors.cyan} │
│ ${colors.dim}JSON from Unstructured Text${colors.reset}${colors.cyan} │
╰─────────────────────────────────────────────╯${colors.reset}
`);
}
export function log(level: 'info' | 'warn' | 'error' | 'success', message: string): void {
const prefix = {
info: `${colors.cyan}ℹ${colors.reset}`,
warn: `${colors.yellow}⚠${colors.reset}`,
error: `${colors.red}✗${colors.reset}`,
success: `${colors.green}✓${colors.reset}`
};
console.log(`${prefix[level]} ${message}`);
}
// [end:logging]
// [start:validation]
/**
* Simple JSON Schema validator
* Returns array of validation errors (empty if valid)
*/
export function validateAgainstSchema(
data: unknown,
schema: JSONSchema,
path = ''
): ValidationError[] {
const errors: ValidationError[] = [];
if (data === null || data === undefined) {
if (schema.required?.length) {
errors.push({ field: path || 'root', message: 'Value is required' });
}
return errors;
}
// Type checking
if (schema.type === 'object' && typeof data === 'object' && !Array.isArray(data)) {
const obj = data as Record<string, unknown>;
// Check required fields
for (const field of schema.required || []) {
if (!(field in obj) || obj[field] === null || obj[field] === undefined) {
errors.push({
field: path ? `${path}.${field}` : field,
message: 'Required field missing'
});
}
}
// Validate properties
if (schema.properties) {
for (const [key, propSchema] of Object.entries(schema.properties)) {
if (key in obj) {
errors.push(
...validateAgainstSchema(obj[key], propSchema, path ? `${path}.${key}` : key)
);
}
}
}
} else if (schema.type === 'array' && Array.isArray(data)) {
if (schema.items) {
data.forEach((item, index) => {
errors.push(...validateAgainstSchema(item, schema.items!, `${path}[${index}]`));
});
}
} else if (schema.type === 'string') {
if (typeof data !== 'string') {
errors.push({ field: path, message: `Expected string, got ${typeof data}` });
} else if (schema.format === 'email' && !data.includes('@')) {
errors.push({ field: path, message: 'Invalid email format' });
} else if (schema.format === 'date' && isNaN(Date.parse(data))) {
errors.push({ field: path, message: 'Invalid date format' });
}
} else if (schema.type === 'number') {
if (typeof data !== 'number') {
errors.push({ field: path, message: `Expected number, got ${typeof data}` });
} else {
if (schema.minimum !== undefined && data < schema.minimum) {
errors.push({ field: path, message: `Value must be >= ${schema.minimum}` });
}
if (schema.maximum !== undefined && data > schema.maximum) {
errors.push({ field: path, message: `Value must be <= ${schema.maximum}` });
}
}
} else if (schema.type === 'boolean' && typeof data !== 'boolean') {
errors.push({ field: path, message: `Expected boolean, got ${typeof data}` });
}
// Enum validation
if (schema.enum && !schema.enum.includes(data as string)) {
errors.push({ field: path, message: `Value must be one of: ${schema.enum.join(', ')}` });
}
return errors;
}
// [end:validation]
// [start:extract-data]
/**
* Extract structured data from text using LLM tool calling
*/
export async function extractData<T>(
config: ExtractorConfig,
text: string,
schema: JSONSchema,
schemaName: string,
maxRetries = 3
): Promise<ExtractionResult<T>> {
let attempts = 0;
let lastErrors: ValidationError[] = [];
let extractedData: unknown = null;
while (attempts < maxRetries) {
attempts++;
// Create tool registry with extraction tool
const registry = new ToolRegistry();
registry.registerTool(
'extract_data',
async (args: Record<string, unknown>) => {
extractedData = args;
return { success: true, data: args };
},
{
description: `Extract ${schemaName} from the provided text. Call this tool with the extracted data.`,
parameters: schema
}
);
// Build prompt
let prompt = `Extract the following information from the text below and call the extract_data tool with the result.
Text:
"""
${text}
"""
Extract all ${schemaName} information you can find. For optional fields, only include them if the information is clearly present.`;
// Add error feedback for retries
if (lastErrors.length > 0) {
prompt += `\n\nPrevious extraction had these validation errors, please fix them:
${lastErrors.map((e) => `- ${e.field}: ${e.message}`).join('\n')}`;
}
log('info', `Attempt ${attempts}/${maxRetries}: Extracting ${schemaName}...`);
// Create extraction client with tool
const extractionClient = new ChatClient({
provider: config.provider,
model: config.model,
apiKey: config.apiKey,
tools: registry
});
try {
await extractionClient.chat(prompt);
// Validate extracted data
if (extractedData) {
const errors = validateAgainstSchema(extractedData, schema);
if (errors.length === 0) {
log('success', `Extraction successful on attempt ${attempts}`);
return {
data: extractedData as T,
attempts,
raw: extractedData
};
}
lastErrors = errors;
log('warn', `Validation failed: ${errors.map((e) => e.message).join(', ')}`);
} else {
lastErrors = [{ field: 'root', message: 'No data extracted' }];
log('warn', 'No data was extracted');
}
} catch (error) {
log('error', `Extraction error: ${error instanceof Error ? error.message : String(error)}`);
lastErrors = [{ field: 'root', message: String(error) }];
}
}
throw new Error(
`Failed to extract valid ${schemaName} after ${maxRetries} attempts. Last errors: ${lastErrors.map((e) => `${e.field}: ${e.message}`).join(', ')}`
);
}
// [end:extract-data]
// [start:contact-schema]
export const contactSchema: JSONSchema = {
type: 'object',
properties: {
name: { type: 'string', description: 'Full name of the person' },
email: { type: 'string', format: 'email', description: 'Email address' },
phone: { type: 'string', description: 'Phone number' },
company: { type: 'string', description: 'Company or organization' },
role: { type: 'string', description: 'Job title or role' }
},
required: ['name', 'email']
};
export interface Contact {
name: string;
email: string;
phone?: string;
company?: string;
role?: string;
}
// [end:contact-schema]
// [start:invoice-schema]
export const invoiceSchema: JSONSchema = {
type: 'object',
properties: {
invoiceNumber: { type: 'string', description: 'Invoice or reference number' },
date: { type: 'string', format: 'date', description: 'Invoice date (YYYY-MM-DD)' },
vendor: { type: 'string', description: 'Vendor or seller name' },
items: {
type: 'array',
description: 'Line items on the invoice',
items: {
type: 'object',
properties: {
description: { type: 'string', description: 'Item description' },
quantity: { type: 'number', minimum: 0, description: 'Quantity' },
unitPrice: { type: 'number', minimum: 0, description: 'Price per unit' },
total: { type: 'number', minimum: 0, description: 'Line total' }
},
required: ['description', 'quantity', 'unitPrice']
}
},
subtotal: { type: 'number', minimum: 0, description: 'Subtotal before tax' },
tax: { type: 'number', minimum: 0, description: 'Tax amount' },
total: { type: 'number', minimum: 0, description: 'Total amount due' }
},
required: ['invoiceNumber', 'date', 'items', 'total']
};
export interface Invoice {
invoiceNumber: string;
date: string;
vendor?: string;
items: Array<{
description: string;
quantity: number;
unitPrice: number;
total?: number;
}>;
subtotal?: number;
tax?: number;
total: number;
}
// [end:invoice-schema]
// [start:event-schema]
export const eventSchema: JSONSchema = {
type: 'object',
properties: {
title: { type: 'string', description: 'Event title or name' },
date: { type: 'string', format: 'date', description: 'Event date (YYYY-MM-DD)' },
time: { type: 'string', description: 'Event time (e.g., 2:00 PM)' },
location: { type: 'string', description: 'Event location or venue' },
description: { type: 'string', description: 'Event description' },
organizer: { type: 'string', description: 'Event organizer' },
attendees: {
type: 'array',
description: 'List of attendees',
items: { type: 'string' }
}
},
required: ['title', 'date']
};
export interface CalendarEvent {
title: string;
date: string;
time?: string;
location?: string;
description?: string;
organizer?: string;
attendees?: string[];
}
// [end:event-schema]
// [start:sample-texts]
export const sampleTexts = {
contact: `
Hey there! I wanted to introduce you to our new team lead, Sarah Chen.
She's the Director of Engineering at TechCorp Inc. You can reach her
at sarah.chen@techcorp.io or call her office at (555) 123-4567.
She's really excited to collaborate on the upcoming project!
`,
invoice: `
INVOICE #INV-2024-0042
Date: January 3, 2026
From: Acme Supplies Ltd.
Items:
- Widget Pro (x5) @ $29.99 each = $149.95
- Super Gadget (x2) @ $89.50 each = $179.00
- Premium Cable (x10) @ $12.00 each = $120.00
Subtotal: $448.95
Tax (8%): $35.92
TOTAL DUE: $484.87
`,
event: `
You're invited to the Annual Tech Summit 2026!
Join us on March 15, 2026 at 9:00 AM at the Grand Convention Center
in downtown San Francisco. This year's theme is "AI in Practice".
Hosted by the Bay Area Tech Association, we'll have speakers from
Google, Meta, and OpenAI. Expected attendees include John Smith,
Jane Doe, and Bob Wilson from our team.
Don't miss this opportunity to network and learn!
`
};
// [end:sample-texts]
// [start:create-config]
export function createExtractorConfig(): ExtractorConfig {
return {
provider: process.env.OPENAI_API_KEY ? 'openai' : 'anthropic',
model: process.env.OPENAI_API_KEY ? 'gpt-4o-mini' : 'claude-3-haiku-20240307',
apiKey: process.env.OPENAI_API_KEY || process.env.ANTHROPIC_API_KEY
};
}
// [end:create-config]
// [start:usage-example]
/**
* Example usage of the structured data extractor:
*
* ```typescript
* const config = createExtractorConfig();
*
* const result = await extractData<Contact>(
* config,
* 'Sarah Chen is Director at TechCorp, sarah@techcorp.io',
* contactSchema,
* 'contact'
* );
*
* console.log(result.data);
* // { name: "Sarah Chen", email: "sarah@techcorp.io", ... }
* ```
*/
// [end:usage-example]Usage
typescript
/**
* Structured Data Extractor Library
*
* Exported functions for the Structured Data Extractor recipe.
* Snippet markers allow VitePress to extract code for documentation.
*/
import { ChatClient, ToolRegistry } from '../../../src';
// [start:colors]
export const colors = {
reset: '\x1b[0m',
dim: '\x1b[2m',
bold: '\x1b[1m',
red: '\x1b[31m',
yellow: '\x1b[33m',
green: '\x1b[32m',
cyan: '\x1b[36m',
magenta: '\x1b[35m'
};
// [end:colors]
// [start:types]
export interface JSONSchema {
type: string;
properties?: Record<string, JSONSchema & { description?: string }>;
items?: JSONSchema;
required?: string[];
description?: string;
enum?: string[];
format?: string;
minimum?: number;
maximum?: number;
}
export interface ExtractionResult<T> {
data: T;
attempts: number;
raw?: unknown;
}
export interface ValidationError {
field: string;
message: string;
}
export interface ExtractorConfig {
provider: 'openai' | 'anthropic' | 'ollama';
model: string;
apiKey?: string;
baseURL?: string;
}
// [end:types]
// [start:logging]
export function printBanner(): void {
console.log(`
${colors.cyan}╭─────────────────────────────────────────────╮
│ ${colors.reset}${colors.bold}Structured Data Extractor${colors.reset}${colors.cyan} │
│ ${colors.dim}JSON from Unstructured Text${colors.reset}${colors.cyan} │
╰─────────────────────────────────────────────╯${colors.reset}
`);
}
export function log(level: 'info' | 'warn' | 'error' | 'success', message: string): void {
const prefix = {
info: `${colors.cyan}ℹ${colors.reset}`,
warn: `${colors.yellow}⚠${colors.reset}`,
error: `${colors.red}✗${colors.reset}`,
success: `${colors.green}✓${colors.reset}`
};
console.log(`${prefix[level]} ${message}`);
}
// [end:logging]
// [start:validation]
/**
* Simple JSON Schema validator
* Returns array of validation errors (empty if valid)
*/
export function validateAgainstSchema(
data: unknown,
schema: JSONSchema,
path = ''
): ValidationError[] {
const errors: ValidationError[] = [];
if (data === null || data === undefined) {
if (schema.required?.length) {
errors.push({ field: path || 'root', message: 'Value is required' });
}
return errors;
}
// Type checking
if (schema.type === 'object' && typeof data === 'object' && !Array.isArray(data)) {
const obj = data as Record<string, unknown>;
// Check required fields
for (const field of schema.required || []) {
if (!(field in obj) || obj[field] === null || obj[field] === undefined) {
errors.push({
field: path ? `${path}.${field}` : field,
message: 'Required field missing'
});
}
}
// Validate properties
if (schema.properties) {
for (const [key, propSchema] of Object.entries(schema.properties)) {
if (key in obj) {
errors.push(
...validateAgainstSchema(obj[key], propSchema, path ? `${path}.${key}` : key)
);
}
}
}
} else if (schema.type === 'array' && Array.isArray(data)) {
if (schema.items) {
data.forEach((item, index) => {
errors.push(...validateAgainstSchema(item, schema.items!, `${path}[${index}]`));
});
}
} else if (schema.type === 'string') {
if (typeof data !== 'string') {
errors.push({ field: path, message: `Expected string, got ${typeof data}` });
} else if (schema.format === 'email' && !data.includes('@')) {
errors.push({ field: path, message: 'Invalid email format' });
} else if (schema.format === 'date' && isNaN(Date.parse(data))) {
errors.push({ field: path, message: 'Invalid date format' });
}
} else if (schema.type === 'number') {
if (typeof data !== 'number') {
errors.push({ field: path, message: `Expected number, got ${typeof data}` });
} else {
if (schema.minimum !== undefined && data < schema.minimum) {
errors.push({ field: path, message: `Value must be >= ${schema.minimum}` });
}
if (schema.maximum !== undefined && data > schema.maximum) {
errors.push({ field: path, message: `Value must be <= ${schema.maximum}` });
}
}
} else if (schema.type === 'boolean' && typeof data !== 'boolean') {
errors.push({ field: path, message: `Expected boolean, got ${typeof data}` });
}
// Enum validation
if (schema.enum && !schema.enum.includes(data as string)) {
errors.push({ field: path, message: `Value must be one of: ${schema.enum.join(', ')}` });
}
return errors;
}
// [end:validation]
// [start:extract-data]
/**
* Extract structured data from text using LLM tool calling
*/
export async function extractData<T>(
config: ExtractorConfig,
text: string,
schema: JSONSchema,
schemaName: string,
maxRetries = 3
): Promise<ExtractionResult<T>> {
let attempts = 0;
let lastErrors: ValidationError[] = [];
let extractedData: unknown = null;
while (attempts < maxRetries) {
attempts++;
// Create tool registry with extraction tool
const registry = new ToolRegistry();
registry.registerTool(
'extract_data',
async (args: Record<string, unknown>) => {
extractedData = args;
return { success: true, data: args };
},
{
description: `Extract ${schemaName} from the provided text. Call this tool with the extracted data.`,
parameters: schema
}
);
// Build prompt
let prompt = `Extract the following information from the text below and call the extract_data tool with the result.
Text:
"""
${text}
"""
Extract all ${schemaName} information you can find. For optional fields, only include them if the information is clearly present.`;
// Add error feedback for retries
if (lastErrors.length > 0) {
prompt += `\n\nPrevious extraction had these validation errors, please fix them:
${lastErrors.map((e) => `- ${e.field}: ${e.message}`).join('\n')}`;
}
log('info', `Attempt ${attempts}/${maxRetries}: Extracting ${schemaName}...`);
// Create extraction client with tool
const extractionClient = new ChatClient({
provider: config.provider,
model: config.model,
apiKey: config.apiKey,
tools: registry
});
try {
await extractionClient.chat(prompt);
// Validate extracted data
if (extractedData) {
const errors = validateAgainstSchema(extractedData, schema);
if (errors.length === 0) {
log('success', `Extraction successful on attempt ${attempts}`);
return {
data: extractedData as T,
attempts,
raw: extractedData
};
}
lastErrors = errors;
log('warn', `Validation failed: ${errors.map((e) => e.message).join(', ')}`);
} else {
lastErrors = [{ field: 'root', message: 'No data extracted' }];
log('warn', 'No data was extracted');
}
} catch (error) {
log('error', `Extraction error: ${error instanceof Error ? error.message : String(error)}`);
lastErrors = [{ field: 'root', message: String(error) }];
}
}
throw new Error(
`Failed to extract valid ${schemaName} after ${maxRetries} attempts. Last errors: ${lastErrors.map((e) => `${e.field}: ${e.message}`).join(', ')}`
);
}
// [end:extract-data]
// [start:contact-schema]
export const contactSchema: JSONSchema = {
type: 'object',
properties: {
name: { type: 'string', description: 'Full name of the person' },
email: { type: 'string', format: 'email', description: 'Email address' },
phone: { type: 'string', description: 'Phone number' },
company: { type: 'string', description: 'Company or organization' },
role: { type: 'string', description: 'Job title or role' }
},
required: ['name', 'email']
};
export interface Contact {
name: string;
email: string;
phone?: string;
company?: string;
role?: string;
}
// [end:contact-schema]
// [start:invoice-schema]
export const invoiceSchema: JSONSchema = {
type: 'object',
properties: {
invoiceNumber: { type: 'string', description: 'Invoice or reference number' },
date: { type: 'string', format: 'date', description: 'Invoice date (YYYY-MM-DD)' },
vendor: { type: 'string', description: 'Vendor or seller name' },
items: {
type: 'array',
description: 'Line items on the invoice',
items: {
type: 'object',
properties: {
description: { type: 'string', description: 'Item description' },
quantity: { type: 'number', minimum: 0, description: 'Quantity' },
unitPrice: { type: 'number', minimum: 0, description: 'Price per unit' },
total: { type: 'number', minimum: 0, description: 'Line total' }
},
required: ['description', 'quantity', 'unitPrice']
}
},
subtotal: { type: 'number', minimum: 0, description: 'Subtotal before tax' },
tax: { type: 'number', minimum: 0, description: 'Tax amount' },
total: { type: 'number', minimum: 0, description: 'Total amount due' }
},
required: ['invoiceNumber', 'date', 'items', 'total']
};
export interface Invoice {
invoiceNumber: string;
date: string;
vendor?: string;
items: Array<{
description: string;
quantity: number;
unitPrice: number;
total?: number;
}>;
subtotal?: number;
tax?: number;
total: number;
}
// [end:invoice-schema]
// [start:event-schema]
export const eventSchema: JSONSchema = {
type: 'object',
properties: {
title: { type: 'string', description: 'Event title or name' },
date: { type: 'string', format: 'date', description: 'Event date (YYYY-MM-DD)' },
time: { type: 'string', description: 'Event time (e.g., 2:00 PM)' },
location: { type: 'string', description: 'Event location or venue' },
description: { type: 'string', description: 'Event description' },
organizer: { type: 'string', description: 'Event organizer' },
attendees: {
type: 'array',
description: 'List of attendees',
items: { type: 'string' }
}
},
required: ['title', 'date']
};
export interface CalendarEvent {
title: string;
date: string;
time?: string;
location?: string;
description?: string;
organizer?: string;
attendees?: string[];
}
// [end:event-schema]
// [start:sample-texts]
export const sampleTexts = {
contact: `
Hey there! I wanted to introduce you to our new team lead, Sarah Chen.
She's the Director of Engineering at TechCorp Inc. You can reach her
at sarah.chen@techcorp.io or call her office at (555) 123-4567.
She's really excited to collaborate on the upcoming project!
`,
invoice: `
INVOICE #INV-2024-0042
Date: January 3, 2026
From: Acme Supplies Ltd.
Items:
- Widget Pro (x5) @ $29.99 each = $149.95
- Super Gadget (x2) @ $89.50 each = $179.00
- Premium Cable (x10) @ $12.00 each = $120.00
Subtotal: $448.95
Tax (8%): $35.92
TOTAL DUE: $484.87
`,
event: `
You're invited to the Annual Tech Summit 2026!
Join us on March 15, 2026 at 9:00 AM at the Grand Convention Center
in downtown San Francisco. This year's theme is "AI in Practice".
Hosted by the Bay Area Tech Association, we'll have speakers from
Google, Meta, and OpenAI. Expected attendees include John Smith,
Jane Doe, and Bob Wilson from our team.
Don't miss this opportunity to network and learn!
`
};
// [end:sample-texts]
// [start:create-config]
export function createExtractorConfig(): ExtractorConfig {
return {
provider: process.env.OPENAI_API_KEY ? 'openai' : 'anthropic',
model: process.env.OPENAI_API_KEY ? 'gpt-4o-mini' : 'claude-3-haiku-20240307',
apiKey: process.env.OPENAI_API_KEY || process.env.ANTHROPIC_API_KEY
};
}
// [end:create-config]
// [start:usage-example]
/**
* Example usage of the structured data extractor:
*
* ```typescript
* const config = createExtractorConfig();
*
* const result = await extractData<Contact>(
* config,
* 'Sarah Chen is Director at TechCorp, sarah@techcorp.io',
* contactSchema,
* 'contact'
* );
*
* console.log(result.data);
* // { name: "Sarah Chen", email: "sarah@techcorp.io", ... }
* ```
*/
// [end:usage-example]More Schema Examples
Invoice
typescript
/**
* Structured Data Extractor Library
*
* Exported functions for the Structured Data Extractor recipe.
* Snippet markers allow VitePress to extract code for documentation.
*/
import { ChatClient, ToolRegistry } from '../../../src';
// [start:colors]
export const colors = {
reset: '\x1b[0m',
dim: '\x1b[2m',
bold: '\x1b[1m',
red: '\x1b[31m',
yellow: '\x1b[33m',
green: '\x1b[32m',
cyan: '\x1b[36m',
magenta: '\x1b[35m'
};
// [end:colors]
// [start:types]
export interface JSONSchema {
type: string;
properties?: Record<string, JSONSchema & { description?: string }>;
items?: JSONSchema;
required?: string[];
description?: string;
enum?: string[];
format?: string;
minimum?: number;
maximum?: number;
}
export interface ExtractionResult<T> {
data: T;
attempts: number;
raw?: unknown;
}
export interface ValidationError {
field: string;
message: string;
}
export interface ExtractorConfig {
provider: 'openai' | 'anthropic' | 'ollama';
model: string;
apiKey?: string;
baseURL?: string;
}
// [end:types]
// [start:logging]
export function printBanner(): void {
console.log(`
${colors.cyan}╭─────────────────────────────────────────────╮
│ ${colors.reset}${colors.bold}Structured Data Extractor${colors.reset}${colors.cyan} │
│ ${colors.dim}JSON from Unstructured Text${colors.reset}${colors.cyan} │
╰─────────────────────────────────────────────╯${colors.reset}
`);
}
export function log(level: 'info' | 'warn' | 'error' | 'success', message: string): void {
const prefix = {
info: `${colors.cyan}ℹ${colors.reset}`,
warn: `${colors.yellow}⚠${colors.reset}`,
error: `${colors.red}✗${colors.reset}`,
success: `${colors.green}✓${colors.reset}`
};
console.log(`${prefix[level]} ${message}`);
}
// [end:logging]
// [start:validation]
/**
* Simple JSON Schema validator
* Returns array of validation errors (empty if valid)
*/
export function validateAgainstSchema(
data: unknown,
schema: JSONSchema,
path = ''
): ValidationError[] {
const errors: ValidationError[] = [];
if (data === null || data === undefined) {
if (schema.required?.length) {
errors.push({ field: path || 'root', message: 'Value is required' });
}
return errors;
}
// Type checking
if (schema.type === 'object' && typeof data === 'object' && !Array.isArray(data)) {
const obj = data as Record<string, unknown>;
// Check required fields
for (const field of schema.required || []) {
if (!(field in obj) || obj[field] === null || obj[field] === undefined) {
errors.push({
field: path ? `${path}.${field}` : field,
message: 'Required field missing'
});
}
}
// Validate properties
if (schema.properties) {
for (const [key, propSchema] of Object.entries(schema.properties)) {
if (key in obj) {
errors.push(
...validateAgainstSchema(obj[key], propSchema, path ? `${path}.${key}` : key)
);
}
}
}
} else if (schema.type === 'array' && Array.isArray(data)) {
if (schema.items) {
data.forEach((item, index) => {
errors.push(...validateAgainstSchema(item, schema.items!, `${path}[${index}]`));
});
}
} else if (schema.type === 'string') {
if (typeof data !== 'string') {
errors.push({ field: path, message: `Expected string, got ${typeof data}` });
} else if (schema.format === 'email' && !data.includes('@')) {
errors.push({ field: path, message: 'Invalid email format' });
} else if (schema.format === 'date' && isNaN(Date.parse(data))) {
errors.push({ field: path, message: 'Invalid date format' });
}
} else if (schema.type === 'number') {
if (typeof data !== 'number') {
errors.push({ field: path, message: `Expected number, got ${typeof data}` });
} else {
if (schema.minimum !== undefined && data < schema.minimum) {
errors.push({ field: path, message: `Value must be >= ${schema.minimum}` });
}
if (schema.maximum !== undefined && data > schema.maximum) {
errors.push({ field: path, message: `Value must be <= ${schema.maximum}` });
}
}
} else if (schema.type === 'boolean' && typeof data !== 'boolean') {
errors.push({ field: path, message: `Expected boolean, got ${typeof data}` });
}
// Enum validation
if (schema.enum && !schema.enum.includes(data as string)) {
errors.push({ field: path, message: `Value must be one of: ${schema.enum.join(', ')}` });
}
return errors;
}
// [end:validation]
// [start:extract-data]
/**
* Extract structured data from text using LLM tool calling
*/
export async function extractData<T>(
config: ExtractorConfig,
text: string,
schema: JSONSchema,
schemaName: string,
maxRetries = 3
): Promise<ExtractionResult<T>> {
let attempts = 0;
let lastErrors: ValidationError[] = [];
let extractedData: unknown = null;
while (attempts < maxRetries) {
attempts++;
// Create tool registry with extraction tool
const registry = new ToolRegistry();
registry.registerTool(
'extract_data',
async (args: Record<string, unknown>) => {
extractedData = args;
return { success: true, data: args };
},
{
description: `Extract ${schemaName} from the provided text. Call this tool with the extracted data.`,
parameters: schema
}
);
// Build prompt
let prompt = `Extract the following information from the text below and call the extract_data tool with the result.
Text:
"""
${text}
"""
Extract all ${schemaName} information you can find. For optional fields, only include them if the information is clearly present.`;
// Add error feedback for retries
if (lastErrors.length > 0) {
prompt += `\n\nPrevious extraction had these validation errors, please fix them:
${lastErrors.map((e) => `- ${e.field}: ${e.message}`).join('\n')}`;
}
log('info', `Attempt ${attempts}/${maxRetries}: Extracting ${schemaName}...`);
// Create extraction client with tool
const extractionClient = new ChatClient({
provider: config.provider,
model: config.model,
apiKey: config.apiKey,
tools: registry
});
try {
await extractionClient.chat(prompt);
// Validate extracted data
if (extractedData) {
const errors = validateAgainstSchema(extractedData, schema);
if (errors.length === 0) {
log('success', `Extraction successful on attempt ${attempts}`);
return {
data: extractedData as T,
attempts,
raw: extractedData
};
}
lastErrors = errors;
log('warn', `Validation failed: ${errors.map((e) => e.message).join(', ')}`);
} else {
lastErrors = [{ field: 'root', message: 'No data extracted' }];
log('warn', 'No data was extracted');
}
} catch (error) {
log('error', `Extraction error: ${error instanceof Error ? error.message : String(error)}`);
lastErrors = [{ field: 'root', message: String(error) }];
}
}
throw new Error(
`Failed to extract valid ${schemaName} after ${maxRetries} attempts. Last errors: ${lastErrors.map((e) => `${e.field}: ${e.message}`).join(', ')}`
);
}
// [end:extract-data]
// [start:contact-schema]
export const contactSchema: JSONSchema = {
type: 'object',
properties: {
name: { type: 'string', description: 'Full name of the person' },
email: { type: 'string', format: 'email', description: 'Email address' },
phone: { type: 'string', description: 'Phone number' },
company: { type: 'string', description: 'Company or organization' },
role: { type: 'string', description: 'Job title or role' }
},
required: ['name', 'email']
};
export interface Contact {
name: string;
email: string;
phone?: string;
company?: string;
role?: string;
}
// [end:contact-schema]
// [start:invoice-schema]
export const invoiceSchema: JSONSchema = {
type: 'object',
properties: {
invoiceNumber: { type: 'string', description: 'Invoice or reference number' },
date: { type: 'string', format: 'date', description: 'Invoice date (YYYY-MM-DD)' },
vendor: { type: 'string', description: 'Vendor or seller name' },
items: {
type: 'array',
description: 'Line items on the invoice',
items: {
type: 'object',
properties: {
description: { type: 'string', description: 'Item description' },
quantity: { type: 'number', minimum: 0, description: 'Quantity' },
unitPrice: { type: 'number', minimum: 0, description: 'Price per unit' },
total: { type: 'number', minimum: 0, description: 'Line total' }
},
required: ['description', 'quantity', 'unitPrice']
}
},
subtotal: { type: 'number', minimum: 0, description: 'Subtotal before tax' },
tax: { type: 'number', minimum: 0, description: 'Tax amount' },
total: { type: 'number', minimum: 0, description: 'Total amount due' }
},
required: ['invoiceNumber', 'date', 'items', 'total']
};
export interface Invoice {
invoiceNumber: string;
date: string;
vendor?: string;
items: Array<{
description: string;
quantity: number;
unitPrice: number;
total?: number;
}>;
subtotal?: number;
tax?: number;
total: number;
}
// [end:invoice-schema]
// [start:event-schema]
export const eventSchema: JSONSchema = {
type: 'object',
properties: {
title: { type: 'string', description: 'Event title or name' },
date: { type: 'string', format: 'date', description: 'Event date (YYYY-MM-DD)' },
time: { type: 'string', description: 'Event time (e.g., 2:00 PM)' },
location: { type: 'string', description: 'Event location or venue' },
description: { type: 'string', description: 'Event description' },
organizer: { type: 'string', description: 'Event organizer' },
attendees: {
type: 'array',
description: 'List of attendees',
items: { type: 'string' }
}
},
required: ['title', 'date']
};
export interface CalendarEvent {
title: string;
date: string;
time?: string;
location?: string;
description?: string;
organizer?: string;
attendees?: string[];
}
// [end:event-schema]
// [start:sample-texts]
export const sampleTexts = {
contact: `
Hey there! I wanted to introduce you to our new team lead, Sarah Chen.
She's the Director of Engineering at TechCorp Inc. You can reach her
at sarah.chen@techcorp.io or call her office at (555) 123-4567.
She's really excited to collaborate on the upcoming project!
`,
invoice: `
INVOICE #INV-2024-0042
Date: January 3, 2026
From: Acme Supplies Ltd.
Items:
- Widget Pro (x5) @ $29.99 each = $149.95
- Super Gadget (x2) @ $89.50 each = $179.00
- Premium Cable (x10) @ $12.00 each = $120.00
Subtotal: $448.95
Tax (8%): $35.92
TOTAL DUE: $484.87
`,
event: `
You're invited to the Annual Tech Summit 2026!
Join us on March 15, 2026 at 9:00 AM at the Grand Convention Center
in downtown San Francisco. This year's theme is "AI in Practice".
Hosted by the Bay Area Tech Association, we'll have speakers from
Google, Meta, and OpenAI. Expected attendees include John Smith,
Jane Doe, and Bob Wilson from our team.
Don't miss this opportunity to network and learn!
`
};
// [end:sample-texts]
// [start:create-config]
export function createExtractorConfig(): ExtractorConfig {
return {
provider: process.env.OPENAI_API_KEY ? 'openai' : 'anthropic',
model: process.env.OPENAI_API_KEY ? 'gpt-4o-mini' : 'claude-3-haiku-20240307',
apiKey: process.env.OPENAI_API_KEY || process.env.ANTHROPIC_API_KEY
};
}
// [end:create-config]
// [start:usage-example]
/**
* Example usage of the structured data extractor:
*
* ```typescript
* const config = createExtractorConfig();
*
* const result = await extractData<Contact>(
* config,
* 'Sarah Chen is Director at TechCorp, sarah@techcorp.io',
* contactSchema,
* 'contact'
* );
*
* console.log(result.data);
* // { name: "Sarah Chen", email: "sarah@techcorp.io", ... }
* ```
*/
// [end:usage-example]Event
typescript
/**
* Structured Data Extractor Library
*
* Exported functions for the Structured Data Extractor recipe.
* Snippet markers allow VitePress to extract code for documentation.
*/
import { ChatClient, ToolRegistry } from '../../../src';
// [start:colors]
export const colors = {
reset: '\x1b[0m',
dim: '\x1b[2m',
bold: '\x1b[1m',
red: '\x1b[31m',
yellow: '\x1b[33m',
green: '\x1b[32m',
cyan: '\x1b[36m',
magenta: '\x1b[35m'
};
// [end:colors]
// [start:types]
export interface JSONSchema {
type: string;
properties?: Record<string, JSONSchema & { description?: string }>;
items?: JSONSchema;
required?: string[];
description?: string;
enum?: string[];
format?: string;
minimum?: number;
maximum?: number;
}
export interface ExtractionResult<T> {
data: T;
attempts: number;
raw?: unknown;
}
export interface ValidationError {
field: string;
message: string;
}
export interface ExtractorConfig {
provider: 'openai' | 'anthropic' | 'ollama';
model: string;
apiKey?: string;
baseURL?: string;
}
// [end:types]
// [start:logging]
export function printBanner(): void {
console.log(`
${colors.cyan}╭─────────────────────────────────────────────╮
│ ${colors.reset}${colors.bold}Structured Data Extractor${colors.reset}${colors.cyan} │
│ ${colors.dim}JSON from Unstructured Text${colors.reset}${colors.cyan} │
╰─────────────────────────────────────────────╯${colors.reset}
`);
}
export function log(level: 'info' | 'warn' | 'error' | 'success', message: string): void {
const prefix = {
info: `${colors.cyan}ℹ${colors.reset}`,
warn: `${colors.yellow}⚠${colors.reset}`,
error: `${colors.red}✗${colors.reset}`,
success: `${colors.green}✓${colors.reset}`
};
console.log(`${prefix[level]} ${message}`);
}
// [end:logging]
// [start:validation]
/**
* Simple JSON Schema validator
* Returns array of validation errors (empty if valid)
*/
export function validateAgainstSchema(
data: unknown,
schema: JSONSchema,
path = ''
): ValidationError[] {
const errors: ValidationError[] = [];
if (data === null || data === undefined) {
if (schema.required?.length) {
errors.push({ field: path || 'root', message: 'Value is required' });
}
return errors;
}
// Type checking
if (schema.type === 'object' && typeof data === 'object' && !Array.isArray(data)) {
const obj = data as Record<string, unknown>;
// Check required fields
for (const field of schema.required || []) {
if (!(field in obj) || obj[field] === null || obj[field] === undefined) {
errors.push({
field: path ? `${path}.${field}` : field,
message: 'Required field missing'
});
}
}
// Validate properties
if (schema.properties) {
for (const [key, propSchema] of Object.entries(schema.properties)) {
if (key in obj) {
errors.push(
...validateAgainstSchema(obj[key], propSchema, path ? `${path}.${key}` : key)
);
}
}
}
} else if (schema.type === 'array' && Array.isArray(data)) {
if (schema.items) {
data.forEach((item, index) => {
errors.push(...validateAgainstSchema(item, schema.items!, `${path}[${index}]`));
});
}
} else if (schema.type === 'string') {
if (typeof data !== 'string') {
errors.push({ field: path, message: `Expected string, got ${typeof data}` });
} else if (schema.format === 'email' && !data.includes('@')) {
errors.push({ field: path, message: 'Invalid email format' });
} else if (schema.format === 'date' && isNaN(Date.parse(data))) {
errors.push({ field: path, message: 'Invalid date format' });
}
} else if (schema.type === 'number') {
if (typeof data !== 'number') {
errors.push({ field: path, message: `Expected number, got ${typeof data}` });
} else {
if (schema.minimum !== undefined && data < schema.minimum) {
errors.push({ field: path, message: `Value must be >= ${schema.minimum}` });
}
if (schema.maximum !== undefined && data > schema.maximum) {
errors.push({ field: path, message: `Value must be <= ${schema.maximum}` });
}
}
} else if (schema.type === 'boolean' && typeof data !== 'boolean') {
errors.push({ field: path, message: `Expected boolean, got ${typeof data}` });
}
// Enum validation
if (schema.enum && !schema.enum.includes(data as string)) {
errors.push({ field: path, message: `Value must be one of: ${schema.enum.join(', ')}` });
}
return errors;
}
// [end:validation]
// [start:extract-data]
/**
* Extract structured data from text using LLM tool calling
*/
export async function extractData<T>(
config: ExtractorConfig,
text: string,
schema: JSONSchema,
schemaName: string,
maxRetries = 3
): Promise<ExtractionResult<T>> {
let attempts = 0;
let lastErrors: ValidationError[] = [];
let extractedData: unknown = null;
while (attempts < maxRetries) {
attempts++;
// Create tool registry with extraction tool
const registry = new ToolRegistry();
registry.registerTool(
'extract_data',
async (args: Record<string, unknown>) => {
extractedData = args;
return { success: true, data: args };
},
{
description: `Extract ${schemaName} from the provided text. Call this tool with the extracted data.`,
parameters: schema
}
);
// Build prompt
let prompt = `Extract the following information from the text below and call the extract_data tool with the result.
Text:
"""
${text}
"""
Extract all ${schemaName} information you can find. For optional fields, only include them if the information is clearly present.`;
// Add error feedback for retries
if (lastErrors.length > 0) {
prompt += `\n\nPrevious extraction had these validation errors, please fix them:
${lastErrors.map((e) => `- ${e.field}: ${e.message}`).join('\n')}`;
}
log('info', `Attempt ${attempts}/${maxRetries}: Extracting ${schemaName}...`);
// Create extraction client with tool
const extractionClient = new ChatClient({
provider: config.provider,
model: config.model,
apiKey: config.apiKey,
tools: registry
});
try {
await extractionClient.chat(prompt);
// Validate extracted data
if (extractedData) {
const errors = validateAgainstSchema(extractedData, schema);
if (errors.length === 0) {
log('success', `Extraction successful on attempt ${attempts}`);
return {
data: extractedData as T,
attempts,
raw: extractedData
};
}
lastErrors = errors;
log('warn', `Validation failed: ${errors.map((e) => e.message).join(', ')}`);
} else {
lastErrors = [{ field: 'root', message: 'No data extracted' }];
log('warn', 'No data was extracted');
}
} catch (error) {
log('error', `Extraction error: ${error instanceof Error ? error.message : String(error)}`);
lastErrors = [{ field: 'root', message: String(error) }];
}
}
throw new Error(
`Failed to extract valid ${schemaName} after ${maxRetries} attempts. Last errors: ${lastErrors.map((e) => `${e.field}: ${e.message}`).join(', ')}`
);
}
// [end:extract-data]
// [start:contact-schema]
export const contactSchema: JSONSchema = {
type: 'object',
properties: {
name: { type: 'string', description: 'Full name of the person' },
email: { type: 'string', format: 'email', description: 'Email address' },
phone: { type: 'string', description: 'Phone number' },
company: { type: 'string', description: 'Company or organization' },
role: { type: 'string', description: 'Job title or role' }
},
required: ['name', 'email']
};
export interface Contact {
name: string;
email: string;
phone?: string;
company?: string;
role?: string;
}
// [end:contact-schema]
// [start:invoice-schema]
export const invoiceSchema: JSONSchema = {
type: 'object',
properties: {
invoiceNumber: { type: 'string', description: 'Invoice or reference number' },
date: { type: 'string', format: 'date', description: 'Invoice date (YYYY-MM-DD)' },
vendor: { type: 'string', description: 'Vendor or seller name' },
items: {
type: 'array',
description: 'Line items on the invoice',
items: {
type: 'object',
properties: {
description: { type: 'string', description: 'Item description' },
quantity: { type: 'number', minimum: 0, description: 'Quantity' },
unitPrice: { type: 'number', minimum: 0, description: 'Price per unit' },
total: { type: 'number', minimum: 0, description: 'Line total' }
},
required: ['description', 'quantity', 'unitPrice']
}
},
subtotal: { type: 'number', minimum: 0, description: 'Subtotal before tax' },
tax: { type: 'number', minimum: 0, description: 'Tax amount' },
total: { type: 'number', minimum: 0, description: 'Total amount due' }
},
required: ['invoiceNumber', 'date', 'items', 'total']
};
export interface Invoice {
invoiceNumber: string;
date: string;
vendor?: string;
items: Array<{
description: string;
quantity: number;
unitPrice: number;
total?: number;
}>;
subtotal?: number;
tax?: number;
total: number;
}
// [end:invoice-schema]
// [start:event-schema]
export const eventSchema: JSONSchema = {
type: 'object',
properties: {
title: { type: 'string', description: 'Event title or name' },
date: { type: 'string', format: 'date', description: 'Event date (YYYY-MM-DD)' },
time: { type: 'string', description: 'Event time (e.g., 2:00 PM)' },
location: { type: 'string', description: 'Event location or venue' },
description: { type: 'string', description: 'Event description' },
organizer: { type: 'string', description: 'Event organizer' },
attendees: {
type: 'array',
description: 'List of attendees',
items: { type: 'string' }
}
},
required: ['title', 'date']
};
export interface CalendarEvent {
title: string;
date: string;
time?: string;
location?: string;
description?: string;
organizer?: string;
attendees?: string[];
}
// [end:event-schema]
// [start:sample-texts]
export const sampleTexts = {
contact: `
Hey there! I wanted to introduce you to our new team lead, Sarah Chen.
She's the Director of Engineering at TechCorp Inc. You can reach her
at sarah.chen@techcorp.io or call her office at (555) 123-4567.
She's really excited to collaborate on the upcoming project!
`,
invoice: `
INVOICE #INV-2024-0042
Date: January 3, 2026
From: Acme Supplies Ltd.
Items:
- Widget Pro (x5) @ $29.99 each = $149.95
- Super Gadget (x2) @ $89.50 each = $179.00
- Premium Cable (x10) @ $12.00 each = $120.00
Subtotal: $448.95
Tax (8%): $35.92
TOTAL DUE: $484.87
`,
event: `
You're invited to the Annual Tech Summit 2026!
Join us on March 15, 2026 at 9:00 AM at the Grand Convention Center
in downtown San Francisco. This year's theme is "AI in Practice".
Hosted by the Bay Area Tech Association, we'll have speakers from
Google, Meta, and OpenAI. Expected attendees include John Smith,
Jane Doe, and Bob Wilson from our team.
Don't miss this opportunity to network and learn!
`
};
// [end:sample-texts]
// [start:create-config]
export function createExtractorConfig(): ExtractorConfig {
return {
provider: process.env.OPENAI_API_KEY ? 'openai' : 'anthropic',
model: process.env.OPENAI_API_KEY ? 'gpt-4o-mini' : 'claude-3-haiku-20240307',
apiKey: process.env.OPENAI_API_KEY || process.env.ANTHROPIC_API_KEY
};
}
// [end:create-config]
// [start:usage-example]
/**
* Example usage of the structured data extractor:
*
* ```typescript
* const config = createExtractorConfig();
*
* const result = await extractData<Contact>(
* config,
* 'Sarah Chen is Director at TechCorp, sarah@techcorp.io',
* contactSchema,
* 'contact'
* );
*
* console.log(result.data);
* // { name: "Sarah Chen", email: "sarah@techcorp.io", ... }
* ```
*/
// [end:usage-example]Validation
The built-in validator checks:
| Check | Description |
|---|---|
| Required fields | All required fields must be present |
| Type checking | Values must match declared type |
| Format validation | email, date formats are validated |
| Range validation | minimum, maximum for numbers |
| Enum validation | Values must be in enum list |
Retry with Feedback
When validation fails, errors are fed back to the LLM:
Previous extraction had these validation errors, please fix them:
- email: Invalid email format
- date: Value must be in format YYYY-MM-DDThis helps the LLM correct mistakes on subsequent attempts.
Environment Variables
| Variable | Required | Description |
|---|---|---|
OPENAI_API_KEY | One of these required | OpenAI API key |
ANTHROPIC_API_KEY | Anthropic API key |
At least one API key is required. OpenAI is recommended for tool calling.
Use Cases
Data Entry Automation
typescript
// Extract contact info from business cards, emails, signatures
const contact = await extractData<Contact>(config, emailSignature, contactSchema, 'contact');Document Processing
typescript
// Extract invoice data from scanned documents
const invoice = await extractData<Invoice>(config, ocrText, invoiceSchema, 'invoice');Event Parsing
typescript
// Extract event details from meeting invitations
const event = await extractData<Event>(config, inviteText, eventSchema, 'event');Custom Schemas
typescript
// Define any schema for your domain
const productSchema = {
type: 'object',
properties: {
name: { type: 'string' },
price: { type: 'number', minimum: 0 },
category: { type: 'string', enum: ['electronics', 'clothing', 'food'] }
},
required: ['name', 'price']
};
const product = await extractData<Product>(config, description, productSchema, 'product');