Intelligent table analysis and JSON extraction with automatic domain detection
Transform complex document tables into structured, semantic JSON data with zero configuration. This library automatically understands financial statements, healthcare records, manufacturing inventories, and more.
- Overview
- Key Concepts
- Architecture
- The Pipeline
- Quick Start
- Domain Configurations
- Sample Usage Scenarios
- Advanced Features
- Component Reference
- Best Practices
- Troubleshooting
Document tables are everywhere—financial reports, patient records, inventory sheets—but extracting their data programmatically is challenging:
- Headers are ambiguous: "Patient ID" vs "MRN" vs "Medical Record Number"
- Tables vary by industry: Financial tables look nothing like manufacturing inventories
- Structure is inconsistent: Merged cells, multi-row headers, title rows
- Manual mapping is tedious: Maintaining header-to-field mappings for every variation
This library provides semantic table understanding:
// Before: Complex parsing logic for each table type
// After: Automatic understanding
var analyzer = new TableSemanticAnalyzer();
var result = analyzer.Analyze(table);
// Automatically knows: "MRN" → patient_id, "SKU" → product_id, etc.Key Innovation: The system automatically detects the table's domain (Financial, Healthcare, Manufacturing, Generic) and applies industry-specific knowledge to understand headers, identify key columns, and structure the data correctly.
Tables are analyzed within domain contexts:
| Domain | Examples | Keywords |
|---|---|---|
| Financial | P&L, Balance Sheet, Credit Risk | counterparty, EBITDA, assets, revenue |
| Healthcare | Patient Records, Lab Results | patient, MRN, diagnosis, ICD |
| Manufacturing | Inventory, BOM, Production | SKU, quantity, warehouse, batch |
| Generic | Any other table | name, date, amount, status |
Headers are automatically normalized:
"Exposure EUR" → exposure_amount_eur
"Patient ID" → patient_id
"Part Number" → product_id
"Util %" → utilization_percent
Columns are automatically typed:
- identifier: Patient ID, SKU, Counterparty
- amount: Monetary values
- percentage: Utilization %, completion rates
- date: Dates, maturity dates
- text: Descriptions, names
Summary rows are automatically identified and separated:
{
"Rows": [ /* data rows */ ],
"TotalRows": {
"subtotal": { ... },
"total": { ... }
}
}┌─────────────────────────────────────────────────────────────────┐
│ TX Text Control │
│ Document/Table │
└────────────────────────────┬────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────────┐
│ TableSemanticAnalyzer │
│ ┌──────────────────────────────────────────────────────────┐ │
│ │ 1. Domain Detection (DomainDetector) │ │
│ │ • Analyzes headers │ │
│ │ • Scores each domain │ │
│ │ • Selects best match │ │
│ └──────────────────────────────────────────────────────────┘ │
│ ┌──────────────────────────────────────────────────────────┐ │
│ │ 2. Header Row Detection │ │
│ │ • Identifies header rows │ │
│ │ • Distinguishes title rows from headers │ │
│ │ • Handles merged cells │ │
│ └──────────────────────────────────────────────────────────┘ │
│ ┌──────────────────────────────────────────────────────────┐ │
│ │ 3. Column Analysis │ │
│ │ • Extracts header text │ │
│ │ • Canonicalizes names (HeaderCanonicalizer) │ │
│ │ • Infers semantic types │ │
│ └──────────────────────────────────────────────────────────┘ │
│ ┌──────────────────────────────────────────────────────────┐ │
│ │ 4. Key Column Detection │ │
│ │ • Identifies identifier columns │ │
│ │ • Analyzes uniqueness and data types │ │
│ └──────────────────────────────────────────────────────────┘ │
└────────────────────────────┬────────────────────────────────────┘
│
▼ TableSemanticAnalysisResult
│ • DetectedDomain
│ • HeaderRowNumbers
│ • Columns (with canonical names)
│ • KeyColumnNumber
│
▼
┌─────────────────────────────────────────────────────────────────┐
│ TableJsonExtractor │
│ ┌──────────────────────────────────────────────────────────┐ │
│ │ 1. Row Processing │ │
│ │ • Iterates data rows │ │
│ │ • Extracts cell values (handles merged cells) │ │
│ │ • Identifies total rows │ │
│ └──────────────────────────────────────────────────────────┘ │
│ ┌──────────────────────────────────────────────────────────┐ │
│ │ 2. Value Conversion │ │
│ │ • Parses amounts (removes currency symbols) │ │
│ │ • Parses percentages │ │
│ │ • Formats dates │ │
│ └──────────────────────────────────────────────────────────┘ │
│ ┌──────────────────────────────────────────────────────────┐ │
│ │ 3. JSON Structure Building │ │
│ │ • Maps to canonical column names │ │
│ │ • Groups regular rows vs. total rows │ │
│ └──────────────────────────────────────────────────────────┘ │
└────────────────────────────┬────────────────────────────────────┘
│
▼ ExtractedTableData (JSON)
Analyzes table content and selects the appropriate domain configuration.
Input: TX Text Control Table object
Output: Domain name, configuration, and confidence score
Define industry-specific knowledge:
- Header keywords
- Synonym mappings
- Identifier patterns
- Total row keywords
Converts header text to standardized names:
- Removes special characters
- Converts to snake_case
- Applies domain synonyms
Core analysis engine that:
- Detects headers
- Identifies structure
- Infers semantics
- Applies domain knowledge
Converts analyzed tables to JSON:
- Extracts row data
- Handles merged cells
- Separates totals
- Formats values
┌─────────────────────┐
│ Load Document │
│ (TX Text Control) │
└──────────┬──────────┘
│
▼
┌─────────────────────┐
│ For Each Table │
└──────────┬──────────┘
│
▼
┌──────────────────────────────────────────────────────────┐
│ PHASE 1: DOMAIN DETECTION │
│ │
│ Input: Table cells (first 3 rows) │
│ │
│ Process: │
│ 1. Extract header cell text │
│ 2. Score against each domain's keywords │
│ • Financial: "Counterparty" → +15 points │
│ • Healthcare: "Patient" → +15 points │
│ • Manufacturing: "SKU" → +15 points │
│ 3. Select highest scoring domain │
│ │
│ Output: DetectedDomain = "Financial" (0.85 confidence) │
└──────────────────────┬───────────────────────────────────┘
│
▼
┌──────────────────────────────────────────────────────────┐
│ PHASE 2: HEADER DETECTION │
│ │
│ Process: │
│ 1. Check for explicit header rows (IsHeader property) │
│ 2. If not found, score rows heuristically: │
│ • Contains domain keywords (+points) │
│ • Mostly text, not numbers (+points) │
│ • Has formatting (shaded, bordered) (+points) │
│ • Row 2 preferred over row 1 (+bonus) │
│ • Single merged cell = title row (-penalty) │
│ 3. Select best scoring row │
│ │
│ Output: HeaderRowNumbers = [2] │
└──────────────────────┬───────────────────────────────────┘
│
▼
┌──────────────────────────────────────────────────────────┐
│ PHASE 3: COLUMN ANALYSIS │
│ │
│ For each column: │
│ 1. Extract header text from detected header row │
│ "Exposure EUR" │
│ │
│ 2. Auto-canonicalize: │
│ • Convert to lowercase │
│ • Replace special chars (€→eur, %→percent) │
│ • Convert to snake_case: "exposure_eur" │
│ │
│ 3. Apply domain synonyms: │
│ "exposure_eur" → "exposure_amount_eur" │
│ │
│ 4. Infer semantic type: │
│ • Check header keywords (% → percentage) │
│ • Sample first 10 data rows │
│ • Detect patterns (numbers → amount, dates → date) │
│ Type = "amount" │
│ │
│ Output: Column { │
│ HeaderText: "Exposure EUR", │
│ CanonicalName: "exposure_amount_eur", │
│ SemanticType: "amount" │
│ } │
└──────────────────────┬───────────────────────────────────┘
│
▼
┌──────────────────────────────────────────────────────────┐
│ PHASE 4: KEY COLUMN DETECTION │
│ │
│ Process: │
│ 1. For each column, analyze data rows: │
│ • Text ratio (non-numeric values) │
│ • Uniqueness ratio (distinct values) │
│ • Left position bias (column 1 or 2) │
│ • Semantic boost (identifier headers) │
│ 2. Select column with highest score │
│ │
│ Output: KeyColumnNumber = 1 (Counterparty) │
└──────────────────────┬───────────────────────────────────┘
│
▼
┌──────────────────────────────────────────────────────────┐
│ PHASE 5: DATA EXTRACTION │
│ │
│ For each row (starting after header): │
│ 1. Extract cell values: │
│ • Direct cell → use value │
│ • Merged cell → search for anchor │
│ • Empty cell → null │
│ │
│ 2. Check if total row: │
│ • Scan all columns for total keywords │
│ • "Total", "Subtotal", "Average", etc. │
│ │
│ 3. Convert values by semantic type: │
│ • amount: Parse decimal, remove €/$ │
│ • percentage: Parse decimal, remove % │
│ • date: Format as yyyy-MM-dd │
│ • text: Keep as-is │
│ │
│ 4. Map to canonical column names: │
│ { │
│ "counterparty": "Alpha AG", │
│ "exposure_amount_eur": 1250000, │
│ "rating": "A" │
│ } │
│ │
│ Output: Rows[] or TotalRows{} │
└──────────────────────┬───────────────────────────────────┘
│
▼
┌──────────────────────────────────────────────────────────┐
│ FINAL OUTPUT: JSON │
│ │
│ { │
│ "DetectedDomain": "Financial", │
│ "DomainConfidence": 0.85, │
│ "HeaderRow": ["Counterparty", "Exposure EUR", ...], │
│ "Columns": [ ... ], │
│ "Rows": [ ... ], │
│ "TotalRows": { ... } │
│ } │
└───────────────────────────────────────────────────────────┘
# Add TX Text Control reference (prerequisite)
# Add this project to your solutionusing System;
using TextControl.TableDetection;
using TextControl.TableDetection.Domain;
using TXTextControl;
// Load document
using var tx = new ServerTextControl();
tx.Create();
tx.Load("financial-report.docx", StreamType.WordprocessingML);
Console.WriteLine($"Tables detected: {tx.Tables.Count}");
if (tx.Tables.Count == 0)
{
Console.WriteLine("No tables found.");
return;
}
// Mode 1: Auto-detect domain (recommended)
var options = new TableSemanticAnalysisOptions
{
SkipNestedTables = true,
MaxHeaderRowsToInspect = 3,
AutoDetectDomain = true,
Domain = null // Let it auto-detect
};
// Mode 2: Explicit domain (uncomment to use)
// var options = new TableSemanticAnalysisOptions
// {
// AutoDetectDomain = false,
// Domain = new FinancialDomainConfiguration()
// };
var analyzer = new TableSemanticAnalyzer(options);
var extractor = new TableJsonExtractor();
foreach (Table table in tx.Tables)
{
// Analyze table structure and semantics
var analysis = analyzer.Analyze(table);
// Extract data to JSON
var data = extractor.Extract(table, analysis, options);
// Output results
Console.WriteLine(new string('=', 80));
Console.WriteLine($"Table ID: {data.TableId}");
Console.WriteLine($"Detected Domain: {analysis.DetectedDomain} (Confidence: {analysis.DomainConfidence:P0})");
Console.WriteLine();
Console.WriteLine(data.ToJson());
Console.WriteLine();
}Output:
================================================================================
Table ID: 0
Detected Domain: Financial (Confidence: 85%)
{
"TableId": 0,
"DetectedDomain": "Financial",
"TableName": "Credit Exposure Summary",
"HeaderRow": ["Counterparty", "Exposure EUR", "Rating"],
"Rows": [
{ "counterparty": "Alpha AG", "exposure_amount_eur": 1250000, "rating": "A" }
]
}
Best for: P&L statements, balance sheets, credit risk tables, financial reports
Keywords (150+):
- Credit/Risk: counterparty, exposure, rating, limit, utilization
- P&L: revenue, COGS, EBITDA, operating profit, net income
- Balance Sheet: assets, liabilities, equity, cash, inventory
- Periods: YTD, QTD, actual, budget, variance
Example:
| Counterparty | Exposure EUR | Rating | Limit |
|--------------|--------------|--------|-----------|
| Alpha AG | 1,250,000 | A | 2,000,000 |
→ Detects as Financial (85% confidence)
Best for: Patient records, clinical data, diagnoses, medications, lab results
Keywords (80+):
- Identifiers: patient, MRN, medical record number
- Clinical: diagnosis, ICD, CPT, procedure, medication
- Visits: encounter, admission, discharge, LOS
- Providers: physician, doctor, department
Example:
| Patient ID | MRN | Diagnosis | Physician |
|------------|--------|---------------------|-------------|
| P001 | 123456 | Type 2 Diabetes | Dr. Smith |
→ Detects as Healthcare (92% confidence)
Best for: Inventory, supply chain, production, parts management
Keywords (90+):
- Products: SKU, part number, item
- Inventory: quantity, stock, warehouse
- Supply Chain: supplier, vendor, manufacturer
- Production: batch, lot, production date
📖 Complete Manufacturing Guide →
Example:
| SKU | Description | Qty | Warehouse | Supplier |
|------------|----------------|------|-----------|------------|
| P-001-2024 | Steel Bolt M8 | 5000 | WH-01 | FastCo |
→ Detects as Manufacturing (88% confidence)
Best for: General-purpose tables, mixed data, custom business tables
Keywords (30+):
- Common: name, id, description, date, amount
- Status: status, type, category
- Numeric: quantity, value, count, total
Example:
| Employee ID | Name | Department | Status |
|-------------|-------------|------------|---------|
| E001 | John Smith | IT | Active |
→ Detects as Generic (65% confidence)
Use Case: Extract P&L data from Word document for analysis
using var tx = new ServerTextControl();
tx.Load("Q4-2024-Financial-Report.docx", StreamType.WordprocessingML);
var options = new TableSemanticAnalysisOptions
{
AutoDetectDomain = true
};
var analyzer = new TableSemanticAnalyzer(options);
var extractor = new TableJsonExtractor();
foreach (Table table in tx.Tables)
{
var analysis = analyzer.Analyze(table);
// Only process financial tables
if (analysis.DetectedDomain == "Financial" &&
analysis.DomainConfidence > 0.7)
{
var data = extractor.Extract(table, analysis, options);
// Extract specific metrics
foreach (var row in data.Rows)
{
if (row.TryGetValue("revenue", out var revenue))
{
Console.WriteLine($"Revenue: {revenue}");
}
if (row.TryGetValue("ebitda", out var ebitda))
{
Console.WriteLine($"EBITDA: {ebitda}");
}
}
// Check totals
if (data.TotalRows?.TryGetValue("total", out var totals) == true)
{
Console.WriteLine("Total Row:");
Console.WriteLine(JsonSerializer.Serialize(totals, new JsonSerializerOptions
{
WriteIndented = true
}));
}
}
}Use Case: Extract patient records for system migration
using var tx = new ServerTextControl();
tx.Load("patient-census.tx", StreamType.InternalUnicodeFormat);
// Force Healthcare domain for consistency
var options = new TableSemanticAnalysisOptions
{
AutoDetectDomain = false,
Domain = new HealthcareDomainConfiguration()
};
var analyzer = new TableSemanticAnalyzer(options);
var extractor = new TableJsonExtractor();
var patients = new List<PatientRecord>();
foreach (Table table in tx.Tables)
{
var analysis = analyzer.Analyze(table);
var data = extractor.Extract(table, analysis, options);
// Map to domain model
foreach (var row in data.Rows)
{
patients.Add(new PatientRecord
{
PatientId = row.GetValueOrDefault("patient_id")?.ToString(),
MRN = row.GetValueOrDefault("patient_id")?.ToString(), // Synonym mapped
DateOfBirth = row.GetValueOrDefault("date_of_birth")?.ToString(),
Physician = row.GetValueOrDefault("physician")?.ToString()
});
}
}
// Export to database
await SavePatientsAsync(patients);Use Case: Extract inventory data for warehouse system
using var tx = new ServerTextControl();
tx.Load("inventory-report.docx", StreamType.WordprocessingML);
var analyzer = new TableSemanticAnalyzer(); // Auto-detect
var extractor = new TableJsonExtractor();
foreach (Table table in tx.Tables)
{
var analysis = analyzer.Analyze(table);
if (analysis.DetectedDomain == "Manufacturing")
{
var data = extractor.Extract(table, analysis, analyzer.Options);
// Process inventory items
foreach (var row in data.Rows)
{
var item = new InventoryItem
{
SKU = row.GetValueOrDefault("product_id")?.ToString(),
Quantity = Convert.ToInt32(row.GetValueOrDefault("quantity")),
Warehouse = row.GetValueOrDefault("warehouse")?.ToString(),
Supplier = row.GetValueOrDefault("supplier")?.ToString()
};
// Update inventory system
await UpdateInventoryAsync(item);
}
}
}Use Case: Process document with tables from multiple domains
using var tx = new ServerTextControl();
tx.Load("annual-report.docx", StreamType.WordprocessingML);
var analyzer = new TableSemanticAnalyzer();
var extractor = new TableJsonExtractor();
var results = new Dictionary<string, List<ExtractedTableData>>();
foreach (Table table in tx.Tables)
{
var analysis = analyzer.Analyze(table);
var data = extractor.Extract(table, analysis, analyzer.Options);
// Group by detected domain
if (!results.ContainsKey(analysis.DetectedDomain))
{
results[analysis.DetectedDomain] = new List<ExtractedTableData>();
}
results[analysis.DetectedDomain].Add(data);
// Log detection
Console.WriteLine($"Table {table.ID}: {analysis.DetectedDomain} " +
$"({analysis.DomainConfidence:P0})");
}
// Process by domain
if (results.ContainsKey("Financial"))
{
ProcessFinancialTables(results["Financial"]);
}
if (results.ContainsKey("Manufacturing"))
{
ProcessInventoryTables(results["Manufacturing"]);
}Use Case: Create domain for retail industry
// Define custom retail domain
public class RetailDomainConfiguration : IDomainConfiguration
{
public ISet<string> HeaderKeywords { get; } = new HashSet<string>(StringComparer.OrdinalIgnoreCase)
{
"product", "sku", "barcode", "upc",
"price", "cost", "margin", "discount",
"store", "location", "sales", "units sold",
"category", "brand", "supplier"
};
public ISet<string> TotalRowKeywords { get; } = new HashSet<string>(StringComparer.OrdinalIgnoreCase)
{
"total", "subtotal", "grand total"
};
public IDictionary<string, string> HeaderSynonyms { get; } = new Dictionary<string, string>(StringComparer.OrdinalIgnoreCase)
{
["upc"] = "barcode",
["product code"] = "sku",
["retail price"] = "price",
["units"] = "quantity_sold"
};
public ISet<string> IdentifierHeaders { get; } = new HashSet<string>(StringComparer.OrdinalIgnoreCase)
{
"product", "sku", "barcode", "upc"
};
}
// Use custom domain
var options = new TableSemanticAnalysisOptions
{
Domain = new RetailDomainConfiguration()
};
var analyzer = new TableSemanticAnalyzer(options);Use Case: Process multiple documents with error handling
public async Task ProcessDocumentsAsync(string[] filePaths)
{
var analyzer = new TableSemanticAnalyzer();
var extractor = new TableJsonExtractor();
foreach (var filePath in filePaths)
{
try
{
using var tx = new ServerTextControl();
tx.Create();
tx.Load(filePath, StreamType.WordprocessingML);
Console.WriteLine($"\nProcessing: {Path.GetFileName(filePath)}");
Console.WriteLine($"Tables found: {tx.Tables.Count}");
foreach (Table table in tx.Tables)
{
try
{
var analysis = analyzer.Analyze(table);
var data = extractor.Extract(table, analysis, analyzer.Options);
// Save to database or file
await SaveTableDataAsync(filePath, table.ID, data);
Console.WriteLine($" ✓ Table {table.ID}: {analysis.DetectedDomain} " +
$"({data.Rows.Count} rows)");
}
catch (Exception ex)
{
Console.WriteLine($" ✗ Table {table.ID}: {ex.Message}");
// Log error but continue processing
await LogErrorAsync(filePath, table.ID, ex);
}
}
}
catch (Exception ex)
{
Console.WriteLine($"Failed to process {filePath}: {ex.Message}");
}
}
}Use Case: Understand detection decisions for troubleshooting
var options = new TableSemanticAnalysisOptions
{
AutoDetectDomain = true
};
var analyzer = new TableSemanticAnalyzer(options);
var analysis = analyzer.Analyze(table);
// Review diagnostics
Console.WriteLine("=== Analysis Diagnostics ===");
foreach (var diagnostic in analysis.Diagnostics)
{
Console.WriteLine($" {diagnostic}");
}
// Output domain detection
Console.WriteLine($"\nDetected Domain: {analysis.DetectedDomain}");
Console.WriteLine($"Confidence: {analysis.DomainConfidence:P2}");
// Review detected structure
Console.WriteLine($"\nTable Structure:");
Console.WriteLine($" Name: {analysis.DetectedTableName}");
Console.WriteLine($" Header Rows: {string.Join(", ", analysis.HeaderRowNumbers)}");
Console.WriteLine($" Key Column: {analysis.KeyColumnNumber}");
Console.WriteLine($" Columns: {analysis.Columns.Count}");
// Review each column
Console.WriteLine($"\nColumn Details:");
foreach (var col in analysis.Columns)
{
Console.WriteLine($" {col.ColumnNumber}. {col.HeaderText}");
Console.WriteLine($" Canonical: {col.CanonicalName}");
Console.WriteLine($" Type: {col.SemanticType}");
Console.WriteLine($" Confidence: {col.Confidence:P0}");
}Example Output:
=== Analysis Diagnostics ===
Auto-detected domain: Financial (confidence: 0.85)
Detected header row heuristically: 2
Detected key column: 1
Detected 5 columns
Detected Domain: Financial
Confidence: 85.00%
Table Structure:
Name: Credit Exposure Summary
Header Rows: 2
Key Column: 1
Columns: 5
Column Details:
1. Counterparty
Canonical: counterparty
Type: identifier
Confidence: 85%
2. Exposure EUR
Canonical: exposure_amount_eur
Type: amount
Confidence: 90%
...
The library safely handles merged cells without exceptions:
// Table with merged title row:
// ┌─────────────────────────────┐
// │ Credit Exposure Summary │ ← Single merged cell
// ├──────────┬─────────┬────────┤
// │ Name │ Amount │ Rating │ ← Actual headers
// └──────────┴─────────┴────────┘
var analysis = analyzer.Analyze(table);
// Correctly identifies row 2 as headers, not row 1
// InternalTitle = "Credit Exposure Summary"
// HeaderRowNumbers = [2]How it works:
- Detects merged cells by catching
TextEditorExceptionon.Textaccess - Penalizes single-cell rows in header detection
- Automatically identifies title rows vs. header rows
Handles headers spanning multiple rows:
// Table with multi-row header:
// ┌──────────┬─────────────────┬─────────┐
// │ │ Financial │ │
// │ Account ├──────────┬──────┤ Status │
// │ │ Budget │Actual│ │
// └──────────┴──────────┴──────┴─────────┘
var options = new TableSemanticAnalysisOptions
{
MaxHeaderRowsToInspect = 3 // Check first 3 rows
};
// Composite headers: "Financial Budget", "Financial Actual"Supports subtotals, totals, averages:
{
"TotalRows": {
"subtotal": {
"region": "EMEA Subtotal",
"amount": 1250000
},
"total": {
"region": "Grand Total",
"amount": 5000000
},
"average": {
"region": "Average",
"amount": 625000
}
}
}Every detection includes confidence:
var analysis = analyzer.Analyze(table);
if (analysis.DomainConfidence < 0.6)
{
Console.WriteLine("⚠️ Low confidence detection - review results");
}
foreach (var col in analysis.Columns)
{
if (col.Confidence < 0.5)
{
Console.WriteLine($"⚠️ Column '{col.HeaderText}' has low confidence");
}
}Override value conversion for specific types:
// Custom extractor with specialized conversion
public class CustomExtractor : TableJsonExtractor
{
protected override object? ConvertValue(string raw, string semanticType)
{
return semanticType switch
{
"amount" when raw.Contains("€") => ParseEuroAmount(raw),
"date" when raw.Contains("/") => ParseCustomDate(raw),
_ => base.ConvertValue(raw, semanticType)
};
}
}Purpose: Analyzes table structure and applies semantic understanding
Key Methods:
public TableSemanticAnalysisResult Analyze(Table table)Returns:
DetectedDomain- Auto-detected or configured domainDomainConfidence- Detection confidence (0.0-1.0)HeaderRowNumbers- Detected header row(s)KeyColumnNumber- Primary identifier columnColumns- Column metadata with canonical names and typesDiagnostics- Analysis decision log
Purpose: Extracts table data to JSON structure
Key Methods:
public ExtractedTableData Extract(
Table table,
TableSemanticAnalysisResult analysis,
TableSemanticAnalysisOptions options)Returns:
DetectedDomain- Domain used for extractionHeaderRow- Original header textsColumns- Column definitionsRows- Data rows as dictionariesTotalRows- Summary rows grouped by type
Purpose: Automatically selects best matching domain
Key Methods:
public static (string DomainName, IDomainConfiguration Config, double Confidence)
DetectDomain(Table table)Algorithm:
- Scans first 3 rows for keywords
- Scores each domain (0-100+)
- Returns best match with confidence
Purpose: Converts header text to canonical names
Key Methods:
public static string Canonicalize(string headerText)
public static string ApplySynonym(string canonicalName, IDomainConfiguration domain)Process:
- Lowercase conversion
- Special character replacement (€→eur, %→percent)
- Non-alphanumeric removal
- Snake_case conversion
- Domain synonym application
Interface:
public interface IDomainConfiguration
{
ISet<string> HeaderKeywords { get; }
ISet<string> TotalRowKeywords { get; }
IDictionary<string, string> HeaderSynonyms { get; }
ISet<string> IdentifierHeaders { get; }
}Built-in Implementations:
FinancialDomainConfigurationHealthcareDomainConfigurationManufacturingDomainConfigurationGenericDomainConfiguration
✅ Recommended:
var analyzer = new TableSemanticAnalyzer(); // Auto-detect❌ Avoid premature optimization:
// Don't force domain unless necessary
var options = new TableSemanticAnalysisOptions
{
AutoDetectDomain = false,
Domain = new FinancialDomainConfiguration() // Only if required
};var analysis = analyzer.Analyze(table);
if (analysis.DomainConfidence < 0.6)
{
// Low confidence - consider:
// 1. Adding domain-specific keywords to headers
// 2. Using explicit domain configuration
// 3. Creating custom domain
Console.WriteLine($"⚠️ Low confidence: {analysis.DomainConfidence:P0}");
}// Always check diagnostics when results are unexpected
foreach (var diagnostic in analysis.Diagnostics)
{
Console.WriteLine(diagnostic);
}The library handles this automatically, but be aware:
// Merged cells return null for non-anchor cells
// This is expected behavior
var data = extractor.Extract(table, analysis, options);
// Some columns may have null values in merged regions| Data Type | Domain |
|---|---|
| Financial statements | Financial |
| Patient records | Healthcare |
| Inventory/supply chain | Manufacturing |
| Generic business data | Generic (auto-selected) |
| Custom industry | Create custom domain |
var data = extractor.Extract(table, analysis, options);
foreach (var row in data.Rows)
{
// Validate required fields
if (!row.ContainsKey("patient_id"))
{
Console.WriteLine("⚠️ Missing required field: patient_id");
}
// Validate data types
if (row.TryGetValue("amount", out var amount) && amount is not decimal)
{
Console.WriteLine($"⚠️ Expected decimal, got {amount?.GetType()}");
}
}// Group related tables
var tablesByDomain = new Dictionary<string, List<ExtractedTableData>>();
foreach (Table table in tx.Tables)
{
var analysis = analyzer.Analyze(table);
var data = extractor.Extract(table, analysis, analyzer.Options);
if (!tablesByDomain.ContainsKey(analysis.DetectedDomain))
{
tablesByDomain[analysis.DetectedDomain] = new List<ExtractedTableData>();
}
tablesByDomain[analysis.DetectedDomain].Add(data);
}Symptoms:
Domain: Generic (Confidence: 55%)
Expected: Financial
Solutions:
-
Check header keywords:
// Add domain-specific terms to headers // Instead of: "Amount", "Value" // Use: "Exposure EUR", "Market Value"
-
Force explicit domain:
var options = new TableSemanticAnalysisOptions { AutoDetectDomain = false, Domain = new FinancialDomainConfiguration() };
-
Review diagnostics:
foreach (var diagnostic in analysis.Diagnostics) { Console.WriteLine(diagnostic); } // Look for: "Auto-detected domain: X (confidence: Y)"
Symptoms:
HeaderRowNumbers: [1] // Should be [2]
Solutions:
-
Check for title rows:
- Row 1 with single merged cell = title (penalized)
- System should auto-detect row 2 as headers
-
Increase inspection depth:
var options = new TableSemanticAnalysisOptions { MaxHeaderRowsToInspect = 5 // Check more rows };
-
Mark headers explicitly:
// In TX Text Control, mark row as header table.Rows[1].IsHeader = true;
Symptoms:
CanonicalName: "column_1" // Expected: "patient_id"
Solutions:
-
Add keywords to headers:
Instead of: "ID" Use: "Patient ID" or "MRN" -
Check domain configuration:
var domain = new HealthcareDomainConfiguration(); // Verify "patient" and "mrn" are in HeaderKeywords
-
Add custom synonyms:
public class CustomHealthcare : HealthcareDomainConfiguration { public CustomHealthcare() { HeaderSynonyms["id"] = "patient_id"; } }
Symptoms:
TXTextControl.TextEditorException: Cannot access .Text
Solutions:
✅ This is handled automatically - the library catches these exceptions.
If you still see errors:
- Ensure you're using the latest version
- Check that
SafeGetCellTextis being used - Report the issue with a sample document
Symptoms:
{
"TotalRows": null // Should contain total
}Solutions:
-
Check total row keywords:
var domain = new FinancialDomainConfiguration(); // Verify "Total", "Subtotal", etc. are in TotalRowKeywords
-
Verify keyword placement:
✓ "Total" in ANY column ✗ "Total" only in calculation (not in text) -
Add custom keywords:
domain.TotalRowKeywords.Add("Sum"); domain.TotalRowKeywords.Add("Grand Total");
Symptoms:
- Slow processing of large documents
Solutions:
-
Process tables selectively:
foreach (Table table in tx.Tables) { // Skip nested tables if (table.NestedLevel > 0) continue; // Process only var analysis = analyzer.Analyze(table); }
-
Batch process:
// Process multiple documents in parallel await Task.WhenAll(files.Select(ProcessFileAsync));
-
Cache analyzer:
// Reuse analyzer instance var analyzer = new TableSemanticAnalyzer(); foreach (var file in files) { // Use same analyzer }
- TX Text Control .NET (any edition)
- .NET 10 (C# 14)
- Windows (for TX Text Control)
Contributions welcome! Areas for improvement:
-
Additional Domains
- Legal documents
- Education/academic
- Real estate
- Transportation/logistics
-
Enhanced Detection
- ML-based domain detection
- Pattern recognition for complex tables
- Multi-language support
-
More Features
- Table validation rules
- Schema generation
- Data quality scoring
- Export to other formats (CSV, Excel, Database)
- Financial Domain Guide - 150+ keywords
- Healthcare Domain Guide - 80+ keywords
- Manufacturing Domain Guide - 90+ keywords
- Generic Domain Guide - 30+ keywords
- Auto-Detection Guide - How detection works
- TX Text Control Docs - TX Text Control documentation
This library transforms document table extraction from a manual, error-prone process into an intelligent, automated workflow:
- Load any TX Text Control document
- Analyze tables with automatic domain detection
- Extract to structured JSON with semantic understanding
- Process with industry-specific field mappings
Zero configuration required - the system automatically understands financial statements, patient records, inventory sheets, and more.
Perfect for:
- Document processing pipelines
- Data migration projects
- Report automation
- Business intelligence extraction
- Healthcare data aggregation
- Financial analysis tools
Get started in 5 lines of code:
using var tx = new ServerTextControl();
tx.Load("document.docx", StreamType.WordprocessingML);
var analyzer = new TableSemanticAnalyzer(new TableSemanticAnalysisOptions { AutoDetectDomain = true });
var extractor = new TableJsonExtractor();
var analysis = analyzer.Analyze(tx.Tables[0]);
var data = extractor.Extract(tx.Tables[0], analysis, analyzer.Options);
Console.WriteLine(data.ToJson());