Skip to content

TextControl/TXTextControl.SmartTableExtraction

Repository files navigation

TX Text Control Table Detection & Extraction

Intelligent table analysis and JSON extraction with automatic domain detection

Transform complex document tables into structured, semantic JSON data with zero configuration. This library automatically understands financial statements, healthcare records, manufacturing inventories, and more.


Table of Contents


Overview

The Problem

Document tables are everywhere—financial reports, patient records, inventory sheets—but extracting their data programmatically is challenging:

  • Headers are ambiguous: "Patient ID" vs "MRN" vs "Medical Record Number"
  • Tables vary by industry: Financial tables look nothing like manufacturing inventories
  • Structure is inconsistent: Merged cells, multi-row headers, title rows
  • Manual mapping is tedious: Maintaining header-to-field mappings for every variation

The Solution

This library provides semantic table understanding:

// Before: Complex parsing logic for each table type
// After: Automatic understanding
var analyzer = new TableSemanticAnalyzer();
var result = analyzer.Analyze(table);
// Automatically knows: "MRN" → patient_id, "SKU" → product_id, etc.

Key Innovation: The system automatically detects the table's domain (Financial, Healthcare, Manufacturing, Generic) and applies industry-specific knowledge to understand headers, identify key columns, and structure the data correctly.


Key Concepts

1. Domain-Driven Analysis

Tables are analyzed within domain contexts:

Domain Examples Keywords
Financial P&L, Balance Sheet, Credit Risk counterparty, EBITDA, assets, revenue
Healthcare Patient Records, Lab Results patient, MRN, diagnosis, ICD
Manufacturing Inventory, BOM, Production SKU, quantity, warehouse, batch
Generic Any other table name, date, amount, status

2. Automatic Canonicalization

Headers are automatically normalized:

"Exposure EUR" → exposure_amount_eur
"Patient ID"   → patient_id  
"Part Number"  → product_id
"Util %"       → utilization_percent

3. Semantic Type Inference

Columns are automatically typed:

  • identifier: Patient ID, SKU, Counterparty
  • amount: Monetary values
  • percentage: Utilization %, completion rates
  • date: Dates, maturity dates
  • text: Descriptions, names

4. Total Row Detection

Summary rows are automatically identified and separated:

{
  "Rows": [ /* data rows */ ],
  "TotalRows": {
    "subtotal": { ... },
    "total": { ... }
  }
}

Architecture

Component Diagram

┌─────────────────────────────────────────────────────────────────┐
│                        TX Text Control                          │
│                         Document/Table                          │
└────────────────────────────┬────────────────────────────────────┘
                             │
                             ▼
┌─────────────────────────────────────────────────────────────────┐
│                     TableSemanticAnalyzer                       │
│  ┌──────────────────────────────────────────────────────────┐  │
│  │  1. Domain Detection (DomainDetector)                    │  │
│  │     • Analyzes headers                                   │  │
│  │     • Scores each domain                                 │  │
│  │     • Selects best match                                 │  │
│  └──────────────────────────────────────────────────────────┘  │
│  ┌──────────────────────────────────────────────────────────┐  │
│  │  2. Header Row Detection                                 │  │
│  │     • Identifies header rows                             │  │
│  │     • Distinguishes title rows from headers              │  │
│  │     • Handles merged cells                               │  │
│  └──────────────────────────────────────────────────────────┘  │
│  ┌──────────────────────────────────────────────────────────┐  │
│  │  3. Column Analysis                                      │  │
│  │     • Extracts header text                               │  │
│  │     • Canonicalizes names (HeaderCanonicalizer)          │  │
│  │     • Infers semantic types                              │  │
│  └──────────────────────────────────────────────────────────┘  │
│  ┌──────────────────────────────────────────────────────────┐  │
│  │  4. Key Column Detection                                 │  │
│  │     • Identifies identifier columns                      │  │
│  │     • Analyzes uniqueness and data types                 │  │
│  └──────────────────────────────────────────────────────────┘  │
└────────────────────────────┬────────────────────────────────────┘
                             │
                             ▼ TableSemanticAnalysisResult
                             │   • DetectedDomain
                             │   • HeaderRowNumbers
                             │   • Columns (with canonical names)
                             │   • KeyColumnNumber
                             │
                             ▼
┌─────────────────────────────────────────────────────────────────┐
│                      TableJsonExtractor                         │
│  ┌──────────────────────────────────────────────────────────┐  │
│  │  1. Row Processing                                       │  │
│  │     • Iterates data rows                                 │  │
│  │     • Extracts cell values (handles merged cells)        │  │
│  │     • Identifies total rows                              │  │
│  └──────────────────────────────────────────────────────────┘  │
│  ┌──────────────────────────────────────────────────────────┐  │
│  │  2. Value Conversion                                     │  │
│  │     • Parses amounts (removes currency symbols)          │  │
│  │     • Parses percentages                                 │  │
│  │     • Formats dates                                      │  │
│  └──────────────────────────────────────────────────────────┘  │
│  ┌──────────────────────────────────────────────────────────┐  │
│  │  3. JSON Structure Building                              │  │
│  │     • Maps to canonical column names                     │  │
│  │     • Groups regular rows vs. total rows                 │  │
│  └──────────────────────────────────────────────────────────┘  │
└────────────────────────────┬────────────────────────────────────┘
                             │
                             ▼ ExtractedTableData (JSON)

Key Components

1. DomainDetector

Analyzes table content and selects the appropriate domain configuration.

Input: TX Text Control Table object
Output: Domain name, configuration, and confidence score

2. Domain Configurations

Define industry-specific knowledge:

  • Header keywords
  • Synonym mappings
  • Identifier patterns
  • Total row keywords

3. HeaderCanonicalizer

Converts header text to standardized names:

  • Removes special characters
  • Converts to snake_case
  • Applies domain synonyms

4. TableSemanticAnalyzer

Core analysis engine that:

  • Detects headers
  • Identifies structure
  • Infers semantics
  • Applies domain knowledge

5. TableJsonExtractor

Converts analyzed tables to JSON:

  • Extracts row data
  • Handles merged cells
  • Separates totals
  • Formats values

The Pipeline

Step-by-Step Data Flow

┌─────────────────────┐
│   Load Document     │
│  (TX Text Control)  │
└──────────┬──────────┘
           │
           ▼
┌─────────────────────┐
│  For Each Table     │
└──────────┬──────────┘
           │
           ▼
┌──────────────────────────────────────────────────────────┐
│ PHASE 1: DOMAIN DETECTION                                │
│                                                           │
│  Input: Table cells (first 3 rows)                       │
│                                                           │
│  Process:                                                 │
│  1. Extract header cell text                             │
│  2. Score against each domain's keywords                 │
│     • Financial: "Counterparty" → +15 points             │
│     • Healthcare: "Patient" → +15 points                 │
│     • Manufacturing: "SKU" → +15 points                  │
│  3. Select highest scoring domain                        │
│                                                           │
│  Output: DetectedDomain = "Financial" (0.85 confidence)  │
└──────────────────────┬───────────────────────────────────┘
                       │
                       ▼
┌──────────────────────────────────────────────────────────┐
│ PHASE 2: HEADER DETECTION                                │
│                                                           │
│  Process:                                                 │
│  1. Check for explicit header rows (IsHeader property)   │
│  2. If not found, score rows heuristically:              │
│     • Contains domain keywords (+points)                 │
│     • Mostly text, not numbers (+points)                 │
│     • Has formatting (shaded, bordered) (+points)        │
│     • Row 2 preferred over row 1 (+bonus)                │
│     • Single merged cell = title row (-penalty)          │
│  3. Select best scoring row                              │
│                                                           │
│  Output: HeaderRowNumbers = [2]                          │
└──────────────────────┬───────────────────────────────────┘
                       │
                       ▼
┌──────────────────────────────────────────────────────────┐
│ PHASE 3: COLUMN ANALYSIS                                 │
│                                                           │
│  For each column:                                         │
│  1. Extract header text from detected header row         │
│     "Exposure EUR"                                        │
│                                                           │
│  2. Auto-canonicalize:                                   │
│     • Convert to lowercase                               │
│     • Replace special chars (€→eur, %→percent)           │
│     • Convert to snake_case: "exposure_eur"              │
│                                                           │
│  3. Apply domain synonyms:                               │
│     "exposure_eur" → "exposure_amount_eur"               │
│                                                           │
│  4. Infer semantic type:                                 │
│     • Check header keywords (% → percentage)             │
│     • Sample first 10 data rows                          │
│     • Detect patterns (numbers → amount, dates → date)   │
│     Type = "amount"                                      │
│                                                           │
│  Output: Column {                                         │
│    HeaderText: "Exposure EUR",                           │
│    CanonicalName: "exposure_amount_eur",                 │
│    SemanticType: "amount"                                │
│  }                                                        │
└──────────────────────┬───────────────────────────────────┘
                       │
                       ▼
┌──────────────────────────────────────────────────────────┐
│ PHASE 4: KEY COLUMN DETECTION                            │
│                                                           │
│  Process:                                                 │
│  1. For each column, analyze data rows:                  │
│     • Text ratio (non-numeric values)                    │
│     • Uniqueness ratio (distinct values)                 │
│     • Left position bias (column 1 or 2)                 │
│     • Semantic boost (identifier headers)                │
│  2. Select column with highest score                     │
│                                                           │
│  Output: KeyColumnNumber = 1 (Counterparty)             │
└──────────────────────┬───────────────────────────────────┘
                       │
                       ▼
┌──────────────────────────────────────────────────────────┐
│ PHASE 5: DATA EXTRACTION                                 │
│                                                           │
│  For each row (starting after header):                   │
│  1. Extract cell values:                                 │
│     • Direct cell → use value                            │
│     • Merged cell → search for anchor                    │
│     • Empty cell → null                                  │
│                                                           │
│  2. Check if total row:                                  │
│     • Scan all columns for total keywords                │
│     • "Total", "Subtotal", "Average", etc.               │
│                                                           │
│  3. Convert values by semantic type:                     │
│     • amount: Parse decimal, remove €/$                  │
│     • percentage: Parse decimal, remove %                │
│     • date: Format as yyyy-MM-dd                         │
│     • text: Keep as-is                                   │
│                                                           │
│  4. Map to canonical column names:                       │
│     {                                                     │
│       "counterparty": "Alpha AG",                        │
│       "exposure_amount_eur": 1250000,                    │
│       "rating": "A"                                      │
│     }                                                     │
│                                                           │
│  Output: Rows[] or TotalRows{}                           │
└──────────────────────┬───────────────────────────────────┘
                       │
                       ▼
┌──────────────────────────────────────────────────────────┐
│ FINAL OUTPUT: JSON                                       │
│                                                           │
│ {                                                         │
│   "DetectedDomain": "Financial",                         │
│   "DomainConfidence": 0.85,                              │
│   "HeaderRow": ["Counterparty", "Exposure EUR", ...],    │
│   "Columns": [ ... ],                                     │
│   "Rows": [ ... ],                                        │
│   "TotalRows": { ... }                                    │
│ }                                                         │
└───────────────────────────────────────────────────────────┘

Quick Start

Installation

# Add TX Text Control reference (prerequisite)
# Add this project to your solution

Basic Usage

using System;
using TextControl.TableDetection;
using TextControl.TableDetection.Domain;
using TXTextControl;

// Load document
using var tx = new ServerTextControl();
tx.Create();
tx.Load("financial-report.docx", StreamType.WordprocessingML);

Console.WriteLine($"Tables detected: {tx.Tables.Count}");

if (tx.Tables.Count == 0)
{
    Console.WriteLine("No tables found.");
    return;
}

// Mode 1: Auto-detect domain (recommended)
var options = new TableSemanticAnalysisOptions
{
    SkipNestedTables = true,
    MaxHeaderRowsToInspect = 3,
    AutoDetectDomain = true,
    Domain = null  // Let it auto-detect
};

// Mode 2: Explicit domain (uncomment to use)
// var options = new TableSemanticAnalysisOptions
// {
//     AutoDetectDomain = false,
//     Domain = new FinancialDomainConfiguration()
// };

var analyzer = new TableSemanticAnalyzer(options);
var extractor = new TableJsonExtractor();

foreach (Table table in tx.Tables)
{
    // Analyze table structure and semantics
    var analysis = analyzer.Analyze(table);
    
    // Extract data to JSON
    var data = extractor.Extract(table, analysis, options);
    
    // Output results
    Console.WriteLine(new string('=', 80));
    Console.WriteLine($"Table ID: {data.TableId}");
    Console.WriteLine($"Detected Domain: {analysis.DetectedDomain} (Confidence: {analysis.DomainConfidence:P0})");
    Console.WriteLine();
    Console.WriteLine(data.ToJson());
    Console.WriteLine();
}

Output:

================================================================================
Table ID: 0
Detected Domain: Financial (Confidence: 85%)

{
  "TableId": 0,
  "DetectedDomain": "Financial",
  "TableName": "Credit Exposure Summary",
  "HeaderRow": ["Counterparty", "Exposure EUR", "Rating"],
  "Rows": [
    { "counterparty": "Alpha AG", "exposure_amount_eur": 1250000, "rating": "A" }
  ]
}

Domain Configurations

Financial Domain 🏦

Best for: P&L statements, balance sheets, credit risk tables, financial reports

Keywords (150+):

  • Credit/Risk: counterparty, exposure, rating, limit, utilization
  • P&L: revenue, COGS, EBITDA, operating profit, net income
  • Balance Sheet: assets, liabilities, equity, cash, inventory
  • Periods: YTD, QTD, actual, budget, variance

📖 Complete Financial Guide →

Example:

| Counterparty | Exposure EUR | Rating | Limit     |
|--------------|--------------|--------|-----------|
| Alpha AG     | 1,250,000    | A      | 2,000,000 |

→ Detects as Financial (85% confidence)

Healthcare Domain 🏥

Best for: Patient records, clinical data, diagnoses, medications, lab results

Keywords (80+):

  • Identifiers: patient, MRN, medical record number
  • Clinical: diagnosis, ICD, CPT, procedure, medication
  • Visits: encounter, admission, discharge, LOS
  • Providers: physician, doctor, department

📖 Complete Healthcare Guide →

Example:

| Patient ID | MRN    | Diagnosis           | Physician   |
|------------|--------|---------------------|-------------|
| P001       | 123456 | Type 2 Diabetes     | Dr. Smith   |

→ Detects as Healthcare (92% confidence)

Manufacturing Domain 🏭

Best for: Inventory, supply chain, production, parts management

Keywords (90+):

  • Products: SKU, part number, item
  • Inventory: quantity, stock, warehouse
  • Supply Chain: supplier, vendor, manufacturer
  • Production: batch, lot, production date

📖 Complete Manufacturing Guide →

Example:

| SKU        | Description    | Qty  | Warehouse | Supplier   |
|------------|----------------|------|-----------|------------|
| P-001-2024 | Steel Bolt M8  | 5000 | WH-01     | FastCo     |

→ Detects as Manufacturing (88% confidence)

Generic Domain 📋

Best for: General-purpose tables, mixed data, custom business tables

Keywords (30+):

  • Common: name, id, description, date, amount
  • Status: status, type, category
  • Numeric: quantity, value, count, total

📖 Complete Generic Guide →

Example:

| Employee ID | Name        | Department | Status  |
|-------------|-------------|------------|---------|
| E001        | John Smith  | IT         | Active  |

→ Detects as Generic (65% confidence)


Sample Usage Scenarios

Scenario 1: Financial Report Analysis

Use Case: Extract P&L data from Word document for analysis

using var tx = new ServerTextControl();
tx.Load("Q4-2024-Financial-Report.docx", StreamType.WordprocessingML);

var options = new TableSemanticAnalysisOptions
{
    AutoDetectDomain = true
};

var analyzer = new TableSemanticAnalyzer(options);
var extractor = new TableJsonExtractor();

foreach (Table table in tx.Tables)
{
    var analysis = analyzer.Analyze(table);
    
    // Only process financial tables
    if (analysis.DetectedDomain == "Financial" && 
        analysis.DomainConfidence > 0.7)
    {
        var data = extractor.Extract(table, analysis, options);
        
        // Extract specific metrics
        foreach (var row in data.Rows)
        {
            if (row.TryGetValue("revenue", out var revenue))
            {
                Console.WriteLine($"Revenue: {revenue}");
            }
            if (row.TryGetValue("ebitda", out var ebitda))
            {
                Console.WriteLine($"EBITDA: {ebitda}");
            }
        }
        
        // Check totals
        if (data.TotalRows?.TryGetValue("total", out var totals) == true)
        {
            Console.WriteLine("Total Row:");
            Console.WriteLine(JsonSerializer.Serialize(totals, new JsonSerializerOptions 
            { 
                WriteIndented = true 
            }));
        }
    }
}

Scenario 2: Healthcare Data Migration

Use Case: Extract patient records for system migration

using var tx = new ServerTextControl();
tx.Load("patient-census.tx", StreamType.InternalUnicodeFormat);

// Force Healthcare domain for consistency
var options = new TableSemanticAnalysisOptions
{
    AutoDetectDomain = false,
    Domain = new HealthcareDomainConfiguration()
};

var analyzer = new TableSemanticAnalyzer(options);
var extractor = new TableJsonExtractor();

var patients = new List<PatientRecord>();

foreach (Table table in tx.Tables)
{
    var analysis = analyzer.Analyze(table);
    var data = extractor.Extract(table, analysis, options);
    
    // Map to domain model
    foreach (var row in data.Rows)
    {
        patients.Add(new PatientRecord
        {
            PatientId = row.GetValueOrDefault("patient_id")?.ToString(),
            MRN = row.GetValueOrDefault("patient_id")?.ToString(), // Synonym mapped
            DateOfBirth = row.GetValueOrDefault("date_of_birth")?.ToString(),
            Physician = row.GetValueOrDefault("physician")?.ToString()
        });
    }
}

// Export to database
await SavePatientsAsync(patients);

Scenario 3: Inventory Management

Use Case: Extract inventory data for warehouse system

using var tx = new ServerTextControl();
tx.Load("inventory-report.docx", StreamType.WordprocessingML);

var analyzer = new TableSemanticAnalyzer(); // Auto-detect
var extractor = new TableJsonExtractor();

foreach (Table table in tx.Tables)
{
    var analysis = analyzer.Analyze(table);
    
    if (analysis.DetectedDomain == "Manufacturing")
    {
        var data = extractor.Extract(table, analysis, analyzer.Options);
        
        // Process inventory items
        foreach (var row in data.Rows)
        {
            var item = new InventoryItem
            {
                SKU = row.GetValueOrDefault("product_id")?.ToString(),
                Quantity = Convert.ToInt32(row.GetValueOrDefault("quantity")),
                Warehouse = row.GetValueOrDefault("warehouse")?.ToString(),
                Supplier = row.GetValueOrDefault("supplier")?.ToString()
            };
            
            // Update inventory system
            await UpdateInventoryAsync(item);
        }
    }
}

Scenario 4: Multi-Domain Document Processing

Use Case: Process document with tables from multiple domains

using var tx = new ServerTextControl();
tx.Load("annual-report.docx", StreamType.WordprocessingML);

var analyzer = new TableSemanticAnalyzer();
var extractor = new TableJsonExtractor();

var results = new Dictionary<string, List<ExtractedTableData>>();

foreach (Table table in tx.Tables)
{
    var analysis = analyzer.Analyze(table);
    var data = extractor.Extract(table, analysis, analyzer.Options);
    
    // Group by detected domain
    if (!results.ContainsKey(analysis.DetectedDomain))
    {
        results[analysis.DetectedDomain] = new List<ExtractedTableData>();
    }
    results[analysis.DetectedDomain].Add(data);
    
    // Log detection
    Console.WriteLine($"Table {table.ID}: {analysis.DetectedDomain} " +
                     $"({analysis.DomainConfidence:P0})");
}

// Process by domain
if (results.ContainsKey("Financial"))
{
    ProcessFinancialTables(results["Financial"]);
}
if (results.ContainsKey("Manufacturing"))
{
    ProcessInventoryTables(results["Manufacturing"]);
}

Scenario 5: Custom Domain Implementation

Use Case: Create domain for retail industry

// Define custom retail domain
public class RetailDomainConfiguration : IDomainConfiguration
{
    public ISet<string> HeaderKeywords { get; } = new HashSet<string>(StringComparer.OrdinalIgnoreCase)
    {
        "product", "sku", "barcode", "upc",
        "price", "cost", "margin", "discount",
        "store", "location", "sales", "units sold",
        "category", "brand", "supplier"
    };

    public ISet<string> TotalRowKeywords { get; } = new HashSet<string>(StringComparer.OrdinalIgnoreCase)
    {
        "total", "subtotal", "grand total"
    };

    public IDictionary<string, string> HeaderSynonyms { get; } = new Dictionary<string, string>(StringComparer.OrdinalIgnoreCase)
    {
        ["upc"] = "barcode",
        ["product code"] = "sku",
        ["retail price"] = "price",
        ["units"] = "quantity_sold"
    };

    public ISet<string> IdentifierHeaders { get; } = new HashSet<string>(StringComparer.OrdinalIgnoreCase)
    {
        "product", "sku", "barcode", "upc"
    };
}

// Use custom domain
var options = new TableSemanticAnalysisOptions
{
    Domain = new RetailDomainConfiguration()
};

var analyzer = new TableSemanticAnalyzer(options);

Scenario 6: Batch Processing with Error Handling

Use Case: Process multiple documents with error handling

public async Task ProcessDocumentsAsync(string[] filePaths)
{
    var analyzer = new TableSemanticAnalyzer();
    var extractor = new TableJsonExtractor();
    
    foreach (var filePath in filePaths)
    {
        try
        {
            using var tx = new ServerTextControl();
            tx.Create();
            tx.Load(filePath, StreamType.WordprocessingML);
            
            Console.WriteLine($"\nProcessing: {Path.GetFileName(filePath)}");
            Console.WriteLine($"Tables found: {tx.Tables.Count}");
            
            foreach (Table table in tx.Tables)
            {
                try
                {
                    var analysis = analyzer.Analyze(table);
                    var data = extractor.Extract(table, analysis, analyzer.Options);
                    
                    // Save to database or file
                    await SaveTableDataAsync(filePath, table.ID, data);
                    
                    Console.WriteLine($"  ✓ Table {table.ID}: {analysis.DetectedDomain} " +
                                    $"({data.Rows.Count} rows)");
                }
                catch (Exception ex)
                {
                    Console.WriteLine($"  ✗ Table {table.ID}: {ex.Message}");
                    // Log error but continue processing
                    await LogErrorAsync(filePath, table.ID, ex);
                }
            }
        }
        catch (Exception ex)
        {
            Console.WriteLine($"Failed to process {filePath}: {ex.Message}");
        }
    }
}

Scenario 7: Diagnostics and Debugging

Use Case: Understand detection decisions for troubleshooting

var options = new TableSemanticAnalysisOptions
{
    AutoDetectDomain = true
};

var analyzer = new TableSemanticAnalyzer(options);
var analysis = analyzer.Analyze(table);

// Review diagnostics
Console.WriteLine("=== Analysis Diagnostics ===");
foreach (var diagnostic in analysis.Diagnostics)
{
    Console.WriteLine($"  {diagnostic}");
}

// Output domain detection
Console.WriteLine($"\nDetected Domain: {analysis.DetectedDomain}");
Console.WriteLine($"Confidence: {analysis.DomainConfidence:P2}");

// Review detected structure
Console.WriteLine($"\nTable Structure:");
Console.WriteLine($"  Name: {analysis.DetectedTableName}");
Console.WriteLine($"  Header Rows: {string.Join(", ", analysis.HeaderRowNumbers)}");
Console.WriteLine($"  Key Column: {analysis.KeyColumnNumber}");
Console.WriteLine($"  Columns: {analysis.Columns.Count}");

// Review each column
Console.WriteLine($"\nColumn Details:");
foreach (var col in analysis.Columns)
{
    Console.WriteLine($"  {col.ColumnNumber}. {col.HeaderText}");
    Console.WriteLine($"      Canonical: {col.CanonicalName}");
    Console.WriteLine($"      Type: {col.SemanticType}");
    Console.WriteLine($"      Confidence: {col.Confidence:P0}");
}

Example Output:

=== Analysis Diagnostics ===
  Auto-detected domain: Financial (confidence: 0.85)
  Detected header row heuristically: 2
  Detected key column: 1
  Detected 5 columns

Detected Domain: Financial
Confidence: 85.00%

Table Structure:
  Name: Credit Exposure Summary
  Header Rows: 2
  Key Column: 1
  Columns: 5

Column Details:
  1. Counterparty
      Canonical: counterparty
      Type: identifier
      Confidence: 85%
  2. Exposure EUR
      Canonical: exposure_amount_eur
      Type: amount
      Confidence: 90%
  ...

Advanced Features

1. Merged Cell Handling

The library safely handles merged cells without exceptions:

// Table with merged title row:
// ┌─────────────────────────────┐
// │  Credit Exposure Summary    │  ← Single merged cell
// ├──────────┬─────────┬────────┤
// │ Name     │ Amount  │ Rating │  ← Actual headers
// └──────────┴─────────┴────────┘

var analysis = analyzer.Analyze(table);
// Correctly identifies row 2 as headers, not row 1
// InternalTitle = "Credit Exposure Summary"
// HeaderRowNumbers = [2]

How it works:

  • Detects merged cells by catching TextEditorException on .Text access
  • Penalizes single-cell rows in header detection
  • Automatically identifies title rows vs. header rows

2. Multi-Row Headers

Handles headers spanning multiple rows:

// Table with multi-row header:
// ┌──────────┬─────────────────┬─────────┐
// │          │   Financial     │         │
// │ Account  ├──────────┬──────┤ Status  │
// │          │ Budget   │Actual│         │
// └──────────┴──────────┴──────┴─────────┘

var options = new TableSemanticAnalysisOptions
{
    MaxHeaderRowsToInspect = 3  // Check first 3 rows
};

// Composite headers: "Financial Budget", "Financial Actual"

3. Multiple Total Rows

Supports subtotals, totals, averages:

{
  "TotalRows": {
    "subtotal": {
      "region": "EMEA Subtotal",
      "amount": 1250000
    },
    "total": {
      "region": "Grand Total",
      "amount": 5000000
    },
    "average": {
      "region": "Average",
      "amount": 625000
    }
  }
}

4. Confidence Scoring

Every detection includes confidence:

var analysis = analyzer.Analyze(table);

if (analysis.DomainConfidence < 0.6)
{
    Console.WriteLine("⚠️ Low confidence detection - review results");
}

foreach (var col in analysis.Columns)
{
    if (col.Confidence < 0.5)
    {
        Console.WriteLine($"⚠️ Column '{col.HeaderText}' has low confidence");
    }
}

5. Custom Type Conversion

Override value conversion for specific types:

// Custom extractor with specialized conversion
public class CustomExtractor : TableJsonExtractor
{
    protected override object? ConvertValue(string raw, string semanticType)
    {
        return semanticType switch
        {
            "amount" when raw.Contains("€") => ParseEuroAmount(raw),
            "date" when raw.Contains("/") => ParseCustomDate(raw),
            _ => base.ConvertValue(raw, semanticType)
        };
    }
}

Component Reference

TableSemanticAnalyzer

Purpose: Analyzes table structure and applies semantic understanding

Key Methods:

public TableSemanticAnalysisResult Analyze(Table table)

Returns:

  • DetectedDomain - Auto-detected or configured domain
  • DomainConfidence - Detection confidence (0.0-1.0)
  • HeaderRowNumbers - Detected header row(s)
  • KeyColumnNumber - Primary identifier column
  • Columns - Column metadata with canonical names and types
  • Diagnostics - Analysis decision log

TableJsonExtractor

Purpose: Extracts table data to JSON structure

Key Methods:

public ExtractedTableData Extract(
    Table table, 
    TableSemanticAnalysisResult analysis,
    TableSemanticAnalysisOptions options)

Returns:

  • DetectedDomain - Domain used for extraction
  • HeaderRow - Original header texts
  • Columns - Column definitions
  • Rows - Data rows as dictionaries
  • TotalRows - Summary rows grouped by type

DomainDetector

Purpose: Automatically selects best matching domain

Key Methods:

public static (string DomainName, IDomainConfiguration Config, double Confidence) 
    DetectDomain(Table table)

Algorithm:

  • Scans first 3 rows for keywords
  • Scores each domain (0-100+)
  • Returns best match with confidence

HeaderCanonicalizer

Purpose: Converts header text to canonical names

Key Methods:

public static string Canonicalize(string headerText)
public static string ApplySynonym(string canonicalName, IDomainConfiguration domain)

Process:

  • Lowercase conversion
  • Special character replacement (€→eur, %→percent)
  • Non-alphanumeric removal
  • Snake_case conversion
  • Domain synonym application

Domain Configurations

Interface:

public interface IDomainConfiguration
{
    ISet<string> HeaderKeywords { get; }
    ISet<string> TotalRowKeywords { get; }
    IDictionary<string, string> HeaderSynonyms { get; }
    ISet<string> IdentifierHeaders { get; }
}

Built-in Implementations:

  • FinancialDomainConfiguration
  • HealthcareDomainConfiguration
  • ManufacturingDomainConfiguration
  • GenericDomainConfiguration

Best Practices

1. Start with Auto-Detection

Recommended:

var analyzer = new TableSemanticAnalyzer();  // Auto-detect

Avoid premature optimization:

// Don't force domain unless necessary
var options = new TableSemanticAnalysisOptions
{
    AutoDetectDomain = false,
    Domain = new FinancialDomainConfiguration()  // Only if required
};

2. Check Confidence Scores

var analysis = analyzer.Analyze(table);

if (analysis.DomainConfidence < 0.6)
{
    // Low confidence - consider:
    // 1. Adding domain-specific keywords to headers
    // 2. Using explicit domain configuration
    // 3. Creating custom domain
    Console.WriteLine($"⚠️ Low confidence: {analysis.DomainConfidence:P0}");
}

3. Review Diagnostics for Issues

// Always check diagnostics when results are unexpected
foreach (var diagnostic in analysis.Diagnostics)
{
    Console.WriteLine(diagnostic);
}

4. Handle Merged Cells Gracefully

The library handles this automatically, but be aware:

// Merged cells return null for non-anchor cells
// This is expected behavior
var data = extractor.Extract(table, analysis, options);
// Some columns may have null values in merged regions

5. Use Appropriate Domains

Data Type Domain
Financial statements Financial
Patient records Healthcare
Inventory/supply chain Manufacturing
Generic business data Generic (auto-selected)
Custom industry Create custom domain

6. Validate Critical Data

var data = extractor.Extract(table, analysis, options);

foreach (var row in data.Rows)
{
    // Validate required fields
    if (!row.ContainsKey("patient_id"))
    {
        Console.WriteLine("⚠️ Missing required field: patient_id");
    }
    
    // Validate data types
    if (row.TryGetValue("amount", out var amount) && amount is not decimal)
    {
        Console.WriteLine($"⚠️ Expected decimal, got {amount?.GetType()}");
    }
}

7. Process Tables in Context

// Group related tables
var tablesByDomain = new Dictionary<string, List<ExtractedTableData>>();

foreach (Table table in tx.Tables)
{
    var analysis = analyzer.Analyze(table);
    var data = extractor.Extract(table, analysis, analyzer.Options);
    
    if (!tablesByDomain.ContainsKey(analysis.DetectedDomain))
    {
        tablesByDomain[analysis.DetectedDomain] = new List<ExtractedTableData>();
    }
    tablesByDomain[analysis.DetectedDomain].Add(data);
}

Troubleshooting

Problem: Wrong Domain Detected

Symptoms:

Domain: Generic (Confidence: 55%)
Expected: Financial

Solutions:

  1. Check header keywords:

    // Add domain-specific terms to headers
    // Instead of: "Amount", "Value"
    // Use: "Exposure EUR", "Market Value"
  2. Force explicit domain:

    var options = new TableSemanticAnalysisOptions
    {
        AutoDetectDomain = false,
        Domain = new FinancialDomainConfiguration()
    };
  3. Review diagnostics:

    foreach (var diagnostic in analysis.Diagnostics)
    {
        Console.WriteLine(diagnostic);
    }
    // Look for: "Auto-detected domain: X (confidence: Y)"

Problem: Wrong Headers Detected

Symptoms:

HeaderRowNumbers: [1]  // Should be [2]

Solutions:

  1. Check for title rows:

    • Row 1 with single merged cell = title (penalized)
    • System should auto-detect row 2 as headers
  2. Increase inspection depth:

    var options = new TableSemanticAnalysisOptions
    {
        MaxHeaderRowsToInspect = 5  // Check more rows
    };
  3. Mark headers explicitly:

    // In TX Text Control, mark row as header
    table.Rows[1].IsHeader = true;

Problem: Columns Have Generic Names

Symptoms:

CanonicalName: "column_1"  // Expected: "patient_id"

Solutions:

  1. Add keywords to headers:

    Instead of: "ID"
    Use: "Patient ID" or "MRN"
    
  2. Check domain configuration:

    var domain = new HealthcareDomainConfiguration();
    // Verify "patient" and "mrn" are in HeaderKeywords
  3. Add custom synonyms:

    public class CustomHealthcare : HealthcareDomainConfiguration
    {
        public CustomHealthcare()
        {
            HeaderSynonyms["id"] = "patient_id";
        }
    }

Problem: Merged Cells Cause Errors

Symptoms:

TXTextControl.TextEditorException: Cannot access .Text

Solutions:

This is handled automatically - the library catches these exceptions.

If you still see errors:

  1. Ensure you're using the latest version
  2. Check that SafeGetCellText is being used
  3. Report the issue with a sample document

Problem: Total Rows Not Detected

Symptoms:

{
  "TotalRows": null  // Should contain total
}

Solutions:

  1. Check total row keywords:

    var domain = new FinancialDomainConfiguration();
    // Verify "Total", "Subtotal", etc. are in TotalRowKeywords
  2. Verify keyword placement:

    ✓ "Total" in ANY column
    ✗ "Total" only in calculation (not in text)
    
  3. Add custom keywords:

    domain.TotalRowKeywords.Add("Sum");
    domain.TotalRowKeywords.Add("Grand Total");

Problem: Poor Performance

Symptoms:

  • Slow processing of large documents

Solutions:

  1. Process tables selectively:

    foreach (Table table in tx.Tables)
    {
        // Skip nested tables
        if (table.NestedLevel > 0) continue;
        
        // Process only
        var analysis = analyzer.Analyze(table);
    }
  2. Batch process:

    // Process multiple documents in parallel
    await Task.WhenAll(files.Select(ProcessFileAsync));
  3. Cache analyzer:

    // Reuse analyzer instance
    var analyzer = new TableSemanticAnalyzer();
    foreach (var file in files)
    {
        // Use same analyzer
    }

Requirements

  • TX Text Control .NET (any edition)
  • .NET 10 (C# 14)
  • Windows (for TX Text Control)

Contributing

Contributions welcome! Areas for improvement:

  1. Additional Domains

    • Legal documents
    • Education/academic
    • Real estate
    • Transportation/logistics
  2. Enhanced Detection

    • ML-based domain detection
    • Pattern recognition for complex tables
    • Multi-language support
  3. More Features

    • Table validation rules
    • Schema generation
    • Data quality scoring
    • Export to other formats (CSV, Excel, Database)

Resources


Summary

This library transforms document table extraction from a manual, error-prone process into an intelligent, automated workflow:

  1. Load any TX Text Control document
  2. Analyze tables with automatic domain detection
  3. Extract to structured JSON with semantic understanding
  4. Process with industry-specific field mappings

Zero configuration required - the system automatically understands financial statements, patient records, inventory sheets, and more.

Perfect for:

  • Document processing pipelines
  • Data migration projects
  • Report automation
  • Business intelligence extraction
  • Healthcare data aggregation
  • Financial analysis tools

Get started in 5 lines of code:

using var tx = new ServerTextControl();
tx.Load("document.docx", StreamType.WordprocessingML);
var analyzer = new TableSemanticAnalyzer(new TableSemanticAnalysisOptions { AutoDetectDomain = true });
var extractor = new TableJsonExtractor();
var analysis = analyzer.Analyze(tx.Tables[0]);
var data = extractor.Extract(tx.Tables[0], analysis, analyzer.Options);
Console.WriteLine(data.ToJson());

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages