TX Text Control Table Detection & Extraction

Intelligent table analysis and JSON extraction with automatic domain detection

Transform complex document tables into structured, semantic JSON data with zero configuration. This library automatically understands financial statements, healthcare records, manufacturing inventories, and more.

Overview

The Problem

Document tables are everywhere—financial reports, patient records, inventory sheets—but extracting their data programmatically is challenging:

Headers are ambiguous: "Patient ID" vs "MRN" vs "Medical Record Number"
Tables vary by industry: Financial tables look nothing like manufacturing inventories
Structure is inconsistent: Merged cells, multi-row headers, title rows
Manual mapping is tedious: Maintaining header-to-field mappings for every variation

The Solution

This library provides semantic table understanding:

// Before: Complex parsing logic for each table type
// After: Automatic understanding
var analyzer = new TableSemanticAnalyzer();
var result = analyzer.Analyze(table);
// Automatically knows: "MRN" → patient_id, "SKU" → product_id, etc.

Key Innovation: The system automatically detects the table's domain (Financial, Healthcare, Manufacturing, Generic) and applies industry-specific knowledge to understand headers, identify key columns, and structure the data correctly.

Key Concepts

1. Domain-Driven Analysis

Tables are analyzed within domain contexts:

Domain	Examples	Keywords
Financial	P&L, Balance Sheet, Credit Risk	counterparty, EBITDA, assets, revenue
Healthcare	Patient Records, Lab Results	patient, MRN, diagnosis, ICD
Manufacturing	Inventory, BOM, Production	SKU, quantity, warehouse, batch
Generic	Any other table	name, date, amount, status

2. Automatic Canonicalization

Headers are automatically normalized:

"Exposure EUR" → exposure_amount_eur
"Patient ID"   → patient_id  
"Part Number"  → product_id
"Util %"       → utilization_percent

3. Semantic Type Inference

Columns are automatically typed:

identifier: Patient ID, SKU, Counterparty
amount: Monetary values
percentage: Utilization %, completion rates
date: Dates, maturity dates
text: Descriptions, names

4. Total Row Detection

Summary rows are automatically identified and separated:

{
  "Rows": [ /* data rows */ ],
  "TotalRows": {
    "subtotal": { ... },
    "total": { ... }
  }
}

Architecture

Component Diagram

┌─────────────────────────────────────────────────────────────────┐
│                        TX Text Control                          │
│                         Document/Table                          │
└────────────────────────────┬────────────────────────────────────┘
                             │
                             ▼
┌─────────────────────────────────────────────────────────────────┐
│                     TableSemanticAnalyzer                       │
│  ┌──────────────────────────────────────────────────────────┐  │
│  │  1. Domain Detection (DomainDetector)                    │  │
│  │     • Analyzes headers                                   │  │
│  │     • Scores each domain                                 │  │
│  │     • Selects best match                                 │  │
│  └──────────────────────────────────────────────────────────┘  │
│  ┌──────────────────────────────────────────────────────────┐  │
│  │  2. Header Row Detection                                 │  │
│  │     • Identifies header rows                             │  │
│  │     • Distinguishes title rows from headers              │  │
│  │     • Handles merged cells                               │  │
│  └──────────────────────────────────────────────────────────┘  │
│  ┌──────────────────────────────────────────────────────────┐  │
│  │  3. Column Analysis                                      │  │
│  │     • Extracts header text                               │  │
│  │     • Canonicalizes names (HeaderCanonicalizer)          │  │
│  │     • Infers semantic types                              │  │
│  └──────────────────────────────────────────────────────────┘  │
│  ┌──────────────────────────────────────────────────────────┐  │
│  │  4. Key Column Detection                                 │  │
│  │     • Identifies identifier columns                      │  │
│  │     • Analyzes uniqueness and data types                 │  │
│  └──────────────────────────────────────────────────────────┘  │
└────────────────────────────┬────────────────────────────────────┘
                             │
                             ▼ TableSemanticAnalysisResult
                             │   • DetectedDomain
                             │   • HeaderRowNumbers
                             │   • Columns (with canonical names)
                             │   • KeyColumnNumber
                             │
                             ▼
┌─────────────────────────────────────────────────────────────────┐
│                      TableJsonExtractor                         │
│  ┌──────────────────────────────────────────────────────────┐  │
│  │  1. Row Processing                                       │  │
│  │     • Iterates data rows                                 │  │
│  │     • Extracts cell values (handles merged cells)        │  │
│  │     • Identifies total rows                              │  │
│  └──────────────────────────────────────────────────────────┘  │
│  ┌──────────────────────────────────────────────────────────┐  │
│  │  2. Value Conversion                                     │  │
│  │     • Parses amounts (removes currency symbols)          │  │
│  │     • Parses percentages                                 │  │
│  │     • Formats dates                                      │  │
│  └──────────────────────────────────────────────────────────┘  │
│  ┌──────────────────────────────────────────────────────────┐  │
│  │  3. JSON Structure Building                              │  │
│  │     • Maps to canonical column names                     │  │
│  │     • Groups regular rows vs. total rows                 │  │
│  └──────────────────────────────────────────────────────────┘  │
└────────────────────────────┬────────────────────────────────────┘
                             │
                             ▼ ExtractedTableData (JSON)

Key Components

1. DomainDetector

Analyzes table content and selects the appropriate domain configuration.

Input: TX Text Control Table object
Output: Domain name, configuration, and confidence score

2. Domain Configurations

Define industry-specific knowledge:

Header keywords
Synonym mappings
Identifier patterns
Total row keywords

3. HeaderCanonicalizer

Converts header text to standardized names:

Removes special characters
Converts to snake_case
Applies domain synonyms

4. TableSemanticAnalyzer

Core analysis engine that:

Detects headers
Identifies structure
Infers semantics
Applies domain knowledge

5. TableJsonExtractor

Converts analyzed tables to JSON:

Extracts row data
Handles merged cells
Separates totals
Formats values

The Pipeline

Step-by-Step Data Flow

┌─────────────────────┐
│   Load Document     │
│  (TX Text Control)  │
└──────────┬──────────┘
           │
           ▼
┌─────────────────────┐
│  For Each Table     │
└──────────┬──────────┘
           │
           ▼
┌──────────────────────────────────────────────────────────┐
│ PHASE 1: DOMAIN DETECTION                                │
│                                                           │
│  Input: Table cells (first 3 rows)                       │
│                                                           │
│  Process:                                                 │
│  1. Extract header cell text                             │
│  2. Score against each domain's keywords                 │
│     • Financial: "Counterparty" → +15 points             │
│     • Healthcare: "Patient" → +15 points                 │
│     • Manufacturing: "SKU" → +15 points                  │
│  3. Select highest scoring domain                        │
│                                                           │
│  Output: DetectedDomain = "Financial" (0.85 confidence)  │
└──────────────────────┬───────────────────────────────────┘
                       │
                       ▼
┌──────────────────────────────────────────────────────────┐
│ PHASE 2: HEADER DETECTION                                │
│                                                           │
│  Process:                                                 │
│  1. Check for explicit header rows (IsHeader property)   │
│  2. If not found, score rows heuristically:              │
│     • Contains domain keywords (+points)                 │
│     • Mostly text, not numbers (+points)                 │
│     • Has formatting (shaded, bordered) (+points)        │
│     • Row 2 preferred over row 1 (+bonus)                │
│     • Single merged cell = title row (-penalty)          │
│  3. Select best scoring row                              │
│                                                           │
│  Output: HeaderRowNumbers = [2]                          │
└──────────────────────┬───────────────────────────────────┘
                       │
                       ▼
┌──────────────────────────────────────────────────────────┐
│ PHASE 3: COLUMN ANALYSIS                                 │
│                                                           │
│  For each column:                                         │
│  1. Extract header text from detected header row         │
│     "Exposure EUR"                                        │
│                                                           │
│  2. Auto-canonicalize:                                   │
│     • Convert to lowercase                               │
│     • Replace special chars (€→eur, %→percent)           │
│     • Convert to snake_case: "exposure_eur"              │
│                                                           │
│  3. Apply domain synonyms:                               │
│     "exposure_eur" → "exposure_amount_eur"               │
│                                                           │
│  4. Infer semantic type:                                 │
│     • Check header keywords (% → percentage)             │
│     • Sample first 10 data rows                          │
│     • Detect patterns (numbers → amount, dates → date)   │
│     Type = "amount"                                      │
│                                                           │
│  Output: Column {                                         │
│    HeaderText: "Exposure EUR",                           │
│    CanonicalName: "exposure_amount_eur",                 │
│    SemanticType: "amount"                                │
│  }                                                        │
└──────────────────────┬───────────────────────────────────┘
                       │
                       ▼
┌──────────────────────────────────────────────────────────┐
│ PHASE 4: KEY COLUMN DETECTION                            │
│                                                           │
│  Process:                                                 │
│  1. For each column, analyze data rows:                  │
│     • Text ratio (non-numeric values)                    │
│     • Uniqueness ratio (distinct values)                 │
│     • Left position bias (column 1 or 2)                 │
│     • Semantic boost (identifier headers)                │
│  2. Select column with highest score                     │
│                                                           │
│  Output: KeyColumnNumber = 1 (Counterparty)             │
└──────────────────────┬───────────────────────────────────┘
                       │
                       ▼
┌──────────────────────────────────────────────────────────┐
│ PHASE 5: DATA EXTRACTION                                 │
│                                                           │
│  For each row (starting after header):                   │
│  1. Extract cell values:                                 │
│     • Direct cell → use value                            │
│     • Merged cell → search for anchor                    │
│     • Empty cell → null                                  │
│                                                           │
│  2. Check if total row:                                  │
│     • Scan all columns for total keywords                │
│     • "Total", "Subtotal", "Average", etc.               │
│                                                           │
│  3. Convert values by semantic type:                     │
│     • amount: Parse decimal, remove €/$                  │
│     • percentage: Parse decimal, remove %                │
│     • date: Format as yyyy-MM-dd                         │
│     • text: Keep as-is                                   │
│                                                           │
│  4. Map to canonical column names:                       │
│     {                                                     │
│       "counterparty": "Alpha AG",                        │
│       "exposure_amount_eur": 1250000,                    │
│       "rating": "A"                                      │
│     }                                                     │
│                                                           │
│  Output: Rows[] or TotalRows{}                           │
└──────────────────────┬───────────────────────────────────┘
                       │
                       ▼
┌──────────────────────────────────────────────────────────┐
│ FINAL OUTPUT: JSON                                       │
│                                                           │
│ {                                                         │
│   "DetectedDomain": "Financial",                         │
│   "DomainConfidence": 0.85,                              │
│   "HeaderRow": ["Counterparty", "Exposure EUR", ...],    │
│   "Columns": [ ... ],                                     │
│   "Rows": [ ... ],                                        │
│   "TotalRows": { ... }                                    │
│ }                                                         │
└───────────────────────────────────────────────────────────┘

Quick Start

Installation

# Add TX Text Control reference (prerequisite)
# Add this project to your solution

Basic Usage

using System;
using TextControl.TableDetection;
using TextControl.TableDetection.Domain;
using TXTextControl;

// Load document
using var tx = new ServerTextControl();
tx.Create();
tx.Load("financial-report.docx", StreamType.WordprocessingML);

Console.WriteLine($"Tables detected: {tx.Tables.Count}");

if (tx.Tables.Count == 0)
{
    Console.WriteLine("No tables found.");
    return;
}

// Mode 1: Auto-detect domain (recommended)
var options = new TableSemanticAnalysisOptions
{
    SkipNestedTables = true,
    MaxHeaderRowsToInspect = 3,
    AutoDetectDomain = true,
    Domain = null  // Let it auto-detect
};

// Mode 2: Explicit domain (uncomment to use)
// var options = new TableSemanticAnalysisOptions
// {
//     AutoDetectDomain = false,
//     Domain = new FinancialDomainConfiguration()
// };

var analyzer = new TableSemanticAnalyzer(options);
var extractor = new TableJsonExtractor();

foreach (Table table in tx.Tables)
{
    // Analyze table structure and semantics
    var analysis = analyzer.Analyze(table);
    
    // Extract data to JSON
    var data = extractor.Extract(table, analysis, options);
    
    // Output results
    Console.WriteLine(new string('=', 80));
    Console.WriteLine($"Table ID: {data.TableId}");
    Console.WriteLine($"Detected Domain: {analysis.DetectedDomain} (Confidence: {analysis.DomainConfidence:P0})");
    Console.WriteLine();
    Console.WriteLine(data.ToJson());
    Console.WriteLine();
}

Output:

================================================================================
Table ID: 0
Detected Domain: Financial (Confidence: 85%)

{
  "TableId": 0,
  "DetectedDomain": "Financial",
  "TableName": "Credit Exposure Summary",
  "HeaderRow": ["Counterparty", "Exposure EUR", "Rating"],
  "Rows": [
    { "counterparty": "Alpha AG", "exposure_amount_eur": 1250000, "rating": "A" }
  ]
}

Domain Configurations

Financial Domain 🏦

Best for: P&L statements, balance sheets, credit risk tables, financial reports

Keywords (150+):

Credit/Risk: counterparty, exposure, rating, limit, utilization
P&L: revenue, COGS, EBITDA, operating profit, net income
Balance Sheet: assets, liabilities, equity, cash, inventory
Periods: YTD, QTD, actual, budget, variance

📖 Complete Financial Guide →

Example:

| Counterparty | Exposure EUR | Rating | Limit     |
|--------------|--------------|--------|-----------|
| Alpha AG     | 1,250,000    | A      | 2,000,000 |

→ Detects as Financial (85% confidence)

Healthcare Domain 🏥

Best for: Patient records, clinical data, diagnoses, medications, lab results

Keywords (80+):

Identifiers: patient, MRN, medical record number
Clinical: diagnosis, ICD, CPT, procedure, medication
Visits: encounter, admission, discharge, LOS
Providers: physician, doctor, department

📖 Complete Healthcare Guide →

Example:

| Patient ID | MRN    | Diagnosis           | Physician   |
|------------|--------|---------------------|-------------|
| P001       | 123456 | Type 2 Diabetes     | Dr. Smith   |

→ Detects as Healthcare (92% confidence)

Manufacturing Domain 🏭

Best for: Inventory, supply chain, production, parts management

Keywords (90+):

Products: SKU, part number, item
Inventory: quantity, stock, warehouse
Supply Chain: supplier, vendor, manufacturer
Production: batch, lot, production date

📖 Complete Manufacturing Guide →

Example:

| SKU        | Description    | Qty  | Warehouse | Supplier   |
|------------|----------------|------|-----------|------------|
| P-001-2024 | Steel Bolt M8  | 5000 | WH-01     | FastCo     |

→ Detects as Manufacturing (88% confidence)

Generic Domain 📋

Best for: General-purpose tables, mixed data, custom business tables

Keywords (30+):

Common: name, id, description, date, amount
Status: status, type, category
Numeric: quantity, value, count, total

📖 Complete Generic Guide →

Example:

| Employee ID | Name        | Department | Status  |
|-------------|-------------|------------|---------|
| E001        | John Smith  | IT         | Active  |

→ Detects as Generic (65% confidence)

Sample Usage Scenarios

Scenario 1: Financial Report Analysis

Use Case: Extract P&L data from Word document for analysis

using var tx = new ServerTextControl();
tx.Load("Q4-2024-Financial-Report.docx", StreamType.WordprocessingML);

var options = new TableSemanticAnalysisOptions
{
    AutoDetectDomain = true
};

var analyzer = new TableSemanticAnalyzer(options);
var extractor = new TableJsonExtractor();

foreach (Table table in tx.Tables)
{
    var analysis = analyzer.Analyze(table);
    
    // Only process financial tables
    if (analysis.DetectedDomain == "Financial" && 
        analysis.DomainConfidence > 0.7)
    {
        var data = extractor.Extract(table, analysis, options);
        
        // Extract specific metrics
        foreach (var row in data.Rows)
        {
            if (row.TryGetValue("revenue", out var revenue))
            {
                Console.WriteLine($"Revenue: {revenue}");
            }
            if (row.TryGetValue("ebitda", out var ebitda))
            {
                Console.WriteLine($"EBITDA: {ebitda}");
            }
        }
        
        // Check totals
        if (data.TotalRows?.TryGetValue("total", out var totals) == true)
        {
            Console.WriteLine("Total Row:");
            Console.WriteLine(JsonSerializer.Serialize(totals, new JsonSerializerOptions 
            { 
                WriteIndented = true 
            }));
        }
    }
}

Scenario 2: Healthcare Data Migration

Use Case: Extract patient records for system migration

using var tx = new ServerTextControl();
tx.Load("patient-census.tx", StreamType.InternalUnicodeFormat);

// Force Healthcare domain for consistency
var options = new TableSemanticAnalysisOptions
{
    AutoDetectDomain = false,
    Domain = new HealthcareDomainConfiguration()
};

var analyzer = new TableSemanticAnalyzer(options);
var extractor = new TableJsonExtractor();

var patients = new List<PatientRecord>();

foreach (Table table in tx.Tables)
{
    var analysis = analyzer.Analyze(table);
    var data = extractor.Extract(table, analysis, options);
    
    // Map to domain model
    foreach (var row in data.Rows)
    {
        patients.Add(new PatientRecord
        {
            PatientId = row.GetValueOrDefault("patient_id")?.ToString(),
            MRN = row.GetValueOrDefault("patient_id")?.ToString(), // Synonym mapped
            DateOfBirth = row.GetValueOrDefault("date_of_birth")?.ToString(),
            Physician = row.GetValueOrDefault("physician")?.ToString()
        });
    }
}

// Export to database
await SavePatientsAsync(patients);

Scenario 3: Inventory Management

Use Case: Extract inventory data for warehouse system

using var tx = new ServerTextControl();
tx.Load("inventory-report.docx", StreamType.WordprocessingML);

var analyzer = new TableSemanticAnalyzer(); // Auto-detect
var extractor = new TableJsonExtractor();

foreach (Table table in tx.Tables)
{
    var analysis = analyzer.Analyze(table);
    
    if (analysis.DetectedDomain == "Manufacturing")
    {
        var data = extractor.Extract(table, analysis, analyzer.Options);
        
        // Process inventory items
        foreach (var row in data.Rows)
        {
            var item = new InventoryItem
            {
                SKU = row.GetValueOrDefault("product_id")?.ToString(),
                Quantity = Convert.ToInt32(row.GetValueOrDefault("quantity")),
                Warehouse = row.GetValueOrDefault("warehouse")?.ToString(),
                Supplier = row.GetValueOrDefault("supplier")?.ToString()
            };
            
            // Update inventory system
            await UpdateInventoryAsync(item);
        }
    }
}

Scenario 4: Multi-Domain Document Processing

Use Case: Process document with tables from multiple domains

using var tx = new ServerTextControl();
tx.Load("annual-report.docx", StreamType.WordprocessingML);

var analyzer = new TableSemanticAnalyzer();
var extractor = new TableJsonExtractor();

var results = new Dictionary<string, List<ExtractedTableData>>();

foreach (Table table in tx.Tables)
{
    var analysis = analyzer.Analyze(table);
    var data = extractor.Extract(table, analysis, analyzer.Options);
    
    // Group by detected domain
    if (!results.ContainsKey(analysis.DetectedDomain))
    {
        results[analysis.DetectedDomain] = new List<ExtractedTableData>();
    }
    results[analysis.DetectedDomain].Add(data);
    
    // Log detection
    Console.WriteLine($"Table {table.ID}: {analysis.DetectedDomain} " +
                     $"({analysis.DomainConfidence:P0})");
}

// Process by domain
if (results.ContainsKey("Financial"))
{
    ProcessFinancialTables(results["Financial"]);
}
if (results.ContainsKey("Manufacturing"))
{
    ProcessInventoryTables(results["Manufacturing"]);
}

Scenario 5: Custom Domain Implementation

Use Case: Create domain for retail industry

// Define custom retail domain
public class RetailDomainConfiguration : IDomainConfiguration
{
    public ISet<string> HeaderKeywords { get; } = new HashSet<string>(StringComparer.OrdinalIgnoreCase)
    {
        "product", "sku", "barcode", "upc",
        "price", "cost", "margin", "discount",
        "store", "location", "sales", "units sold",
        "category", "brand", "supplier"
    };

    public ISet<string> TotalRowKeywords { get; } = new HashSet<string>(StringComparer.OrdinalIgnoreCase)
    {
        "total", "subtotal", "grand total"
    };

    public IDictionary<string, string> HeaderSynonyms { get; } = new Dictionary<string, string>(StringComparer.OrdinalIgnoreCase)
    {
        ["upc"] = "barcode",
        ["product code"] = "sku",
        ["retail price"] = "price",
        ["units"] = "quantity_sold"
    };

    public ISet<string> IdentifierHeaders { get; } = new HashSet<string>(StringComparer.OrdinalIgnoreCase)
    {
        "product", "sku", "barcode", "upc"
    };
}

// Use custom domain
var options = new TableSemanticAnalysisOptions
{
    Domain = new RetailDomainConfiguration()
};

var analyzer = new TableSemanticAnalyzer(options);

Scenario 6: Batch Processing with Error Handling

Use Case: Process multiple documents with error handling

public async Task ProcessDocumentsAsync(string[] filePaths)
{
    var analyzer = new TableSemanticAnalyzer();
    var extractor = new TableJsonExtractor();
    
    foreach (var filePath in filePaths)
    {
        try
        {
            using var tx = new ServerTextControl();
            tx.Create();
            tx.Load(filePath, StreamType.WordprocessingML);
            
            Console.WriteLine($"\nProcessing: {Path.GetFileName(filePath)}");
            Console.WriteLine($"Tables found: {tx.Tables.Count}");
            
            foreach (Table table in tx.Tables)
            {
                try
                {
                    var analysis = analyzer.Analyze(table);
                    var data = extractor.Extract(table, analysis, analyzer.Options);
                    
                    // Save to database or file
                    await SaveTableDataAsync(filePath, table.ID, data);
                    
                    Console.WriteLine($"  ✓ Table {table.ID}: {analysis.DetectedDomain} " +
                                    $"({data.Rows.Count} rows)");
                }
                catch (Exception ex)
                {
                    Console.WriteLine($"  ✗ Table {table.ID}: {ex.Message}");
                    // Log error but continue processing
                    await LogErrorAsync(filePath, table.ID, ex);
                }
            }
        }
        catch (Exception ex)
        {
            Console.WriteLine($"Failed to process {filePath}: {ex.Message}");
        }
    }
}

Scenario 7: Diagnostics and Debugging

Use Case: Understand detection decisions for troubleshooting

var options = new TableSemanticAnalysisOptions
{
    AutoDetectDomain = true
};

var analyzer = new TableSemanticAnalyzer(options);
var analysis = analyzer.Analyze(table);

// Review diagnostics
Console.WriteLine("=== Analysis Diagnostics ===");
foreach (var diagnostic in analysis.Diagnostics)
{
    Console.WriteLine($"  {diagnostic}");
}

// Output domain detection
Console.WriteLine($"\nDetected Domain: {analysis.DetectedDomain}");
Console.WriteLine($"Confidence: {analysis.DomainConfidence:P2}");

// Review detected structure
Console.WriteLine($"\nTable Structure:");
Console.WriteLine($"  Name: {analysis.DetectedTableName}");
Console.WriteLine($"  Header Rows: {string.Join(", ", analysis.HeaderRowNumbers)}");
Console.WriteLine($"  Key Column: {analysis.KeyColumnNumber}");
Console.WriteLine($"  Columns: {analysis.Columns.Count}");

// Review each column
Console.WriteLine($"\nColumn Details:");
foreach (var col in analysis.Columns)
{
    Console.WriteLine($"  {col.ColumnNumber}. {col.HeaderText}");
    Console.WriteLine($"      Canonical: {col.CanonicalName}");
    Console.WriteLine($"      Type: {col.SemanticType}");
    Console.WriteLine($"      Confidence: {col.Confidence:P0}");
}

Example Output:

=== Analysis Diagnostics ===
  Auto-detected domain: Financial (confidence: 0.85)
  Detected header row heuristically: 2
  Detected key column: 1
  Detected 5 columns

Detected Domain: Financial
Confidence: 85.00%

Table Structure:
  Name: Credit Exposure Summary
  Header Rows: 2
  Key Column: 1
  Columns: 5

Column Details:
  1. Counterparty
      Canonical: counterparty
      Type: identifier
      Confidence: 85%
  2. Exposure EUR
      Canonical: exposure_amount_eur
      Type: amount
      Confidence: 90%
  ...

Advanced Features

1. Merged Cell Handling

The library safely handles merged cells without exceptions:

// Table with merged title row:
// ┌─────────────────────────────┐
// │  Credit Exposure Summary    │  ← Single merged cell
// ├──────────┬─────────┬────────┤
// │ Name     │ Amount  │ Rating │  ← Actual headers
// └──────────┴─────────┴────────┘

var analysis = analyzer.Analyze(table);
// Correctly identifies row 2 as headers, not row 1
// InternalTitle = "Credit Exposure Summary"
// HeaderRowNumbers = [2]

How it works:

Detects merged cells by catching TextEditorException on .Text access
Penalizes single-cell rows in header detection
Automatically identifies title rows vs. header rows

2. Multi-Row Headers

Handles headers spanning multiple rows:

// Table with multi-row header:
// ┌──────────┬─────────────────┬─────────┐
// │          │   Financial     │         │
// │ Account  ├──────────┬──────┤ Status  │
// │          │ Budget   │Actual│         │
// └──────────┴──────────┴──────┴─────────┘

var options = new TableSemanticAnalysisOptions
{
    MaxHeaderRowsToInspect = 3  // Check first 3 rows
};

// Composite headers: "Financial Budget", "Financial Actual"

3. Multiple Total Rows

Supports subtotals, totals, averages:

{
  "TotalRows": {
    "subtotal": {
      "region": "EMEA Subtotal",
      "amount": 1250000
    },
    "total": {
      "region": "Grand Total",
      "amount": 5000000
    },
    "average": {
      "region": "Average",
      "amount": 625000
    }
  }
}

4. Confidence Scoring

Every detection includes confidence:

var analysis = analyzer.Analyze(table);

if (analysis.DomainConfidence < 0.6)
{
    Console.WriteLine("⚠️ Low confidence detection - review results");
}

foreach (var col in analysis.Columns)
{
    if (col.Confidence < 0.5)
    {
        Console.WriteLine($"⚠️ Column '{col.HeaderText}' has low confidence");
    }
}

5. Custom Type Conversion

Override value conversion for specific types:

// Custom extractor with specialized conversion
public class CustomExtractor : TableJsonExtractor
{
    protected override object? ConvertValue(string raw, string semanticType)
    {
        return semanticType switch
        {
            "amount" when raw.Contains("€") => ParseEuroAmount(raw),
            "date" when raw.Contains("/") => ParseCustomDate(raw),
            _ => base.ConvertValue(raw, semanticType)
        };
    }
}

Component Reference

TableSemanticAnalyzer

Purpose: Analyzes table structure and applies semantic understanding

Key Methods:

public TableSemanticAnalysisResult Analyze(Table table)

Returns:

DetectedDomain - Auto-detected or configured domain
DomainConfidence - Detection confidence (0.0-1.0)
HeaderRowNumbers - Detected header row(s)
KeyColumnNumber - Primary identifier column
Columns - Column metadata with canonical names and types
Diagnostics - Analysis decision log

TableJsonExtractor

Purpose: Extracts table data to JSON structure

Key Methods:

public ExtractedTableData Extract(
    Table table, 
    TableSemanticAnalysisResult analysis,
    TableSemanticAnalysisOptions options)

Returns:

DetectedDomain - Domain used for extraction
HeaderRow - Original header texts
Columns - Column definitions
Rows - Data rows as dictionaries
TotalRows - Summary rows grouped by type

DomainDetector

Purpose: Automatically selects best matching domain

Key Methods:

public static (string DomainName, IDomainConfiguration Config, double Confidence) 
    DetectDomain(Table table)

Algorithm:

Scans first 3 rows for keywords
Scores each domain (0-100+)
Returns best match with confidence

HeaderCanonicalizer

Purpose: Converts header text to canonical names

Key Methods:

public static string Canonicalize(string headerText)
public static string ApplySynonym(string canonicalName, IDomainConfiguration domain)

Process:

Lowercase conversion
Special character replacement (€→eur, %→percent)
Non-alphanumeric removal
Snake_case conversion
Domain synonym application

Domain Configurations

Interface:

public interface IDomainConfiguration
{
    ISet<string> HeaderKeywords { get; }
    ISet<string> TotalRowKeywords { get; }
    IDictionary<string, string> HeaderSynonyms { get; }
    ISet<string> IdentifierHeaders { get; }
}

Built-in Implementations:

FinancialDomainConfiguration
HealthcareDomainConfiguration
ManufacturingDomainConfiguration
GenericDomainConfiguration

Best Practices

1. Start with Auto-Detection

✅ Recommended:

var analyzer = new TableSemanticAnalyzer();  // Auto-detect

❌ Avoid premature optimization:

// Don't force domain unless necessary
var options = new TableSemanticAnalysisOptions
{
    AutoDetectDomain = false,
    Domain = new FinancialDomainConfiguration()  // Only if required
};

2. Check Confidence Scores

var analysis = analyzer.Analyze(table);

if (analysis.DomainConfidence < 0.6)
{
    // Low confidence - consider:
    // 1. Adding domain-specific keywords to headers
    // 2. Using explicit domain configuration
    // 3. Creating custom domain
    Console.WriteLine($"⚠️ Low confidence: {analysis.DomainConfidence:P0}");
}

3. Review Diagnostics for Issues

// Always check diagnostics when results are unexpected
foreach (var diagnostic in analysis.Diagnostics)
{
    Console.WriteLine(diagnostic);
}

4. Handle Merged Cells Gracefully

The library handles this automatically, but be aware:

// Merged cells return null for non-anchor cells
// This is expected behavior
var data = extractor.Extract(table, analysis, options);
// Some columns may have null values in merged regions

5. Use Appropriate Domains

Data Type	Domain
Financial statements	Financial
Patient records	Healthcare
Inventory/supply chain	Manufacturing
Generic business data	Generic (auto-selected)
Custom industry	Create custom domain

6. Validate Critical Data

var data = extractor.Extract(table, analysis, options);

foreach (var row in data.Rows)
{
    // Validate required fields
    if (!row.ContainsKey("patient_id"))
    {
        Console.WriteLine("⚠️ Missing required field: patient_id");
    }
    
    // Validate data types
    if (row.TryGetValue("amount", out var amount) && amount is not decimal)
    {
        Console.WriteLine($"⚠️ Expected decimal, got {amount?.GetType()}");
    }
}

7. Process Tables in Context

// Group related tables
var tablesByDomain = new Dictionary<string, List<ExtractedTableData>>();

foreach (Table table in tx.Tables)
{
    var analysis = analyzer.Analyze(table);
    var data = extractor.Extract(table, analysis, analyzer.Options);
    
    if (!tablesByDomain.ContainsKey(analysis.DetectedDomain))
    {
        tablesByDomain[analysis.DetectedDomain] = new List<ExtractedTableData>();
    }
    tablesByDomain[analysis.DetectedDomain].Add(data);
}

Troubleshooting

Problem: Wrong Domain Detected

Symptoms:

Domain: Generic (Confidence: 55%)
Expected: Financial

Solutions:

Check header keywords:

// Add domain-specific terms to headers
// Instead of: "Amount", "Value"
// Use: "Exposure EUR", "Market Value"

Force explicit domain:

var options = new TableSemanticAnalysisOptions
{
    AutoDetectDomain = false,
    Domain = new FinancialDomainConfiguration()
};

Review diagnostics:

foreach (var diagnostic in analysis.Diagnostics)
{
    Console.WriteLine(diagnostic);
}
// Look for: "Auto-detected domain: X (confidence: Y)"

Problem: Wrong Headers Detected

Symptoms:

HeaderRowNumbers: [1]  // Should be [2]

Solutions:

Check for title rows:
- Row 1 with single merged cell = title (penalized)
- System should auto-detect row 2 as headers

Increase inspection depth:

var options = new TableSemanticAnalysisOptions
{
    MaxHeaderRowsToInspect = 5  // Check more rows
};

Mark headers explicitly:

// In TX Text Control, mark row as header
table.Rows[1].IsHeader = true;

Problem: Columns Have Generic Names

Symptoms:

CanonicalName: "column_1"  // Expected: "patient_id"

Solutions:

Add keywords to headers:

Instead of: "ID"
Use: "Patient ID" or "MRN"

Check domain configuration:

var domain = new HealthcareDomainConfiguration();
// Verify "patient" and "mrn" are in HeaderKeywords

Add custom synonyms:

public class CustomHealthcare : HealthcareDomainConfiguration
{
    public CustomHealthcare()
    {
        HeaderSynonyms["id"] = "patient_id";
    }
}

Problem: Merged Cells Cause Errors

Symptoms:

TXTextControl.TextEditorException: Cannot access .Text

Solutions:

✅ This is handled automatically - the library catches these exceptions.

If you still see errors:

Ensure you're using the latest version
Check that SafeGetCellText is being used
Report the issue with a sample document

Problem: Total Rows Not Detected

Symptoms:

{
  "TotalRows": null  // Should contain total
}

Solutions:

Check total row keywords:

var domain = new FinancialDomainConfiguration();
// Verify "Total", "Subtotal", etc. are in TotalRowKeywords

Verify keyword placement:

✓ "Total" in ANY column
✗ "Total" only in calculation (not in text)

Add custom keywords:

domain.TotalRowKeywords.Add("Sum");
domain.TotalRowKeywords.Add("Grand Total");

Problem: Poor Performance

Symptoms:

Slow processing of large documents

Solutions:

Process tables selectively:

foreach (Table table in tx.Tables)
{
    // Skip nested tables
    if (table.NestedLevel > 0) continue;
    
    // Process only
    var analysis = analyzer.Analyze(table);
}

Batch process:

// Process multiple documents in parallel
await Task.WhenAll(files.Select(ProcessFileAsync));

Cache analyzer:

// Reuse analyzer instance
var analyzer = new TableSemanticAnalyzer();
foreach (var file in files)
{
    // Use same analyzer
}

Requirements

TX Text Control .NET (any edition)
.NET 10 (C# 14)
Windows (for TX Text Control)

Contributing

Contributions welcome! Areas for improvement:

Additional Domains
- Legal documents
- Education/academic
- Real estate
- Transportation/logistics
Enhanced Detection
- ML-based domain detection
- Pattern recognition for complex tables
- Multi-language support
More Features
- Table validation rules
- Schema generation
- Data quality scoring
- Export to other formats (CSV, Excel, Database)

Resources

Financial Domain Guide - 150+ keywords
Healthcare Domain Guide - 80+ keywords
Manufacturing Domain Guide - 90+ keywords
Generic Domain Guide - 30+ keywords
Auto-Detection Guide - How detection works
TX Text Control Docs - TX Text Control documentation

Summary

This library transforms document table extraction from a manual, error-prone process into an intelligent, automated workflow:

Load any TX Text Control document
Analyze tables with automatic domain detection
Extract to structured JSON with semantic understanding
Process with industry-specific field mappings

Zero configuration required - the system automatically understands financial statements, patient records, inventory sheets, and more.

Perfect for:

Document processing pipelines
Data migration projects
Report automation
Business intelligence extraction
Healthcare data aggregation
Financial analysis tools

Get started in 5 lines of code:

using var tx = new ServerTextControl();
tx.Load("document.docx", StreamType.WordprocessingML);
var analyzer = new TableSemanticAnalyzer(new TableSemanticAnalysisOptions { AutoDetectDomain = true });
var extractor = new TableJsonExtractor();
var analysis = analyzer.Analyze(tx.Tables[0]);
var data = extractor.Extract(tx.Tables[0], analysis, analyzer.Options);
Console.WriteLine(data.ToJson());

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
textcontrol-table-detection		textcontrol-table-detection
.gitattributes		.gitattributes
.gitignore		.gitignore
README.md		README.md
textcontrol-table-detection.slnx		textcontrol-table-detection.slnx

Folders and files

Latest commit

History

Repository files navigation

TX Text Control Table Detection & Extraction

Table of Contents

Overview

The Problem

The Solution

Key Concepts

1. Domain-Driven Analysis

2. Automatic Canonicalization

3. Semantic Type Inference

4. Total Row Detection

Architecture

Component Diagram

Key Components

1. DomainDetector

2. Domain Configurations

3. HeaderCanonicalizer

4. TableSemanticAnalyzer

5. TableJsonExtractor

The Pipeline

Step-by-Step Data Flow

Quick Start

Installation

Basic Usage

Domain Configurations

Financial Domain 🏦

Healthcare Domain 🏥

Manufacturing Domain 🏭

Generic Domain 📋

Sample Usage Scenarios

Scenario 1: Financial Report Analysis

Scenario 2: Healthcare Data Migration

Scenario 3: Inventory Management

Scenario 4: Multi-Domain Document Processing

Scenario 5: Custom Domain Implementation

Scenario 6: Batch Processing with Error Handling

Scenario 7: Diagnostics and Debugging

Advanced Features

1. Merged Cell Handling

2. Multi-Row Headers

3. Multiple Total Rows

4. Confidence Scoring

5. Custom Type Conversion

Component Reference

TableSemanticAnalyzer

TableJsonExtractor

DomainDetector

HeaderCanonicalizer

Domain Configurations

Best Practices

1. Start with Auto-Detection

2. Check Confidence Scores

3. Review Diagnostics for Issues

4. Handle Merged Cells Gracefully

5. Use Appropriate Domains

6. Validate Critical Data

7. Process Tables in Context

Troubleshooting

Problem: Wrong Domain Detected

Problem: Wrong Headers Detected

Problem: Columns Have Generic Names

Problem: Merged Cells Cause Errors

Problem: Total Rows Not Detected

Problem: Poor Performance

Requirements

Contributing

Resources

Summary

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Packages