Skip to content

Proposal: Refactor GeoSDK into composable modules to support multiple country schemas #1

@francbartoli

Description

@francbartoli

Context

We're building ANNCSU Viewer, an open-source address viewer for Italy's national address archive (~14M addresses). We adopted several architectural patterns from geocoding-sdk — H3 tiling, Jaccard similarity, LRU cache, smart query detection — but had to reimplement them because the SDK is tightly coupled to the Saudi Arabia dataset schema.

We'd like to contribute upstream to make the SDK reusable across different countries and schemas.

Problem

The GeoSDK class in geocoder-h3.ts (~1685 lines) is a monolith that mixes:

  • DuckDB WASM lifecycle and tile loading (generic)
  • H3 tile index management (generic)
  • LRU caching (generic)
  • FTS/BM25 and Jaccard search (generic algorithm, schema-specific field names)
  • Arabic/English language detection (Saudi-specific)
  • Address field references like full_address_ar, full_address_en (Saudi-specific)
  • Admin hierarchy with Saudi region names (Saudi-specific)

This makes it impossible to use the SDK with a different dataset without forking the entire class.

Proposal

Refactor into composable modules:

1. @tabaqat/core — Generic infrastructure

  • DuckDB WASM initialization and lifecycle
  • H3 tile index loading and spatial filtering
  • HTTP tile fetching with parallel downloads
  • LRU cache (search cache + admin cache + grid-based cache)

2. @tabaqat/search — Pluggable search engine

  • FTS/BM25 index creation and querying
  • Jaccard similarity fallback
  • Multi-term CONTAINS filtering
  • Configurable via a SearchConfig interface (field names, stemmer, etc.)

3. @tabaqat/schema — Schema adapter interface

interface SchemaAdapter {
  // Field mappings
  addressFields: string[]          // Fields to search
  displayAddress: (row: Record<string, unknown>) => string
  
  // Optional features
  language?: {
    detect: (query: string) => string
    fields: Record<string, string>   // language → field name
    stemmers: Record<string, string> // language → stemmer
  }
  
  // Municipality / admin hierarchy
  municipality?: {
    nameField: string
    codeField: string
  }
  
  // Postcode
  postcode?: {
    pattern: RegExp
    field: string
  }
}

Example adapters:

// Saudi Arabia (current behavior, no breaking changes)
const saudiAdapter: SchemaAdapter = {
  addressFields: ['full_address_ar', 'full_address_en'],
  displayAddress: (row) => row.full_address_en as string,
  language: {
    detect: (q) => /[\u0600-\u06FF]/.test(q) ? 'ar' : 'en',
    fields: { ar: 'full_address_ar', en: 'full_address_en' },
    stemmers: { ar: 'arabic', en: 'porter' },
  },
  municipality: { nameField: 'district_name_en', codeField: 'district_id' },
  postcode: { pattern: /^\d{5}$/, field: 'postcode' },
}

// Italy (ANNCSU)
const anncsuAdapter: SchemaAdapter = {
  addressFields: ['ODONIMO'],
  displayAddress: (row) => 
    `${row.ODONIMO} ${row.CIVICO}${row.ESPONENTE ? ' ' + row.ESPONENTE : ''}`,
  municipality: { nameField: 'NOME_COMUNE', codeField: 'CODICE_ISTAT' },
  postcode: { pattern: /^\d{5}$/, field: 'CAP' },
}

4. @tabaqat/geocoder — High-level API

const geocoder = createGeocoder({
  dataUrl: 'https://data.example.com/tiles',
  schema: anncsuAdapter,
  // Optional overrides
  h3Resolution: 5,
  maxTiles: 50,
  cacheSize: 100,
  cacheTtlMs: 5 * 60 * 1000,
})

const results = await geocoder.geocode('Roma, Via Appia 1')

What we'd contribute

  • Schema adapter interface and refactoring of field references out of the core
  • Italian ANNCSU adapter as a second real-world schema
  • Tests for the adapter pattern (ensuring Saudi behavior doesn't break)
  • Documentation for creating new country adapters

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions