Skip to content

feat(graphile-search): @searchConfig, chunk querying, validation, integration tests & codegen docs (Phases D-E, H-J)#851

Merged
pyramation merged 3 commits intomainfrom
devin/1773910840-search-config-smart-tags
Mar 19, 2026
Merged

feat(graphile-search): @searchConfig, chunk querying, validation, integration tests & codegen docs (Phases D-E, H-J)#851
pyramation merged 3 commits intomainfrom
devin/1773910840-search-config-smart-tags

Conversation

@pyramation
Copy link
Contributor

@pyramation pyramation commented Mar 19, 2026

Summary

Adds several features to graphile-search and enhances codegen docs, spanning Phases D, E, H, I, and J:

Phase D — Per-table @searchConfig smart tag: The composite searchScore field now reads per-table configuration from codec.extensions.tags.searchConfig (written by DataSearch/DataFullTextSearch/DataBm25 in constructive-db PR #622). Per-table config overrides the global searchScoreWeights. Supports configurable normalization strategy (linear vs sigmoid) and a recency boost (exponential decay). The previously duplicated normalization logic was extracted into reusable normalizeScore() and applyRecencyBoost() helpers, and the dual unweighted/weighted code paths were unified into a single weighted-average path (defaulting weight=1).

Phase E — Transparent chunk querying: The pgvector adapter now reads @hasChunks smart tags from codec extensions. When present, buildFilterApply generates a LEAST(parent_distance, closest_chunk_distance) subquery that transparently searches both parent and chunks tables. A new includeChunks boolean field is added to VectorNearbyInput (defaults to true when chunks exist; set false to skip chunk search).

Phase I — Schema-time validation of recency field: If @searchConfig specifies boost_recency_field but that column doesn't exist on the table's codec attributes, recency boost is now disabled gracefully with a console.warn instead of crashing at query time with a column-not-found error.

Phase J — Integration tests for @searchConfig and @hasChunks: New search-config-integration.test.ts with tests against a real PostgreSQL database. A custom makeTestSmartTagsPlugin injects JSON smart tags programmatically during schema build (since @searchConfig and @hasChunks are JSON objects that can't be set via SQL COMMENT ON). Tests cover: per-table weights, recency boost, combined search, sigmoid normalization, chunk-aware querying (LEAST of parent + chunks), includeChunks: false toggle, and validation of missing recency fields.

Phase H — Codegen docs for chunk-aware tables: categorizeSpecialFields() in docs-utils.ts now checks the TypeRegistry for a VectorNearbyInput type with an includeChunks field. When present, generated docs for tables with embedding fields mention chunk-aware search capability and the includeChunks option.

Updates since last revision (auto-review feedback)

Addressed all auto-review feedback from initial Phases D-E:

  1. Recency boost fixed: applyRecencyBoost now accepts the recency value directly instead of trying to read from row[fieldName]. The recency field is injected into the SQL SELECT via $select.selectAndReturnIndex() and accessed by numeric index at runtime (matching how score values are retrieved).
  2. Parent PK field configurable: Added parentPkField to ChunksInfo (read from @hasChunks tag's parentPk, defaults to 'id'). No longer hardcoded.
  3. Schema-qualified chunks table: Added chunksSchema to ChunksInfo. Resolves via: explicit chunksSchema in tag → parent codec's pg.schemaName → unqualified. Chunk query builds schema.table reference when schema is present.
  4. Normalization clarified: sigmoid strategy forces sigmoid for ALL adapters (bounded and unbounded). linear (default) uses linear for known-range adapters and sigmoid as fallback for unbounded. Documented in docstring.
  5. Build fix: Added sql to destructuring from build in GraphQLObjectType_fields hook (was causing TS2304 compilation error).
  6. Tests updated: All 15 unit tests updated for new signatures; added tests for schema inheritance and schema-qualified chunk queries.

Updates since last revision (Phases H, I, J)

  1. Phase I — Recency field validation: plugin.ts now checks codec.attributes[boostRecencyField] at schema build time. If the field doesn't exist, emits console.warn and sets boostRecent = false.
  2. Phase J — Integration tests: New search-config-integration.test.ts (~630 lines) with 6 test suites running against real PostgreSQL. Uses a custom Graphile plugin to inject JSON smart tags on test codecs during the init hook.
  3. Phase H — Codegen docs: Added hasIncludeChunksCapability() helper that inspects TypeRegistry for VectorNearbyInput.includeChunks. When detected, the embedding SpecialFieldGroup description includes chunk-aware search documentation in generated README/AGENTS/skills docs.

Review & Testing Checklist for Human

  • Chunk-aware docs are global, not per-table. Phase H checks whether VectorNearbyInput has includeChunks in the TypeRegistry, but this is a schema-wide type — ALL tables with embedding fields will show chunk-aware documentation, including tables that don't actually have a @hasChunks tag. Verify whether this is acceptable or if per-table detection is needed.
  • Chunk correlated subquery performance on large tables. The SELECT MIN(chunk.embedding <=> vector) FROM chunks WHERE chunks.parent_fk = parent.pk subquery runs per parent row. On tables with many chunks per parent, this relies entirely on proper indexing (HNSW/IVFFlat on the chunks table). Confirm that embedding_chunks auto-creation in PR Devin/1768714819 smtp env refactor #619 creates the needed index.
  • Sigmoid normalization comment may be misleading. The code comment says "BM25: negative scores, more negative = better" but the 1 / (1 + Math.abs(score)) formula maps larger absolute values to lower normalized scores. If BM25 actually returns positive distance-like scores (lower = better), the math is correct but the comment is wrong. Verify the actual score polarity from the BM25 adapter.
  • Integration test smart tag injection approach. The makeTestSmartTagsPlugin mutates codec extensions in the init hook. This works but is a non-standard way to apply smart tags — in production, tags come from SQL COMMENT ON or metaschema_public.table.tags. Verify the injection happens early enough that all downstream plugins see the tags consistently.

Suggested test plan:

  1. Run pnpm test in graphile/graphile-search to execute both unit and integration tests.
  2. Deploy a schema with DataSearch + DataEmbedding + embedding_chunks from PRs Devin/1768714819 smtp env refactor #619/fix(codegen): use correct deleted{Entity}NodeId field in delete mutation #622, then run codegen and inspect the generated README for embedding fields — confirm the chunk-aware description appears when includeChunks is present.
  3. Test recency validation: create a @searchConfig with boost_recency_field: "nonexistent" — verify console.warn fires and no query-time crash.

Notes

  • This PR is paired with constructive-db PR #622 which implements the SQL-side DataFullTextSearch, DataBm25, and DataSearch node types that write the @searchConfig smart tag.
  • The @hasChunks smart tag structure expected: { chunksTable, chunksSchema?, parentFk?, parentPk?, embeddingField? } with defaults for parentFk (parent_id), parentPk (id), and embeddingField (embedding). Schema defaults to parent codec's pg.schemaName if not explicit.
  • Phase F (DataPostGIS node type) is in a separate constructive-db PR #623.

Link to Devin session: https://app.devin.ai/sessions/e57604f3fc7c4e3d87e78c75a00cca23
Requested by: @pyramation

…t chunk querying

Phase D: Per-table search score customization via @searchConfig smart tag
- Read @searchConfig from codec.extensions.tags (written by DataSearch/DataFullTextSearch/DataBm25)
- Per-table weights override global searchScoreWeights
- Configurable normalization strategy (linear vs sigmoid)
- Recency boost with configurable field and decay rate
- Extracted normalizeScore() and applyRecencyBoost() as reusable helpers

Phase E: Transparent chunk querying in pgvector adapter
- Detect @hasChunks smart tag on codecs with chunk table metadata
- Generate chunk-aware SQL using LEAST(parent_distance, closest_chunk_distance)
- Added includeChunks field to VectorNearbyInput (default true for chunk tables)
- enableChunkQuerying option on PgvectorAdapterOptions (default true)

Tests: 13 new unit tests covering chunk detection, filter generation, and config parsing
@devin-ai-integration
Copy link
Contributor

🤖 Devin AI Engineer

I'll be helping with this pull request! Here's what you should know:

✅ I will automatically:

  • Address comments on this PR. Add '(aside)' to your comment to have me ignore it.
  • Look at CI failures and help fix them

Note: I can only respond to comments from users who have write access to this repository.

⚙️ Control Options:

  • Disable automatic comment and CI monitoring

…docs (Phases I+J+H)

Phase I: Validate boost_recency_field exists on codec before injecting into SQL.
  Gracefully disables recency boost with console.warn if field is missing.

Phase J: Integration tests for @searchConfig and @hasChunks smart tags.
  Custom makeTestSmartTagsPlugin injects tags programmatically during schema build.
  Tests cover per-table weights, recency boost, chunk-aware queries, and validation.

Phase H: Enhance codegen docs for chunk-aware embedding tables.
  Detects VectorNearbyInput.includeChunks in TypeRegistry and adds chunk-aware
  search documentation to embedding field groups in generated docs.
@devin-ai-integration devin-ai-integration bot changed the title feat(graphile-search): per-table @searchConfig smart tag & transparent chunk querying (Phases D-E) feat(graphile-search): @searchConfig, chunk querying, validation, integration tests & codegen docs (Phases D-E, H-J) Mar 19, 2026
@pyramation pyramation merged commit 16b988e into main Mar 19, 2026
44 checks passed
@pyramation pyramation deleted the devin/1773910840-search-config-smart-tags branch March 19, 2026 23:37
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant