Skip to content

feat: case-insensitive regex for lowercasing tokenizers#6277

Open
congx4 wants to merge 2 commits intoquickwit-oss:mainfrom
congx4:cong.xie/case-insensitive-regex
Open

feat: case-insensitive regex for lowercasing tokenizers#6277
congx4 wants to merge 2 commits intoquickwit-oss:mainfrom
congx4:cong.xie/case-insensitive-regex

Conversation

@congx4
Copy link
Copy Markdown
Contributor

@congx4 congx4 commented Apr 7, 2026

Summary

  • Regex queries now automatically prepend (?i) when the field's tokenizer lowercases indexed terms (e.g. default, raw_lowercase, lowercase), so patterns like .*ECONNREFUSED.* match correctly against lowercase-indexed data
  • to_field_and_regex returns a ResolvedRegex struct (with tokenizer name), and both build_tantivy_ast_impl and warmup visit_regex use TokenizerManager::tokenizer_does_lowercasing to decide whether to add the flag
  • Skips the prefix if the regex already contains (?i)

Motivation

When Trino's Elasticsearch connector translates a SQL LIKE predicate on a text field, it sends a wildcard query to the ES-compatible API. For example:

-- Trino SQL
SELECT * FROM logs WHERE extra_fts LIKE '%ECONNREFUSED%';

In our deployment, the extra_fts field uses the datadog tokenizer, which lowercases all indexed terms. This means the inverted index only contains lowercase tokens (e.g. econnrefused), even if the original log message contained ECONNREFUSED.

The ES connector translates this to:

{"query": {"wildcard": {"extra_fts": {"value": "*ECONNREFUSED*"}}}}

The WildcardQuery in Quickwit already handles case-insensitivity correctly — it normalizes literal text through the field's tokenizer, so ECONNREFUSED becomes econnrefused before matching.

However, regex queries do not have this normalization. A case_insensitive: true ES term query gets converted to a RegexQuery:

{"query": {"term": {"extra_fts": {"value": "ECONNREFUSED", "case_insensitive": true}}}}

This becomes RegexQuery { regex: "(?i)ECONNREFUSED" } — which works because of the explicit (?i) flag.

But when a bare regex query is used without case_insensitive:

{"query": {"regexp": {"extra_fts": ".*ECONNREFUSED.*"}}}

The regex .*ECONNREFUSED.* is matched against the inverted index which only contains lowercase tokens (because the datadog tokenizer — or any other lowercasing tokenizer like default, raw_lowercase, lowercase — lowercases during indexing). The uppercase pattern never matches.

Unlike WildcardQuery, RegexQuery cannot normalize literal parts through the tokenizer because regex metacharacters are interleaved with literal text — you can't reliably separate them for normalization.

The fix: when the field's tokenizer lowercases its output, automatically prepend (?i) to make the regex match case-insensitively. This aligns regex behavior with how wildcard queries already work (just via a different mechanism).

Why the warmup must also change

Quickwit pre-warms search data by walking the query AST and loading relevant term dictionary entries. The warmup builds a regex automaton to scan the term dictionary. If the warmup uses .*ECONNREFUSED.* (case-sensitive) but the actual query uses (?i).*ECONNREFUSED.* (case-insensitive), the warmup misses the matching terms and the query returns no results. Both must apply the same (?i) logic.

Test plan

  • Unit test: lowercasing tokenizer causes (?i) prefix (test_regex_case_insensitive_with_lowercasing_tokenizer)
  • Unit test: already-(?i) regex is not doubled (test_regex_already_case_insensitive_not_doubled)
  • Unit test: raw tokenizer does NOT get (?i) (test_regex_no_case_insensitive_with_raw_tokenizer)
  • Unit test: to_field_and_regex returns correct tokenizer name for text and JSON fields
  • Unit test: tokenizer_does_lowercasing correctly identifies lowercasing vs non-lowercasing tokenizers

🤖 Generated with Claude Code

When a field's tokenizer lowercases indexed terms (e.g. "default",
"raw_lowercase", "lowercase"), regex queries now automatically prepend
(?i) to match case-insensitively. Without this, patterns like
`.*ECONNREFUSED.*` would never match because the inverted index only
contains lowercase tokens.

Changes:
- `to_field_and_regex` now returns the tokenizer name as a 4th element
- `build_tantivy_ast_impl` and warmup `visit_regex` prepend (?i) when
  the tokenizer does lowercasing and the regex doesn't already have it
- `TokenizerManager::tokenizer_does_lowercasing` public helper added
- Unit tests for case-insensitive behavior, tokenizer detection, and
  edge cases (already-(?i), raw tokenizer, JSON fields)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@guilload guilload requested a review from trinity-1686a April 7, 2026 18:58
Replace the 4-element tuple return from `to_field_and_regex` with a
named `ResolvedRegex` struct to satisfy clippy::type_complexity which
is promoted to a hard error via -D warnings in CI.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant