feat: case-insensitive regex for lowercasing tokenizers by congx4 · Pull Request #6277 · quickwit-oss/quickwit

congx4 · 2026-04-07T18:57:00Z

Summary

Regex queries now automatically prepend (?i) when the field's tokenizer lowercases indexed terms (e.g. default, raw_lowercase, lowercase), so patterns like .*ECONNREFUSED.* match correctly against lowercase-indexed data
to_field_and_regex returns a ResolvedRegex struct (with tokenizer name), and both build_tantivy_ast_impl and warmup visit_regex use TokenizerManager::tokenizer_does_lowercasing to decide whether to add the flag
Skips the prefix if the regex already contains (?i)

Motivation

When Trino's Elasticsearch connector translates a SQL LIKE predicate on a text field, it sends a wildcard query to the ES-compatible API. For example:

-- Trino SQL
SELECT * FROM logs WHERE extra_fts LIKE '%ECONNREFUSED%';

In our deployment, the extra_fts field uses the datadog tokenizer, which lowercases all indexed terms. This means the inverted index only contains lowercase tokens (e.g. econnrefused), even if the original log message contained ECONNREFUSED.

The ES connector translates this to:

{"query": {"wildcard": {"extra_fts": {"value": "*ECONNREFUSED*"}}}}

The WildcardQuery in Quickwit already handles case-insensitivity correctly — it normalizes literal text through the field's tokenizer, so ECONNREFUSED becomes econnrefused before matching.

However, regex queries do not have this normalization. A case_insensitive: true ES term query gets converted to a RegexQuery:

{"query": {"term": {"extra_fts": {"value": "ECONNREFUSED", "case_insensitive": true}}}}

This becomes RegexQuery { regex: "(?i)ECONNREFUSED" } — which works because of the explicit (?i) flag.

But when a bare regex query is used without case_insensitive:

{"query": {"regexp": {"extra_fts": ".*ECONNREFUSED.*"}}}

The regex .*ECONNREFUSED.* is matched against the inverted index which only contains lowercase tokens (because the datadog tokenizer — or any other lowercasing tokenizer like default, raw_lowercase, lowercase — lowercases during indexing). The uppercase pattern never matches.

Unlike WildcardQuery, RegexQuery cannot normalize literal parts through the tokenizer because regex metacharacters are interleaved with literal text — you can't reliably separate them for normalization.

The fix: when the field's tokenizer lowercases its output, automatically prepend (?i) to make the regex match case-insensitively. This aligns regex behavior with how wildcard queries already work (just via a different mechanism).

Why the warmup must also change

Quickwit pre-warms search data by walking the query AST and loading relevant term dictionary entries. The warmup builds a regex automaton to scan the term dictionary. If the warmup uses .*ECONNREFUSED.* (case-sensitive) but the actual query uses (?i).*ECONNREFUSED.* (case-insensitive), the warmup misses the matching terms and the query returns no results. Both must apply the same (?i) logic.

Test plan

Unit test: lowercasing tokenizer causes (?i) prefix (test_regex_case_insensitive_with_lowercasing_tokenizer)
Unit test: already-(?i) regex is not doubled (test_regex_already_case_insensitive_not_doubled)
Unit test: raw tokenizer does NOT get (?i) (test_regex_no_case_insensitive_with_raw_tokenizer)
Unit test: to_field_and_regex returns correct tokenizer name for text and JSON fields
Unit test: tokenizer_does_lowercasing correctly identifies lowercasing vs non-lowercasing tokenizers

🤖 Generated with Claude Code

When a field's tokenizer lowercases indexed terms (e.g. "default", "raw_lowercase", "lowercase"), regex queries now automatically prepend (?i) to match case-insensitively. Without this, patterns like `.*ECONNREFUSED.*` would never match because the inverted index only contains lowercase tokens. Changes: - `to_field_and_regex` now returns the tokenizer name as a 4th element - `build_tantivy_ast_impl` and warmup `visit_regex` prepend (?i) when the tokenizer does lowercasing and the regex doesn't already have it - `TokenizerManager::tokenizer_does_lowercasing` public helper added - Unit tests for case-insensitive behavior, tokenizer detection, and edge cases (already-(?i), raw tokenizer, JSON fields) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Replace the 4-element tuple return from `to_field_and_regex` with a named `ResolvedRegex` struct to satisfy clippy::type_complexity which is promoted to a hard error via -D warnings in CI. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

guilload requested a review from trinity-1686a April 7, 2026 18:58

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: case-insensitive regex for lowercasing tokenizers#6277

feat: case-insensitive regex for lowercasing tokenizers#6277
congx4 wants to merge 2 commits intoquickwit-oss:mainfrom
congx4:cong.xie/case-insensitive-regex

congx4 commented Apr 7, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

congx4 commented Apr 7, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Motivation

Why the warmup must also change

Test plan

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

congx4 commented Apr 7, 2026 •

edited

Loading