feat: case-insensitive regex for lowercasing tokenizers#6277
Open
congx4 wants to merge 2 commits intoquickwit-oss:mainfrom
Open
feat: case-insensitive regex for lowercasing tokenizers#6277congx4 wants to merge 2 commits intoquickwit-oss:mainfrom
congx4 wants to merge 2 commits intoquickwit-oss:mainfrom
Conversation
When a field's tokenizer lowercases indexed terms (e.g. "default", "raw_lowercase", "lowercase"), regex queries now automatically prepend (?i) to match case-insensitively. Without this, patterns like `.*ECONNREFUSED.*` would never match because the inverted index only contains lowercase tokens. Changes: - `to_field_and_regex` now returns the tokenizer name as a 4th element - `build_tantivy_ast_impl` and warmup `visit_regex` prepend (?i) when the tokenizer does lowercasing and the regex doesn't already have it - `TokenizerManager::tokenizer_does_lowercasing` public helper added - Unit tests for case-insensitive behavior, tokenizer detection, and edge cases (already-(?i), raw tokenizer, JSON fields) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Replace the 4-element tuple return from `to_field_and_regex` with a named `ResolvedRegex` struct to satisfy clippy::type_complexity which is promoted to a hard error via -D warnings in CI. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
(?i)when the field's tokenizer lowercases indexed terms (e.g.default,raw_lowercase,lowercase), so patterns like.*ECONNREFUSED.*match correctly against lowercase-indexed datato_field_and_regexreturns aResolvedRegexstruct (with tokenizer name), and bothbuild_tantivy_ast_impland warmupvisit_regexuseTokenizerManager::tokenizer_does_lowercasingto decide whether to add the flag(?i)Motivation
When Trino's Elasticsearch connector translates a SQL
LIKEpredicate on a text field, it sends a wildcard query to the ES-compatible API. For example:In our deployment, the
extra_ftsfield uses thedatadogtokenizer, which lowercases all indexed terms. This means the inverted index only contains lowercase tokens (e.g.econnrefused), even if the original log message containedECONNREFUSED.The ES connector translates this to:
{"query": {"wildcard": {"extra_fts": {"value": "*ECONNREFUSED*"}}}}The
WildcardQueryin Quickwit already handles case-insensitivity correctly — it normalizes literal text through the field's tokenizer, soECONNREFUSEDbecomeseconnrefusedbefore matching.However, regex queries do not have this normalization. A
case_insensitive: trueES term query gets converted to aRegexQuery:{"query": {"term": {"extra_fts": {"value": "ECONNREFUSED", "case_insensitive": true}}}}This becomes
RegexQuery { regex: "(?i)ECONNREFUSED" }— which works because of the explicit(?i)flag.But when a bare regex query is used without
case_insensitive:{"query": {"regexp": {"extra_fts": ".*ECONNREFUSED.*"}}}The regex
.*ECONNREFUSED.*is matched against the inverted index which only contains lowercase tokens (because thedatadogtokenizer — or any other lowercasing tokenizer likedefault,raw_lowercase,lowercase— lowercases during indexing). The uppercase pattern never matches.Unlike
WildcardQuery,RegexQuerycannot normalize literal parts through the tokenizer because regex metacharacters are interleaved with literal text — you can't reliably separate them for normalization.The fix: when the field's tokenizer lowercases its output, automatically prepend
(?i)to make the regex match case-insensitively. This aligns regex behavior with how wildcard queries already work (just via a different mechanism).Why the warmup must also change
Quickwit pre-warms search data by walking the query AST and loading relevant term dictionary entries. The warmup builds a regex automaton to scan the term dictionary. If the warmup uses
.*ECONNREFUSED.*(case-sensitive) but the actual query uses(?i).*ECONNREFUSED.*(case-insensitive), the warmup misses the matching terms and the query returns no results. Both must apply the same(?i)logic.Test plan
(?i)prefix (test_regex_case_insensitive_with_lowercasing_tokenizer)(?i)regex is not doubled (test_regex_already_case_insensitive_not_doubled)rawtokenizer does NOT get(?i)(test_regex_no_case_insensitive_with_raw_tokenizer)to_field_and_regexreturns correct tokenizer name for text and JSON fieldstokenizer_does_lowercasingcorrectly identifies lowercasing vs non-lowercasing tokenizers🤖 Generated with Claude Code