-
Notifications
You must be signed in to change notification settings - Fork 10
Description
Problem
Contextive currently relies on delimiter-based tokenisation (spaces, parentheses, etc.) and regex-based splitting (camelCase, snake_case) to identify candidate terms. This works well for languages that use spaces between words, but languages that don't use spaces to delimit words — such as Japanese, Chinese, Korean, Thai, Lao, and Khmer — cannot match glossary terms at all.
Reproduction
- Create a glossary file:
contexts:
- terms:
- name: 注文
definition: An order placed by a customer- Open a file containing the text
注文が届く - Hover over
注文— no hover result is shown
Why this happens
The Tokeniser extracts the entire 注文が届く as a single token (no delimiters within it). The CandidateTerms regex then attempts to split it by camelCase/snake_case patterns, but since CJK characters are not in the \p{Lu} / \p{Ll} ranges used by the regex, no splitting occurs. The full string 注文が届く is looked up as-is in the index, which only contains 注文, so no match is found.
Affected languages
Any language that does not use spaces between words:
| Language | Script | Example |
|---|---|---|
| Japanese | Kanji / Kana | 注文が届く |
| Chinese | Hanzi | 购物车管理 |
| Korean | Hangul | 주문을처리 |
| Thai | Thai script | สวัสดีครับ |
| Lao | Lao script | ພາສາລາວ |
| Khmer | Khmer script | ភាសាខ្មែរ |
| Myanmar | Myanmar script | မြန်မာဘာသာ |