Skip to content

Terms in languages without word delimiters (CJK, Thai, etc.) are not detected #118

@aromarious

Description

@aromarious

Problem

Contextive currently relies on delimiter-based tokenisation (spaces, parentheses, etc.) and regex-based splitting (camelCase, snake_case) to identify candidate terms. This works well for languages that use spaces between words, but languages that don't use spaces to delimit words — such as Japanese, Chinese, Korean, Thai, Lao, and Khmer — cannot match glossary terms at all.

Reproduction

  1. Create a glossary file:
contexts:
  - terms:
    - name: 注文
      definition: An order placed by a customer
  1. Open a file containing the text 注文が届く
  2. Hover over 注文 — no hover result is shown

Why this happens

The Tokeniser extracts the entire 注文が届く as a single token (no delimiters within it). The CandidateTerms regex then attempts to split it by camelCase/snake_case patterns, but since CJK characters are not in the \p{Lu} / \p{Ll} ranges used by the regex, no splitting occurs. The full string 注文が届く is looked up as-is in the index, which only contains 注文, so no match is found.

Affected languages

Any language that does not use spaces between words:

Language Script Example
Japanese Kanji / Kana 注文が届く
Chinese Hanzi 购物车管理
Korean Hangul 주문을처리
Thai Thai script สวัสดีครับ
Lao Lao script ພາສາລາວ
Khmer Khmer script ភាសាខ្មែរ
Myanmar Myanmar script မြန်မာဘာသာ

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions