refactor(core): removed old non-deterministic core engine#4
Conversation
The current engine has diverted too much from Gumtree and is in a non-deterministic state will be writing the core engine again group up
Added a new rules.lua file that basically tells the engine how to handle different node types per language. Some nodes like semicolons and commas are just noise so we ignore them, some nodes like string literals are better treated as a single unit so we flatten them, and some node types are just different names for the same thing so we alias them. Did this for c, cpp, go, javascript, typescript, python and lua.
Hooked the new rules module into preprocess_tree. Now when we walk the tree we skip ignored nodes entirely, flatten nodes that should be treated as leaves, and use canonical type names before hashing. This makes hashes a lot more stable across code that is structurally the same but written slightly differently.
Added a pre-matching step that runs before the main top-down phase. The idea is simple, use vim.diff to find lines that did not change, then match AST nodes that sit on those unchanged lines by type and column. This gives the later phases a better starting point and reduces the chance of nodes getting matched to the wrong counterpart. Based on: Matsumoto et al., Beyond GumTree: A Hybrid Approach to Generate Edit Scripts, MSR 2019.
Rewrote the top-down phase from scratch. Instead of the old parent bucket scan we now use a height queue that processes nodes tallest first. When multiple nodes share the same hash we defer them and resolve largest first to avoid bad early matches. Also added an optional function pre-matching pass that uses name similarity and child dice score to match functions before the main loop runs. Based on: Falleri and Martinez, Fine-grained Accurate and Scalable Source Code Differencing, ICSE 2024. https://doi.org/10.1145/3597503.3639148
Rewrote the bottom-up phase. Dropped the roles dependency and replaced the old name lookup candidate search with dice similarity over mapped descendants. When similarity scores tie we use child position distance as a tiebreaker. Also pulled simple_recovery out as a shared function and exposed it as _simple_recovery along with _prune_mappings so the recovery phase can reuse them. Based on: Falleri and Martinez, ICSE 2024. https://doi.org/10.1145/3597503.3639148
Replaced the old LCS worklist loop with something much simpler. We now just run simple_recovery and prune_mappings in a loop until the mapping set stops changing. The max_size threshold is gone entirely which was always a pain to tune anyway. Based on: Falleri and Martinez, ICSE 2024. https://doi.org/10.1145/3597503.3639148
The old roles module was doing a lot. Stripped it down to just what we actually need which is looking up a name for a given node kind. build_index now only records outer and name capture pairs and get_kind_name_text just walks up the ancestors to find a match. Much easier to reason about.
Rewrote generate_actions to work directly from mapping metadata instead of going through the roles and semantic machinery. Update detection now checks parent consistency and leaf labels directly. Move detection uses LIS to find reordered siblings and parent change detection for structural moves. Removed semantic.lua entirely since nothing uses it anymore.
Rewrote the hunk computation in analysis.lua. The old version had a lot of moving parts with semantic leaf changes and multiple merge and refine passes. New version just calls vim.diff with result_type indices and then does a word level diff for single line changes. Full token types like identifiers and literals get one hunk covering the whole node which is usually what you want anyway.
Added M.diff() to core.lua which wires together all the phases in order: pre_match, top_down_match, bottom_up_match, recovery_match and generate_actions. Also wrapped each phase in its own function with a cleaner signature so callers can run individual phases if they want. Accepts either positional args or an options table and returns either a flat tuple or a structured table.
Added hunks.lua with normalize_hunk and normalize_list. Basically a safety layer so rendering code does not have to defensively check every field on every hunk. Fills in missing rows and cols, infers kind from whichever side is present, and clamps zero width ranges to at least one character wide.
Plugged normalize_list into both renderer.lua and filler.lua before any hunk rendering happens. Added a guard that skips insert and delete hunks inside update actions when they overlap an explicit insert or delete action to avoid double highlighting. Switched init.lua from the old manual four step pipeline to the new core.diff() call.
There was a problem hiding this comment.
Pull request overview
This PR refactors Diffmantic’s core diff pipeline toward a new deterministic engine by introducing rule-driven Treesitter preprocessing, new prematching, and a redesigned action/hunk generation + rendering flow.
Changes:
- Replaces the previous multi-stage core API (
top_down_match/bottom_up_match/recovery_match/generate_actions) with a consolidatedcore.diff()pipeline and new prematch + recovery behavior. - Introduces language rules (
core/rules.lua) and hunk normalization (ui/hunks.lua) to stabilize hashing/hunk ranges and rendering. - Reworks action generation (
core/actions.lua) and analysis (core/analysis.lua) to rely onvim.diff-based hunks and updated rendering logic.
Reviewed changes
Copilot reviewed 15 out of 15 changed files in this pull request and generated 4 comments.
Show a summary per file
| File | Description |
|---|---|
| lua/diffmantic/ui/renderer.lua | Normalizes hunks before rendering and avoids double-rendering update-hunks that overlap explicit insert/delete actions. |
| lua/diffmantic/ui/hunks.lua | New utility to normalize/compat ranges and hunk shape for renderer/filler consumers. |
| lua/diffmantic/ui/filler.lua | Normalizes update hunks before computing uncovered ranges and filler layout. |
| lua/diffmantic/treesitter.lua | Adds rule-driven node ignore/flatten/canonicalization during preprocessing and updates hash computation. |
| lua/diffmantic/init.lua | Switches to core.diff() returning a result table with actions/mappings/info. |
| lua/diffmantic/core/top_down.lua | Major rewrite of top-down mapping logic with optional function prematch and ambiguous handling. |
| lua/diffmantic/core/semantic.lua | Removes the old semantic leaf-diff/rename heuristics module. |
| lua/diffmantic/core/rules.lua | New per-language rules for ignored/flattened/aliased node types and label handling. |
| lua/diffmantic/core/roles.lua | Simplifies role indexing to map diff.<kind>.outer/name captures into a lightweight lookup. |
| lua/diffmantic/core/recovery.lua | Replaces prior recovery implementation with repeated passes delegating to bottom-up helpers + pruning. |
| lua/diffmantic/core/line_prematch.lua | New unchanged-line prematch to seed initial mappings. |
| lua/diffmantic/core/bottom_up.lua | Major rewrite of bottom-up mapping + recovery helpers + pruning exported for recovery pass. |
| lua/diffmantic/core/analysis.lua | Replaces semantic/token diff logic with vim.diff hunk generation and range construction helpers. |
| lua/diffmantic/core/actions.lua | Replaces action generation (insert/delete/update/move) and adds reorder-move detection. |
| lua/diffmantic/core.lua | Introduces the new public diff() orchestration and wraps matching stages + analysis enrichment. |
Comments suppressed due to low confidence (1)
lua/diffmantic/treesitter.lua:85
- This
if parent_id then ... pi.parent_id = pi.parent_id endblock is a no-op (it assigns the field to itself) and can be removed. Keeping it makes the flattening branch harder to follow and suggests there is missing logic.
if parent_id then
local pi = info[parent_id]
if pi then pi.parent_id = pi.parent_id end -- already set
end
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| for src_line, dst_line in pairs(line_map) do | ||
| local src_nodes = src_by_line[src_line] or {} | ||
| local dst_nodes = dst_by_line[dst_line] or {} |
There was a problem hiding this comment.
prematch_unchanged iterates line_map using pairs(), which is not deterministic in Lua. Because this loop builds mappings (and consumes candidates via table.remove(candidates, 1)), the resulting pre-mappings can vary between runs even for identical inputs. Iterate line_map keys in a stable order (e.g., collect src_line keys into an array, sort numerically, then loop with ipairs).
lua/diffmantic/treesitter.lua
Outdated
|
|
||
| local lang = opts.lang | ||
| if not lang then | ||
| lang = vim.bo[bufnr] and vim.bo[bufnr].filetype or "" |
There was a problem hiding this comment.
Rules selection uses vim.bo[bufnr].filetype as the language key. For many Neovim setups the Treesitter language differs from filetype (e.g. typescriptreact/tsx, javascriptreact/tsx, etc.), so rules_mod.get(lang) may silently fall back to EMPTY and disable the intended canonicalization/ignores. Consider resolving the Treesitter language via vim.treesitter.language.get_lang(filetype) (as done elsewhere) or requiring callers to pass opts.lang consistently.
| lang = vim.bo[bufnr] and vim.bo[bufnr].filetype or "" | |
| local ft = vim.bo[bufnr] and vim.bo[bufnr].filetype or "" | |
| -- Prefer Treesitter language mapping when available, fall back to filetype | |
| lang = vim.treesitter.language.get_lang(ft) or ft |
| local function build_height_queue(info) | ||
| local by_height = {} | ||
| for id, v in pairs(info) do | ||
| if v.height >= MIN_HEIGHT then | ||
| by_height[v.height] = by_height[v.height] or {} | ||
| table.insert(by_height[v.height], id) | ||
| end |
There was a problem hiding this comment.
build_height_queue populates by_height by iterating pairs(info), which is unordered. Since the resulting per-height ID lists are later used to drive matching decisions, this can reintroduce non-determinism when there are multiple same-hash candidates. To keep matching deterministic, sort the ID lists per height (e.g. using node_pos_lt) before returning/consuming them.
lua/diffmantic/core/top_down.lua
Outdated
| for _, ids in pairs(src_by_h) do | ||
| for _, id in ipairs(ids) do | ||
| local h = src_info[id].height | ||
| src_open[h] = src_open[h] or {} | ||
| table.insert(src_open[h], id) | ||
| end |
There was a problem hiding this comment.
src_open is seeded by iterating pairs(src_by_h), so insertion order of IDs into src_open[h] depends on Lua hash iteration order. Because later logic preserves this order when building hash buckets, the final mappings can vary between runs. Consider iterating heights in sorted order and appending IDs in a stable order (or sorting src_open[h]/dst_open[h]).
I was checking unstable matches and noticed row/col was not enough in a few cases. This stores start_byte and end_byte from node:range(true) so later steps can sort and compare nodes more consistently.
I was seeing occasional order drift in mappings when tie cases happened. This adds deterministic tie-break sorting in prematch and top-down, and normalizes mappings in core.diff so repeated runs stay identical.
I cleaned up a few noisy action cases so updates are represented more consistently. I also removed filler/alignment wiring from the active UI path for the alpha cut, keeping rendering focused on core highlights and signs.
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 15 out of 15 changed files in this pull request and generated 5 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| local lang = opts.lang | ||
| if not lang then | ||
| lang = vim.bo[bufnr] and vim.bo[bufnr].filetype or "" | ||
| end | ||
| local rules = rules_mod.get(lang) |
There was a problem hiding this comment.
preprocess_tree derives lang from vim.bo[bufnr].filetype, but Treesitter languages often differ from filetype (e.g. typescriptreact -> tsx). That will cause rules_mod.get(lang) to miss rules and change hashing/matching behavior depending on filetype. Resolve lang via vim.treesitter.language.get_lang(filetype) (with a fallback) so rule lookup matches the parser language.
| clamp_non_empty_range = function(range, lines) | ||
| if not range then | ||
| return nil | ||
| end | ||
| local first = line_value:find("%S") | ||
| if not first then | ||
| return nil | ||
| end | ||
| local rev_last = line_value:reverse():find("%S") | ||
| local last = #line_value - rev_last + 1 | ||
| return make_range(row, base_col + first - 1, base_col + last) | ||
| end | ||
|
|
||
| local function hunk_change(src_range, dst_range) | ||
| return { kind = "change", src = src_range, dst = dst_range } | ||
| end | ||
|
|
||
| local function hunk_insert(dst_range) | ||
| return { kind = "insert", src = nil, dst = dst_range } | ||
| end | ||
|
|
||
| local function hunk_delete(src_range) | ||
| return { kind = "delete", src = src_range, dst = nil } | ||
| end | ||
|
|
||
| local function range_contains(outer, inner) | ||
| if not outer or not inner then | ||
| return false | ||
| end | ||
| if outer.start_row == nil or outer.end_row == nil or outer.start_col == nil or outer.end_col == nil then | ||
| return false | ||
| end | ||
| if inner.start_row == nil or inner.end_row == nil or inner.start_col == nil or inner.end_col == nil then | ||
| return false | ||
| range.start_row = range.start_row or 0 | ||
| range.end_row = range.end_row or range.start_row | ||
| range.start_col = range.start_col or 0 | ||
| if range.end_col == nil then | ||
| local rel_end = (range.end_row - range.start_row) + 1 | ||
| range.end_col = line_len(lines or {}, rel_end) | ||
| end | ||
| if inner.start_row < outer.start_row or inner.end_row > outer.end_row then | ||
| return false | ||
| if range.start_row == range.end_row and range.end_col <= range.start_col then | ||
| range.end_col = range.start_col + 1 | ||
| end |
There was a problem hiding this comment.
clamp_non_empty_range forces end_col = start_col + 1 when end_col <= start_col. For empty lines this can push end_col past the actual line length (0), which can make extmark highlights fail or behave inconsistently. Consider representing empty-line spans using end_row = start_row + 1, end_col = 0 (or setting hl_eol = true at render time) instead of forcing a 1-column range.
lua/diffmantic/core/top_down.lua
Outdated
| local function build_maps(mappings) | ||
| local s2d, d2s = {}, {} | ||
| for _, m in ipairs(mappings) do | ||
| s2d[m.src] = m.dst | ||
| d2s[m.dst] = m.src | ||
| end | ||
| return s2d, d2s | ||
| end |
There was a problem hiding this comment.
build_maps is defined but not referenced anywhere in this module. Removing it (or using it where appropriate) would avoid dead code and keep the matching implementation easier to follow.
lua/diffmantic/treesitter.lua
Outdated
| if parent_id then | ||
| local pi = info[parent_id] | ||
| if pi then pi.parent_id = pi.parent_id end -- already set | ||
| end |
There was a problem hiding this comment.
This block is a no-op (pi.parent_id = pi.parent_id) and doesn’t affect behavior. It can be removed to reduce noise and avoid suggesting there’s some important parent-linking logic here.
| local src_role_index = roles.build_index(root1, buf1) | ||
| local dst_role_index = roles.build_index(root2, buf2) | ||
|
|
||
| local mappings, src_info, dst_info = core.top_down_match(root1, root2, buf1, buf2, { | ||
| adaptive_mode = true, | ||
| }) | ||
| -- print("Top-down mappings: " .. #mappings) | ||
|
|
||
| -- local before_bottom_up = #mappings | ||
| mappings = core.bottom_up_match(mappings, src_info, dst_info, root1, root2, buf1, buf2, { | ||
| src_role_index = src_role_index, | ||
| dst_role_index = dst_role_index, | ||
| adaptive_mode = true, | ||
| }) | ||
| -- print("Mappings after Bottom-up: " .. #mappings .. " (+" .. (#mappings - before_bottom_up) .. " new)") | ||
|
|
||
| -- local before_recovery = #mappings | ||
| local src_count = 0 | ||
| local dst_count = 0 | ||
| for _ in pairs(src_info) do | ||
| src_count = src_count + 1 | ||
| end | ||
| for _ in pairs(dst_info) do | ||
| dst_count = dst_count + 1 | ||
| end | ||
| local max_nodes = math.max(src_count, dst_count) | ||
| mappings = core.recovery_match(root1, root2, mappings, src_info, dst_info, buf1, buf2, { | ||
| recovery_lcs_cell_limit = max_nodes >= 25000 and 1500 or 6000, | ||
| adaptive_mode = true, | ||
| }) | ||
| -- debug_utils.print_recovery_mappings(mappings, before_recovery, src_info, dst_info, buf1, buf2) | ||
|
|
||
| local actions = core.generate_actions(root1, root2, mappings, src_info, dst_info, { | ||
| local result = core.diff(root1, root2, { | ||
| src_buf = buf1, | ||
| dst_buf = buf2, | ||
| src_role_index = src_role_index, | ||
| dst_role_index = dst_role_index, | ||
| adaptive_mode = true, | ||
| }) |
There was a problem hiding this comment.
src_role_index/dst_role_index are computed and passed into core.diff, but the new core pipeline doesn’t read these options anymore (no references under lua/diffmantic/core.lua or matching modules). This adds unnecessary Treesitter query work on every :Diffmantic invocation; consider removing these computations/payload until the indices are actually consumed.
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 18 out of 22 changed files in this pull request and generated 2 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
You can also share your feedback on Copilot code review. Take the survey.
lua/diffmantic/core/actions.lua
Outdated
| metadata = { | ||
| node_type = emit_info.type, | ||
| new_name = node_label(emit_info, dst_buf), | ||
| from_line = range and (range.start_row + 1) or nil, |
| for sid, si in pairs(src_info or {}) do | ||
| if not is_source_property(si) or s2d[sid] then | ||
| goto continue_src | ||
| end | ||
| local mapped_parent = si.parent_id and s2d[si.parent_id] or nil | ||
| if not mapped_parent then | ||
| goto continue_src | ||
| end | ||
|
|
||
| local src_key = property_key_text(si, src_buf) | ||
| if src_key == "" then | ||
| goto continue_src | ||
| end | ||
|
|
||
| local best_di = nil | ||
| local best_dist = nil | ||
| for did, di in pairs(dst_info or {}) do | ||
| if used_dst[did] or d2s[did] then | ||
| goto continue_dst | ||
| end | ||
| elseif t == "update" then | ||
| local metadata = action.metadata or {} | ||
| table.insert(summary.updates, { | ||
| node_type = metadata.node_type, | ||
| from_line = metadata.from_line, | ||
| to_line = metadata.to_line, | ||
| src_range = action.src, | ||
| dst_range = action.dst, | ||
| }) | ||
| elseif t == "insert" then | ||
| local metadata = action.metadata or {} | ||
| table.insert(summary.inserts, { | ||
| node_type = metadata.node_type, | ||
| line = metadata.to_line, | ||
| dst_range = action.dst, | ||
| }) | ||
| elseif t == "delete" then | ||
| local metadata = action.metadata or {} | ||
| table.insert(summary.deletes, { | ||
| node_type = metadata.node_type, | ||
| line = metadata.from_line, | ||
| src_range = action.src, | ||
| }) | ||
| if not di or di.type ~= "pair" or di.parent_id ~= mapped_parent then | ||
| goto continue_dst | ||
| end | ||
| local dst_key = property_key_text(di, dst_buf) | ||
| if dst_key ~= src_key then | ||
| goto continue_dst | ||
| end | ||
| local dist = math.abs((si.start_row or 0) - (di.start_row or 0)) | ||
| if not best_di or dist < best_dist then | ||
| best_di = did | ||
| best_dist = dist | ||
| end | ||
| ::continue_dst:: | ||
| end |
The current engine has diverted too much from Gumtree and is in a non-deterministic state will be writing the core engine again ground up