Skip to content

refactor(core): removed old non-deterministic core engine#4

Merged
HarshK97 merged 22 commits intomainfrom
rewrite/core
Mar 13, 2026
Merged

refactor(core): removed old non-deterministic core engine#4
HarshK97 merged 22 commits intomainfrom
rewrite/core

Conversation

@HarshK97
Copy link
Owner

@HarshK97 HarshK97 commented Feb 28, 2026

The current engine has diverted too much from Gumtree and is in a non-deterministic state will be writing the core engine again ground up

The current engine has diverted too much from Gumtree and is in a
non-deterministic state will be writing the core engine again group up
Added a new rules.lua file that basically tells the engine how to
handle different node types per language. Some nodes like semicolons
and commas are just noise so we ignore them, some nodes like string
literals are better treated as a single unit so we flatten them, and
some node types are just different names for the same thing so we
alias them. Did this for c, cpp, go, javascript, typescript, python
and lua.
Hooked the new rules module into preprocess_tree. Now when we walk
the tree we skip ignored nodes entirely, flatten nodes that should
be treated as leaves, and use canonical type names before hashing.
This makes hashes a lot more stable across code that is structurally
the same but written slightly differently.
Added a pre-matching step that runs before the main top-down phase.
The idea is simple, use vim.diff to find lines that did not change,
then match AST nodes that sit on those unchanged lines by type and
column. This gives the later phases a better starting point and
reduces the chance of nodes getting matched to the wrong counterpart.

Based on: Matsumoto et al., Beyond GumTree: A Hybrid Approach to
Generate Edit Scripts, MSR 2019.
Rewrote the top-down phase from scratch. Instead of the old parent
bucket scan we now use a height queue that processes nodes tallest
first. When multiple nodes share the same hash we defer them and
resolve largest first to avoid bad early matches. Also added an
optional function pre-matching pass that uses name similarity and
child dice score to match functions before the main loop runs.

Based on: Falleri and Martinez, Fine-grained Accurate and Scalable
Source Code Differencing, ICSE 2024.
https://doi.org/10.1145/3597503.3639148
Rewrote the bottom-up phase. Dropped the roles dependency and
replaced the old name lookup candidate search with dice similarity
over mapped descendants. When similarity scores tie we use child
position distance as a tiebreaker. Also pulled simple_recovery out
as a shared function and exposed it as _simple_recovery along with
_prune_mappings so the recovery phase can reuse them.

Based on: Falleri and Martinez, ICSE 2024.
https://doi.org/10.1145/3597503.3639148
Replaced the old LCS worklist loop with something much simpler. We
now just run simple_recovery and prune_mappings in a loop until the
mapping set stops changing. The max_size threshold is gone entirely
which was always a pain to tune anyway.

Based on: Falleri and Martinez, ICSE 2024.
https://doi.org/10.1145/3597503.3639148
The old roles module was doing a lot. Stripped it down to just what
we actually need which is looking up a name for a given node kind.
build_index now only records outer and name capture pairs and
get_kind_name_text just walks up the ancestors to find a match.
Much easier to reason about.
Rewrote generate_actions to work directly from mapping metadata
instead of going through the roles and semantic machinery. Update
detection now checks parent consistency and leaf labels directly.
Move detection uses LIS to find reordered siblings and parent change
detection for structural moves. Removed semantic.lua entirely since
nothing uses it anymore.
Rewrote the hunk computation in analysis.lua. The old version had a
lot of moving parts with semantic leaf changes and multiple merge and
refine passes. New version just calls vim.diff with result_type
indices and then does a word level diff for single line changes.
Full token types like identifiers and literals get one hunk covering
the whole node which is usually what you want anyway.
Added M.diff() to core.lua which wires together all the phases in
order: pre_match, top_down_match, bottom_up_match, recovery_match
and generate_actions. Also wrapped each phase in its own function
with a cleaner signature so callers can run individual phases if
they want. Accepts either positional args or an options table and
returns either a flat tuple or a structured table.
Added hunks.lua with normalize_hunk and normalize_list. Basically
a safety layer so rendering code does not have to defensively check
every field on every hunk. Fills in missing rows and cols, infers
kind from whichever side is present, and clamps zero width ranges
to at least one character wide.
Plugged normalize_list into both renderer.lua and filler.lua before
any hunk rendering happens. Added a guard that skips insert and
delete hunks inside update actions when they overlap an explicit
insert or delete action to avoid double highlighting. Switched
init.lua from the old manual four step pipeline to the new
core.diff() call.
Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR refactors Diffmantic’s core diff pipeline toward a new deterministic engine by introducing rule-driven Treesitter preprocessing, new prematching, and a redesigned action/hunk generation + rendering flow.

Changes:

  • Replaces the previous multi-stage core API (top_down_match/bottom_up_match/recovery_match/generate_actions) with a consolidated core.diff() pipeline and new prematch + recovery behavior.
  • Introduces language rules (core/rules.lua) and hunk normalization (ui/hunks.lua) to stabilize hashing/hunk ranges and rendering.
  • Reworks action generation (core/actions.lua) and analysis (core/analysis.lua) to rely on vim.diff-based hunks and updated rendering logic.

Reviewed changes

Copilot reviewed 15 out of 15 changed files in this pull request and generated 4 comments.

Show a summary per file
File Description
lua/diffmantic/ui/renderer.lua Normalizes hunks before rendering and avoids double-rendering update-hunks that overlap explicit insert/delete actions.
lua/diffmantic/ui/hunks.lua New utility to normalize/compat ranges and hunk shape for renderer/filler consumers.
lua/diffmantic/ui/filler.lua Normalizes update hunks before computing uncovered ranges and filler layout.
lua/diffmantic/treesitter.lua Adds rule-driven node ignore/flatten/canonicalization during preprocessing and updates hash computation.
lua/diffmantic/init.lua Switches to core.diff() returning a result table with actions/mappings/info.
lua/diffmantic/core/top_down.lua Major rewrite of top-down mapping logic with optional function prematch and ambiguous handling.
lua/diffmantic/core/semantic.lua Removes the old semantic leaf-diff/rename heuristics module.
lua/diffmantic/core/rules.lua New per-language rules for ignored/flattened/aliased node types and label handling.
lua/diffmantic/core/roles.lua Simplifies role indexing to map diff.<kind>.outer/name captures into a lightweight lookup.
lua/diffmantic/core/recovery.lua Replaces prior recovery implementation with repeated passes delegating to bottom-up helpers + pruning.
lua/diffmantic/core/line_prematch.lua New unchanged-line prematch to seed initial mappings.
lua/diffmantic/core/bottom_up.lua Major rewrite of bottom-up mapping + recovery helpers + pruning exported for recovery pass.
lua/diffmantic/core/analysis.lua Replaces semantic/token diff logic with vim.diff hunk generation and range construction helpers.
lua/diffmantic/core/actions.lua Replaces action generation (insert/delete/update/move) and adds reorder-move detection.
lua/diffmantic/core.lua Introduces the new public diff() orchestration and wraps matching stages + analysis enrichment.
Comments suppressed due to low confidence (1)

lua/diffmantic/treesitter.lua:85

  • This if parent_id then ... pi.parent_id = pi.parent_id end block is a no-op (it assigns the field to itself) and can be removed. Keeping it makes the flattening branch harder to follow and suggests there is missing logic.
			if parent_id then
				local pi = info[parent_id]
				if pi then pi.parent_id = pi.parent_id end -- already set
			end

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +84 to +86
for src_line, dst_line in pairs(line_map) do
local src_nodes = src_by_line[src_line] or {}
local dst_nodes = dst_by_line[dst_line] or {}
Copy link

Copilot AI Mar 5, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

prematch_unchanged iterates line_map using pairs(), which is not deterministic in Lua. Because this loop builds mappings (and consumes candidates via table.remove(candidates, 1)), the resulting pre-mappings can vary between runs even for identical inputs. Iterate line_map keys in a stable order (e.g., collect src_line keys into an array, sort numerically, then loop with ipairs).

Copilot uses AI. Check for mistakes.

local lang = opts.lang
if not lang then
lang = vim.bo[bufnr] and vim.bo[bufnr].filetype or ""
Copy link

Copilot AI Mar 5, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rules selection uses vim.bo[bufnr].filetype as the language key. For many Neovim setups the Treesitter language differs from filetype (e.g. typescriptreact/tsx, javascriptreact/tsx, etc.), so rules_mod.get(lang) may silently fall back to EMPTY and disable the intended canonicalization/ignores. Consider resolving the Treesitter language via vim.treesitter.language.get_lang(filetype) (as done elsewhere) or requiring callers to pass opts.lang consistently.

Suggested change
lang = vim.bo[bufnr] and vim.bo[bufnr].filetype or ""
local ft = vim.bo[bufnr] and vim.bo[bufnr].filetype or ""
-- Prefer Treesitter language mapping when available, fall back to filetype
lang = vim.treesitter.language.get_lang(ft) or ft

Copilot uses AI. Check for mistakes.
Comment on lines +325 to 331
local function build_height_queue(info)
local by_height = {}
for id, v in pairs(info) do
if v.height >= MIN_HEIGHT then
by_height[v.height] = by_height[v.height] or {}
table.insert(by_height[v.height], id)
end
Copy link

Copilot AI Mar 5, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

build_height_queue populates by_height by iterating pairs(info), which is unordered. Since the resulting per-height ID lists are later used to drive matching decisions, this can reintroduce non-determinism when there are multiple same-hash candidates. To keep matching deterministic, sort the ID lists per height (e.g. using node_pos_lt) before returning/consuming them.

Copilot uses AI. Check for mistakes.
Comment on lines 384 to 389
for _, ids in pairs(src_by_h) do
for _, id in ipairs(ids) do
local h = src_info[id].height
src_open[h] = src_open[h] or {}
table.insert(src_open[h], id)
end
Copy link

Copilot AI Mar 5, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

src_open is seeded by iterating pairs(src_by_h), so insertion order of IDs into src_open[h] depends on Lua hash iteration order. Because later logic preserves this order when building hash buckets, the final mappings can vary between runs. Consider iterating heights in sorted order and appending IDs in a stable order (or sorting src_open[h]/dst_open[h]).

Copilot uses AI. Check for mistakes.
HarshK97 added 3 commits March 6, 2026 10:15
I was checking unstable matches and noticed row/col was not enough in a few cases.
This stores start_byte and end_byte from node:range(true) so later steps can sort and compare nodes more consistently.
I was seeing occasional order drift in mappings when tie cases happened.
This adds deterministic tie-break sorting in prematch and top-down, and normalizes mappings in core.diff so repeated runs stay identical.
I cleaned up a few noisy action cases so updates are represented more consistently.
I also removed filler/alignment wiring from the active UI path for the alpha cut, keeping rendering focused on core highlights and signs.
Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 15 out of 15 changed files in this pull request and generated 5 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +35 to +39
local lang = opts.lang
if not lang then
lang = vim.bo[bufnr] and vim.bo[bufnr].filetype or ""
end
local rules = rules_mod.get(lang)
Copy link

Copilot AI Mar 10, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

preprocess_tree derives lang from vim.bo[bufnr].filetype, but Treesitter languages often differ from filetype (e.g. typescriptreact -> tsx). That will cause rules_mod.get(lang) to miss rules and change hashing/matching behavior depending on filetype. Resolve lang via vim.treesitter.language.get_lang(filetype) (with a fallback) so rule lookup matches the parser language.

Copilot uses AI. Check for mistakes.
Comment on lines +49 to 62
clamp_non_empty_range = function(range, lines)
if not range then
return nil
end
local first = line_value:find("%S")
if not first then
return nil
end
local rev_last = line_value:reverse():find("%S")
local last = #line_value - rev_last + 1
return make_range(row, base_col + first - 1, base_col + last)
end

local function hunk_change(src_range, dst_range)
return { kind = "change", src = src_range, dst = dst_range }
end

local function hunk_insert(dst_range)
return { kind = "insert", src = nil, dst = dst_range }
end

local function hunk_delete(src_range)
return { kind = "delete", src = src_range, dst = nil }
end

local function range_contains(outer, inner)
if not outer or not inner then
return false
end
if outer.start_row == nil or outer.end_row == nil or outer.start_col == nil or outer.end_col == nil then
return false
end
if inner.start_row == nil or inner.end_row == nil or inner.start_col == nil or inner.end_col == nil then
return false
range.start_row = range.start_row or 0
range.end_row = range.end_row or range.start_row
range.start_col = range.start_col or 0
if range.end_col == nil then
local rel_end = (range.end_row - range.start_row) + 1
range.end_col = line_len(lines or {}, rel_end)
end
if inner.start_row < outer.start_row or inner.end_row > outer.end_row then
return false
if range.start_row == range.end_row and range.end_col <= range.start_col then
range.end_col = range.start_col + 1
end
Copy link

Copilot AI Mar 10, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

clamp_non_empty_range forces end_col = start_col + 1 when end_col <= start_col. For empty lines this can push end_col past the actual line length (0), which can make extmark highlights fail or behave inconsistently. Consider representing empty-line spans using end_row = start_row + 1, end_col = 0 (or setting hl_eol = true at render time) instead of forcing a 1-column range.

Copilot uses AI. Check for mistakes.
Comment on lines 7 to 14
local function build_maps(mappings)
local s2d, d2s = {}, {}
for _, m in ipairs(mappings) do
s2d[m.src] = m.dst
d2s[m.dst] = m.src
end
return s2d, d2s
end
Copy link

Copilot AI Mar 10, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

build_maps is defined but not referenced anywhere in this module. Removing it (or using it where appropriate) would avoid dead code and keep the matching implementation easier to follow.

Copilot uses AI. Check for mistakes.
Comment on lines +85 to +88
if parent_id then
local pi = info[parent_id]
if pi then pi.parent_id = pi.parent_id end -- already set
end
Copy link

Copilot AI Mar 10, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This block is a no-op (pi.parent_id = pi.parent_id) and doesn’t affect behavior. It can be removed to reduce noise and avoid suggesting there’s some important parent-linking logic here.

Copilot uses AI. Check for mistakes.
Comment on lines 122 to 130
local src_role_index = roles.build_index(root1, buf1)
local dst_role_index = roles.build_index(root2, buf2)

local mappings, src_info, dst_info = core.top_down_match(root1, root2, buf1, buf2, {
adaptive_mode = true,
})
-- print("Top-down mappings: " .. #mappings)

-- local before_bottom_up = #mappings
mappings = core.bottom_up_match(mappings, src_info, dst_info, root1, root2, buf1, buf2, {
src_role_index = src_role_index,
dst_role_index = dst_role_index,
adaptive_mode = true,
})
-- print("Mappings after Bottom-up: " .. #mappings .. " (+" .. (#mappings - before_bottom_up) .. " new)")

-- local before_recovery = #mappings
local src_count = 0
local dst_count = 0
for _ in pairs(src_info) do
src_count = src_count + 1
end
for _ in pairs(dst_info) do
dst_count = dst_count + 1
end
local max_nodes = math.max(src_count, dst_count)
mappings = core.recovery_match(root1, root2, mappings, src_info, dst_info, buf1, buf2, {
recovery_lcs_cell_limit = max_nodes >= 25000 and 1500 or 6000,
adaptive_mode = true,
})
-- debug_utils.print_recovery_mappings(mappings, before_recovery, src_info, dst_info, buf1, buf2)

local actions = core.generate_actions(root1, root2, mappings, src_info, dst_info, {
local result = core.diff(root1, root2, {
src_buf = buf1,
dst_buf = buf2,
src_role_index = src_role_index,
dst_role_index = dst_role_index,
adaptive_mode = true,
})
Copy link

Copilot AI Mar 10, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

src_role_index/dst_role_index are computed and passed into core.diff, but the new core pipeline doesn’t read these options anymore (no references under lua/diffmantic/core.lua or matching modules). This adds unnecessary Treesitter query work on every :Diffmantic invocation; consider removing these computations/payload until the indices are actually consumed.

Copilot uses AI. Check for mistakes.
Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 18 out of 22 changed files in this pull request and generated 2 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

You can also share your feedback on Copilot code review. Take the survey.

metadata = {
node_type = emit_info.type,
new_name = node_label(emit_info, dst_buf),
from_line = range and (range.start_row + 1) or nil,
Comment on lines +632 to 665
for sid, si in pairs(src_info or {}) do
if not is_source_property(si) or s2d[sid] then
goto continue_src
end
local mapped_parent = si.parent_id and s2d[si.parent_id] or nil
if not mapped_parent then
goto continue_src
end

local src_key = property_key_text(si, src_buf)
if src_key == "" then
goto continue_src
end

local best_di = nil
local best_dist = nil
for did, di in pairs(dst_info or {}) do
if used_dst[did] or d2s[did] then
goto continue_dst
end
elseif t == "update" then
local metadata = action.metadata or {}
table.insert(summary.updates, {
node_type = metadata.node_type,
from_line = metadata.from_line,
to_line = metadata.to_line,
src_range = action.src,
dst_range = action.dst,
})
elseif t == "insert" then
local metadata = action.metadata or {}
table.insert(summary.inserts, {
node_type = metadata.node_type,
line = metadata.to_line,
dst_range = action.dst,
})
elseif t == "delete" then
local metadata = action.metadata or {}
table.insert(summary.deletes, {
node_type = metadata.node_type,
line = metadata.from_line,
src_range = action.src,
})
if not di or di.type ~= "pair" or di.parent_id ~= mapped_parent then
goto continue_dst
end
local dst_key = property_key_text(di, dst_buf)
if dst_key ~= src_key then
goto continue_dst
end
local dist = math.abs((si.start_row or 0) - (di.start_row or 0))
if not best_di or dist < best_dist then
best_di = did
best_dist = dist
end
::continue_dst::
end
@HarshK97 HarshK97 marked this pull request as ready for review March 13, 2026 11:52
@HarshK97 HarshK97 merged commit d65dd9f into main Mar 13, 2026
@HarshK97 HarshK97 deleted the rewrite/core branch March 13, 2026 13:36
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants