Skip to content

improve count accumulator performance by avoiding allocating new hash every time#1070

Draft
markyang-toast wants to merge 3 commits intoblock:mainfrom
markyang-toast:improve-count-accumulator-performance
Draft

improve count accumulator performance by avoiding allocating new hash every time#1070
markyang-toast wants to merge 3 commits intoblock:mainfrom
markyang-toast:improve-count-accumulator-performance

Conversation

@markyang-toast
Copy link

@markyang-toast markyang-toast commented Mar 10, 2026

.merge() will allocate a new Hash every time while the simple '=' operation mutates mapping (in-place).
This improves CPU usage (since it's quicker) and memory allocation (since it's not allocating new memory)
here is some performance comparison:

•	small hash (~5 keys): ||= ~ 63 ns vs merge ~ 392 ns → ~6.2× slower
•	big hash (~100 keys): ||= ~ 56 ns vs merge ~ 1101 ns → ~19.5× slower
# frozen_string_literal: true
require "objspace"

def bench(label, n)
  GC.start
  before = GC.stat(:total_allocated_objects)
  t0 = Process.clock_gettime(Process::CLOCK_MONOTONIC)
  yield
  t1 = Process.clock_gettime(Process::CLOCK_MONOTONIC)
  after = GC.stat(:total_allocated_objects)

  elapsed = t1 - t0
  puts "%-35s %8.1f ns/iter  %5.2f alloc/iter" % [
    label,
    elapsed * 1e9 / n,
    (after - before).to_f / n
  ]
end

small = { "a"=>1, "b"=>2, "c"=>3, "d"=>4, "e"=>5 }.merge("type"=>"existing")
big   = (1..100).each_with_object({}) { |i,h| h["k#{i}"]=i }.merge("type"=>"existing")

n_small = 5_000_000
n_big   = 1_000_000

bench("small ||= (type present)", n_small) { h=small; i=0; while i<n_small; h["type"] ||= "nested"; i+=1; end }
bench("small merge (type present)", n_small) { h=small; x=nil; i=0; while i<n_small; x=h.merge("type"=>"nested"); i+=1; end; x }

bench("big ||= (type present)", n_big) { h=big; i=0; while i<n_big; h["type"] ||= "nested"; i+=1; end }
bench("big merge (type present)", n_big) { h=big; x=nil; i=0; while i<n_big; x=h.merge("type"=>"nested"); i+=1; end; x }

@CLAassistant
Copy link

CLAassistant commented Mar 10, 2026

CLA assistant check
All committers have signed the CLA.

@markyang-toast markyang-toast force-pushed the improve-count-accumulator-performance branch from 8b4777b to f5dbeb5 Compare March 10, 2026 15:59
@markyang-toast markyang-toast marked this pull request as draft March 10, 2026 16:09
Copy link
Collaborator

@myronmarston myronmarston left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Left some inline comments about this. Regarding the benchmark, I recommend using benchmark-ips for future benchmarks. Here's what that looks like for this case:

#!/usr/bin/env ruby
# frozen_string_literal: true

require "bundler/inline"

gemfile do
  source "https://rubygems.org"
  gem "benchmark-ips"
end

require "benchmark/ips"

# Compares `hash.merge("type" => "nested")` (allocates new Hash each time)
# vs `hash["type"] ||= "nested"` (mutates in place, no allocation).
#
# This is the optimization applied in CountAccumulator#merge_list_counts_into.

small = {"a" => 1, "b" => 2, "c" => 3, "d" => 4, "e" => 5, "type" => "existing"}
big = (1..100).each_with_object({}) { |i, h| h["k#{i}"] = i }.merge("type" => "existing")

puts "=== Small hash (#{small.size} keys) ==="
puts

Benchmark.ips do |x|
  x.config(time: 5, warmup: 2)

  x.report("||= assign") do
    small["type"] ||= "nested"
  end

  x.report("merge") do
    small.merge("type" => "nested")
  end

  x.compare!
end

puts
puts "=== Big hash (#{big.size} keys) ==="
puts

Benchmark.ips do |x|
  x.config(time: 5, warmup: 2)

  x.report("||= assign") do
    big["type"] ||= "nested"
  end

  x.report("merge") do
    big.merge("type" => "nested")
  end

  x.compare!
end

And here are the results I got:

=== Small hash (6 keys) ===

ruby 4.0.0 (2025-12-25 revision 553f1675f3) +PRISM [arm64-darwin25]
Warming up --------------------------------------
          ||= assign     3.423M i/100ms
               merge   658.384k i/100ms
Calculating -------------------------------------
          ||= assign     33.737M (± 4.9%) i/s   (29.64 ns/i) -    171.166M in   5.090397s
               merge      6.327M (± 4.4%) i/s  (158.05 ns/i) -     31.602M in   5.004810s

Comparison:
          ||= assign: 33736984.1 i/s
               merge:  6327001.0 i/s - 5.33x  slower


=== Big hash (101 keys) ===

ruby 4.0.0 (2025-12-25 revision 553f1675f3) +PRISM [arm64-darwin25]
Warming up --------------------------------------
          ||= assign     3.015M i/100ms
               merge   265.689k i/100ms
Calculating -------------------------------------
          ||= assign     29.874M (± 2.2%) i/s   (33.47 ns/i) -    150.743M in   5.048738s
               merge      2.567M (±10.4%) i/s  (389.54 ns/i) -     12.753M in   5.031967s

Comparison:
          ||= assign: 29873750.0 i/s
               merge:  2567100.2 i/s - 11.64x  slower

Also, feel free to commit benchmarks--that's what the benchmarks directory is for!

Given that this method is called once per ingested event (and not called in any sort of tight loop), I'm not terribly concerned with the slower performance of merge.

# @implements CountAccumulator
def self.merge_list_counts_into(params, mapping:, list_counts_field_paths_for_source:)
# Assign nested type to the property if not yet assigned
mapping["type"] ||= "nested"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm concerned that this mutates the mapping argument that's passed in from the caller. I generally favor a functional "avoid mutation if possible" style of code, but am willing to relax that when the mutation is localized and easy to reason about (e.g. if mapping originated within this method, the mutation wouldn't be a problem). But I tend to draw the line when mutation crosses a caller boundary like this, particularly when it's not the point of the method. I've seen too many bugs in my career that have stemmed from this kind of thing.

If you can provide a benchmark showing a noticable/significant different in real-world benchmarks I'll consider this but a microbenchmark like you've provided is IMO not sufficient to outweigh the downsides of changing this method to have a hidden side effect. I've seen plenty of microbenchmarks that don't pan out into any real gains in practice.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I also have doubt here so I'm still testing it out.
one important thing for us it actually memory allocation so I will also get benchmark on RAM.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants