Skip to content

Migrate StarEvent from MongoDB to PostgreSQL with periodic GitHub API fetch#96

Draft
tricknotes wants to merge 45 commits intomainfrom
claude/xenodochial-engelbart
Draft

Migrate StarEvent from MongoDB to PostgreSQL with periodic GitHub API fetch#96
tricknotes wants to merge 45 commits intomainfrom
claude/xenodochial-engelbart

Conversation

@tricknotes
Copy link
Copy Markdown
Owner

@tricknotes tricknotes commented Mar 19, 2026

背景

GitHub API の変更により WatchEvent データが取得できなくなったため、
/users/{login}/starred エンドポイントから star 情報を取得する方式に変更する。
データはキャッシュ・アーカイブ目的で PostgreSQL に保存する。

主な変更

データ層

  • StarEvent / Repository を Mongoid → ActiveRecord (PostgreSQL) に移行
  • star_events テーブル: actor_login, repo_name, repo_owner, starred_at
  • repositories テーブル: stargazers_count など随時更新されるメタデータを分離
  • repo_owner カラムを追加し StarEvent.owner の LIKE クエリを廃止

フェッチ戦略

  • コントローラーでのオンデマンドフェッチを廃止
  • rake star_events:fetch タスクで全ユーザーの star を定期取得
  • コントローラーは DB に存在するデータのみを表示

削除

  • mongoid gem、config/mongoid.yml
  • lib/tasks/fetch_repositories.rake

🤖 Generated with Claude Code

@tricknotes tricknotes marked this pull request as draft March 19, 2026 18:23
tricknotes and others added 7 commits March 21, 2026 17:56
…orage

GitHub API changes made WatchEvent data unavailable. Instead of relying on
stored events, StarEvent now fetches starred repos from GitHub API
(/users/{login}/starred) on demand and persists them in PostgreSQL for
caching and archival purposes.

- Rewrite StarEvent and Repository from Mongoid to ActiveRecord
- Add fetch_and_upsert class method to StarEvent for GitHub API integration
- Store repository metadata in a separate repositories table
- Update controllers to fetch on demand before reading from DB
- Remove mongoid gem, config/mongoid.yml, and MongoDB from CI/Docker
- Remove fetch_repositories.rake (no longer needed)
- Update views, helpers, rake tasks, and test support

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
alias_method doesn't work with ActiveRecord attribute methods;
use a regular method definition instead.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- Use ::OpenStruct to avoid NameError in Repository model
- Handle Time objects from Octokit in fetch_starred_since
- Add webmock and stub StarEvent.fetch_and_upsert in tests to
  prevent unintended GitHub API calls during test suite

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Ruby 3.4+ requires explicit ostruct gem as it was removed from
the default gems. Revert ::OpenStruct to OpenStruct now that the
gem is properly declared.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@tricknotes tricknotes force-pushed the claude/xenodochial-engelbart branch from 9d5685a to b63b05b Compare March 21, 2026 08:58
tricknotes and others added 2 commits March 23, 2026 12:27
- Update avatar_image_tag/image_link_to_github_url to handle OpenStruct
  (repo.owner) using respond_to?(:login) duck typing
- Update notify.text.erb to use event.actor_login instead of event['actor']['login']

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- Restore Hash fallback in avatar_image_tag (lost during rebase merge)
- Update notify.text.erb to use event.actor_login instead of event['actor']['login']

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@Claude
Copy link
Copy Markdown

Claude AI commented Mar 24, 2026

@tricknotes I've opened a new pull request, #97, to work on those changes. Once the pull request is ready, I'll request review from you.

@tricknotes tricknotes force-pushed the claude/xenodochial-engelbart branch from e90d9b3 to 0d916ef Compare March 24, 2026 15:11
@tricknotes tricknotes changed the title Replace MongoDB-based StarEvent with GitHub API fetch + PostgreSQL storage Migrate StarEvent from MongoDB to PostgreSQL with periodic GitHub API fetch Mar 24, 2026
tricknotes and others added 11 commits March 25, 2026 01:10
- Remove StarEvent.fetch_and_upsert calls from controllers
  (activities, dashboard, stars) - controllers now read from DB only
- Remove User#fetch_star_events (no longer needed)
- Add lib/tasks/fetch_star_events.rake for periodic background fetch
- Remove fetch_and_upsert stub from rails_helper (no longer needed)
- Skip Settings.url_options in test env to avoid BASE_URL port
  leaking into action_mailer.default_url_options

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
The repo_* keys in the intermediate hash were not DB columns but
appeared alongside DB columns, causing confusion. Refactor so that
fetch_starred_since returns two distinct collections: star_events
(only the star_events table fields) and repos (repository fields).
upsert_repositories now receives repo data directly without needing
the repo_* prefixed keys.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
all_by and by were identical — both delegate to where(actor_login:)
which accepts a single value or an array. Remove the by class method,
rename all_by scope to by, and update the one call site in User.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Replace LIKE-based owner scope with an equality check on the new
repo_owner column (indexed). Populate repo_owner from repo.owner.login
during fetch, and derive it from repo_name in stub_star_event! for tests.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Data.define provides an immutable value object with an explicit
interface, no method_missing overhead, and errors on unknown
attributes. Also removes the ostruct gem dependency since Data
is built into Ruby 3.2+.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- by_name: was used in the old Mongoid-based repository! lookup on
  StarEvent, which was removed during the PostgreSQL migration
- watchers_count: compatibility alias left over from when repositories
  were fetched directly from GitHub API responses; views access
  stargazers_count directly on the ActiveRecord model

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Include repo_owner column and its index directly in the initial
create_star_events migration, removing the separate
add_repo_owner_to_star_events migration.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Controllers no longer call StarEvent.fetch_and_upsert directly;
GitHub API access is only triggered via the rake task, which is
never executed during the test suite. There are no stub_request
usages, so webmock provides no value.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Previously fetch_starred_since accumulated all pages in arrays before
upserting, causing large memory spikes when users had many starred repos.

- Replace fetch_starred_since with fetch_each_page (block/yield style)
  so each page's data is upserted immediately and GC'd
- Change User.all.to_a to User.find_each in rake task to avoid loading
  all user records into memory at once

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@tricknotes tricknotes force-pushed the claude/xenodochial-engelbart branch from 5c4a542 to b1cb406 Compare March 31, 2026 03:41
tricknotes and others added 2 commits March 31, 2026 13:44
Per-page upsert (fetch_each_page) caps each thread's memory to one
page (~30 items), so FETCH_CONCURRENCY=5 keeps peak memory at ~150
API response objects — well within Heroku eco dyno limits.

- fetch_and_upsert: parallelize fetch_each_page across logins using
  Concurrent::FixedThreadPool; futures are awaited and rejections logged
- rake task: parallelize user.followings calls using the same pool size;
  logins are merged under a Mutex; remove each_slice (no longer needed)
- Thread pool is always shut down via ensure

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
tricknotes and others added 2 commits March 31, 2026 14:01
Previously all logins were mapped to futures at once, keeping all
Future objects and closures in memory until the entire batch completed.

- fetch_and_upsert: process logins in slices of FETCH_CONCURRENCY;
  each slice's futures are awaited before the next is started, allowing
  completed futures and their API response data to be GC'd promptly;
  GC.compact is called between slices to compact the heap
- rake task: call GC.compact after the followings phase completes

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Use Concurrent::AtomicFixnum for thread-safe incrementing so each
thread can log its position without data races.

  [fetch_and_upsert] (3/150) fetching @login
  [fetch_and_upsert] (3/150) @login done (1.23s)

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
tricknotes and others added 2 commits April 1, 2026 12:08
Introduce `StarEvent.fetch_and_upsert_graphql` which issues a single HTTP
request per GRAPHQL_BATCH_SIZE (default 20) users using GitHub's GraphQL API
with field aliases.  This reduces the number of HTTP round-trips from
N_logins to N_logins / GRAPHQL_BATCH_SIZE, making wall-clock time roughly
proportional to the batch size ratio.

Key details:
- GRAPHQL_BATCH_SIZE (env) controls users per GraphQL call (default 20)
- GRAPHQL_PAGE_SIZE = 30 matches the REST per_page default, keeping GraphQL
  point cost low while covering the common case of few new stars per window
- Users with hasNextPage: true and items still within the since window are
  queued for a REST fallback (upsert_all idempotency makes re-fetch safe)
- Uses Net::HTTP directly – no new gems required
- A companion rake task `star_events:fetch_graphql` allows side-by-side
  timing comparison with the existing `star_events:fetch` (REST) task

Also fix a pre-existing test failure: Rails merges routes.default_url_options
(which includes Puma's default port 3000) into mailer URL generation.
Adding `port: nil` to `action_mailer.default_url_options` in the test env
explicitly suppresses the port from generated URLs.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Implement option B: each app-user's own GitHub token is used to fetch their
followings' starred events, multiplying the effective rate limit by the number
of app-users.  A new rake task `star_events:fetch_per_user` drives this path.

Design:
- One thread per app-user (bounded by FETCH_CONCURRENCY) in the outer pool
- Within each user thread, logins are processed sequentially via the new
  `StarEvent.fetch_and_upsert_per_user` method, avoiding nested thread pools
- upsert_all idempotency handles the overlap where multiple users follow the
  same person (each user fetches them with their own token)

Private repository filtering (applies to all fetch paths):
- REST path (fetch_each_page): checks repo[:private] and skips via `next`
- GraphQL path (parse_starred_edges): adds `isPrivate` to the query and
  skips private nodes before building star_events / repos hashes

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
fetch_and_upsert_per_user now delegates to fetch_and_upsert_graphql,
reducing HTTP round-trips from N_logins to N_logins / GRAPHQL_BATCH_SIZE
while still using each app-user's own token for their independent rate-limit
budget.

fetch_and_upsert_graphql gains an optional fallback_client: parameter so
the REST pagination fallback uses the same token context as the GraphQL
phase instead of falling back to the shared Settings.github_client.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Three targeted improvements to address R15 (memory quota exceeded):

1. Reuse a single persistent HTTPS connection in fetch_and_upsert_graphql
   Previously execute_graphql created a new Net::HTTP object per batch,
   causing repeated TLS handshakes and OpenSSL context allocations that
   accumulated across N_logins/GRAPHQL_BATCH_SIZE batches before GC could
   reclaim them.  Net::HTTP.start now opens one connection for all batches,
   reducing connection objects from O(N_batches) to O(1).

2. Slice users in fetch_per_user rake task with GC.compact between slices
   find_in_batches(batch_size: FETCH_CONCURRENCY) replaces find_each with
   a growing futures array.  Each slice of FETCH_CONCURRENCY users is fully
   processed and GC.compact is called before the next slice loads, preventing
   all users' logins arrays and Sawyer response objects from coexisting in
   memory simultaneously.

3. Periodic GC.compact in fetch_and_upsert_per_user (REST path)
   Compact the heap every FETCH_CONCURRENCY iterations to release Sawyer /
   Faraday response objects that accumulate in a tight sequential loop before
   the GC gets a chance to collect them.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@tricknotes
Copy link
Copy Markdown
Owner Author

It's production ready 🎉

heroku run "rails star_events:fetch_per_user[24] FETCH_CONCURRENCY=20" -a   2.98s user 1.20s system 0% cpu 1:12:56.84 total

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants