Skip to content

Add PUF + source impute modules, fix AGI ceiling (issue #530)#537

Open
baogorek wants to merge 9 commits intomainfrom
puf-impute-fix-530
Open

Add PUF + source impute modules, fix AGI ceiling (issue #530)#537
baogorek wants to merge 9 commits intomainfrom
puf-impute-fix-530

Conversation

@baogorek
Copy link
Collaborator

@baogorek baogorek commented Feb 16, 2026

Summary

Fixes #530 — extracts PUF cloning and source imputation into standalone modules, replacing inline logic in extended_cps.py.

  • puf_impute.py — PUF clone + batched QRF with state_fips as predictor. Uses stratified subsample (20K target) that force-includes top 0.5% by AGI, preserving the high-income tail. Includes weeks_unemployed imputation for PUF half.
  • source_impute.py — ACS re-imputation with state predictor. SIPP/SCF imputation without state (surveys lack state identifiers).
  • extended_cps.py refactor — 443 → 75 lines, delegates to puf_clone_dataset().
  • unified_calibration.py — New --puf-dataset, --skip-puf, --skip-source-impute flags; _build_puf_cloned_dataset() helper; pipeline: geography → source impute → PUF clone → double_geography_for_puf() → takeup rerandomize → matrix build → L0 calibrate.
  • 21 new tests for both modules (mock data, skip_qrf=True path).

Also fixes #539 — sets the environment variable DC_STATEHOOD=1 so that -us recognizes DC as a state, including it in every file that pulls all states throughout the data pipeline.

Notes

I'm seeing a $400M max AGI in ExtendedCPS_2024, per the manual test below.

Test plan

  • pytest policyengine_us_data/tests/test_calibration/ — 63/63 pass (21 new + 42 existing)
  • black . -l 79 --check — formatting clean
  • Manual: ExtendedCPS_2024().generate() and verify max AGI > $10M

🤖 Generated with Claude Code

@MaxGhenis
Copy link
Contributor

Why assign and impute from random states for the SCF and SIPP?

@MaxGhenis
Copy link
Contributor

Can you mark as draft until we can merge

@baogorek baogorek marked this pull request as draft February 16, 2026 15:55
@baogorek
Copy link
Collaborator Author

@MaxGhenis good call out on the random state simulation for SIPP and SCF. The argument was "predictor consistency" but they were already different predictor sets. So the state simulation is out now.

Another thing I added was stratified sampling of the top .5% incomes in the PUF (and uniformly sampling the rest) for the purposes of imputation. Ignoring the weights shouldn't be a problem statistically since we just want a good P(Y | X) and thus we just need a wide range of Xs (we're not sampling proportional to residual size or something).

To push things along, I also added you to as a reviewer to the prerequisite PR #531 alongside Pavel. With your development, my cherry picking and refinements, and then Pavel's feedback, I think it's probably pretty close. But it is a large and very important PR.

Base automatically changed from census-block-calibration to main February 17, 2026 15:25
baogorek and others added 2 commits February 17, 2026 11:08
…530)

Adds puf_impute.py and source_impute.py from PR #516 (by @MaxGhenis),
refactors extended_cps.py to delegate to the new modules, and integrates
both into the unified calibration pipeline. The core fix removes the
subsample(10_000) call that dropped high-income PUF records before QRF
training, which caused a hard AGI ceiling at ~$6.26M after uprating.

Co-Authored-By: Max Ghenis <mghenis@gmail.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…ibration loader

Replace weight-proportional PUF subsample with stratified approach that
force-includes top 0.5% by AGI and randomly samples rest to 20K, preserving
the high-income tail the QRF needs. Remove random state assignment from SIPP
and SCF in source_impute.py since these surveys lack state identifiers. Fix
unified_calibration.py to handle TIME_PERIOD_ARRAYS dataset format. Add
`make calibrate` target.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
The test read extended_cps.py source and asserted string presence of
functions that moved to puf_impute.py. Behavioral tests retained.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Stratified subsample (not all records), ACS-only state predictor
(not SIPP/SCF), remove unverified AGI ceiling claims.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
baogorek and others added 2 commits February 17, 2026 18:46
The aca_ptc model produces suppressed values (~$70bn vs $98bn target),
so reweighting alone cannot close the gap. Worst state is ~53% error.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
DC has zero CPS households, so simulated values are always $0 (100%
error) regardless of reweighting. Skip it in both calibration tests.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@baogorek
Copy link
Collaborator Author

I'm putting this back in draft status until this issue is solved. #539

DC has no representation!

baogorek and others added 2 commits February 17, 2026 22:26
DC is entirely missing from block_cd_distributions.csv.gz due to a
bug in the us library integration (see #539). Removing the skip so
CI correctly surfaces the failure.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@juaristi22 juaristi22 marked this pull request as ready for review February 18, 2026 10:39
@baogorek
Copy link
Collaborator Author

baogorek commented Feb 18, 2026

Thanks @juaristi22 for getting this ready to go. I had claude already going on it and here's what it said about your fix:

Ha — she took a completely different (and arguably cleaner) approach. Instead of patching each script individually, she set DC_STATEHOOD=1 in
  storage/__init__.py, which tells the us library to include DC in STATES_AND_TERRITORIES globally. That means us.states.lookup("DC") works and DC shows up in
   the state loop — no per-script fixes needed. 

Nice work.

Update: I did add one more commit to make it explicit: + [us.states.DC] in the state lists and a getattr fallback in fetch_block_population. Works regardless of import order. (So at least my claude session wasn't totally wasted)

The `us` library v3.2.0 excludes DC from STATES_AND_TERRITORIES, so three
scripts silently skipped it. Add `+ [us.states.DC]` to the state lists in
make_block_cd_distributions.py and make_block_crosswalk.py, and add a
getattr fallback in fetch_block_population so DC lookups succeed.

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

3 participants