Add calibration package checkpointing, target config, and hyperparameter CLI#538
Add calibration package checkpointing, target config, and hyperparameter CLI#538
Conversation
…530) Adds puf_impute.py and source_impute.py from PR #516 (by @MaxGhenis), refactors extended_cps.py to delegate to the new modules, and integrates both into the unified calibration pipeline. The core fix removes the subsample(10_000) call that dropped high-income PUF records before QRF training, which caused a hard AGI ceiling at ~$6.26M after uprating. Co-Authored-By: Max Ghenis <mghenis@gmail.com> Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…ibration loader Replace weight-proportional PUF subsample with stratified approach that force-includes top 0.5% by AGI and randomly samples rest to 20K, preserving the high-income tail the QRF needs. Remove random state assignment from SIPP and SCF in source_impute.py since these surveys lack state identifiers. Fix unified_calibration.py to handle TIME_PERIOD_ARRAYS dataset format. Add `make calibrate` target. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…ter CLI - Add build-only mode to save calibration matrix as pickle package - Add target config YAML for declarative target exclusion rules - Add CLI flags for beta, lambda_l2, learning_rate hyperparameters - Add streaming subprocess output in Modal runner - Add calibration pipeline documentation - Add tests for target config filtering and CLI arg parsing Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
| raw_data = source_sim.dataset.load_dataset() | ||
| data_dict = {} | ||
| for var in raw_data: | ||
| data_dict[var] = {2024: raw_data[var][...]} |
There was a problem hiding this comment.
this fails when trying to run calibration because load_dataset() returns dicts, not h5py datasets
| data_dict[var] = {2024: raw_data[var][...]} | |
| if isinstance(raw_data[var], dict): | |
| vals = list(raw_data[var].values()) | |
| data_dict[var] = {2024: vals[0]} | |
| else: | |
| data_dict[var] = {2024: np.array(raw_data[var])} |
| dataset_path=dataset_path, | ||
| puf_dataset_path=puf_dataset_path, | ||
| state_fips=base_states, | ||
| time_period=2024, |
There was a problem hiding this comment.
should we make this and other instances where time_period=2024 is hardcoded flexibly derive the time period from the dataset?
| @@ -1 +1 @@ | |||
| """Non-PUF QRF imputations with state_fips as predictor. | |||
There was a problem hiding this comment.
claude recommends adding new files like source_impute.py and puf_impute.py to the __innit__ file, probably wouldn't hurt though not urgent
| - `storage/calibration/unified_diagnostics.csv` --- per-target error report | ||
| - `storage/calibration/unified_run_config.json` --- full run configuration | ||
|
|
||
| ### 2. Build-then-fit (recommended for iteration) |
There was a problem hiding this comment.
would we want to support this option for the modal runner as well? i think currently the modal runner is not wired to do so and save the calibration package, so it could only be used for local / kaggle notebook buiilds
| Person-level state FIPS array. | ||
| """ | ||
| hh_ids_person = data.get("person_household_id", {}).get(time_period) | ||
| if hh_ids_person is not None: |
There was a problem hiding this comment.
will person_household_id ever not be available?
the fallback assumes every household has the same number of people and could lead to wrong state assignments, but we might be able to get rid of it altogether, if we can safely assume that person_household_id will always be in the data
There was a problem hiding this comment.
Minor comments, but generally LGTM, I was also able to run the calibration job in modal (after removing the ellipsis in unified_calibration.py)!
Small note: if im not mistaken this pr addressess issue #534. Seems like #310 was referenced in it as something that would be addressed together, but this pr does not save the calibration_log.csv among its outputs. Do we want to add it at this point?
4c51b32 to
61523d8
Compare
Fixes #533
Fixes #534
Summary
--build-onlysaves the expensive matrix build as a pickle,--package-pathloads it for fast re-fitting with different hyperparameters or target setstarget_config.yaml) replace hardcoded target filtering; checked-in config reproduces the junkyard's 22 excluded groups--beta,--lambda-l2,--learning-rateare now tunable from the command line and Modal runnerdocs/calibration.mdcovers all workflows (single-pass, build-then-fit, package re-filtering, Modal, portable fitting)Test plan
pytest policyengine_us_data/tests/test_calibration/test_unified_calibration.py— CLI arg parsing testspytest policyengine_us_data/tests/test_calibration/test_target_config.py— target config filtering + package round-trip testsmake calibrate-buildproduces package,--package-pathloads it and fits🤖 Generated with Claude Code