Realistic synthetic data generation for social services analytical workflows
This repository generates completely fictional but realistic social services case data to support the development and validation of analytical workflows in the Strategic Data Analytics (SDA) unit. The synthetic data mirrors real-world complexity while maintaining complete privacy protection and supporting rigorous algorithm testing.
Primary Objectives:
- Validation Support: Create synthetic datasets with known characteristics to test risk flagging, sentiment analysis, and pattern detection algorithms
- Workflow Testing: Provide controlled synthetic data to benchmark AI agent performance in
sda-casenote-reader - Training Data: Generate diverse client scenarios for algorithm training and refinement
- Public Service Adaptability: Provide a reusable framework for other government organizations
Target Users:
- Strategic Data Analytics (SDA) research staff (primary)
- Government researchers and academic partners (secondary)
- Other public service organizations requiring synthetic social services data
simulation/- Synthetic data generation engineinput-specifications/- YAML configuration filesgeneration-engine/- R scripts for data generationoutput-datasets/- Generated synthetic datasets
analysis/- Analysis and reporting workflowsai/- AI assistant configuration and memorydata-public/- Public datasets and metadatadata-private/- Private/derived datasets
- Start session:
show_context_status() - Load context: Choose appropriate persona or add specific files
- Generate data: Work with simulation specifications
- Analyze results: Use analysis workflows
- Log changes:
log_change('file.R', 'description')
# Configure specifications in simulation/input-specifications/
# Run generation engine scripts in simulation/generation-engine/
source('simulation/generation-engine/client-generator.R')# Execute analysis workflows
source('analysis/eda-1/eda-1.R')# Render Quarto reports
quarto render analysis/eda-1/eda-1.qmd --to htmlDomain experts define synthetic data parameters through human-readable YAML files:
simulation/input-specifications/
βββ client-profiles.yml # Client demographic patterns & risk factors
βββ case-complexity-levels.yml # Service intensity & documentation patterns
βββ writing-style-guides.yml # Caseworker writing style variations
βββ project-scenarios/ # Project-specific testing configurations
βββ risk-assessment-validation.yml
βββ template-scenario.yml
Modular R scripts handle different aspects of synthetic data generation:
simulation/generation-engine/
βββ client-generator.R # β
Demographic profile generation with risk factors
βββ note-generator.R # π§ Case note text synthesis (planned)
βββ complexity-controller.R # π§ Case complexity orchestration (planned)
βββ validation-framework.R # π§ Quality assurance workflows (planned)
Generated synthetic data organized for easy access and validation:
simulation/output-datasets/
βββ client-profiles/ # Generated demographic data
βββ case-notes/ # Generated case note text
βββ validation-reports/ # Quality metrics and authenticity checks
This project includes a dynamic AI assistant with specialized personas for different types of work:
- Default - General assistance with minimal context (activated by default)
- Developer - Technical implementation focus with minimal context
- Project Manager - Strategic oversight with full project context
- Case Note Analyst - Domain expertise with specialized social services context
# Switch between personas
activate_default() # General assistance
activate_developer() # Technical focus
activate_project_manager() # Strategic oversight
activate_casenote_analyst() # Domain expertise
# Check current status
show_context_status()The AI assistant automatically loads with the Default persona when you open the project in VS Code, providing helpful general assistance while keeping specialized context available on demand.
- R (4.0+)
- RStudio (recommended)
- Git (for version control)
- Quarto CLI (for reports)
git clone https://github.com/andkov/case-note-simulator.git
cd case-note-simulator# Run the package installer
source('utility/install-packages.R')# Check project setup
source('scripts/check-setup.R')# Load the AI context system
source('ai/scripts/ai-context-management.R')
# Start with full project context
activate_project_manager()
# Check status
show_context_status()# Load the client generator
source("./simulation/generation-engine/client-generator.R")
# Generate a test population
test_clients <- generate_client_population(n_clients = 50)
# Review the results
head(test_clients)
validation <- validate_client_population(test_clients)
print(validation)# Export to CSV for use in analytical workflows
export_client_population(
test_clients,
"./simulation/output-datasets/client-profiles/test_population.csv"
)The system generates four primary client types reflecting real-world social services populations:
| Archetype | Description | Risk Profile | Typical Duration |
|---|---|---|---|
| Stable Employment Seeker | Low-barrier clients focused on employment services | Low complexity | 3-8 months |
| Moderate Multi-Barrier | Clients with 2-3 significant challenges | Moderate complexity | 8-18 months |
| High Complexity Intensive | Multiple severe barriers requiring intensive support | High complexity | 12-36 months |
| Elderly Support Needs | Older adults (65-80) with age-related requirements | Low-moderate complexity | 6-24 months |
The system generates realistic patterns for key risk factors:
- Housing Instability - Homelessness, overcrowding, frequent moves
- Substance Use - Alcohol/drug challenges affecting service engagement
- Mental Health Challenges - Conditions requiring service coordination
- Criminal History - Justice system involvement affecting opportunities
- Hospital Stays - Medical complexity requiring case management
- Dependents - Children/family affecting service planning
- Employment Barriers - Skills gaps, transportation, health limitations
Synthetic case notes reflect authentic caseworker documentation patterns:
- Formal Detailed (30%) - Comprehensive, policy-compliant documentation
- Efficient Bullet (35%) - Time-efficient bullet-point style
- Conversational Narrative (25%) - Story-like, informal approach
- Clinical Precise (10%) - Medical/clinical background terminology
Create targeted synthetic datasets for specific analytical testing:
scenario_name: "Risk Assessment Algorithm Validation"
total_clients: 500
validation_targets:
housing_risk_detection: 0.95
substance_use_flagging: 0.88
crisis_prediction: 0.85See risk-assessment-validation.yml for complete example.
All synthetic data is completely fictional with systematic privacy safeguards:
- Fictional Names: No correspondence to real individuals
- Geographic Obfuscation: Realistic but fictional Alberta-like locations
- Temporal Displacement: Dates preventing correlation with real service periods
- Demographic Noise: Statistical realism while eliminating identifiability
Designed for seamless integration with sda-casenote-reader:
# Generate data for specific SDA project
scenario_clients <- generate_client_population(
n_clients = 500,
scenario_file = "./simulation/input-specifications/project-scenarios/risk-assessment-validation.yml"
)
# Export in SDA-compatible format
export_client_population(
scenario_clients,
"./simulation/testing-harness/sda_test_data.csv"
)- Commands Reference - Essential AI system commands
- Context System - AI context management and persona system
- MCP Setup - Model Context Protocol setup instructions
- Implementation Guide - Comprehensive architecture and workflow documentation
- Simulation Overview - Synthetic data generation system overview
- Project Mission - Project purpose and epistemic goals
- Project Method - Synthetic data generation methodology
- Glossary - Social services and technical terminology
| Component | Status | Description |
|---|---|---|
| Client Generator | β Complete | Demographic profiles with realistic risk factors |
| Expert Specifications | β Complete | YAML templates for all major archetypes |
| Project Scenarios | β Template Ready | Risk assessment validation example |
| Note Generator | π§ Planned | Case note text synthesis with writing variations |
| Complexity Controller | π§ Planned | Service intensity orchestration |
| Validation Framework | π§ Planned | Automated quality assurance |
- Review and customize YAML specifications in
simulation/input-specifications/ - Create new project scenarios using
template-scenario.yml - Validate generated synthetic data for realism and authenticity
- Extend generation engine with additional modules
- Implement quality validation frameworks
- Enhance SDA integration capabilities
Generated client profiles include comprehensive demographic and risk information:
# Sample synthetic client profile
client_id: "SYNTH_00123"
archetype: "moderate_complexity_multi_barrier"
age: 34
education_level: "high_school"
family_composition: "single_parent"
location: "Spruce Valley"
housing_instability: 1
substance_use: 0
mental_health_challenges: 1
has_dependents: 1
case_complexity: "moderate"
estimated_duration_months: 14
intake_date: "2023-03-15"- Use Project Manager persona for strategic work and planning
- Use Developer persona for focused coding work
- Use Case Note Analyst persona for domain expertise work
- Project memory system tracks decisions and intentions automatically
- All synthetic data is completely fictional - no real client information
- Check AI system status with
show_context_status() - Log important changes with
log_change('file.R', 'description')
This public repository supports the private sda-casenote-reader project while serving as a reusable framework for other public service organizations requiring synthetic social services data for analytical workflow development and validation.