Skip to content
View skerk001's full-sized avatar
😃
😃

Block or report skerk001

Block user

Prevent this user from interacting with your repositories and sending you notifications. Learn more about blocking users.

You must be logged in to block users.

Maximum 250 characters. Please don't include any personal information such as legal names or email addresses. Markdown supported. This note will be visible to only you.
Report abuse

Contact GitHub support about this user’s behavior. Learn more about reporting abuse.

Report abuse
skerk001/README.md

Samir Kerkar - Data Scientist | ML Engineer | Healthcare Analytics

Typing SVG

M.S. Data Science @ Johns Hopkins University (incoming) · B.S. Mathematics, UC Irvine
📧 Samir2000VIP@gmail.com · 💼 LinkedIn


4+ years building production data science in healthcare — causal inference, predictive modeling, and NLP across 60,000+ patients and 20+ facilities. Built computer vision models at UCI, then moved into healthcare outcomes research — COPD cost-effectiveness, readmission modeling, chronic disease analytics, and four ASHP conference presentations. Heart failure outcomes manuscript under peer review (n=3,024, 11 primary care clinics).


🔬 Featured Projects

GenomicsGPT — ML + LLM pipeline for clinical variant interpretation. XGBoost/LightGBM ensemble on 1.69M ClinVar variants (AUC = 0.9949, leakage-corrected 0.985) with SHAP explainability and Llama 3 / Claude clinical report generation. Manuscript targeting Bioinformatics Advances.

ClinicalRAG — RAG system for clinical question answering over 220 discharge summaries with hallucination guardrails, citation tracking, and systematic chunking evaluation. 97.6% condition recall, 95.2% abstention accuracy.

CausalCare — Causal inference analysis of ICU beta-blocker treatment effects using propensity matching, IPW, doubly robust estimation, Double ML, and Causal Forest on eICU data via EconML/DoWhy.

REIGN NBA Analytics — Cross-era player impact metric with 4 era-specific regression models, playoff opponent adjustments, and interactive visualizations. 29,969 player-seasons across 80 years. Research Paper

Diabetic Retinopathy Classification — CNN-based 5-class severity grading from retinal fundus images (F1 = 0.94) with GradCAM interpretability. Research Paper

Gene Expression Cancer Prediction — ML classification of AML vs. ALL leukemia subtypes from 7,000+ gene expression features (F1 = 0.95).


📊 By the Numbers

  • 0.9949 AUC on 1.69M genetic variants (GenomicsGPT)
  • $83.50 PMPM cost reduction in COPD intervention analysis (p = 0.0027, n = 997)
  • 97.6% condition recall on clinical RAG system
  • 0.94 F1 on 5-class diabetic retinopathy classification
  • 0.95 F1 on gene expression cancer subtype prediction

🛠️ Tech Stack

ML/AI: Python · scikit-learn · XGBoost · LightGBM · TensorFlow/Keras · SHAP · LangChain · EconML/DoWhy · pandas · NumPy · R

LLM/NLP: Llama 3 (Ollama) · Claude API · ChromaDB · RAG pipelines · prompt engineering

Engineering: React · TypeScript · Vite · Flask · FastAPI · PostgreSQL · SQL · Git

Domain: EHR/clinical data · genomics · causal inference · healthcare analytics · sports analytics


In my free time! — chess (2500+ rated), basketball, piano, and gaming.

Pinned Loading

  1. diabetic-retinopathy-classification diabetic-retinopathy-classification Public

    CNN-based 5-class diabetic retinopathy severity classification from retinal fundus images (F1 = 0.94)

  2. gene-cancer-prediction gene-cancer-prediction Public

    ML classification of AML vs. ALL leukemia subtypes from gene expression data (F1 = 0.95)

    Jupyter Notebook

  3. clinical-rag clinical-rag Public

    RAG system for clinical question answering over 220 discharge summaries with hallucination guardrails, citation tracking, and chunking strategy evaluation (97.6% condition recall)

    Python

  4. genomicsgpt genomicsgpt Public

    ML + LLM pipeline for genetic variant pathogenicity prediction (AUC 0.9949, 1.69M ClinVar variants) with SHAP explainability and clinical report generation via Llama 3 / Claude

    Jupyter Notebook

  5. CausalCare CausalCare Public

    Causal inference analysis of ICU beta-blocker treatment effects using propensity matching, IPW, doubly robust estimation, Double ML, and Causal Forest on eICU data

    Python

  6. reign-web reign-web Public

    NBA player impact analytics across 80 years. Era-specific composite models, playoff opponent adjustments, and interactive visualizations for 3,484 players (1946–2025).

    JavaScript