Predicting student academic performance using 7 ML models on 30+ environmental and personal features
A comprehensive machine learning analysis predicting student academic grades based on environmental, demographic, and lifestyle factors. Compares 7 models (from linear regression to deep neural networks) and includes K-Means clustering for outlier detection. Uses the Kaggle Student Performance dataset (395 students, 32 features).
- Can ML predict student performance based on environmental and personal factors?
- Do parental education levels significantly affect student performance?
- How do personal habits and health impact academic grades?
| Model | Description |
|---|---|
| Linear Regression | Baseline regression model |
| Ridge Regression | L2-regularized linear model |
| Decision Tree | Non-linear tree-based regression |
| SVR | Support Vector Regression (linear kernel) |
| Model | Description |
|---|---|
| Basic Neural Network | 2-layer Dense with ReLU |
| Improved Neural Network | 3-layer with BatchNormalization |
| Advanced Neural Network | 3-layer with BatchNormalization + Dropout |
| Model | Description |
|---|---|
| K-Means Clustering | Outlier detection and student segmentation |
Source: Kaggle - Student Performance Data Size: 395 students, 32 features
Key Feature Categories:
- Demographics — Age, sex, family size, parental status
- Education — Mother's/Father's education level, school support, study time
- Lifestyle — Free time, going out, alcohol consumption, health status
- Academic — Past failures, absences, travel time
Target: Average grade (mean of G1, G2, G3 grading periods)
- Preprocessing — Categorical to numerical conversion, MinMax normalization
- EDA — Distribution analysis, correlation heatmap, feature-target scatter plots
- Outlier Detection — K-Means clustering with distance-based threshold (1.956)
- Model Training — 80/20 train-test split across all 7 models
- Evaluation — MAE, MSE, RMSE, R² Score, Explained Variance
- Correlation matrix heatmap
- Target distribution bar chart
- 30-feature scatter plot grid
- K-Means cluster visualization (PCA)
- Per-model prediction plots with training/validation loss curves
StudentPerformanceAI/
├── StudentPerformanceAI.ipynb # Main analysis notebook
├── StudentPerformance_Report_SolimanZakaria.pdf # Full report
├── Model Plots/ # Model performance visualizations
├── Plots/ # EDA visualizations
└── Student Performance Dataset/
└── student_data.csv # Source dataset
git clone https://github.com/SolyZak/StudentPerformanceAI.git
cd StudentPerformanceAI
pip install pandas numpy scikit-learn tensorflow matplotlib seaborn
jupyter notebook StudentPerformanceAI.ipynbThis project is for educational purposes.