This project analyses demographic and household statistics across Malaysian states using Python. The dataset includes variables such as population size, age distribution, household structure, and urbanisation rates.
Using libraries such as pandas, matplotlib, and scikit-learn, the project performs data cleaning, correlation analysis, regression, K-Means clustering, and Principal Component Analysis (PCA). The goal is to identify patterns and relationships between demographic factors and household characteristics and visualise how states differ in their demographic profiles.
The dataset was retrieved from the Department of Statistics Malaysia.
The dataset contains demographic and household statistics for Malaysian states.
Key variables include:
- Population (thousands)
- Age distribution (0–14, 15–64, 65+)
- Total, urban, and rural households
- Average household size
- Urbanisation rate
The data was cleaned and processed using pandas before analysis.
- Correlation Analysis
- Regression
- K-means Clustering
- Principle Complex Analysis
The analysis generates several plots, including:
- Youth population vs household size
- Elderly population vs household size
- Urbanisation vs household size
- Cluster visualisation with regression lines
- PCA plot of Malaysian states
- Pandas
- Matplotlib
- Sklearn
- Numpy
-
Install dependencies: pip install pandas matplotlib scikit-learn numpy
-
Place the dataset file
pop_stats.csvin the project folder. -
Run the script: python analysis.py
- Add interactive visualisations
- Include more demographic variables
- Apply additional clustering evaluation methods