diff --git a/Cheatsheats/plotting.md b/Cheatsheats/plotting.md
new file mode 100644
index 00000000..96b6cd07
--- /dev/null
+++ b/Cheatsheats/plotting.md
@@ -0,0 +1,12 @@
+| Feature Type | Target Type | What you want to see | Plot to use | What it tells you (in simple words) | Python example |
+| ------------ | ------------------- | -------------------------------------- | ------------------- | ------------------------------------ | ----------------------------------------------- |
+| Number | Number | Does feature increase/decrease target? | Scatter | Shows upward/downward pattern | `sns.scatterplot(x='age', y='salary', data=df)` |
+| Number | Number | Are there extreme values? | Scatter | Dots far away = outliers | `sns.scatterplot(x='age', y='salary', data=df)` |
+| Number | Number | Is data spread normal or skewed? | Histogram | Shows shape of data | `sns.histplot(df['age'], kde=True)` |
+| Number | Number | Compare many numeric features at once | Heatmap | Which feature relates most to target | `sns.heatmap(df.corr(), annot=True)` |
+| Number | Class (0/1, Yes/No) | Do classes look different? | Box plot | If boxes separate → feature useful | `sns.boxplot(x='class', y='age', data=df)` |
+| Number | Class | How dense values are per class | Violin plot | Shows distribution per class | `sns.violinplot(x='class', y='age', data=df)` |
+| Category | Number | Which category has higher target? | Bar plot (mean) | Shows average target per category | `sns.barplot(x='city', y='sales', data=df)` |
+| Category | Number | Which category appears most? | Count plot | Shows frequency | `sns.countplot(x='city', data=df)` |
+| Category | Class | Relation between two categories | Count plot with hue | Shows class split per category | `sns.countplot(x='gender', hue='buy', data=df)` |
+| Many Numbers | Number/Class | Overall relationship view | Pairplot | Quick scan of all relations | `sns.pairplot(df, hue='class')` |
\ No newline at end of file
diff --git a/Cheatsheats/testing.md b/Cheatsheats/testing.md
new file mode 100644
index 00000000..fa4c50e5
--- /dev/null
+++ b/Cheatsheats/testing.md
@@ -0,0 +1,21 @@
+| Test | Purpose (What it checks) | Data Type | Key Assumptions | Python (scipy/statsmodels) |
+| ------------------------ | ------------------------------------------------- | ----------------------------------- | ----------------------------------------------------- | ------------------------------------------------------------------ |
+| **One-sample t-test** | Compare sample mean to a known value | Numeric | Normal data | `stats.ttest_1samp(x, μ)` |
+| **Independent t-test** | Compare means of 2 independent groups | Numeric (2 groups) | Normal, equal variance | `stats.ttest_ind(a, b)` |
+| **Paired t-test** | Compare same group before vs after | Numeric (paired) | Normal differences | `stats.ttest_rel(a, b)` |
+| **One-way ANOVA** | Compare means of 3+ groups | Numeric (3+ groups) | Normal, equal variance | `stats.f_oneway(g1, g2, g3)` |
+| **One-way ANOVA (OLS)** | Check if **mean age differs across classes** | Numeric (age) + categorical (class) | Normal residuals, equal variance, independent samples | `ols('age ~ class', data=df).fit()` + `sm.stats.anova_lm(model)` |
+| **Two-way ANOVA** | Effect of 2 factors on mean | Numeric + categorical | Normal, equal variance | `statsmodels.formula.api.ols()` |
+| **Mann–Whitney U** | 2 groups, non-normal | Ordinal/Numeric | Independent | `stats.mannwhitneyu(a, b)` |
+| **Wilcoxon test** | Paired, non-normal | Ordinal/Numeric | Paired | `stats.wilcoxon(a, b)` |
+| **Kruskal–Wallis** | 3+ groups, non-normal | Ordinal/Numeric | Independent | `stats.kruskal(g1, g2, g3)` |
+| **Chi-square test** | Relationship between categories | Categorical | Expected freq > 5 | `stats.chi2_contingency(table)` |
+| **Fisher’s Exact** | Small categorical samples | Categorical | 2×2 table | `stats.fisher_exact(table)` |
+| **Pearson correlation** | Linear relation between 2 vars | Numeric | Normal, linear | `stats.pearsonr(x, y)` |
+| **Spearman correlation** | Rank-based relation | Ordinal/Numeric | Monotonic | `stats.spearmanr(x, y)` |
+| **Linear regression** | Predict Y from X | Numeric | Linearity, normal errors | `stats.linregress(x, y)` |
+| **Logistic regression** | Predict binary outcome | Numeric + categorical | Independent | `sklearn.linear_model.LogisticRegression()` |
+| **Shapiro-Wilk** | Test for normality | Numeric | Random sample | `stats.shapiro(x)` |
+| **Kolmogorov–Smirnov** | Compare to a distribution | Numeric | Continuous | `stats.kstest(x, 'norm')` |
+| **Levene’s test** | Test equal variance | Numeric | Independent | `stats.levene(a, b)` |
+| **Tukey’s HSD** | Find **which specific groups differ** after ANOVA | Numeric + categorical (3+ groups) | Normal data, equal variance, independent groups | `statsmodels.stats.multicomp.pairwise_tukeyhsd(endog=y, groups=g)` |
\ No newline at end of file
diff --git a/ML/10_svm/Exercise/solution_for_understanding.ipynb b/ML/10_svm/Exercise/solution_for_understanding.ipynb
new file mode 100644
index 00000000..af95bcea
--- /dev/null
+++ b/ML/10_svm/Exercise/solution_for_understanding.ipynb
@@ -0,0 +1,1692 @@
+{
+ "cells": [
+ {
+ "cell_type": "markdown",
+ "id": "75076c8d",
+ "metadata": {},
+ "source": [
+ "# SVM with Hyperparameter Tuning"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "57037e1e",
+ "metadata": {},
+ "source": [
+ "## Imports and Data Load"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 17,
+ "id": "d737904f",
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "import pandas as pd\n",
+ "from sklearn.datasets import load_digits\n",
+ "from sklearn.model_selection import train_test_split\n",
+ "from sklearn.svm import SVC"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 18,
+ "id": "d30e15d4",
+ "metadata": {},
+ "outputs": [
+ {
+ "data": {
+ "text/plain": [
+ "['DESCR', 'data', 'feature_names', 'frame', 'images', 'target', 'target_names']"
+ ]
+ },
+ "execution_count": 18,
+ "metadata": {},
+ "output_type": "execute_result"
+ }
+ ],
+ "source": [
+ "digits = load_digits()\n",
+ "dir(digits)"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "3a429a95",
+ "metadata": {},
+ "source": [
+ "### Data Exploration"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 19,
+ "id": "647c530e",
+ "metadata": {},
+ "outputs": [
+ {
+ "data": {
+ "text/plain": [
+ "array([[ 0., 0., 5., ..., 0., 0., 0.],\n",
+ " [ 0., 0., 0., ..., 10., 0., 0.],\n",
+ " [ 0., 0., 0., ..., 16., 9., 0.],\n",
+ " ...,\n",
+ " [ 0., 0., 1., ..., 6., 0., 0.],\n",
+ " [ 0., 0., 2., ..., 12., 0., 0.],\n",
+ " [ 0., 0., 10., ..., 12., 1., 0.]], shape=(1797, 64))"
+ ]
+ },
+ "execution_count": 19,
+ "metadata": {},
+ "output_type": "execute_result"
+ }
+ ],
+ "source": [
+ "digits.data"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 20,
+ "id": "35c0e99e",
+ "metadata": {},
+ "outputs": [
+ {
+ "data": {
+ "text/plain": [
+ "['pixel_0_0',\n",
+ " 'pixel_0_1',\n",
+ " 'pixel_0_2',\n",
+ " 'pixel_0_3',\n",
+ " 'pixel_0_4',\n",
+ " 'pixel_0_5',\n",
+ " 'pixel_0_6',\n",
+ " 'pixel_0_7',\n",
+ " 'pixel_1_0',\n",
+ " 'pixel_1_1',\n",
+ " 'pixel_1_2',\n",
+ " 'pixel_1_3',\n",
+ " 'pixel_1_4',\n",
+ " 'pixel_1_5',\n",
+ " 'pixel_1_6',\n",
+ " 'pixel_1_7',\n",
+ " 'pixel_2_0',\n",
+ " 'pixel_2_1',\n",
+ " 'pixel_2_2',\n",
+ " 'pixel_2_3',\n",
+ " 'pixel_2_4',\n",
+ " 'pixel_2_5',\n",
+ " 'pixel_2_6',\n",
+ " 'pixel_2_7',\n",
+ " 'pixel_3_0',\n",
+ " 'pixel_3_1',\n",
+ " 'pixel_3_2',\n",
+ " 'pixel_3_3',\n",
+ " 'pixel_3_4',\n",
+ " 'pixel_3_5',\n",
+ " 'pixel_3_6',\n",
+ " 'pixel_3_7',\n",
+ " 'pixel_4_0',\n",
+ " 'pixel_4_1',\n",
+ " 'pixel_4_2',\n",
+ " 'pixel_4_3',\n",
+ " 'pixel_4_4',\n",
+ " 'pixel_4_5',\n",
+ " 'pixel_4_6',\n",
+ " 'pixel_4_7',\n",
+ " 'pixel_5_0',\n",
+ " 'pixel_5_1',\n",
+ " 'pixel_5_2',\n",
+ " 'pixel_5_3',\n",
+ " 'pixel_5_4',\n",
+ " 'pixel_5_5',\n",
+ " 'pixel_5_6',\n",
+ " 'pixel_5_7',\n",
+ " 'pixel_6_0',\n",
+ " 'pixel_6_1',\n",
+ " 'pixel_6_2',\n",
+ " 'pixel_6_3',\n",
+ " 'pixel_6_4',\n",
+ " 'pixel_6_5',\n",
+ " 'pixel_6_6',\n",
+ " 'pixel_6_7',\n",
+ " 'pixel_7_0',\n",
+ " 'pixel_7_1',\n",
+ " 'pixel_7_2',\n",
+ " 'pixel_7_3',\n",
+ " 'pixel_7_4',\n",
+ " 'pixel_7_5',\n",
+ " 'pixel_7_6',\n",
+ " 'pixel_7_7']"
+ ]
+ },
+ "execution_count": 20,
+ "metadata": {},
+ "output_type": "execute_result"
+ }
+ ],
+ "source": [
+ "digits.feature_names"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 21,
+ "id": "a548c600",
+ "metadata": {},
+ "outputs": [
+ {
+ "data": {
+ "text/plain": [
+ "array([0, 1, 2, ..., 8, 9, 8], shape=(1797,))"
+ ]
+ },
+ "execution_count": 21,
+ "metadata": {},
+ "output_type": "execute_result"
+ }
+ ],
+ "source": [
+ "digits.target"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 22,
+ "id": "968d0907",
+ "metadata": {},
+ "outputs": [
+ {
+ "data": {
+ "text/plain": [
+ "array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])"
+ ]
+ },
+ "execution_count": 22,
+ "metadata": {},
+ "output_type": "execute_result"
+ }
+ ],
+ "source": [
+ "digits.target_names"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "06121aac",
+ "metadata": {},
+ "source": [
+ "### Making Data Frames"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 23,
+ "id": "1825b11b",
+ "metadata": {},
+ "outputs": [
+ {
+ "data": {
+ "text/html": [
+ "
\n",
+ "\n",
+ "
\n",
+ " \n",
+ " \n",
+ " \n",
+ " pixel_0_0 \n",
+ " pixel_0_1 \n",
+ " pixel_0_2 \n",
+ " pixel_0_3 \n",
+ " pixel_0_4 \n",
+ " pixel_0_5 \n",
+ " pixel_0_6 \n",
+ " pixel_0_7 \n",
+ " pixel_1_0 \n",
+ " pixel_1_1 \n",
+ " ... \n",
+ " pixel_6_6 \n",
+ " pixel_6_7 \n",
+ " pixel_7_0 \n",
+ " pixel_7_1 \n",
+ " pixel_7_2 \n",
+ " pixel_7_3 \n",
+ " pixel_7_4 \n",
+ " pixel_7_5 \n",
+ " pixel_7_6 \n",
+ " pixel_7_7 \n",
+ " \n",
+ " \n",
+ " \n",
+ " \n",
+ " 0 \n",
+ " 0.0 \n",
+ " 0.0 \n",
+ " 5.0 \n",
+ " 13.0 \n",
+ " 9.0 \n",
+ " 1.0 \n",
+ " 0.0 \n",
+ " 0.0 \n",
+ " 0.0 \n",
+ " 0.0 \n",
+ " ... \n",
+ " 0.0 \n",
+ " 0.0 \n",
+ " 0.0 \n",
+ " 0.0 \n",
+ " 6.0 \n",
+ " 13.0 \n",
+ " 10.0 \n",
+ " 0.0 \n",
+ " 0.0 \n",
+ " 0.0 \n",
+ " \n",
+ " \n",
+ " 1 \n",
+ " 0.0 \n",
+ " 0.0 \n",
+ " 0.0 \n",
+ " 12.0 \n",
+ " 13.0 \n",
+ " 5.0 \n",
+ " 0.0 \n",
+ " 0.0 \n",
+ " 0.0 \n",
+ " 0.0 \n",
+ " ... \n",
+ " 0.0 \n",
+ " 0.0 \n",
+ " 0.0 \n",
+ " 0.0 \n",
+ " 0.0 \n",
+ " 11.0 \n",
+ " 16.0 \n",
+ " 10.0 \n",
+ " 0.0 \n",
+ " 0.0 \n",
+ " \n",
+ " \n",
+ " 2 \n",
+ " 0.0 \n",
+ " 0.0 \n",
+ " 0.0 \n",
+ " 4.0 \n",
+ " 15.0 \n",
+ " 12.0 \n",
+ " 0.0 \n",
+ " 0.0 \n",
+ " 0.0 \n",
+ " 0.0 \n",
+ " ... \n",
+ " 5.0 \n",
+ " 0.0 \n",
+ " 0.0 \n",
+ " 0.0 \n",
+ " 0.0 \n",
+ " 3.0 \n",
+ " 11.0 \n",
+ " 16.0 \n",
+ " 9.0 \n",
+ " 0.0 \n",
+ " \n",
+ " \n",
+ " 3 \n",
+ " 0.0 \n",
+ " 0.0 \n",
+ " 7.0 \n",
+ " 15.0 \n",
+ " 13.0 \n",
+ " 1.0 \n",
+ " 0.0 \n",
+ " 0.0 \n",
+ " 0.0 \n",
+ " 8.0 \n",
+ " ... \n",
+ " 9.0 \n",
+ " 0.0 \n",
+ " 0.0 \n",
+ " 0.0 \n",
+ " 7.0 \n",
+ " 13.0 \n",
+ " 13.0 \n",
+ " 9.0 \n",
+ " 0.0 \n",
+ " 0.0 \n",
+ " \n",
+ " \n",
+ " 4 \n",
+ " 0.0 \n",
+ " 0.0 \n",
+ " 0.0 \n",
+ " 1.0 \n",
+ " 11.0 \n",
+ " 0.0 \n",
+ " 0.0 \n",
+ " 0.0 \n",
+ " 0.0 \n",
+ " 0.0 \n",
+ " ... \n",
+ " 0.0 \n",
+ " 0.0 \n",
+ " 0.0 \n",
+ " 0.0 \n",
+ " 0.0 \n",
+ " 2.0 \n",
+ " 16.0 \n",
+ " 4.0 \n",
+ " 0.0 \n",
+ " 0.0 \n",
+ " \n",
+ " \n",
+ "
\n",
+ "
5 rows × 64 columns
\n",
+ "
"
+ ],
+ "text/plain": [
+ " pixel_0_0 pixel_0_1 pixel_0_2 pixel_0_3 pixel_0_4 pixel_0_5 \\\n",
+ "0 0.0 0.0 5.0 13.0 9.0 1.0 \n",
+ "1 0.0 0.0 0.0 12.0 13.0 5.0 \n",
+ "2 0.0 0.0 0.0 4.0 15.0 12.0 \n",
+ "3 0.0 0.0 7.0 15.0 13.0 1.0 \n",
+ "4 0.0 0.0 0.0 1.0 11.0 0.0 \n",
+ "\n",
+ " pixel_0_6 pixel_0_7 pixel_1_0 pixel_1_1 ... pixel_6_6 pixel_6_7 \\\n",
+ "0 0.0 0.0 0.0 0.0 ... 0.0 0.0 \n",
+ "1 0.0 0.0 0.0 0.0 ... 0.0 0.0 \n",
+ "2 0.0 0.0 0.0 0.0 ... 5.0 0.0 \n",
+ "3 0.0 0.0 0.0 8.0 ... 9.0 0.0 \n",
+ "4 0.0 0.0 0.0 0.0 ... 0.0 0.0 \n",
+ "\n",
+ " pixel_7_0 pixel_7_1 pixel_7_2 pixel_7_3 pixel_7_4 pixel_7_5 \\\n",
+ "0 0.0 0.0 6.0 13.0 10.0 0.0 \n",
+ "1 0.0 0.0 0.0 11.0 16.0 10.0 \n",
+ "2 0.0 0.0 0.0 3.0 11.0 16.0 \n",
+ "3 0.0 0.0 7.0 13.0 13.0 9.0 \n",
+ "4 0.0 0.0 0.0 2.0 16.0 4.0 \n",
+ "\n",
+ " pixel_7_6 pixel_7_7 \n",
+ "0 0.0 0.0 \n",
+ "1 0.0 0.0 \n",
+ "2 9.0 0.0 \n",
+ "3 0.0 0.0 \n",
+ "4 0.0 0.0 \n",
+ "\n",
+ "[5 rows x 64 columns]"
+ ]
+ },
+ "execution_count": 23,
+ "metadata": {},
+ "output_type": "execute_result"
+ }
+ ],
+ "source": [
+ "df = pd.DataFrame(digits.data, columns=digits.feature_names)\n",
+ "\n",
+ "x = df\n",
+ "\n",
+ "df.head()"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 24,
+ "id": "d6f44e1d",
+ "metadata": {},
+ "outputs": [
+ {
+ "data": {
+ "text/plain": [
+ "(1797, 65)"
+ ]
+ },
+ "execution_count": 24,
+ "metadata": {},
+ "output_type": "execute_result"
+ }
+ ],
+ "source": [
+ "df[\"target\"] = digits.target\n",
+ "df.shape"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 25,
+ "id": "d95c7b34",
+ "metadata": {},
+ "outputs": [
+ {
+ "data": {
+ "text/plain": [
+ "0 0\n",
+ "1 1\n",
+ "2 2\n",
+ "3 3\n",
+ "4 4\n",
+ "Name: target, dtype: int64"
+ ]
+ },
+ "execution_count": 25,
+ "metadata": {},
+ "output_type": "execute_result"
+ }
+ ],
+ "source": [
+ "y = df.target\n",
+ "y.head()"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "fac154c9",
+ "metadata": {},
+ "source": [
+ "## Model Fitting"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 26,
+ "id": "5a102519",
+ "metadata": {},
+ "outputs": [
+ {
+ "data": {
+ "text/html": [
+ "SVC() In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook. On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org. \n",
+ "
\n",
+ "
\n",
+ " Parameters \n",
+ " \n",
+ " \n",
+ " \n",
+ " \n",
+ " \n",
+ " \n",
+ " \n",
+ " C\n",
+ " C: float, default=1.0 Regularization parameter. The strength of the regularization is inversely proportional to C. Must be strictly positive. The penalty is a squared l2 penalty. For an intuitive visualization of the effects of scaling the regularization parameter C, see :ref:`sphx_glr_auto_examples_svm_plot_svm_scale_c.py`. \n",
+ " \n",
+ " \n",
+ " 1.0 \n",
+ " \n",
+ " \n",
+ "\n",
+ " \n",
+ " \n",
+ " \n",
+ " \n",
+ " kernel\n",
+ " kernel: {'linear', 'poly', 'rbf', 'sigmoid', 'precomputed'} or callable, default='rbf' Specifies the kernel type to be used in the algorithm. If none is given, 'rbf' will be used. If a callable is given it is used to pre-compute the kernel matrix from data matrices; that matrix should be an array of shape ``(n_samples, n_samples)``. For an intuitive visualization of different kernel types see :ref:`sphx_glr_auto_examples_svm_plot_svm_kernels.py`. \n",
+ " \n",
+ " \n",
+ " 'rbf' \n",
+ " \n",
+ " \n",
+ "\n",
+ " \n",
+ " \n",
+ " \n",
+ " \n",
+ " degree\n",
+ " degree: int, default=3 Degree of the polynomial kernel function ('poly'). Must be non-negative. Ignored by all other kernels. \n",
+ " \n",
+ " \n",
+ " 3 \n",
+ " \n",
+ " \n",
+ "\n",
+ " \n",
+ " \n",
+ " \n",
+ " \n",
+ " gamma\n",
+ " gamma: {'scale', 'auto'} or float, default='scale' Kernel coefficient for 'rbf', 'poly' and 'sigmoid'. - if ``gamma='scale'`` (default) is passed then it uses 1 / (n_features * X.var()) as value of gamma, - if 'auto', uses 1 / n_features - if float, must be non-negative. .. versionchanged:: 0.22 The default value of ``gamma`` changed from 'auto' to 'scale'. \n",
+ " \n",
+ " \n",
+ " 'scale' \n",
+ " \n",
+ " \n",
+ "\n",
+ " \n",
+ " \n",
+ " \n",
+ " \n",
+ " coef0\n",
+ " coef0: float, default=0.0 Independent term in kernel function. It is only significant in 'poly' and 'sigmoid'. \n",
+ " \n",
+ " \n",
+ " 0.0 \n",
+ " \n",
+ " \n",
+ "\n",
+ " \n",
+ " \n",
+ " \n",
+ " \n",
+ " shrinking\n",
+ " shrinking: bool, default=True Whether to use the shrinking heuristic. See the :ref:`User Guide `. \n",
+ " \n",
+ " \n",
+ " True \n",
+ " \n",
+ " \n",
+ "\n",
+ " \n",
+ " \n",
+ " \n",
+ " \n",
+ " probability\n",
+ " probability: bool, default=False Whether to enable probability estimates. This must be enabled prior to calling `fit`, will slow down that method as it internally uses 5-fold cross-validation, and `predict_proba` may be inconsistent with `predict`. Read more in the :ref:`User Guide `. \n",
+ " \n",
+ " \n",
+ " False \n",
+ " \n",
+ " \n",
+ "\n",
+ " \n",
+ " \n",
+ " \n",
+ " \n",
+ " tol\n",
+ " tol: float, default=1e-3 Tolerance for stopping criterion. \n",
+ " \n",
+ " \n",
+ " 0.001 \n",
+ " \n",
+ " \n",
+ "\n",
+ " \n",
+ " \n",
+ " \n",
+ " \n",
+ " cache_size\n",
+ " cache_size: float, default=200 Specify the size of the kernel cache (in MB). \n",
+ " \n",
+ " \n",
+ " 200 \n",
+ " \n",
+ " \n",
+ "\n",
+ " \n",
+ " \n",
+ " \n",
+ " \n",
+ " class_weight\n",
+ " class_weight: dict or 'balanced', default=None Set the parameter C of class i to class_weight[i]*C for SVC. If not given, all classes are supposed to have weight one. The \"balanced\" mode uses the values of y to automatically adjust weights inversely proportional to class frequencies in the input data as ``n_samples / (n_classes * np.bincount(y))``. \n",
+ " \n",
+ " \n",
+ " None \n",
+ " \n",
+ " \n",
+ "\n",
+ " \n",
+ " \n",
+ " \n",
+ " \n",
+ " verbose\n",
+ " verbose: bool, default=False Enable verbose output. Note that this setting takes advantage of a per-process runtime setting in libsvm that, if enabled, may not work properly in a multithreaded context. \n",
+ " \n",
+ " \n",
+ " False \n",
+ " \n",
+ " \n",
+ "\n",
+ " \n",
+ " \n",
+ " \n",
+ " \n",
+ " max_iter\n",
+ " max_iter: int, default=-1 Hard limit on iterations within solver, or -1 for no limit. \n",
+ " \n",
+ " \n",
+ " -1 \n",
+ " \n",
+ " \n",
+ "\n",
+ " \n",
+ " \n",
+ " \n",
+ " \n",
+ " decision_function_shape\n",
+ " decision_function_shape: {'ovo', 'ovr'}, default='ovr' Whether to return a one-vs-rest ('ovr') decision function of shape (n_samples, n_classes) as all other classifiers, or the original one-vs-one ('ovo') decision function of libsvm which has shape (n_samples, n_classes * (n_classes - 1) / 2). However, note that internally, one-vs-one ('ovo') is always used as a multi-class strategy to train models; an ovr matrix is only constructed from the ovo matrix. The parameter is ignored for binary classification. .. versionchanged:: 0.19 decision_function_shape is 'ovr' by default. .. versionadded:: 0.17 *decision_function_shape='ovr'* is recommended. .. versionchanged:: 0.17 Deprecated *decision_function_shape='ovo' and None*. \n",
+ " \n",
+ " \n",
+ " 'ovr' \n",
+ " \n",
+ " \n",
+ "\n",
+ " \n",
+ " \n",
+ " \n",
+ " \n",
+ " break_ties\n",
+ " break_ties: bool, default=False If true, ``decision_function_shape='ovr'``, and number of classes > 2, :term:`predict` will break ties according to the confidence values of :term:`decision_function`; otherwise the first class among the tied classes is returned. Please note that breaking ties comes at a relatively high computational cost compared to a simple predict. See :ref:`sphx_glr_auto_examples_svm_plot_svm_tie_breaking.py` for an example of its usage with ``decision_function_shape='ovr'``. .. versionadded:: 0.22 \n",
+ " \n",
+ " \n",
+ " False \n",
+ " \n",
+ " \n",
+ "\n",
+ " \n",
+ " \n",
+ " \n",
+ " \n",
+ " random_state\n",
+ " random_state: int, RandomState instance or None, default=None Controls the pseudo random number generation for shuffling the data for probability estimates. Ignored when `probability` is False. Pass an int for reproducible output across multiple function calls. See :term:`Glossary `. \n",
+ " \n",
+ " \n",
+ " None \n",
+ " \n",
+ " \n",
+ " \n",
+ "
\n",
+ " \n",
+ "
\n",
+ "
"
+ ],
+ "text/plain": [
+ "SVC()"
+ ]
+ },
+ "execution_count": 26,
+ "metadata": {},
+ "output_type": "execute_result"
+ }
+ ],
+ "source": [
+ "x_train, x_test, y_train, y_test = train_test_split(x, y , train_size=0.8)\n",
+ "\n",
+ "model = SVC()\n",
+ "model.fit(x_train, y_train)"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "2d5022de",
+ "metadata": {},
+ "source": [
+ "### Calculating Score"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 27,
+ "id": "cc4faeb5",
+ "metadata": {},
+ "outputs": [
+ {
+ "data": {
+ "text/plain": [
+ "0.9916666666666667"
+ ]
+ },
+ "execution_count": 27,
+ "metadata": {},
+ "output_type": "execute_result"
+ }
+ ],
+ "source": [
+ "model.score(x_test, y_test)"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "f490bb56",
+ "metadata": {},
+ "source": [
+ "LOL, Do I have to tune it😂"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "da52b92b",
+ "metadata": {},
+ "source": [
+ "## Hyper Parameter Tuning"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "20323446",
+ "metadata": {},
+ "source": [
+ "### Regularization (C)"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 28,
+ "id": "1f3e1001",
+ "metadata": {},
+ "outputs": [
+ {
+ "data": {
+ "text/plain": [
+ "1.0"
+ ]
+ },
+ "execution_count": 28,
+ "metadata": {},
+ "output_type": "execute_result"
+ }
+ ],
+ "source": [
+ "model = SVC( C = 10 )\n",
+ "model.fit(x_train, y_train)\n",
+ "model.score(x_test, y_test)"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 33,
+ "id": "bdbed8e1",
+ "metadata": {},
+ "outputs": [
+ {
+ "data": {
+ "text/plain": [
+ "0.9972222222222222"
+ ]
+ },
+ "execution_count": 33,
+ "metadata": {},
+ "output_type": "execute_result"
+ }
+ ],
+ "source": [
+ "model = SVC( C = 2 )\n",
+ "model.fit(x_train, y_train)\n",
+ "model.score(x_test, y_test)"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "70a2d421",
+ "metadata": {},
+ "source": [
+ "### Kernel"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 38,
+ "id": "f765f825",
+ "metadata": {},
+ "outputs": [
+ {
+ "data": {
+ "text/plain": [
+ "0.9972222222222222"
+ ]
+ },
+ "execution_count": 38,
+ "metadata": {},
+ "output_type": "execute_result"
+ }
+ ],
+ "source": [
+ "model_kernel = SVC( kernel=\"poly\" )\n",
+ "model_kernel.fit(x_train, y_train)\n",
+ "model_kernel.score(x_test, y_test)"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 31,
+ "id": "75690032",
+ "metadata": {},
+ "outputs": [
+ {
+ "data": {
+ "text/plain": [
+ "0.875"
+ ]
+ },
+ "execution_count": 31,
+ "metadata": {},
+ "output_type": "execute_result"
+ }
+ ],
+ "source": [
+ "model_kernel = SVC( kernel=\"sigmoid\" )\n",
+ "model_kernel.fit(x_train, y_train)\n",
+ "model_kernel.score(x_test, y_test)"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "0cdadc79",
+ "metadata": {},
+ "source": [
+ "### Gamma"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 43,
+ "id": "a567ce08",
+ "metadata": {},
+ "outputs": [
+ {
+ "data": {
+ "text/plain": [
+ "0.07222222222222222"
+ ]
+ },
+ "execution_count": 43,
+ "metadata": {},
+ "output_type": "execute_result"
+ }
+ ],
+ "source": [
+ "model_gamma = SVC( gamma=100000 )\n",
+ "model_gamma.fit(x_train, y_train)\n",
+ "model_gamma.score(x_test, y_test)"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "b0f83f09",
+ "metadata": {},
+ "source": [
+ "### Optimized Parameters\n",
+ "\n",
+ "1. **C** = 10\n",
+ "2. **kernel** = poly\n",
+ "3. **gamma** is making it worst at any value"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "1765c1f6",
+ "metadata": {},
+ "source": [
+ "## Final Model"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 44,
+ "id": "5938f026",
+ "metadata": {},
+ "outputs": [
+ {
+ "data": {
+ "text/plain": [
+ "0.9972222222222222"
+ ]
+ },
+ "execution_count": 44,
+ "metadata": {},
+ "output_type": "execute_result"
+ }
+ ],
+ "source": [
+ "model_final = SVC( kernel=\"poly\", C=10 )\n",
+ "model_final.fit(x_train, y_train)\n",
+ "model_final.score(x_test, y_test)"
+ ]
+ }
+ ],
+ "metadata": {
+ "kernelspec": {
+ "display_name": "Python 3",
+ "language": "python",
+ "name": "python3"
+ },
+ "language_info": {
+ "codemirror_mode": {
+ "name": "ipython",
+ "version": 3
+ },
+ "file_extension": ".py",
+ "mimetype": "text/x-python",
+ "name": "python",
+ "nbconvert_exporter": "python",
+ "pygments_lexer": "ipython3",
+ "version": "3.14.2"
+ }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 5
+}
diff --git a/ML/11_random_forest/Exercise/solution_for_understanding.ipynb b/ML/11_random_forest/Exercise/solution_for_understanding.ipynb
new file mode 100644
index 00000000..10402bba
--- /dev/null
+++ b/ML/11_random_forest/Exercise/solution_for_understanding.ipynb
@@ -0,0 +1,1522 @@
+{
+ "cells": [
+ {
+ "cell_type": "markdown",
+ "id": "e75365b4",
+ "metadata": {},
+ "source": [
+ "# Iris Classification using Random Forest"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "71da0be6",
+ "metadata": {},
+ "source": [
+ "## Imports and Data Loading"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 7,
+ "id": "6166ae9f",
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "import pandas as pd\n",
+ "from sklearn.model_selection import train_test_split\n",
+ "from sklearn.datasets import load_iris\n",
+ "from sklearn.ensemble import RandomForestClassifier"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 2,
+ "id": "54275439",
+ "metadata": {},
+ "outputs": [
+ {
+ "data": {
+ "text/plain": [
+ "['DESCR',\n",
+ " 'data',\n",
+ " 'data_module',\n",
+ " 'feature_names',\n",
+ " 'filename',\n",
+ " 'frame',\n",
+ " 'target',\n",
+ " 'target_names']"
+ ]
+ },
+ "execution_count": 2,
+ "metadata": {},
+ "output_type": "execute_result"
+ }
+ ],
+ "source": [
+ "iris = load_iris()\n",
+ "dir(iris)"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 3,
+ "id": "f9f9074f",
+ "metadata": {},
+ "outputs": [
+ {
+ "data": {
+ "text/plain": [
+ "array(['setosa', 'versicolor', 'virginica'], dtype='\n",
+ "\n",
+ "\n",
+ " \n",
+ " \n",
+ " \n",
+ " sepal length (cm) \n",
+ " sepal width (cm) \n",
+ " petal length (cm) \n",
+ " petal width (cm) \n",
+ " \n",
+ " \n",
+ " \n",
+ " \n",
+ " 0 \n",
+ " 5.1 \n",
+ " 3.5 \n",
+ " 1.4 \n",
+ " 0.2 \n",
+ " \n",
+ " \n",
+ " 1 \n",
+ " 4.9 \n",
+ " 3.0 \n",
+ " 1.4 \n",
+ " 0.2 \n",
+ " \n",
+ " \n",
+ " 2 \n",
+ " 4.7 \n",
+ " 3.2 \n",
+ " 1.3 \n",
+ " 0.2 \n",
+ " \n",
+ " \n",
+ " 3 \n",
+ " 4.6 \n",
+ " 3.1 \n",
+ " 1.5 \n",
+ " 0.2 \n",
+ " \n",
+ " \n",
+ " 4 \n",
+ " 5.0 \n",
+ " 3.6 \n",
+ " 1.4 \n",
+ " 0.2 \n",
+ " \n",
+ " \n",
+ " ... \n",
+ " ... \n",
+ " ... \n",
+ " ... \n",
+ " ... \n",
+ " \n",
+ " \n",
+ " 145 \n",
+ " 6.7 \n",
+ " 3.0 \n",
+ " 5.2 \n",
+ " 2.3 \n",
+ " \n",
+ " \n",
+ " 146 \n",
+ " 6.3 \n",
+ " 2.5 \n",
+ " 5.0 \n",
+ " 1.9 \n",
+ " \n",
+ " \n",
+ " 147 \n",
+ " 6.5 \n",
+ " 3.0 \n",
+ " 5.2 \n",
+ " 2.0 \n",
+ " \n",
+ " \n",
+ " 148 \n",
+ " 6.2 \n",
+ " 3.4 \n",
+ " 5.4 \n",
+ " 2.3 \n",
+ " \n",
+ " \n",
+ " 149 \n",
+ " 5.9 \n",
+ " 3.0 \n",
+ " 5.1 \n",
+ " 1.8 \n",
+ " \n",
+ " \n",
+ "
\n",
+ "150 rows × 4 columns
\n",
+ ""
+ ],
+ "text/plain": [
+ " sepal length (cm) sepal width (cm) petal length (cm) petal width (cm)\n",
+ "0 5.1 3.5 1.4 0.2\n",
+ "1 4.9 3.0 1.4 0.2\n",
+ "2 4.7 3.2 1.3 0.2\n",
+ "3 4.6 3.1 1.5 0.2\n",
+ "4 5.0 3.6 1.4 0.2\n",
+ ".. ... ... ... ...\n",
+ "145 6.7 3.0 5.2 2.3\n",
+ "146 6.3 2.5 5.0 1.9\n",
+ "147 6.5 3.0 5.2 2.0\n",
+ "148 6.2 3.4 5.4 2.3\n",
+ "149 5.9 3.0 5.1 1.8\n",
+ "\n",
+ "[150 rows x 4 columns]"
+ ]
+ },
+ "execution_count": 4,
+ "metadata": {},
+ "output_type": "execute_result"
+ }
+ ],
+ "source": [
+ "df = pd.DataFrame(iris.data, columns=iris.feature_names)\n",
+ "x = df\n",
+ "df"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 5,
+ "id": "074f7c65",
+ "metadata": {},
+ "outputs": [
+ {
+ "data": {
+ "text/html": [
+ "\n",
+ "\n",
+ "
\n",
+ " \n",
+ " \n",
+ " \n",
+ " sepal length (cm) \n",
+ " sepal width (cm) \n",
+ " petal length (cm) \n",
+ " petal width (cm) \n",
+ " target \n",
+ " \n",
+ " \n",
+ " \n",
+ " \n",
+ " 0 \n",
+ " 5.1 \n",
+ " 3.5 \n",
+ " 1.4 \n",
+ " 0.2 \n",
+ " 0 \n",
+ " \n",
+ " \n",
+ " 1 \n",
+ " 4.9 \n",
+ " 3.0 \n",
+ " 1.4 \n",
+ " 0.2 \n",
+ " 0 \n",
+ " \n",
+ " \n",
+ " 2 \n",
+ " 4.7 \n",
+ " 3.2 \n",
+ " 1.3 \n",
+ " 0.2 \n",
+ " 0 \n",
+ " \n",
+ " \n",
+ " 3 \n",
+ " 4.6 \n",
+ " 3.1 \n",
+ " 1.5 \n",
+ " 0.2 \n",
+ " 0 \n",
+ " \n",
+ " \n",
+ " 4 \n",
+ " 5.0 \n",
+ " 3.6 \n",
+ " 1.4 \n",
+ " 0.2 \n",
+ " 0 \n",
+ " \n",
+ " \n",
+ " ... \n",
+ " ... \n",
+ " ... \n",
+ " ... \n",
+ " ... \n",
+ " ... \n",
+ " \n",
+ " \n",
+ " 145 \n",
+ " 6.7 \n",
+ " 3.0 \n",
+ " 5.2 \n",
+ " 2.3 \n",
+ " 2 \n",
+ " \n",
+ " \n",
+ " 146 \n",
+ " 6.3 \n",
+ " 2.5 \n",
+ " 5.0 \n",
+ " 1.9 \n",
+ " 2 \n",
+ " \n",
+ " \n",
+ " 147 \n",
+ " 6.5 \n",
+ " 3.0 \n",
+ " 5.2 \n",
+ " 2.0 \n",
+ " 2 \n",
+ " \n",
+ " \n",
+ " 148 \n",
+ " 6.2 \n",
+ " 3.4 \n",
+ " 5.4 \n",
+ " 2.3 \n",
+ " 2 \n",
+ " \n",
+ " \n",
+ " 149 \n",
+ " 5.9 \n",
+ " 3.0 \n",
+ " 5.1 \n",
+ " 1.8 \n",
+ " 2 \n",
+ " \n",
+ " \n",
+ "
\n",
+ "
150 rows × 5 columns
\n",
+ "
"
+ ],
+ "text/plain": [
+ " sepal length (cm) sepal width (cm) petal length (cm) petal width (cm) \\\n",
+ "0 5.1 3.5 1.4 0.2 \n",
+ "1 4.9 3.0 1.4 0.2 \n",
+ "2 4.7 3.2 1.3 0.2 \n",
+ "3 4.6 3.1 1.5 0.2 \n",
+ "4 5.0 3.6 1.4 0.2 \n",
+ ".. ... ... ... ... \n",
+ "145 6.7 3.0 5.2 2.3 \n",
+ "146 6.3 2.5 5.0 1.9 \n",
+ "147 6.5 3.0 5.2 2.0 \n",
+ "148 6.2 3.4 5.4 2.3 \n",
+ "149 5.9 3.0 5.1 1.8 \n",
+ "\n",
+ " target \n",
+ "0 0 \n",
+ "1 0 \n",
+ "2 0 \n",
+ "3 0 \n",
+ "4 0 \n",
+ ".. ... \n",
+ "145 2 \n",
+ "146 2 \n",
+ "147 2 \n",
+ "148 2 \n",
+ "149 2 \n",
+ "\n",
+ "[150 rows x 5 columns]"
+ ]
+ },
+ "execution_count": 5,
+ "metadata": {},
+ "output_type": "execute_result"
+ }
+ ],
+ "source": [
+ "df[\"target\"] = iris.target\n",
+ "df"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 6,
+ "id": "735e480a",
+ "metadata": {},
+ "outputs": [
+ {
+ "data": {
+ "text/plain": [
+ "0 0\n",
+ "1 0\n",
+ "2 0\n",
+ "3 0\n",
+ "4 0\n",
+ " ..\n",
+ "145 2\n",
+ "146 2\n",
+ "147 2\n",
+ "148 2\n",
+ "149 2\n",
+ "Name: target, Length: 150, dtype: int64"
+ ]
+ },
+ "execution_count": 6,
+ "metadata": {},
+ "output_type": "execute_result"
+ }
+ ],
+ "source": [
+ "y = df.target\n",
+ "y"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "b77e1757",
+ "metadata": {},
+ "source": [
+ "## Model Building"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 8,
+ "id": "cb353203",
+ "metadata": {},
+ "outputs": [
+ {
+ "data": {
+ "text/html": [
+ "RandomForestClassifier() In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook. On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org. \n",
+ "
\n",
+ "
\n",
+ " Parameters \n",
+ " \n",
+ " \n",
+ " \n",
+ " \n",
+ " \n",
+ " \n",
+ " \n",
+ " n_estimators\n",
+ " n_estimators: int, default=100 The number of trees in the forest. .. versionchanged:: 0.22 The default value of ``n_estimators`` changed from 10 to 100 in 0.22. \n",
+ " \n",
+ " \n",
+ " 100 \n",
+ " \n",
+ " \n",
+ "\n",
+ " \n",
+ " \n",
+ " \n",
+ " \n",
+ " criterion\n",
+ " criterion: {\"gini\", \"entropy\", \"log_loss\"}, default=\"gini\" The function to measure the quality of a split. Supported criteria are \"gini\" for the Gini impurity and \"log_loss\" and \"entropy\" both for the Shannon information gain, see :ref:`tree_mathematical_formulation`. Note: This parameter is tree-specific. \n",
+ " \n",
+ " \n",
+ " 'gini' \n",
+ " \n",
+ " \n",
+ "\n",
+ " \n",
+ " \n",
+ " \n",
+ " \n",
+ " max_depth\n",
+ " max_depth: int, default=None The maximum depth of the tree. If None, then nodes are expanded until all leaves are pure or until all leaves contain less than min_samples_split samples. \n",
+ " \n",
+ " \n",
+ " None \n",
+ " \n",
+ " \n",
+ "\n",
+ " \n",
+ " \n",
+ " \n",
+ " \n",
+ " min_samples_split\n",
+ " min_samples_split: int or float, default=2 The minimum number of samples required to split an internal node: - If int, then consider `min_samples_split` as the minimum number. - If float, then `min_samples_split` is a fraction and `ceil(min_samples_split * n_samples)` are the minimum number of samples for each split. .. versionchanged:: 0.18 Added float values for fractions. \n",
+ " \n",
+ " \n",
+ " 2 \n",
+ " \n",
+ " \n",
+ "\n",
+ " \n",
+ " \n",
+ " \n",
+ " \n",
+ " min_samples_leaf\n",
+ " min_samples_leaf: int or float, default=1 The minimum number of samples required to be at a leaf node. A split point at any depth will only be considered if it leaves at least ``min_samples_leaf`` training samples in each of the left and right branches. This may have the effect of smoothing the model, especially in regression. - If int, then consider `min_samples_leaf` as the minimum number. - If float, then `min_samples_leaf` is a fraction and `ceil(min_samples_leaf * n_samples)` are the minimum number of samples for each node. .. versionchanged:: 0.18 Added float values for fractions. \n",
+ " \n",
+ " \n",
+ " 1 \n",
+ " \n",
+ " \n",
+ "\n",
+ " \n",
+ " \n",
+ " \n",
+ " \n",
+ " min_weight_fraction_leaf\n",
+ " min_weight_fraction_leaf: float, default=0.0 The minimum weighted fraction of the sum total of weights (of all the input samples) required to be at a leaf node. Samples have equal weight when sample_weight is not provided. \n",
+ " \n",
+ " \n",
+ " 0.0 \n",
+ " \n",
+ " \n",
+ "\n",
+ " \n",
+ " \n",
+ " \n",
+ " \n",
+ " max_features\n",
+ " max_features: {\"sqrt\", \"log2\", None}, int or float, default=\"sqrt\" The number of features to consider when looking for the best split: - If int, then consider `max_features` features at each split. - If float, then `max_features` is a fraction and `max(1, int(max_features * n_features_in_))` features are considered at each split. - If \"sqrt\", then `max_features=sqrt(n_features)`. - If \"log2\", then `max_features=log2(n_features)`. - If None, then `max_features=n_features`. .. versionchanged:: 1.1 The default of `max_features` changed from `\"auto\"` to `\"sqrt\"`. Note: the search for a split does not stop until at least one valid partition of the node samples is found, even if it requires to effectively inspect more than ``max_features`` features. \n",
+ " \n",
+ " \n",
+ " 'sqrt' \n",
+ " \n",
+ " \n",
+ "\n",
+ " \n",
+ " \n",
+ " \n",
+ " \n",
+ " max_leaf_nodes\n",
+ " max_leaf_nodes: int, default=None Grow trees with ``max_leaf_nodes`` in best-first fashion. Best nodes are defined as relative reduction in impurity. If None then unlimited number of leaf nodes. \n",
+ " \n",
+ " \n",
+ " None \n",
+ " \n",
+ " \n",
+ "\n",
+ " \n",
+ " \n",
+ " \n",
+ " \n",
+ " min_impurity_decrease\n",
+ " min_impurity_decrease: float, default=0.0 A node will be split if this split induces a decrease of the impurity greater than or equal to this value. The weighted impurity decrease equation is the following:: N_t / N * (impurity - N_t_R / N_t * right_impurity - N_t_L / N_t * left_impurity) where ``N`` is the total number of samples, ``N_t`` is the number of samples at the current node, ``N_t_L`` is the number of samples in the left child, and ``N_t_R`` is the number of samples in the right child. ``N``, ``N_t``, ``N_t_R`` and ``N_t_L`` all refer to the weighted sum, if ``sample_weight`` is passed. .. versionadded:: 0.19 \n",
+ " \n",
+ " \n",
+ " 0.0 \n",
+ " \n",
+ " \n",
+ "\n",
+ " \n",
+ " \n",
+ " \n",
+ " \n",
+ " bootstrap\n",
+ " bootstrap: bool, default=True Whether bootstrap samples are used when building trees. If False, the whole dataset is used to build each tree. \n",
+ " \n",
+ " \n",
+ " True \n",
+ " \n",
+ " \n",
+ "\n",
+ " \n",
+ " \n",
+ " \n",
+ " \n",
+ " oob_score\n",
+ " oob_score: bool or callable, default=False Whether to use out-of-bag samples to estimate the generalization score. By default, :func:`~sklearn.metrics.accuracy_score` is used. Provide a callable with signature `metric(y_true, y_pred)` to use a custom metric. Only available if `bootstrap=True`. For an illustration of out-of-bag (OOB) error estimation, see the example :ref:`sphx_glr_auto_examples_ensemble_plot_ensemble_oob.py`. \n",
+ " \n",
+ " \n",
+ " False \n",
+ " \n",
+ " \n",
+ "\n",
+ " \n",
+ " \n",
+ " \n",
+ " \n",
+ " n_jobs\n",
+ " n_jobs: int, default=None The number of jobs to run in parallel. :meth:`fit`, :meth:`predict`, :meth:`decision_path` and :meth:`apply` are all parallelized over the trees. ``None`` means 1 unless in a :obj:`joblib.parallel_backend` context. ``-1`` means using all processors. See :term:`Glossary` for more details. \n",
+ " \n",
+ " \n",
+ " None \n",
+ " \n",
+ " \n",
+ "\n",
+ " \n",
+ " \n",
+ " \n",
+ " \n",
+ " random_state\n",
+ " random_state: int, RandomState instance or None, default=None Controls both the randomness of the bootstrapping of the samples used when building trees (if ``bootstrap=True``) and the sampling of the features to consider when looking for the best split at each node (if ``max_features < n_features``). See :term:`Glossary ` for details. \n",
+ " \n",
+ " \n",
+ " None \n",
+ " \n",
+ " \n",
+ "\n",
+ " \n",
+ " \n",
+ " \n",
+ " \n",
+ " verbose\n",
+ " verbose: int, default=0 Controls the verbosity when fitting and predicting. \n",
+ " \n",
+ " \n",
+ " 0 \n",
+ " \n",
+ " \n",
+ "\n",
+ " \n",
+ " \n",
+ " \n",
+ " \n",
+ " warm_start\n",
+ " warm_start: bool, default=False When set to ``True``, reuse the solution of the previous call to fit and add more estimators to the ensemble, otherwise, just fit a whole new forest. See :term:`Glossary ` and :ref:`tree_ensemble_warm_start` for details. \n",
+ " \n",
+ " \n",
+ " False \n",
+ " \n",
+ " \n",
+ "\n",
+ " \n",
+ " \n",
+ " \n",
+ " \n",
+ " class_weight\n",
+ " class_weight: {\"balanced\", \"balanced_subsample\"}, dict or list of dicts, default=None Weights associated with classes in the form ``{class_label: weight}``. If not given, all classes are supposed to have weight one. For multi-output problems, a list of dicts can be provided in the same order as the columns of y. Note that for multioutput (including multilabel) weights should be defined for each class of every column in its own dict. For example, for four-class multilabel classification weights should be [{0: 1, 1: 1}, {0: 1, 1: 5}, {0: 1, 1: 1}, {0: 1, 1: 1}] instead of [{1:1}, {2:5}, {3:1}, {4:1}]. The \"balanced\" mode uses the values of y to automatically adjust weights inversely proportional to class frequencies in the input data as ``n_samples / (n_classes * np.bincount(y))`` The \"balanced_subsample\" mode is the same as \"balanced\" except that weights are computed based on the bootstrap sample for every tree grown. For multi-output, the weights of each column of y will be multiplied. Note that these weights will be multiplied with sample_weight (passed through the fit method) if sample_weight is specified. \n",
+ " \n",
+ " \n",
+ " None \n",
+ " \n",
+ " \n",
+ "\n",
+ " \n",
+ " \n",
+ " \n",
+ " \n",
+ " ccp_alpha\n",
+ " ccp_alpha: non-negative float, default=0.0 Complexity parameter used for Minimal Cost-Complexity Pruning. The subtree with the largest cost complexity that is smaller than ``ccp_alpha`` will be chosen. By default, no pruning is performed. See :ref:`minimal_cost_complexity_pruning` for details. See :ref:`sphx_glr_auto_examples_tree_plot_cost_complexity_pruning.py` for an example of such pruning. .. versionadded:: 0.22 \n",
+ " \n",
+ " \n",
+ " 0.0 \n",
+ " \n",
+ " \n",
+ "\n",
+ " \n",
+ " \n",
+ " \n",
+ " \n",
+ " max_samples\n",
+ " max_samples: int or float, default=None If bootstrap is True, the number of samples to draw from X to train each base estimator. - If None (default), then draw `X.shape[0]` samples. - If int, then draw `max_samples` samples. - If float, then draw `max(round(n_samples * max_samples), 1)` samples. Thus, `max_samples` should be in the interval `(0.0, 1.0]`. .. versionadded:: 0.22 \n",
+ " \n",
+ " \n",
+ " None \n",
+ " \n",
+ " \n",
+ "\n",
+ " \n",
+ " \n",
+ " \n",
+ " \n",
+ " monotonic_cst\n",
+ " monotonic_cst: array-like of int of shape (n_features), default=None Indicates the monotonicity constraint to enforce on each feature. - 1: monotonic increase - 0: no constraint - -1: monotonic decrease If monotonic_cst is None, no constraints are applied. Monotonicity constraints are not supported for: - multiclass classifications (i.e. when `n_classes > 2`), - multioutput classifications (i.e. when `n_outputs_ > 1`), - classifications trained on data with missing values. The constraints hold over the probability of the positive class. Read more in the :ref:`User Guide `. .. versionadded:: 1.4 \n",
+ " \n",
+ " \n",
+ " None \n",
+ " \n",
+ " \n",
+ " \n",
+ "
\n",
+ " \n",
+ "
\n",
+ "
"
+ ],
+ "text/plain": [
+ "RandomForestClassifier()"
+ ]
+ },
+ "execution_count": 8,
+ "metadata": {},
+ "output_type": "execute_result"
+ }
+ ],
+ "source": [
+ "model = RandomForestClassifier()\n",
+ "x_train, x_test, y_train, y_test = train_test_split(x, y , train_size=0.8)\n",
+ "model.fit(x_train, y_train)"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 9,
+ "id": "8a31ea7f",
+ "metadata": {},
+ "outputs": [
+ {
+ "data": {
+ "text/plain": [
+ "1.0"
+ ]
+ },
+ "execution_count": 9,
+ "metadata": {},
+ "output_type": "execute_result"
+ }
+ ],
+ "source": [
+ "model.score(x_test, y_test)"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "5b1c6a02",
+ "metadata": {},
+ "source": [
+ "## Hyperparameter tuning, Though not needed here"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 12,
+ "id": "e40ec0a2",
+ "metadata": {},
+ "outputs": [
+ {
+ "data": {
+ "text/plain": [
+ "1.0"
+ ]
+ },
+ "execution_count": 12,
+ "metadata": {},
+ "output_type": "execute_result"
+ }
+ ],
+ "source": [
+ "model = RandomForestClassifier(n_estimators=50)\n",
+ "model.fit(x_train, y_train)\n",
+ "model.score(x_test, y_test)"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "e77b16e3",
+ "metadata": {},
+ "source": [
+ "It has already reached the maximum."
+ ]
+ }
+ ],
+ "metadata": {
+ "kernelspec": {
+ "display_name": "Python 3",
+ "language": "python",
+ "name": "python3"
+ },
+ "language_info": {
+ "codemirror_mode": {
+ "name": "ipython",
+ "version": 3
+ },
+ "file_extension": ".py",
+ "mimetype": "text/x-python",
+ "name": "python",
+ "nbconvert_exporter": "python",
+ "pygments_lexer": "ipython3",
+ "version": "3.14.2"
+ }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 5
+}
diff --git a/ML/13_kmeans/Exercise/solution_for_understanding.ipynb b/ML/13_kmeans/Exercise/solution_for_understanding.ipynb
new file mode 100644
index 00000000..b5a5f789
--- /dev/null
+++ b/ML/13_kmeans/Exercise/solution_for_understanding.ipynb
@@ -0,0 +1,610 @@
+{
+ "cells": [
+ {
+ "cell_type": "markdown",
+ "id": "1b744bc9",
+ "metadata": {},
+ "source": [
+ "# K_Means Clustering And Elbow method"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "07d4d403",
+ "metadata": {},
+ "source": [
+ "## IMports and Data Load"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 6,
+ "id": "182cfb80",
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "import pandas as pd\n",
+ "import matplotlib.pyplot as plt\n",
+ "from sklearn.datasets import load_iris\n",
+ "from sklearn.cluster import KMeans"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 2,
+ "id": "5b189a77",
+ "metadata": {},
+ "outputs": [
+ {
+ "data": {
+ "text/plain": [
+ "['DESCR',\n",
+ " 'data',\n",
+ " 'data_module',\n",
+ " 'feature_names',\n",
+ " 'filename',\n",
+ " 'frame',\n",
+ " 'target',\n",
+ " 'target_names']"
+ ]
+ },
+ "execution_count": 2,
+ "metadata": {},
+ "output_type": "execute_result"
+ }
+ ],
+ "source": [
+ "iris = load_iris()\n",
+ "dir(iris)"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 3,
+ "id": "54a5bfe7",
+ "metadata": {},
+ "outputs": [
+ {
+ "data": {
+ "text/html": [
+ "\n",
+ "\n",
+ "
\n",
+ " \n",
+ " \n",
+ " \n",
+ " sepal length (cm) \n",
+ " sepal width (cm) \n",
+ " petal length (cm) \n",
+ " petal width (cm) \n",
+ " \n",
+ " \n",
+ " \n",
+ " \n",
+ " 0 \n",
+ " 5.1 \n",
+ " 3.5 \n",
+ " 1.4 \n",
+ " 0.2 \n",
+ " \n",
+ " \n",
+ " 1 \n",
+ " 4.9 \n",
+ " 3.0 \n",
+ " 1.4 \n",
+ " 0.2 \n",
+ " \n",
+ " \n",
+ " 2 \n",
+ " 4.7 \n",
+ " 3.2 \n",
+ " 1.3 \n",
+ " 0.2 \n",
+ " \n",
+ " \n",
+ " 3 \n",
+ " 4.6 \n",
+ " 3.1 \n",
+ " 1.5 \n",
+ " 0.2 \n",
+ " \n",
+ " \n",
+ " 4 \n",
+ " 5.0 \n",
+ " 3.6 \n",
+ " 1.4 \n",
+ " 0.2 \n",
+ " \n",
+ " \n",
+ " ... \n",
+ " ... \n",
+ " ... \n",
+ " ... \n",
+ " ... \n",
+ " \n",
+ " \n",
+ " 145 \n",
+ " 6.7 \n",
+ " 3.0 \n",
+ " 5.2 \n",
+ " 2.3 \n",
+ " \n",
+ " \n",
+ " 146 \n",
+ " 6.3 \n",
+ " 2.5 \n",
+ " 5.0 \n",
+ " 1.9 \n",
+ " \n",
+ " \n",
+ " 147 \n",
+ " 6.5 \n",
+ " 3.0 \n",
+ " 5.2 \n",
+ " 2.0 \n",
+ " \n",
+ " \n",
+ " 148 \n",
+ " 6.2 \n",
+ " 3.4 \n",
+ " 5.4 \n",
+ " 2.3 \n",
+ " \n",
+ " \n",
+ " 149 \n",
+ " 5.9 \n",
+ " 3.0 \n",
+ " 5.1 \n",
+ " 1.8 \n",
+ " \n",
+ " \n",
+ "
\n",
+ "
150 rows × 4 columns
\n",
+ "
"
+ ],
+ "text/plain": [
+ " sepal length (cm) sepal width (cm) petal length (cm) petal width (cm)\n",
+ "0 5.1 3.5 1.4 0.2\n",
+ "1 4.9 3.0 1.4 0.2\n",
+ "2 4.7 3.2 1.3 0.2\n",
+ "3 4.6 3.1 1.5 0.2\n",
+ "4 5.0 3.6 1.4 0.2\n",
+ ".. ... ... ... ...\n",
+ "145 6.7 3.0 5.2 2.3\n",
+ "146 6.3 2.5 5.0 1.9\n",
+ "147 6.5 3.0 5.2 2.0\n",
+ "148 6.2 3.4 5.4 2.3\n",
+ "149 5.9 3.0 5.1 1.8\n",
+ "\n",
+ "[150 rows x 4 columns]"
+ ]
+ },
+ "execution_count": 3,
+ "metadata": {},
+ "output_type": "execute_result"
+ }
+ ],
+ "source": [
+ "df = pd.DataFrame(iris.data, columns=iris.feature_names)\n",
+ "df"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 4,
+ "id": "38f00279",
+ "metadata": {},
+ "outputs": [
+ {
+ "data": {
+ "text/html": [
+ "\n",
+ "\n",
+ "
\n",
+ " \n",
+ " \n",
+ " \n",
+ " petal length (cm) \n",
+ " petal width (cm) \n",
+ " \n",
+ " \n",
+ " \n",
+ " \n",
+ " 0 \n",
+ " 1.4 \n",
+ " 0.2 \n",
+ " \n",
+ " \n",
+ " 1 \n",
+ " 1.4 \n",
+ " 0.2 \n",
+ " \n",
+ " \n",
+ " 2 \n",
+ " 1.3 \n",
+ " 0.2 \n",
+ " \n",
+ " \n",
+ " 3 \n",
+ " 1.5 \n",
+ " 0.2 \n",
+ " \n",
+ " \n",
+ " 4 \n",
+ " 1.4 \n",
+ " 0.2 \n",
+ " \n",
+ " \n",
+ " ... \n",
+ " ... \n",
+ " ... \n",
+ " \n",
+ " \n",
+ " 145 \n",
+ " 5.2 \n",
+ " 2.3 \n",
+ " \n",
+ " \n",
+ " 146 \n",
+ " 5.0 \n",
+ " 1.9 \n",
+ " \n",
+ " \n",
+ " 147 \n",
+ " 5.2 \n",
+ " 2.0 \n",
+ " \n",
+ " \n",
+ " 148 \n",
+ " 5.4 \n",
+ " 2.3 \n",
+ " \n",
+ " \n",
+ " 149 \n",
+ " 5.1 \n",
+ " 1.8 \n",
+ " \n",
+ " \n",
+ "
\n",
+ "
150 rows × 2 columns
\n",
+ "
"
+ ],
+ "text/plain": [
+ " petal length (cm) petal width (cm)\n",
+ "0 1.4 0.2\n",
+ "1 1.4 0.2\n",
+ "2 1.3 0.2\n",
+ "3 1.5 0.2\n",
+ "4 1.4 0.2\n",
+ ".. ... ...\n",
+ "145 5.2 2.3\n",
+ "146 5.0 1.9\n",
+ "147 5.2 2.0\n",
+ "148 5.4 2.3\n",
+ "149 5.1 1.8\n",
+ "\n",
+ "[150 rows x 2 columns]"
+ ]
+ },
+ "execution_count": 4,
+ "metadata": {},
+ "output_type": "execute_result"
+ }
+ ],
+ "source": [
+ "#dropping the sepal column because we do not need them for this exercise\n",
+ "df.drop([\"sepal length (cm)\", \"sepal width (cm)\"], axis=\"columns\", inplace=True)\n",
+ "df"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "e283c46a",
+ "metadata": {},
+ "source": [
+ "## Checking Data Scattering"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 5,
+ "id": "3712422f",
+ "metadata": {},
+ "outputs": [
+ {
+ "data": {
+ "text/plain": [
+ ""
+ ]
+ },
+ "execution_count": 5,
+ "metadata": {},
+ "output_type": "execute_result"
+ },
+ {
+ "data": {
+ "image/png": "",
+ "text/plain": [
+ ""
+ ]
+ },
+ "metadata": {},
+ "output_type": "display_data"
+ }
+ ],
+ "source": [
+ "plt.scatter(df[\"petal length (cm)\"], df[\"petal width (cm)\"])"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "fc103165",
+ "metadata": {},
+ "source": [
+ "*Scale is comparable on x and y, we do not need to scale features*"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "455f087c",
+ "metadata": {},
+ "source": [
+ "## Predicting same type of flower through clustering"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 7,
+ "id": "9591f9e3",
+ "metadata": {},
+ "outputs": [
+ {
+ "data": {
+ "text/plain": [
+ "array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,\n",
+ " 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,\n",
+ " 0, 0, 0, 0, 0, 0, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,\n",
+ " 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 1, 2, 2, 2, 2, 2, 1, 2, 2, 2, 2,\n",
+ " 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 1, 1, 1, 1, 1, 1, 2, 1, 1, 1,\n",
+ " 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 1, 1, 1, 1, 1, 1, 2, 1, 1, 1, 1, 1,\n",
+ " 1, 1, 1, 1, 1, 1, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1], dtype=int32)"
+ ]
+ },
+ "execution_count": 7,
+ "metadata": {},
+ "output_type": "execute_result"
+ }
+ ],
+ "source": [
+ "km = KMeans(n_clusters=3)\n",
+ "y_predicted = km.fit_predict(df)\n",
+ "y_predicted"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 9,
+ "id": "8ad393f3",
+ "metadata": {},
+ "outputs": [
+ {
+ "data": {
+ "text/plain": [
+ "array([[1.462 , 0.246 ],\n",
+ " [5.59583333, 2.0375 ],\n",
+ " [4.26923077, 1.34230769]])"
+ ]
+ },
+ "execution_count": 9,
+ "metadata": {},
+ "output_type": "execute_result"
+ }
+ ],
+ "source": [
+ "km.cluster_centers_"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "e072e9a9",
+ "metadata": {},
+ "source": [
+ "## Visualizing Prediction"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 12,
+ "id": "ee052114",
+ "metadata": {},
+ "outputs": [
+ {
+ "data": {
+ "text/plain": [
+ ""
+ ]
+ },
+ "execution_count": 12,
+ "metadata": {},
+ "output_type": "execute_result"
+ },
+ {
+ "data": {
+ "image/png": "",
+ "text/plain": [
+ ""
+ ]
+ },
+ "metadata": {},
+ "output_type": "display_data"
+ }
+ ],
+ "source": [
+ "#appending prediction colulmn\n",
+ "df[\"clusters\"] = y_predicted\n",
+ "\n",
+ "#plotting different clusters\n",
+ "plt.scatter(df[\"petal length (cm)\"][df.clusters==0], df[\"petal width (cm)\"][df.clusters==0], color=\"blue\")\n",
+ "plt.scatter(df[\"petal length (cm)\"][df.clusters==1], df[\"petal width (cm)\"][df.clusters==1], color=\"red\")\n",
+ "plt.scatter(df[\"petal length (cm)\"][df.clusters==2], df[\"petal width (cm)\"][df.clusters==2], color=\"yellow\")\n",
+ "\n",
+ "#plotting centroids\n",
+ "plt.scatter(km.cluster_centers_[:,0], km.cluster_centers_[:,1], color=\"black\", marker=\"+\")"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "50275a8d",
+ "metadata": {},
+ "source": [
+ "😘"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "5fdcd33f",
+ "metadata": {},
+ "source": [
+ "## Elbow Method"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "192ffea5",
+ "metadata": {},
+ "source": [
+ "### Calculating SSE"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 13,
+ "id": "5fba887a",
+ "metadata": {},
+ "outputs": [
+ {
+ "data": {
+ "text/plain": [
+ "[550.8953333333334,\n",
+ " 86.39021984551397,\n",
+ " 31.37135897435897,\n",
+ " 19.48300089968511,\n",
+ " 13.983213141025638,\n",
+ " 11.147086299967423,\n",
+ " 10.99333075619227,\n",
+ " 8.099060606060608,\n",
+ " 7.800877025394053,\n",
+ " 6.0986768476621425]"
+ ]
+ },
+ "execution_count": 13,
+ "metadata": {},
+ "output_type": "execute_result"
+ }
+ ],
+ "source": [
+ "k_rng = range(1,11)\n",
+ "sse = []\n",
+ "\n",
+ "for k in k_rng:\n",
+ " km = KMeans(n_clusters=k)\n",
+ " km.fit(df.drop([\"clusters\"], axis=\"columns\"))\n",
+ " sse.append(km.inertia_) #inertia_ stores the sum of squared error for km, which we try to find using elbow method\n",
+ "\n",
+ "sse"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "f22d0911",
+ "metadata": {},
+ "source": [
+ "### Plotting"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 16,
+ "id": "2ac77490",
+ "metadata": {},
+ "outputs": [
+ {
+ "data": {
+ "text/plain": [
+ "[]"
+ ]
+ },
+ "execution_count": 16,
+ "metadata": {},
+ "output_type": "execute_result"
+ },
+ {
+ "data": {
+ "image/png": "",
+ "text/plain": [
+ ""
+ ]
+ },
+ "metadata": {},
+ "output_type": "display_data"
+ }
+ ],
+ "source": [
+ "plt.plot(k_rng, sse)"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "38e82828",
+ "metadata": {},
+ "source": [
+ "On elbow is 3, as we guessed"
+ ]
+ }
+ ],
+ "metadata": {
+ "kernelspec": {
+ "display_name": "Python 3",
+ "language": "python",
+ "name": "python3"
+ },
+ "language_info": {
+ "codemirror_mode": {
+ "name": "ipython",
+ "version": 3
+ },
+ "file_extension": ".py",
+ "mimetype": "text/x-python",
+ "name": "python",
+ "nbconvert_exporter": "python",
+ "pygments_lexer": "ipython3",
+ "version": "3.14.2"
+ }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 5
+}
diff --git a/ML/14_naive_bayes/Exercise/Solution_for_understanding.ipynb b/ML/14_naive_bayes/Exercise/Solution_for_understanding.ipynb
new file mode 100644
index 00000000..573d9625
--- /dev/null
+++ b/ML/14_naive_bayes/Exercise/Solution_for_understanding.ipynb
@@ -0,0 +1,337 @@
+{
+ "cells": [
+ {
+ "cell_type": "markdown",
+ "id": "ef96190f",
+ "metadata": {},
+ "source": [
+ "# Wine Classification"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "9deba4e1",
+ "metadata": {},
+ "source": [
+ "## Imports and Data Preparation"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 14,
+ "id": "2118753b",
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "import pandas as pd\n",
+ "from sklearn.datasets import load_wine\n",
+ "from sklearn.model_selection import train_test_split\n",
+ "from sklearn.naive_bayes import MultinomialNB\n",
+ "from sklearn.naive_bayes import GaussianNB\n",
+ "import seaborn as sns"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 4,
+ "id": "fc2dd7d0",
+ "metadata": {},
+ "outputs": [
+ {
+ "data": {
+ "text/plain": [
+ "['DESCR', 'data', 'feature_names', 'frame', 'target', 'target_names']"
+ ]
+ },
+ "execution_count": 4,
+ "metadata": {},
+ "output_type": "execute_result"
+ }
+ ],
+ "source": [
+ "wine = load_wine()\n",
+ "dir(wine)"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "501b19a3",
+ "metadata": {},
+ "source": [
+ "### Data Exploration"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 6,
+ "id": "d30b802c",
+ "metadata": {},
+ "outputs": [
+ {
+ "data": {
+ "text/plain": [
+ "array([[1.423e+01, 1.710e+00, 2.430e+00, ..., 1.040e+00, 3.920e+00,\n",
+ " 1.065e+03],\n",
+ " [1.320e+01, 1.780e+00, 2.140e+00, ..., 1.050e+00, 3.400e+00,\n",
+ " 1.050e+03],\n",
+ " [1.316e+01, 2.360e+00, 2.670e+00, ..., 1.030e+00, 3.170e+00,\n",
+ " 1.185e+03],\n",
+ " ...,\n",
+ " [1.327e+01, 4.280e+00, 2.260e+00, ..., 5.900e-01, 1.560e+00,\n",
+ " 8.350e+02],\n",
+ " [1.317e+01, 2.590e+00, 2.370e+00, ..., 6.000e-01, 1.620e+00,\n",
+ " 8.400e+02],\n",
+ " [1.413e+01, 4.100e+00, 2.740e+00, ..., 6.100e-01, 1.600e+00,\n",
+ " 5.600e+02]], shape=(178, 13))"
+ ]
+ },
+ "execution_count": 6,
+ "metadata": {},
+ "output_type": "execute_result"
+ }
+ ],
+ "source": [
+ "wine.data"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 7,
+ "id": "1968878c",
+ "metadata": {},
+ "outputs": [
+ {
+ "data": {
+ "text/plain": [
+ "['alcohol',\n",
+ " 'malic_acid',\n",
+ " 'ash',\n",
+ " 'alcalinity_of_ash',\n",
+ " 'magnesium',\n",
+ " 'total_phenols',\n",
+ " 'flavanoids',\n",
+ " 'nonflavanoid_phenols',\n",
+ " 'proanthocyanins',\n",
+ " 'color_intensity',\n",
+ " 'hue',\n",
+ " 'od280/od315_of_diluted_wines',\n",
+ " 'proline']"
+ ]
+ },
+ "execution_count": 7,
+ "metadata": {},
+ "output_type": "execute_result"
+ }
+ ],
+ "source": [
+ "wine.feature_names"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 8,
+ "id": "e45ce5e1",
+ "metadata": {},
+ "outputs": [
+ {
+ "data": {
+ "text/plain": [
+ "array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,\n",
+ " 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,\n",
+ " 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1,\n",
+ " 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,\n",
+ " 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,\n",
+ " 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2,\n",
+ " 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,\n",
+ " 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,\n",
+ " 2, 2])"
+ ]
+ },
+ "execution_count": 8,
+ "metadata": {},
+ "output_type": "execute_result"
+ }
+ ],
+ "source": [
+ "wine.target"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 9,
+ "id": "259075ac",
+ "metadata": {},
+ "outputs": [
+ {
+ "data": {
+ "text/plain": [
+ "array(['class_0', 'class_1', 'class_2'], dtype='"
+ ]
+ },
+ "execution_count": 15,
+ "metadata": {},
+ "output_type": "execute_result"
+ },
+ {
+ "data": {
+ "image/png": "",
+ "text/plain": [
+ ""
+ ]
+ },
+ "metadata": {},
+ "output_type": "display_data"
+ }
+ ],
+ "source": [
+ "sns.histplot(wine.data, kde=True)"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "eba8670b",
+ "metadata": {},
+ "source": [
+ "The data is not discrete and not Normat, hence any model could work better, we have to check both."
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "a1e227ac",
+ "metadata": {},
+ "source": [
+ ">If you are having trouble plotting, check this [cheatsheet](../../../Cheatsheats/plotting.md) in preview mode"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "38d670d5",
+ "metadata": {},
+ "source": [
+ "## Model Training"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "93b1d6a4",
+ "metadata": {},
+ "source": [
+ "### GuassianNB"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 17,
+ "id": "755b16ac",
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "x_train, x_test, y_train, y_test = train_test_split(wine.data, wine.target, train_size=0.8)"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "d24724e6",
+ "metadata": {},
+ "source": [
+ "*We don't have to convert to dataframes, as of 2026, train_test_split() and Vectorizer().fit_train() have adapted to take in nd arrays*"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 18,
+ "id": "bc79bc74",
+ "metadata": {},
+ "outputs": [
+ {
+ "data": {
+ "text/plain": [
+ "1.0"
+ ]
+ },
+ "execution_count": 18,
+ "metadata": {},
+ "output_type": "execute_result"
+ }
+ ],
+ "source": [
+ "model_guass = GaussianNB()\n",
+ "model_guass.fit(x_train, y_train)\n",
+ "model_guass.score(x_test, y_test)"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "d4073060",
+ "metadata": {},
+ "source": [
+ "### MultinomialNB"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 19,
+ "id": "f5e9311d",
+ "metadata": {},
+ "outputs": [
+ {
+ "data": {
+ "text/plain": [
+ "0.9444444444444444"
+ ]
+ },
+ "execution_count": 19,
+ "metadata": {},
+ "output_type": "execute_result"
+ }
+ ],
+ "source": [
+ "model_multi = MultinomialNB()\n",
+ "model_multi.fit(x_train, y_train)\n",
+ "model_multi.score(x_test, y_test)"
+ ]
+ }
+ ],
+ "metadata": {
+ "kernelspec": {
+ "display_name": "Python 3",
+ "language": "python",
+ "name": "python3"
+ },
+ "language_info": {
+ "codemirror_mode": {
+ "name": "ipython",
+ "version": 3
+ },
+ "file_extension": ".py",
+ "mimetype": "text/x-python",
+ "name": "python",
+ "nbconvert_exporter": "python",
+ "pygments_lexer": "ipython3",
+ "version": "3.14.2"
+ }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 5
+}
diff --git a/ML/15_gridsearch/Exercise/Solution_for_understanding.ipynb b/ML/15_gridsearch/Exercise/Solution_for_understanding.ipynb
new file mode 100644
index 00000000..90d5459c
--- /dev/null
+++ b/ML/15_gridsearch/Exercise/Solution_for_understanding.ipynb
@@ -0,0 +1,362 @@
+{
+ "cells": [
+ {
+ "cell_type": "markdown",
+ "id": "6c37d4c9",
+ "metadata": {},
+ "source": [
+ "# Best Model and Parameter Selection for Digits Classification"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "550b1066",
+ "metadata": {},
+ "source": [
+ "## Imports and data load"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 1,
+ "id": "c6492d0a",
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "import pandas as pd\n",
+ "from sklearn.datasets import load_digits\n",
+ "from sklearn.tree import DecisionTreeClassifier\n",
+ "from sklearn.svm import SVC\n",
+ "from sklearn.linear_model import LogisticRegression\n",
+ "from sklearn.ensemble import RandomForestClassifier\n",
+ "from sklearn.naive_bayes import MultinomialNB\n",
+ "from sklearn.naive_bayes import GaussianNB\n",
+ "from sklearn.model_selection import GridSearchCV\n",
+ "from sklearn.model_selection import train_test_split"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 2,
+ "id": "7b3fc717",
+ "metadata": {},
+ "outputs": [
+ {
+ "data": {
+ "text/plain": [
+ "['DESCR', 'data', 'feature_names', 'frame', 'images', 'target', 'target_names']"
+ ]
+ },
+ "execution_count": 2,
+ "metadata": {},
+ "output_type": "execute_result"
+ }
+ ],
+ "source": [
+ "digits = load_digits()\n",
+ "dir(digits)"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "dc5e91a0",
+ "metadata": {},
+ "source": [
+ "## Train Test Split"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "351965ce",
+ "metadata": {},
+ "source": [
+ "Not needed because the models also take nparrays"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "1c4c9227",
+ "metadata": {},
+ "source": [
+ "## Defining models array"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "b2629ab9",
+ "metadata": {},
+ "source": [
+ ">The following array is generated by ChatGPT as he can do the 'guessing' task better. This array is large and takes a lot of CPU if run locally"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 11,
+ "id": "504d2c6e",
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "models = {\n",
+ "\n",
+ " 'svm': {\n",
+ " 'model': SVC(),\n",
+ " 'params': {\n",
+ " 'C': [0.1, 1, 10],\n",
+ " 'kernel': ['linear', 'rbf', 'poly'],\n",
+ " 'gamma': ['scale', 0.1, 1]\n",
+ " }\n",
+ " },\n",
+ "\n",
+ " 'LogisticRegression': {\n",
+ " 'model': LogisticRegression(max_iter=1000),\n",
+ " 'params': {\n",
+ " 'C': [0.1, 1, 10],\n",
+ " 'penalty': ['l2'],\n",
+ " 'solver': ['lbfgs']\n",
+ " }\n",
+ " },\n",
+ "\n",
+ " 'MultinomialNB': {\n",
+ " 'model': MultinomialNB(),\n",
+ " 'params': {\n",
+ " 'alpha': [0.1, 0.5, 1.0]\n",
+ " }\n",
+ " },\n",
+ "\n",
+ " 'GaussianNB': {\n",
+ " 'model': GaussianNB(),\n",
+ " 'params': {\n",
+ " 'var_smoothing': [1e-9, 1e-8, 1e-7]\n",
+ " }\n",
+ " },\n",
+ "\n",
+ " 'DecisionTree': {\n",
+ " 'model': DecisionTreeClassifier(),\n",
+ " 'params': {\n",
+ " 'max_depth': [None, 5, 10, 20],\n",
+ " 'min_samples_split': [2, 5, 10],\n",
+ " 'criterion': ['gini', 'entropy']\n",
+ " }\n",
+ " },\n",
+ "\n",
+ " 'RandomForest': {\n",
+ " 'model': RandomForestClassifier(),\n",
+ " 'params': {\n",
+ " 'n_estimators': [100, 200],\n",
+ " 'max_depth': [None, 10, 20],\n",
+ " 'min_samples_split': [2, 5],\n",
+ " 'max_features': ['sqrt', 'log2']\n",
+ " }\n",
+ " }\n",
+ "}"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "fd706afb",
+ "metadata": {},
+ "source": [
+ "## Training Every possible model"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "1be00403",
+ "metadata": {},
+ "source": [
+ "And seeing the score."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 13,
+ "id": "6b5adc7a",
+ "metadata": {},
+ "outputs": [
+ {
+ "name": "stderr",
+ "output_type": "stream",
+ "text": [
+ "c:\\Users\\HC\\AppData\\Local\\Programs\\Python\\Python314\\Lib\\site-packages\\sklearn\\linear_model\\_logistic.py:1135: FutureWarning: 'penalty' was deprecated in version 1.8 and will be removed in 1.10. To avoid this warning, leave 'penalty' set to its default value and use 'l1_ratio' or 'C' instead. Use l1_ratio=0 instead of penalty='l2', l1_ratio=1 instead of penalty='l1', and C=np.inf instead of penalty=None.\n",
+ " warnings.warn(\n",
+ "c:\\Users\\HC\\AppData\\Local\\Programs\\Python\\Python314\\Lib\\site-packages\\sklearn\\linear_model\\_logistic.py:1135: FutureWarning: 'penalty' was deprecated in version 1.8 and will be removed in 1.10. To avoid this warning, leave 'penalty' set to its default value and use 'l1_ratio' or 'C' instead. Use l1_ratio=0 instead of penalty='l2', l1_ratio=1 instead of penalty='l1', and C=np.inf instead of penalty=None.\n",
+ " warnings.warn(\n",
+ "c:\\Users\\HC\\AppData\\Local\\Programs\\Python\\Python314\\Lib\\site-packages\\sklearn\\linear_model\\_logistic.py:1135: FutureWarning: 'penalty' was deprecated in version 1.8 and will be removed in 1.10. To avoid this warning, leave 'penalty' set to its default value and use 'l1_ratio' or 'C' instead. Use l1_ratio=0 instead of penalty='l2', l1_ratio=1 instead of penalty='l1', and C=np.inf instead of penalty=None.\n",
+ " warnings.warn(\n",
+ "c:\\Users\\HC\\AppData\\Local\\Programs\\Python\\Python314\\Lib\\site-packages\\sklearn\\linear_model\\_logistic.py:1135: FutureWarning: 'penalty' was deprecated in version 1.8 and will be removed in 1.10. To avoid this warning, leave 'penalty' set to its default value and use 'l1_ratio' or 'C' instead. Use l1_ratio=0 instead of penalty='l2', l1_ratio=1 instead of penalty='l1', and C=np.inf instead of penalty=None.\n",
+ " warnings.warn(\n",
+ "c:\\Users\\HC\\AppData\\Local\\Programs\\Python\\Python314\\Lib\\site-packages\\sklearn\\linear_model\\_logistic.py:1135: FutureWarning: 'penalty' was deprecated in version 1.8 and will be removed in 1.10. To avoid this warning, leave 'penalty' set to its default value and use 'l1_ratio' or 'C' instead. Use l1_ratio=0 instead of penalty='l2', l1_ratio=1 instead of penalty='l1', and C=np.inf instead of penalty=None.\n",
+ " warnings.warn(\n",
+ "c:\\Users\\HC\\AppData\\Local\\Programs\\Python\\Python314\\Lib\\site-packages\\sklearn\\linear_model\\_logistic.py:1135: FutureWarning: 'penalty' was deprecated in version 1.8 and will be removed in 1.10. To avoid this warning, leave 'penalty' set to its default value and use 'l1_ratio' or 'C' instead. Use l1_ratio=0 instead of penalty='l2', l1_ratio=1 instead of penalty='l1', and C=np.inf instead of penalty=None.\n",
+ " warnings.warn(\n",
+ "c:\\Users\\HC\\AppData\\Local\\Programs\\Python\\Python314\\Lib\\site-packages\\sklearn\\linear_model\\_logistic.py:1135: FutureWarning: 'penalty' was deprecated in version 1.8 and will be removed in 1.10. To avoid this warning, leave 'penalty' set to its default value and use 'l1_ratio' or 'C' instead. Use l1_ratio=0 instead of penalty='l2', l1_ratio=1 instead of penalty='l1', and C=np.inf instead of penalty=None.\n",
+ " warnings.warn(\n",
+ "c:\\Users\\HC\\AppData\\Local\\Programs\\Python\\Python314\\Lib\\site-packages\\sklearn\\linear_model\\_logistic.py:1135: FutureWarning: 'penalty' was deprecated in version 1.8 and will be removed in 1.10. To avoid this warning, leave 'penalty' set to its default value and use 'l1_ratio' or 'C' instead. Use l1_ratio=0 instead of penalty='l2', l1_ratio=1 instead of penalty='l1', and C=np.inf instead of penalty=None.\n",
+ " warnings.warn(\n",
+ "c:\\Users\\HC\\AppData\\Local\\Programs\\Python\\Python314\\Lib\\site-packages\\sklearn\\linear_model\\_logistic.py:1135: FutureWarning: 'penalty' was deprecated in version 1.8 and will be removed in 1.10. To avoid this warning, leave 'penalty' set to its default value and use 'l1_ratio' or 'C' instead. Use l1_ratio=0 instead of penalty='l2', l1_ratio=1 instead of penalty='l1', and C=np.inf instead of penalty=None.\n",
+ " warnings.warn(\n",
+ "c:\\Users\\HC\\AppData\\Local\\Programs\\Python\\Python314\\Lib\\site-packages\\sklearn\\linear_model\\_logistic.py:1135: FutureWarning: 'penalty' was deprecated in version 1.8 and will be removed in 1.10. To avoid this warning, leave 'penalty' set to its default value and use 'l1_ratio' or 'C' instead. Use l1_ratio=0 instead of penalty='l2', l1_ratio=1 instead of penalty='l1', and C=np.inf instead of penalty=None.\n",
+ " warnings.warn(\n",
+ "c:\\Users\\HC\\AppData\\Local\\Programs\\Python\\Python314\\Lib\\site-packages\\sklearn\\linear_model\\_logistic.py:1135: FutureWarning: 'penalty' was deprecated in version 1.8 and will be removed in 1.10. To avoid this warning, leave 'penalty' set to its default value and use 'l1_ratio' or 'C' instead. Use l1_ratio=0 instead of penalty='l2', l1_ratio=1 instead of penalty='l1', and C=np.inf instead of penalty=None.\n",
+ " warnings.warn(\n",
+ "c:\\Users\\HC\\AppData\\Local\\Programs\\Python\\Python314\\Lib\\site-packages\\sklearn\\linear_model\\_logistic.py:1135: FutureWarning: 'penalty' was deprecated in version 1.8 and will be removed in 1.10. To avoid this warning, leave 'penalty' set to its default value and use 'l1_ratio' or 'C' instead. Use l1_ratio=0 instead of penalty='l2', l1_ratio=1 instead of penalty='l1', and C=np.inf instead of penalty=None.\n",
+ " warnings.warn(\n",
+ "c:\\Users\\HC\\AppData\\Local\\Programs\\Python\\Python314\\Lib\\site-packages\\sklearn\\linear_model\\_logistic.py:1135: FutureWarning: 'penalty' was deprecated in version 1.8 and will be removed in 1.10. To avoid this warning, leave 'penalty' set to its default value and use 'l1_ratio' or 'C' instead. Use l1_ratio=0 instead of penalty='l2', l1_ratio=1 instead of penalty='l1', and C=np.inf instead of penalty=None.\n",
+ " warnings.warn(\n",
+ "c:\\Users\\HC\\AppData\\Local\\Programs\\Python\\Python314\\Lib\\site-packages\\sklearn\\linear_model\\_logistic.py:1135: FutureWarning: 'penalty' was deprecated in version 1.8 and will be removed in 1.10. To avoid this warning, leave 'penalty' set to its default value and use 'l1_ratio' or 'C' instead. Use l1_ratio=0 instead of penalty='l2', l1_ratio=1 instead of penalty='l1', and C=np.inf instead of penalty=None.\n",
+ " warnings.warn(\n",
+ "c:\\Users\\HC\\AppData\\Local\\Programs\\Python\\Python314\\Lib\\site-packages\\sklearn\\linear_model\\_logistic.py:1135: FutureWarning: 'penalty' was deprecated in version 1.8 and will be removed in 1.10. To avoid this warning, leave 'penalty' set to its default value and use 'l1_ratio' or 'C' instead. Use l1_ratio=0 instead of penalty='l2', l1_ratio=1 instead of penalty='l1', and C=np.inf instead of penalty=None.\n",
+ " warnings.warn(\n",
+ "c:\\Users\\HC\\AppData\\Local\\Programs\\Python\\Python314\\Lib\\site-packages\\sklearn\\linear_model\\_logistic.py:1135: FutureWarning: 'penalty' was deprecated in version 1.8 and will be removed in 1.10. To avoid this warning, leave 'penalty' set to its default value and use 'l1_ratio' or 'C' instead. Use l1_ratio=0 instead of penalty='l2', l1_ratio=1 instead of penalty='l1', and C=np.inf instead of penalty=None.\n",
+ " warnings.warn(\n"
+ ]
+ }
+ ],
+ "source": [
+ "# Grid Search CV tries every combination for all parameters for ONE MODEL. For more than one models, we still \n",
+ "# need a for loop\n",
+ "\n",
+ "scores = []\n",
+ "\n",
+ "for model_name, model_dict in models.items():\n",
+ " grid_search_obj = GridSearchCV(model_dict['model'], model_dict['params'], cv=5, return_train_score=False)\n",
+ " grid_search_obj.fit(digits.data, digits.target)\n",
+ " scores.append({\n",
+ " 'model' : model_name,\n",
+ " 'best parameters' : grid_search_obj.best_params_,\n",
+ " 'best score' : grid_search_obj.best_score_\n",
+ " })"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 14,
+ "id": "496d0647",
+ "metadata": {},
+ "outputs": [
+ {
+ "data": {
+ "text/html": [
+ "\n",
+ "\n",
+ "
\n",
+ " \n",
+ " \n",
+ " \n",
+ " model \n",
+ " best parameters \n",
+ " best score \n",
+ " \n",
+ " \n",
+ " \n",
+ " \n",
+ " 0 \n",
+ " svm \n",
+ " {'C': 10, 'gamma': 'scale', 'kernel': 'rbf'} \n",
+ " 0.973850 \n",
+ " \n",
+ " \n",
+ " 1 \n",
+ " LogisticRegression \n",
+ " {'C': 0.1, 'penalty': 'l2', 'solver': 'lbfgs'} \n",
+ " 0.918217 \n",
+ " \n",
+ " \n",
+ " 2 \n",
+ " MultinomialNB \n",
+ " {'alpha': 0.1} \n",
+ " 0.870907 \n",
+ " \n",
+ " \n",
+ " 3 \n",
+ " GaussianNB \n",
+ " {'var_smoothing': 1e-07} \n",
+ " 0.832518 \n",
+ " \n",
+ " \n",
+ " 4 \n",
+ " DecisionTree \n",
+ " {'criterion': 'entropy', 'max_depth': 20, 'min... \n",
+ " 0.814708 \n",
+ " \n",
+ " \n",
+ " 5 \n",
+ " RandomForest \n",
+ " {'max_depth': 20, 'max_features': 'log2', 'min... \n",
+ " 0.943254 \n",
+ " \n",
+ " \n",
+ "
\n",
+ "
"
+ ],
+ "text/plain": [
+ " model best parameters \\\n",
+ "0 svm {'C': 10, 'gamma': 'scale', 'kernel': 'rbf'} \n",
+ "1 LogisticRegression {'C': 0.1, 'penalty': 'l2', 'solver': 'lbfgs'} \n",
+ "2 MultinomialNB {'alpha': 0.1} \n",
+ "3 GaussianNB {'var_smoothing': 1e-07} \n",
+ "4 DecisionTree {'criterion': 'entropy', 'max_depth': 20, 'min... \n",
+ "5 RandomForest {'max_depth': 20, 'max_features': 'log2', 'min... \n",
+ "\n",
+ " best score \n",
+ "0 0.973850 \n",
+ "1 0.918217 \n",
+ "2 0.870907 \n",
+ "3 0.832518 \n",
+ "4 0.814708 \n",
+ "5 0.943254 "
+ ]
+ },
+ "execution_count": 14,
+ "metadata": {},
+ "output_type": "execute_result"
+ }
+ ],
+ "source": [
+ "scores = pd.DataFrame(scores)\n",
+ "scores"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "bd566dae",
+ "metadata": {},
+ "source": [
+ "**For me, the winner is SVM with listed parameters.**"
+ ]
+ }
+ ],
+ "metadata": {
+ "kernelspec": {
+ "display_name": "Python 3",
+ "language": "python",
+ "name": "python3"
+ },
+ "language_info": {
+ "codemirror_mode": {
+ "name": "ipython",
+ "version": 3
+ },
+ "file_extension": ".py",
+ "mimetype": "text/x-python",
+ "name": "python",
+ "nbconvert_exporter": "python",
+ "pygments_lexer": "ipython3",
+ "version": "3.14.2"
+ }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 5
+}
diff --git a/ML/17_knn_classification/Exercise/Solution_for_understanding.ipynb b/ML/17_knn_classification/Exercise/Solution_for_understanding.ipynb
new file mode 100644
index 00000000..6980191d
--- /dev/null
+++ b/ML/17_knn_classification/Exercise/Solution_for_understanding.ipynb
@@ -0,0 +1,293 @@
+{
+ "cells": [
+ {
+ "cell_type": "markdown",
+ "id": "e182be46",
+ "metadata": {},
+ "source": [
+ "# Iris Classification using Random Forest"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "252230bf",
+ "metadata": {},
+ "source": [
+ "## Imports and Data Loading"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 26,
+ "id": "dc23c89d",
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "from sklearn.metrics import confusion_matrix, classification_report\n",
+ "from sklearn.model_selection import train_test_split\n",
+ "from sklearn.datasets import load_digits\n",
+ "from sklearn.neighbors import KNeighborsClassifier\n",
+ "import seaborn as sns"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 12,
+ "id": "50546ce1",
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "digits = load_digits()"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "93f59c0d",
+ "metadata": {},
+ "source": [
+ "## Train Test Split"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 30,
+ "id": "50617f47",
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "x_train, x_test, y_train, y_test = train_test_split(digits.data, digits.target , train_size=0.8)"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "82ed02b4",
+ "metadata": {},
+ "source": [
+ "No df making is required, as we are already familiar with the dataset, and models do accept nparrays anyways."
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "4070bd78",
+ "metadata": {},
+ "source": [
+ "## Model Training"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "19a6f7a4",
+ "metadata": {},
+ "source": [
+ "### Hit and Trial (as dataset is very small)"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "88f65886",
+ "metadata": {},
+ "source": [
+ "#### N=5"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 31,
+ "id": "eaafb061",
+ "metadata": {},
+ "outputs": [
+ {
+ "data": {
+ "text/plain": [
+ "0.9861111111111112"
+ ]
+ },
+ "execution_count": 31,
+ "metadata": {},
+ "output_type": "execute_result"
+ }
+ ],
+ "source": [
+ "knn = KNeighborsClassifier(n_neighbors=5)\n",
+ "knn.fit(x_train, y_train)\n",
+ "knn.score(x_test, y_test)"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "a3d3d0a8",
+ "metadata": {},
+ "source": [
+ "#### N=10"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 14,
+ "id": "50b34362",
+ "metadata": {},
+ "outputs": [
+ {
+ "data": {
+ "text/plain": [
+ "0.9833333333333333"
+ ]
+ },
+ "execution_count": 14,
+ "metadata": {},
+ "output_type": "execute_result"
+ }
+ ],
+ "source": [
+ "knn = KNeighborsClassifier(n_neighbors=10)\n",
+ "knn.fit(x_train, y_train)\n",
+ "knn.score(x_test, y_test)"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "07011b48",
+ "metadata": {},
+ "source": [
+ "#### N=3"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 19,
+ "id": "9442ef43",
+ "metadata": {},
+ "outputs": [
+ {
+ "data": {
+ "text/plain": [
+ "0.9944444444444445"
+ ]
+ },
+ "execution_count": 19,
+ "metadata": {},
+ "output_type": "execute_result"
+ }
+ ],
+ "source": [
+ "knn = KNeighborsClassifier(n_neighbors=3)\n",
+ "knn.fit(x_train, y_train)\n",
+ "knn.score(x_test, y_test)"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "e3a5885e",
+ "metadata": {},
+ "source": [
+ "**Lovely 😘**"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "d3c023ac",
+ "metadata": {},
+ "source": [
+ "## Model Evaluation"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "f5abc6d1",
+ "metadata": {},
+ "source": [
+ "### Confusion Matrix"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 25,
+ "id": "148b7959",
+ "metadata": {},
+ "outputs": [
+ {
+ "data": {
+ "text/plain": [
+ ""
+ ]
+ },
+ "execution_count": 25,
+ "metadata": {},
+ "output_type": "execute_result"
+ },
+ {
+ "data": {
+ "image/png": "",
+ "text/plain": [
+ ""
+ ]
+ },
+ "metadata": {},
+ "output_type": "display_data"
+ }
+ ],
+ "source": [
+ "y_predicted = knn.predict(x_test)\n",
+ "\n",
+ "cm = confusion_matrix(y_test, y_predicted)\n",
+ "sns.heatmap(cm, annot=True)"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "d8ea0540",
+ "metadata": {},
+ "source": [
+ "### Classification Report"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 29,
+ "id": "e8fa67f0",
+ "metadata": {},
+ "outputs": [
+ {
+ "name": "stdout",
+ "output_type": "stream",
+ "text": [
+ " precision recall f1-score support\n",
+ "\n",
+ " 0 1.00 1.00 1.00 40\n",
+ " 1 0.98 1.00 0.99 45\n",
+ " 2 1.00 1.00 1.00 35\n",
+ " 3 0.96 1.00 0.98 23\n",
+ " 4 1.00 1.00 1.00 42\n",
+ " 5 1.00 1.00 1.00 39\n",
+ " 6 1.00 1.00 1.00 35\n",
+ " 7 1.00 1.00 1.00 28\n",
+ " 8 1.00 0.98 0.99 41\n",
+ " 9 1.00 0.97 0.98 32\n",
+ "\n",
+ " accuracy 0.99 360\n",
+ " macro avg 0.99 0.99 0.99 360\n",
+ "weighted avg 0.99 0.99 0.99 360\n",
+ "\n"
+ ]
+ }
+ ],
+ "source": [
+ "cr = classification_report(y_test, y_predicted)\n",
+ "print(cr)"
+ ]
+ }
+ ],
+ "metadata": {
+ "kernelspec": {
+ "display_name": "Python 3",
+ "language": "python",
+ "name": "python3"
+ },
+ "language_info": {
+ "name": "python",
+ "version": "3.14.2"
+ }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 5
+}
diff --git a/ML/1_linear_reg/Exercise/solution.ipynb b/ML/1_linear_reg/Exercise/solution.ipynb
new file mode 100644
index 00000000..2910cd32
--- /dev/null
+++ b/ML/1_linear_reg/Exercise/solution.ipynb
@@ -0,0 +1,1098 @@
+{
+ "cells": [
+ {
+ "cell_type": "code",
+ "execution_count": 1,
+ "id": "e9ce1328",
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "import numpy as np\n",
+ "import matplotlib.pyplot as plt\n",
+ "import pandas as pd\n",
+ "from sklearn import linear_model"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 2,
+ "id": "77641018",
+ "metadata": {},
+ "outputs": [
+ {
+ "data": {
+ "text/html": [
+ "\n",
+ "\n",
+ "
\n",
+ " \n",
+ " \n",
+ " \n",
+ " year \n",
+ " per capita income (US$) \n",
+ " \n",
+ " \n",
+ " \n",
+ " \n",
+ " 0 \n",
+ " 1970 \n",
+ " 3399.299037 \n",
+ " \n",
+ " \n",
+ " 1 \n",
+ " 1971 \n",
+ " 3768.297935 \n",
+ " \n",
+ " \n",
+ " 2 \n",
+ " 1972 \n",
+ " 4251.175484 \n",
+ " \n",
+ " \n",
+ " 3 \n",
+ " 1973 \n",
+ " 4804.463248 \n",
+ " \n",
+ " \n",
+ " 4 \n",
+ " 1974 \n",
+ " 5576.514583 \n",
+ " \n",
+ " \n",
+ " 5 \n",
+ " 1975 \n",
+ " 5998.144346 \n",
+ " \n",
+ " \n",
+ " 6 \n",
+ " 1976 \n",
+ " 7062.131392 \n",
+ " \n",
+ " \n",
+ " 7 \n",
+ " 1977 \n",
+ " 7100.126170 \n",
+ " \n",
+ " \n",
+ " 8 \n",
+ " 1978 \n",
+ " 7247.967035 \n",
+ " \n",
+ " \n",
+ " 9 \n",
+ " 1979 \n",
+ " 7602.912681 \n",
+ " \n",
+ " \n",
+ " 10 \n",
+ " 1980 \n",
+ " 8355.968120 \n",
+ " \n",
+ " \n",
+ " 11 \n",
+ " 1981 \n",
+ " 9434.390652 \n",
+ " \n",
+ " \n",
+ " 12 \n",
+ " 1982 \n",
+ " 9619.438377 \n",
+ " \n",
+ " \n",
+ " 13 \n",
+ " 1983 \n",
+ " 10416.536590 \n",
+ " \n",
+ " \n",
+ " 14 \n",
+ " 1984 \n",
+ " 10790.328720 \n",
+ " \n",
+ " \n",
+ " 15 \n",
+ " 1985 \n",
+ " 11018.955850 \n",
+ " \n",
+ " \n",
+ " 16 \n",
+ " 1986 \n",
+ " 11482.891530 \n",
+ " \n",
+ " \n",
+ " 17 \n",
+ " 1987 \n",
+ " 12974.806620 \n",
+ " \n",
+ " \n",
+ " 18 \n",
+ " 1988 \n",
+ " 15080.283450 \n",
+ " \n",
+ " \n",
+ " 19 \n",
+ " 1989 \n",
+ " 16426.725480 \n",
+ " \n",
+ " \n",
+ " 20 \n",
+ " 1990 \n",
+ " 16838.673200 \n",
+ " \n",
+ " \n",
+ " 21 \n",
+ " 1991 \n",
+ " 17266.097690 \n",
+ " \n",
+ " \n",
+ " 22 \n",
+ " 1992 \n",
+ " 16412.083090 \n",
+ " \n",
+ " \n",
+ " 23 \n",
+ " 1993 \n",
+ " 15875.586730 \n",
+ " \n",
+ " \n",
+ " 24 \n",
+ " 1994 \n",
+ " 15755.820270 \n",
+ " \n",
+ " \n",
+ " 25 \n",
+ " 1995 \n",
+ " 16369.317250 \n",
+ " \n",
+ " \n",
+ " 26 \n",
+ " 1996 \n",
+ " 16699.826680 \n",
+ " \n",
+ " \n",
+ " 27 \n",
+ " 1997 \n",
+ " 17310.757750 \n",
+ " \n",
+ " \n",
+ " 28 \n",
+ " 1998 \n",
+ " 16622.671870 \n",
+ " \n",
+ " \n",
+ " 29 \n",
+ " 1999 \n",
+ " 17581.024140 \n",
+ " \n",
+ " \n",
+ " 30 \n",
+ " 2000 \n",
+ " 18987.382410 \n",
+ " \n",
+ " \n",
+ " 31 \n",
+ " 2001 \n",
+ " 18601.397240 \n",
+ " \n",
+ " \n",
+ " 32 \n",
+ " 2002 \n",
+ " 19232.175560 \n",
+ " \n",
+ " \n",
+ " 33 \n",
+ " 2003 \n",
+ " 22739.426280 \n",
+ " \n",
+ " \n",
+ " 34 \n",
+ " 2004 \n",
+ " 25719.147150 \n",
+ " \n",
+ " \n",
+ " 35 \n",
+ " 2005 \n",
+ " 29198.055690 \n",
+ " \n",
+ " \n",
+ " 36 \n",
+ " 2006 \n",
+ " 32738.262900 \n",
+ " \n",
+ " \n",
+ " 37 \n",
+ " 2007 \n",
+ " 36144.481220 \n",
+ " \n",
+ " \n",
+ " 38 \n",
+ " 2008 \n",
+ " 37446.486090 \n",
+ " \n",
+ " \n",
+ " 39 \n",
+ " 2009 \n",
+ " 32755.176820 \n",
+ " \n",
+ " \n",
+ " 40 \n",
+ " 2010 \n",
+ " 38420.522890 \n",
+ " \n",
+ " \n",
+ " 41 \n",
+ " 2011 \n",
+ " 42334.711210 \n",
+ " \n",
+ " \n",
+ " 42 \n",
+ " 2012 \n",
+ " 42665.255970 \n",
+ " \n",
+ " \n",
+ " 43 \n",
+ " 2013 \n",
+ " 42676.468370 \n",
+ " \n",
+ " \n",
+ " 44 \n",
+ " 2014 \n",
+ " 41039.893600 \n",
+ " \n",
+ " \n",
+ " 45 \n",
+ " 2015 \n",
+ " 35175.188980 \n",
+ " \n",
+ " \n",
+ " 46 \n",
+ " 2016 \n",
+ " 34229.193630 \n",
+ " \n",
+ " \n",
+ "
\n",
+ "
"
+ ],
+ "text/plain": [
+ " year per capita income (US$)\n",
+ "0 1970 3399.299037\n",
+ "1 1971 3768.297935\n",
+ "2 1972 4251.175484\n",
+ "3 1973 4804.463248\n",
+ "4 1974 5576.514583\n",
+ "5 1975 5998.144346\n",
+ "6 1976 7062.131392\n",
+ "7 1977 7100.126170\n",
+ "8 1978 7247.967035\n",
+ "9 1979 7602.912681\n",
+ "10 1980 8355.968120\n",
+ "11 1981 9434.390652\n",
+ "12 1982 9619.438377\n",
+ "13 1983 10416.536590\n",
+ "14 1984 10790.328720\n",
+ "15 1985 11018.955850\n",
+ "16 1986 11482.891530\n",
+ "17 1987 12974.806620\n",
+ "18 1988 15080.283450\n",
+ "19 1989 16426.725480\n",
+ "20 1990 16838.673200\n",
+ "21 1991 17266.097690\n",
+ "22 1992 16412.083090\n",
+ "23 1993 15875.586730\n",
+ "24 1994 15755.820270\n",
+ "25 1995 16369.317250\n",
+ "26 1996 16699.826680\n",
+ "27 1997 17310.757750\n",
+ "28 1998 16622.671870\n",
+ "29 1999 17581.024140\n",
+ "30 2000 18987.382410\n",
+ "31 2001 18601.397240\n",
+ "32 2002 19232.175560\n",
+ "33 2003 22739.426280\n",
+ "34 2004 25719.147150\n",
+ "35 2005 29198.055690\n",
+ "36 2006 32738.262900\n",
+ "37 2007 36144.481220\n",
+ "38 2008 37446.486090\n",
+ "39 2009 32755.176820\n",
+ "40 2010 38420.522890\n",
+ "41 2011 42334.711210\n",
+ "42 2012 42665.255970\n",
+ "43 2013 42676.468370\n",
+ "44 2014 41039.893600\n",
+ "45 2015 35175.188980\n",
+ "46 2016 34229.193630"
+ ]
+ },
+ "execution_count": 2,
+ "metadata": {},
+ "output_type": "execute_result"
+ }
+ ],
+ "source": [
+ "df = pd.read_csv(\"canada_per_capita_income.csv\")\n",
+ "df"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 3,
+ "id": "1e7d0d60",
+ "metadata": {},
+ "outputs": [
+ {
+ "data": {
+ "text/plain": [
+ ""
+ ]
+ },
+ "execution_count": 3,
+ "metadata": {},
+ "output_type": "execute_result"
+ },
+ {
+ "data": {
+ "image/png": "",
+ "text/plain": [
+ ""
+ ]
+ },
+ "metadata": {},
+ "output_type": "display_data"
+ }
+ ],
+ "source": [
+ "plt.xlabel(\"Year\")\n",
+ "plt.ylabel(\"Per Capita Income (in USD)\")\n",
+ "plt.scatter(df[\"year\"], df[\"per capita income (US$)\"], marker='x', color='red')"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 4,
+ "id": "6473cb0d",
+ "metadata": {},
+ "outputs": [
+ {
+ "data": {
+ "text/html": [
+ "LinearRegression() In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook. On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org. \n",
+ "
\n",
+ "
\n",
+ " Parameters \n",
+ " \n",
+ " \n",
+ " \n",
+ " \n",
+ " \n",
+ " fit_intercept \n",
+ " True \n",
+ " \n",
+ " \n",
+ "\n",
+ " \n",
+ " \n",
+ " copy_X \n",
+ " True \n",
+ " \n",
+ " \n",
+ "\n",
+ " \n",
+ " \n",
+ " tol \n",
+ " 1e-06 \n",
+ " \n",
+ " \n",
+ "\n",
+ " \n",
+ " \n",
+ " n_jobs \n",
+ " None \n",
+ " \n",
+ " \n",
+ "\n",
+ " \n",
+ " \n",
+ " positive \n",
+ " False \n",
+ " \n",
+ " \n",
+ " \n",
+ "
\n",
+ " \n",
+ "
\n",
+ "
"
+ ],
+ "text/plain": [
+ "LinearRegression()"
+ ]
+ },
+ "execution_count": 4,
+ "metadata": {},
+ "output_type": "execute_result"
+ }
+ ],
+ "source": [
+ "model = linear_model.LinearRegression()\n",
+ "model.fit(df[[\"year\"]], df[[\"per capita income (US$)\"]])"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 5,
+ "id": "b647dce5",
+ "metadata": {},
+ "outputs": [
+ {
+ "name": "stderr",
+ "output_type": "stream",
+ "text": [
+ "c:\\Users\\HC\\AppData\\Local\\Programs\\Python\\Python313\\Lib\\site-packages\\sklearn\\utils\\validation.py:2749: UserWarning: X does not have valid feature names, but LinearRegression was fitted with feature names\n",
+ " warnings.warn(\n"
+ ]
+ },
+ {
+ "data": {
+ "text/plain": [
+ "array([[41288.69409442]])"
+ ]
+ },
+ "execution_count": 5,
+ "metadata": {},
+ "output_type": "execute_result"
+ }
+ ],
+ "source": [
+ "model.predict([[2020]])"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "f3f15d7a",
+ "metadata": {},
+ "source": [
+ "### Following code is just to see the regression line."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "id": "a3ada5b1",
+ "metadata": {},
+ "outputs": [
+ {
+ "name": "stderr",
+ "output_type": "stream",
+ "text": [
+ "c:\\Users\\HC\\AppData\\Local\\Programs\\Python\\Python313\\Lib\\site-packages\\sklearn\\utils\\validation.py:2749: UserWarning: X does not have valid feature names, but LinearRegression was fitted with feature names\n",
+ " warnings.warn(\n"
+ ]
+ },
+ {
+ "data": {
+ "image/png": "",
+ "text/plain": [
+ ""
+ ]
+ },
+ "metadata": {},
+ "output_type": "display_data"
+ },
+ {
+ "name": "stdout",
+ "output_type": "stream",
+ "text": [
+ "Predicted income for 2020: 41288.69409441762\n"
+ ]
+ },
+ {
+ "name": "stderr",
+ "output_type": "stream",
+ "text": [
+ "c:\\Users\\HC\\AppData\\Local\\Programs\\Python\\Python313\\Lib\\site-packages\\sklearn\\utils\\validation.py:2749: UserWarning: X does not have valid feature names, but LinearRegression was fitted with feature names\n",
+ " warnings.warn(\n",
+ "C:\\Users\\HC\\AppData\\Local\\Temp\\ipykernel_8084\\4051086270.py:10: DeprecationWarning: Conversion of an array with ndim > 0 to a scalar is deprecated, and will error in future. Ensure you extract a single element from your array before performing this operation. (Deprecated NumPy 1.25.)\n",
+ " print('Predicted income for 2020:', float(model.predict([[2020]])))\n"
+ ]
+ }
+ ],
+ "source": [
+ "# Plot data with regression line and show prediction for 2020\n",
+ "plt.scatter(df['year'], df['per capita income (US$)'], marker='x', color='red')\n",
+ "preds = model.predict(df[[\"years\"]])\n",
+ "plt.plot(df['year'], preds, color='blue')\n",
+ "plt.xlabel('Year')\n",
+ "plt.ylabel('Per Capita Income (US$)')\n",
+ "plt.title('Canada Per Capita Income with Linear Regression')\n",
+ "plt.show()\n",
+ "print('Predicted income for 2020:', float(model.predict([[2020]])))"
+ ]
+ }
+ ],
+ "metadata": {
+ "kernelspec": {
+ "display_name": "Python 3",
+ "language": "python",
+ "name": "python3"
+ },
+ "language_info": {
+ "codemirror_mode": {
+ "name": "ipython",
+ "version": 3
+ },
+ "file_extension": ".py",
+ "mimetype": "text/x-python",
+ "name": "python",
+ "nbconvert_exporter": "python",
+ "pygments_lexer": "ipython3",
+ "version": "3.13.5"
+ }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 5
+}
diff --git a/ML/1_linear_reg/flashcards.md b/ML/1_linear_reg/flashcards.md
new file mode 100644
index 00000000..c3b673a9
--- /dev/null
+++ b/ML/1_linear_reg/flashcards.md
@@ -0,0 +1,75 @@
+### **Card 1**
+**Front**: What is the mathematical equation for linear regression?
+**Back**:
+`y = mx + b`
+Where: y = dependent variable, x = independent variable, m = slope/coefficient, b = intercept
+
+---
+
+### **Card 2**
+**Front**: Which sklearn module contains the LinearRegression class?
+**Back**:
+`from sklearn import linear_model`
+
+---
+
+### **Card 3**
+**Front**: What method trains a linear regression model on data?
+**Back**:
+`model.fit(X, y)`
+Where X = feature matrix, y = target vector
+
+---
+
+### **Card 4**
+**Front**: How do you make predictions with a trained linear regression model?
+**Back**:
+`model.predict(X_new)`
+Where X_new = new feature values to predict
+
+---
+
+### **Card 5**
+**Front**: What do the coef_ and intercept_ attributes store in a linear regression model?
+**Back**:
+- `coef_`: Slope coefficients (m values)
+- `intercept_`: Y-intercept (b value)
+
+---
+
+### **Card 6**
+**Front**: How can you manually calculate predictions using coef_ and intercept_?
+**Back**:
+`prediction = (X * model.coef_) + model.intercept_`
+For single feature: `prediction = X * coef_ + intercept_`
+
+---
+
+### **Card 7**
+**Front**: What matplotlib function creates a scatter plot?
+**Back**:
+`plt.scatter(x, y)`
+Plots x values against y values as points
+
+---
+
+### **Card 8**
+**Front**: How do you remove a column from a pandas DataFrame?
+**Back**:
+`df.drop('column_name', axis=1)`
+or `df.drop('column_name', axis='columns')`
+
+---
+
+### **Card 9**
+**Front**: How do you add a new column with predictions to a DataFrame?
+**Back**:
+`df['new_column_name'] = predictions_array`
+
+---
+
+### **Card 10**
+**Front**: What method exports a DataFrame to a CSV file?
+**Back**:
+`df.to_csv("filename.csv")`
+Saves DataFrame as comma-separated values file
\ No newline at end of file
diff --git a/ML/2_linear_reg_multivariate/Exercise/solution.ipynb b/ML/2_linear_reg_multivariate/Exercise/solution.ipynb
new file mode 100644
index 00000000..641e179f
--- /dev/null
+++ b/ML/2_linear_reg_multivariate/Exercise/solution.ipynb
@@ -0,0 +1,1076 @@
+{
+ "cells": [
+ {
+ "cell_type": "code",
+ "execution_count": 1,
+ "id": "3dac90a3",
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "import pandas as pd\n",
+ "import matplotlib.pyplot as plt\n",
+ "from sklearn import linear_model\n",
+ "import numpy as np"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 2,
+ "id": "b07f09c4",
+ "metadata": {},
+ "outputs": [
+ {
+ "data": {
+ "text/html": [
+ "\n",
+ "\n",
+ "
\n",
+ " \n",
+ " \n",
+ " \n",
+ " experience \n",
+ " test_score(out of 10) \n",
+ " interview_score(out of 10) \n",
+ " salary($) \n",
+ " \n",
+ " \n",
+ " \n",
+ " \n",
+ " 0 \n",
+ " NaN \n",
+ " 8.0 \n",
+ " 9 \n",
+ " 50000 \n",
+ " \n",
+ " \n",
+ " 1 \n",
+ " NaN \n",
+ " 8.0 \n",
+ " 6 \n",
+ " 45000 \n",
+ " \n",
+ " \n",
+ " 2 \n",
+ " five \n",
+ " 6.0 \n",
+ " 7 \n",
+ " 60000 \n",
+ " \n",
+ " \n",
+ " 3 \n",
+ " two \n",
+ " 10.0 \n",
+ " 10 \n",
+ " 65000 \n",
+ " \n",
+ " \n",
+ " 4 \n",
+ " seven \n",
+ " 9.0 \n",
+ " 6 \n",
+ " 70000 \n",
+ " \n",
+ " \n",
+ " 5 \n",
+ " three \n",
+ " 7.0 \n",
+ " 10 \n",
+ " 62000 \n",
+ " \n",
+ " \n",
+ " 6 \n",
+ " ten \n",
+ " NaN \n",
+ " 7 \n",
+ " 72000 \n",
+ " \n",
+ " \n",
+ " 7 \n",
+ " eleven \n",
+ " 7.0 \n",
+ " 8 \n",
+ " 80000 \n",
+ " \n",
+ " \n",
+ "
\n",
+ "
"
+ ],
+ "text/plain": [
+ " experience test_score(out of 10) interview_score(out of 10) salary($)\n",
+ "0 NaN 8.0 9 50000\n",
+ "1 NaN 8.0 6 45000\n",
+ "2 five 6.0 7 60000\n",
+ "3 two 10.0 10 65000\n",
+ "4 seven 9.0 6 70000\n",
+ "5 three 7.0 10 62000\n",
+ "6 ten NaN 7 72000\n",
+ "7 eleven 7.0 8 80000"
+ ]
+ },
+ "execution_count": 2,
+ "metadata": {},
+ "output_type": "execute_result"
+ }
+ ],
+ "source": [
+ "df = pd.read_csv(\"hiring.csv\")\n",
+ "df"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 10,
+ "id": "96c3ae09",
+ "metadata": {},
+ "outputs": [
+ {
+ "data": {
+ "text/html": [
+ "\n",
+ "\n",
+ "
\n",
+ " \n",
+ " \n",
+ " \n",
+ " experience \n",
+ " test_score(out of 10) \n",
+ " interview_score(out of 10) \n",
+ " salary($) \n",
+ " \n",
+ " \n",
+ " \n",
+ " \n",
+ " 0 \n",
+ " 0 \n",
+ " 8.0 \n",
+ " 9 \n",
+ " 50000 \n",
+ " \n",
+ " \n",
+ " 1 \n",
+ " 0 \n",
+ " 8.0 \n",
+ " 6 \n",
+ " 45000 \n",
+ " \n",
+ " \n",
+ " 2 \n",
+ " 5 \n",
+ " 6.0 \n",
+ " 7 \n",
+ " 60000 \n",
+ " \n",
+ " \n",
+ " 3 \n",
+ " 2 \n",
+ " 10.0 \n",
+ " 10 \n",
+ " 65000 \n",
+ " \n",
+ " \n",
+ " 4 \n",
+ " 7 \n",
+ " 9.0 \n",
+ " 6 \n",
+ " 70000 \n",
+ " \n",
+ " \n",
+ " 5 \n",
+ " 3 \n",
+ " 7.0 \n",
+ " 10 \n",
+ " 62000 \n",
+ " \n",
+ " \n",
+ " 6 \n",
+ " 10 \n",
+ " NaN \n",
+ " 7 \n",
+ " 72000 \n",
+ " \n",
+ " \n",
+ " 7 \n",
+ " 11 \n",
+ " 7.0 \n",
+ " 8 \n",
+ " 80000 \n",
+ " \n",
+ " \n",
+ "
\n",
+ "
"
+ ],
+ "text/plain": [
+ " experience test_score(out of 10) interview_score(out of 10) salary($)\n",
+ "0 0 8.0 9 50000\n",
+ "1 0 8.0 6 45000\n",
+ "2 5 6.0 7 60000\n",
+ "3 2 10.0 10 65000\n",
+ "4 7 9.0 6 70000\n",
+ "5 3 7.0 10 62000\n",
+ "6 10 NaN 7 72000\n",
+ "7 11 7.0 8 80000"
+ ]
+ },
+ "execution_count": 10,
+ "metadata": {},
+ "output_type": "execute_result"
+ }
+ ],
+ "source": [
+ "from word2number import w2n\n",
+ "df.experience.fillna(\"zero\", inplace=True)\n",
+ "df.experience = df.experience.apply(w2n.word_to_num)\n",
+ "df"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 14,
+ "id": "33ad39f6",
+ "metadata": {},
+ "outputs": [
+ {
+ "name": "stderr",
+ "output_type": "stream",
+ "text": [
+ "C:\\Users\\HC\\AppData\\Local\\Temp\\ipykernel_6756\\2839352622.py:1: FutureWarning: A value is trying to be set on a copy of a DataFrame or Series through chained assignment using an inplace method.\n",
+ "The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.\n",
+ "\n",
+ "For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.\n",
+ "\n",
+ "\n",
+ " df[r'test_score(out of 10)'].fillna(df[r'test_score(out of 10)'].mean(), inplace=True)\n"
+ ]
+ },
+ {
+ "data": {
+ "text/html": [
+ "\n",
+ "\n",
+ "
\n",
+ " \n",
+ " \n",
+ " \n",
+ " experience \n",
+ " test_score(out of 10) \n",
+ " interview_score(out of 10) \n",
+ " salary($) \n",
+ " \n",
+ " \n",
+ " \n",
+ " \n",
+ " 0 \n",
+ " 0 \n",
+ " 8.000000 \n",
+ " 9 \n",
+ " 50000 \n",
+ " \n",
+ " \n",
+ " 1 \n",
+ " 0 \n",
+ " 8.000000 \n",
+ " 6 \n",
+ " 45000 \n",
+ " \n",
+ " \n",
+ " 2 \n",
+ " 5 \n",
+ " 6.000000 \n",
+ " 7 \n",
+ " 60000 \n",
+ " \n",
+ " \n",
+ " 3 \n",
+ " 2 \n",
+ " 10.000000 \n",
+ " 10 \n",
+ " 65000 \n",
+ " \n",
+ " \n",
+ " 4 \n",
+ " 7 \n",
+ " 9.000000 \n",
+ " 6 \n",
+ " 70000 \n",
+ " \n",
+ " \n",
+ " 5 \n",
+ " 3 \n",
+ " 7.000000 \n",
+ " 10 \n",
+ " 62000 \n",
+ " \n",
+ " \n",
+ " 6 \n",
+ " 10 \n",
+ " 7.857143 \n",
+ " 7 \n",
+ " 72000 \n",
+ " \n",
+ " \n",
+ " 7 \n",
+ " 11 \n",
+ " 7.000000 \n",
+ " 8 \n",
+ " 80000 \n",
+ " \n",
+ " \n",
+ "
\n",
+ "
"
+ ],
+ "text/plain": [
+ " experience test_score(out of 10) interview_score(out of 10) salary($)\n",
+ "0 0 8.000000 9 50000\n",
+ "1 0 8.000000 6 45000\n",
+ "2 5 6.000000 7 60000\n",
+ "3 2 10.000000 10 65000\n",
+ "4 7 9.000000 6 70000\n",
+ "5 3 7.000000 10 62000\n",
+ "6 10 7.857143 7 72000\n",
+ "7 11 7.000000 8 80000"
+ ]
+ },
+ "execution_count": 14,
+ "metadata": {},
+ "output_type": "execute_result"
+ }
+ ],
+ "source": [
+ "df[r'test_score(out of 10)'].fillna(df[r'test_score(out of 10)'].mean(), inplace=True)\n",
+ "df"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "c92483e7",
+ "metadata": {},
+ "source": [
+ "### For Visualising the patterns and dependencies...\n",
+ "went pretty bad i guess😅"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 18,
+ "id": "366541b7",
+ "metadata": {},
+ "outputs": [
+ {
+ "data": {
+ "text/plain": [
+ ""
+ ]
+ },
+ "execution_count": 18,
+ "metadata": {},
+ "output_type": "execute_result"
+ },
+ {
+ "data": {
+ "image/png": "",
+ "text/plain": [
+ ""
+ ]
+ },
+ "metadata": {},
+ "output_type": "display_data"
+ }
+ ],
+ "source": [
+ "import seaborn as sns\n",
+ "sns.pairplot(df)"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 15,
+ "id": "3184c055",
+ "metadata": {},
+ "outputs": [
+ {
+ "data": {
+ "text/html": [
+ "LinearRegression() In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook. On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org. \n",
+ "
\n",
+ "
\n",
+ " Parameters \n",
+ " \n",
+ " \n",
+ " \n",
+ " \n",
+ " \n",
+ " fit_intercept \n",
+ " True \n",
+ " \n",
+ " \n",
+ "\n",
+ " \n",
+ " \n",
+ " copy_X \n",
+ " True \n",
+ " \n",
+ " \n",
+ "\n",
+ " \n",
+ " \n",
+ " tol \n",
+ " 1e-06 \n",
+ " \n",
+ " \n",
+ "\n",
+ " \n",
+ " \n",
+ " n_jobs \n",
+ " None \n",
+ " \n",
+ " \n",
+ "\n",
+ " \n",
+ " \n",
+ " positive \n",
+ " False \n",
+ " \n",
+ " \n",
+ " \n",
+ "
\n",
+ " \n",
+ "
\n",
+ "
"
+ ],
+ "text/plain": [
+ "LinearRegression()"
+ ]
+ },
+ "execution_count": 15,
+ "metadata": {},
+ "output_type": "execute_result"
+ }
+ ],
+ "source": [
+ "model = linear_model.LinearRegression()\n",
+ "model.fit(df[['experience', 'test_score(out of 10)', 'interview_score(out of 10)']], df['salary($)'])"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 20,
+ "id": "1acba9d9",
+ "metadata": {},
+ "outputs": [
+ {
+ "name": "stdout",
+ "output_type": "stream",
+ "text": [
+ "[53290.89255945]\n",
+ "[92268.07227784]\n"
+ ]
+ },
+ {
+ "name": "stderr",
+ "output_type": "stream",
+ "text": [
+ "c:\\Users\\HC\\AppData\\Local\\Programs\\Python\\Python313\\Lib\\site-packages\\sklearn\\utils\\validation.py:2749: UserWarning: X does not have valid feature names, but LinearRegression was fitted with feature names\n",
+ " warnings.warn(\n",
+ "c:\\Users\\HC\\AppData\\Local\\Programs\\Python\\Python313\\Lib\\site-packages\\sklearn\\utils\\validation.py:2749: UserWarning: X does not have valid feature names, but LinearRegression was fitted with feature names\n",
+ " warnings.warn(\n"
+ ]
+ }
+ ],
+ "source": [
+ "print(model.predict([[2, 9, 6]]))\n",
+ "print(model.predict([[12, 10, 10]]))"
+ ]
+ }
+ ],
+ "metadata": {
+ "kernelspec": {
+ "display_name": "Python 3",
+ "language": "python",
+ "name": "python3"
+ },
+ "language_info": {
+ "codemirror_mode": {
+ "name": "ipython",
+ "version": 3
+ },
+ "file_extension": ".py",
+ "mimetype": "text/x-python",
+ "name": "python",
+ "nbconvert_exporter": "python",
+ "pygments_lexer": "ipython3",
+ "version": "3.13.5"
+ }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 5
+}
diff --git a/ML/2_linear_reg_multivariate/flashcards.md b/ML/2_linear_reg_multivariate/flashcards.md
new file mode 100644
index 00000000..d86e87a6
--- /dev/null
+++ b/ML/2_linear_reg_multivariate/flashcards.md
@@ -0,0 +1,70 @@
+### **Card 1**
+**Front**: What is multiple linear regression?
+**Back**:
+A regression model with multiple independent variables predicting one dependent variable:
+`y = b + m₁x₁ + m₂x₂ + ... + mₙxₙ`
+
+---
+
+### **Card 2**
+**Front**: How do you handle missing values in a DataFrame column using median?
+**Back**:
+`df['column'] = df['column'].fillna(df['column'].median())`
+
+---
+
+### **Card 3**
+**Front**: How to select all columns except one as features for training?
+**Back**:
+`df.drop('target_column', axis=1)`
+Returns DataFrame without the specified column
+
+---
+
+### **Card 4**
+**Front**: What does `reg.coef_` return in multiple linear regression?
+**Back**:
+An array of coefficients - one for each feature in the order they were provided
+
+---
+
+### **Card 5**
+**Front**: How to manually calculate predictions using coefficients and intercept?
+**Back**:
+`prediction = intercept + (feature1 * coef1) + (feature2 * coef2) + ...`
+
+---
+
+### **Card 6**
+**Front**: What's the correct format for prediction input with multiple features?
+**Back**:
+`model.predict([[feature1, feature2, feature3]])`
+Double brackets for 2D array format
+
+---
+
+### **Card 7**
+**Front**: What are independent variables called in machine learning?
+**Back**:
+Features or predictors
+
+---
+
+### **Card 8**
+**Front**: What is the dependent variable called in machine learning?
+**Back**:
+Target or response variable
+
+---
+
+### **Card 9**
+**Front**: How to get the median of a DataFrame column?
+**Back**:
+`df['column_name'].median()`
+
+---
+
+### **Card 10**
+**Front**: What does data preprocessing typically involve?
+**Back**:
+Handling missing values, feature scaling, encoding categorical variables
diff --git a/ML/3_gradient_descent/Exercise/solution.ipynb b/ML/3_gradient_descent/Exercise/solution.ipynb
new file mode 100644
index 00000000..f546084b
--- /dev/null
+++ b/ML/3_gradient_descent/Exercise/solution.ipynb
@@ -0,0 +1,940 @@
+{
+ "cells": [
+ {
+ "cell_type": "markdown",
+ "id": "50084964",
+ "metadata": {},
+ "source": [
+ "## The Process of Gradient Descent\n",
+ "\n",
+ "1. Initialize the parameters (slope and intercept) with random values or zeros\n",
+ "2. Use the current parameters to make predictions\n",
+ "3. Compute the cost function (e.g., MSE)\n",
+ "4. Compute the partial derivatives of the cost with respect to the slope and intercept\n",
+ "5. Update the parameters using the learning rate and the gradients\n",
+ "6. Repeat steps 2–5 until the cost converges or a set number of iterations is reached"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 21,
+ "id": "69df8424",
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "import pandas as pd\n",
+ "import numpy as np #np also has an isclose function\n",
+ "from sklearn.linear_model import LinearRegression"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "id": "15fab6d0",
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "df = pd.read_csv(\"test_scores.csv\")\n",
+ "x = np.array(df.math) #arrays are efficient, series give errors in function defined below\n",
+ "y = np.array(df.cs)"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 26,
+ "id": "45600db6",
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "def gradient_desc(x, y):\n",
+ " m_current = 0\n",
+ " b_current = 0\n",
+ " learning_rate = 0.0002\n",
+ " iterations = 1000000\n",
+ " n = len(x)\n",
+ " mse_previous = 0\n",
+ "\n",
+ " for i in range(iterations):\n",
+ " y_predicted = m_current * x + b_current\n",
+ "\n",
+ " if i > 0:\n",
+ " mse_previous = mse\n",
+ " mse = (1/n) * np.sum([val**2 for val in (y-y_predicted)])\n",
+ " \n",
+ " if np.isclose(mse, mse_previous, rtol=1e-20):\n",
+ " return m_current, b_current\n",
+ "\n",
+ " md = -(2 / n) * np.sum(x * (y - y_predicted))\n",
+ " bd = -(2 / n) * np.sum(y - y_predicted)\n",
+ "\n",
+ " m_current = m_current - learning_rate * md\n",
+ " b_current = b_current - learning_rate * bd\n",
+ "\n",
+ " #math.isclose(mse,)\n"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 27,
+ "id": "dfa06b83",
+ "metadata": {},
+ "outputs": [
+ {
+ "name": "stdout",
+ "output_type": "stream",
+ "text": [
+ "1.0186053631193346 1.8536266598224878\n"
+ ]
+ }
+ ],
+ "source": [
+ "slope, coefficient = gradient_desc(x, y)\n",
+ "print(slope, coefficient)"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "da5d1965",
+ "metadata": {},
+ "source": [
+ "*What we have just done is created a linear regression model in python, it successfully calculated the slope and the intercept. So the equation is:*\n",
+ "$ y = 1.0189653763997015x + 1.82811345313911 $"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "9d1db445",
+ "metadata": {},
+ "source": [
+ "Checking by matching through scikit library"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 24,
+ "id": "34765020",
+ "metadata": {},
+ "outputs": [
+ {
+ "data": {
+ "text/html": [
+ "LinearRegression() In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook. On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org. \n",
+ "
\n",
+ "
\n",
+ " Parameters \n",
+ " \n",
+ " \n",
+ " \n",
+ " \n",
+ " \n",
+ " \n",
+ " \n",
+ " fit_intercept\n",
+ " fit_intercept: bool, default=True Whether to calculate the intercept for this model. If set to False, no intercept will be used in calculations (i.e. data is expected to be centered). \n",
+ " \n",
+ " \n",
+ " True \n",
+ " \n",
+ " \n",
+ "\n",
+ " \n",
+ " \n",
+ " \n",
+ " \n",
+ " copy_X\n",
+ " copy_X: bool, default=True If True, X will be copied; else, it may be overwritten. \n",
+ " \n",
+ " \n",
+ " True \n",
+ " \n",
+ " \n",
+ "\n",
+ " \n",
+ " \n",
+ " \n",
+ " \n",
+ " tol\n",
+ " tol: float, default=1e-6 The precision of the solution (`coef_`) is determined by `tol` which specifies a different convergence criterion for the `lsqr` solver. `tol` is set as `atol` and `btol` of :func:`scipy.sparse.linalg.lsqr` when fitting on sparse training data. This parameter has no effect when fitting on dense data. .. versionadded:: 1.7 \n",
+ " \n",
+ " \n",
+ " 1e-06 \n",
+ " \n",
+ " \n",
+ "\n",
+ " \n",
+ " \n",
+ " \n",
+ " \n",
+ " n_jobs\n",
+ " n_jobs: int, default=None The number of jobs to use for the computation. This will only provide speedup in case of sufficiently large problems, that is if firstly `n_targets > 1` and secondly `X` is sparse or if `positive` is set to `True`. ``None`` means 1 unless in a :obj:`joblib.parallel_backend` context. ``-1`` means using all processors. See :term:`Glossary ` for more details. \n",
+ " \n",
+ " \n",
+ " None \n",
+ " \n",
+ " \n",
+ "\n",
+ " \n",
+ " \n",
+ " \n",
+ " \n",
+ " positive\n",
+ " positive: bool, default=False When set to ``True``, forces the coefficients to be positive. This option is only supported for dense arrays. For a comparison between a linear regression model with positive constraints on the regression coefficients and a linear regression without such constraints, see :ref:`sphx_glr_auto_examples_linear_model_plot_nnls.py`. .. versionadded:: 0.24 \n",
+ " \n",
+ " \n",
+ " False \n",
+ " \n",
+ " \n",
+ " \n",
+ "
\n",
+ " \n",
+ "
\n",
+ "
"
+ ],
+ "text/plain": [
+ "LinearRegression()"
+ ]
+ },
+ "execution_count": 24,
+ "metadata": {},
+ "output_type": "execute_result"
+ }
+ ],
+ "source": [
+ "model = LinearRegression()\n",
+ "model.fit(df[[\"math\"]], y)"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 25,
+ "id": "b06185f9",
+ "metadata": {},
+ "outputs": [
+ {
+ "name": "stdout",
+ "output_type": "stream",
+ "text": [
+ "[1.01773624] 1.9152193111568891\n"
+ ]
+ }
+ ],
+ "source": [
+ "print(model.coef_, model.intercept_)"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "291620d5",
+ "metadata": {},
+ "source": [
+ "They Matched 😎 "
+ ]
+ }
+ ],
+ "metadata": {
+ "kernelspec": {
+ "display_name": "Python 3",
+ "language": "python",
+ "name": "python3"
+ },
+ "language_info": {
+ "codemirror_mode": {
+ "name": "ipython",
+ "version": 3
+ },
+ "file_extension": ".py",
+ "mimetype": "text/x-python",
+ "name": "python",
+ "nbconvert_exporter": "python",
+ "pygments_lexer": "ipython3",
+ "version": "3.14.2"
+ }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 5
+}
diff --git a/ML/3_gradient_descent/flashcards.md b/ML/3_gradient_descent/flashcards.md
new file mode 100644
index 00000000..7c09db4d
--- /dev/null
+++ b/ML/3_gradient_descent/flashcards.md
@@ -0,0 +1,88 @@
+### **Card 1**
+**Front**: What is the Mean Squared Error (MSE) formula?
+**Back**:
+`MSE = (1/n) * Σ(y_actual - y_predicted)²`
+Where n = number of samples
+
+---
+
+### **Card 2**
+**Front**: What is a cost function in machine learning?
+**Back**:
+A function that measures how wrong the model's predictions are.
+MSE is a common cost function for regression.
+
+---
+
+### **Card 3**
+**Front**: What is gradient descent?
+**Back**:
+An optimization algorithm that minimizes the cost function by iteratively adjusting parameters in the direction of steepest descent.
+
+---
+
+### **Card 4**
+**Front**: What is the learning rate (α) in gradient descent?
+**Back**:
+A hyperparameter that controls the step size during parameter updates.
+Too small = slow convergence, too large = may overshoot minimum.
+
+---
+
+### **Card 5**
+**Front**: What are the gradient descent update rules for linear regression?
+**Back**:
+`m_new = m - α * ∂J/∂m`
+`b_new = b - α * ∂J/∂b`
+Where J is the cost function
+
+---
+
+### **Card 6**
+**Front**: What do partial derivatives represent in gradient descent?
+**Back**:
+- `∂J/∂m` = rate of cost change with respect to slope
+- `∂J/∂b` = rate of cost change with respect to intercept
+
+---
+
+### **Card 7**
+**Front**: What are the partial derivative formulas for MSE cost function?
+**Back**:
+`∂J/∂m = (-2/n) * Σ x(y - (mx + b))`
+`∂J/∂b = (-2/n) * Σ (y - (mx + b))`
+
+---
+
+### **Card 8**
+**Front**: What happens if learning rate is too large?
+**Back**:
+Overshooting the minimum, causing divergence or oscillation around the optimal values.
+
+---
+
+### **Card 9**
+**Front**: What happens if learning rate is too small?
+**Back**:
+Very slow convergence, requiring many iterations to reach the minimum.
+
+---
+
+### **Card 10**
+**Front**: How do you know when gradient descent has converged?
+**Back**:
+When the cost function stops decreasing significantly between iterations.
+
+---
+
+### **Card 11**
+**Front**: What is the typical range for learning rate values?
+**Back**:
+Usually between 0.001 and 0.1, but depends on the specific problem and data scale.
+
+---
+
+### **Card 12**
+**Front**: Why do we square the errors in MSE?
+**Back**:
+To handle negative errors and penalize larger errors more heavily.
\ No newline at end of file
diff --git a/ML/3_gradient_descent/image-1.png b/ML/3_gradient_descent/image-1.png
new file mode 100644
index 00000000..ae2d09eb
Binary files /dev/null and b/ML/3_gradient_descent/image-1.png differ
diff --git a/ML/3_gradient_descent/image-2.png b/ML/3_gradient_descent/image-2.png
new file mode 100644
index 00000000..215ddede
Binary files /dev/null and b/ML/3_gradient_descent/image-2.png differ
diff --git a/ML/3_gradient_descent/image-3.png b/ML/3_gradient_descent/image-3.png
new file mode 100644
index 00000000..467bc89c
Binary files /dev/null and b/ML/3_gradient_descent/image-3.png differ
diff --git a/ML/3_gradient_descent/image-4.png b/ML/3_gradient_descent/image-4.png
new file mode 100644
index 00000000..85bc797c
Binary files /dev/null and b/ML/3_gradient_descent/image-4.png differ
diff --git a/ML/3_gradient_descent/image-5.png b/ML/3_gradient_descent/image-5.png
new file mode 100644
index 00000000..31d1119e
Binary files /dev/null and b/ML/3_gradient_descent/image-5.png differ
diff --git a/ML/3_gradient_descent/image.png b/ML/3_gradient_descent/image.png
new file mode 100644
index 00000000..7586a145
Binary files /dev/null and b/ML/3_gradient_descent/image.png differ
diff --git a/ML/4_save_model/4_save_and_load_model_using_pickle.ipynb b/ML/4_save_model/4_save_and_load_model_using_pickle.ipynb
index b1bcf5ac..cc8410d4 100644
--- a/ML/4_save_model/4_save_and_load_model_using_pickle.ipynb
+++ b/ML/4_save_model/4_save_and_load_model_using_pickle.ipynb
@@ -9,7 +9,7 @@
},
{
"cell_type": "code",
- "execution_count": 1,
+ "execution_count": 2,
"metadata": {},
"outputs": [],
"source": [
@@ -20,7 +20,7 @@
},
{
"cell_type": "code",
- "execution_count": 2,
+ "execution_count": 3,
"metadata": {},
"outputs": [
{
@@ -87,7 +87,7 @@
"4 4000 725000"
]
},
- "execution_count": 2,
+ "execution_count": 3,
"metadata": {},
"output_type": "execute_result"
}
@@ -99,17 +99,774 @@
},
{
"cell_type": "code",
- "execution_count": 3,
+ "execution_count": 4,
"metadata": {},
"outputs": [
{
"data": {
+ "text/html": [
+ "LinearRegression() In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook. On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org. \n",
+ "
\n",
+ "
\n",
+ " Parameters \n",
+ " \n",
+ " \n",
+ " \n",
+ " \n",
+ " \n",
+ " \n",
+ " \n",
+ " fit_intercept\n",
+ " fit_intercept: bool, default=True Whether to calculate the intercept for this model. If set to False, no intercept will be used in calculations (i.e. data is expected to be centered). \n",
+ " \n",
+ " \n",
+ " True \n",
+ " \n",
+ " \n",
+ "\n",
+ " \n",
+ " \n",
+ " \n",
+ " \n",
+ " copy_X\n",
+ " copy_X: bool, default=True If True, X will be copied; else, it may be overwritten. \n",
+ " \n",
+ " \n",
+ " True \n",
+ " \n",
+ " \n",
+ "\n",
+ " \n",
+ " \n",
+ " \n",
+ " \n",
+ " tol\n",
+ " tol: float, default=1e-6 The precision of the solution (`coef_`) is determined by `tol` which specifies a different convergence criterion for the `lsqr` solver. `tol` is set as `atol` and `btol` of :func:`scipy.sparse.linalg.lsqr` when fitting on sparse training data. This parameter has no effect when fitting on dense data. .. versionadded:: 1.7 \n",
+ " \n",
+ " \n",
+ " 1e-06 \n",
+ " \n",
+ " \n",
+ "\n",
+ " \n",
+ " \n",
+ " \n",
+ " \n",
+ " n_jobs\n",
+ " n_jobs: int, default=None The number of jobs to use for the computation. This will only provide speedup in case of sufficiently large problems, that is if firstly `n_targets > 1` and secondly `X` is sparse or if `positive` is set to `True`. ``None`` means 1 unless in a :obj:`joblib.parallel_backend` context. ``-1`` means using all processors. See :term:`Glossary ` for more details. \n",
+ " \n",
+ " \n",
+ " None \n",
+ " \n",
+ " \n",
+ "\n",
+ " \n",
+ " \n",
+ " \n",
+ " \n",
+ " positive\n",
+ " positive: bool, default=False When set to ``True``, forces the coefficients to be positive. This option is only supported for dense arrays. For a comparison between a linear regression model with positive constraints on the regression coefficients and a linear regression without such constraints, see :ref:`sphx_glr_auto_examples_linear_model_plot_nnls.py`. .. versionadded:: 0.24 \n",
+ " \n",
+ " \n",
+ " False \n",
+ " \n",
+ " \n",
+ " \n",
+ "
\n",
+ " \n",
+ "
\n",
+ "
"
+ ],
"text/plain": [
- "LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None,\n",
- " normalize=False)"
+ "LinearRegression()"
]
},
- "execution_count": 3,
+ "execution_count": 4,
"metadata": {},
"output_type": "execute_result"
}
@@ -121,7 +878,7 @@
},
{
"cell_type": "code",
- "execution_count": 4,
+ "execution_count": 5,
"metadata": {},
"outputs": [
{
@@ -130,7 +887,7 @@
"array([135.78767123])"
]
},
- "execution_count": 4,
+ "execution_count": 5,
"metadata": {},
"output_type": "execute_result"
}
@@ -141,16 +898,16 @@
},
{
"cell_type": "code",
- "execution_count": 5,
+ "execution_count": 6,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
- "180616.43835616432"
+ "np.float64(180616.43835616432)"
]
},
- "execution_count": 5,
+ "execution_count": 6,
"metadata": {},
"output_type": "execute_result"
}
@@ -164,6 +921,14 @@
"execution_count": 7,
"metadata": {},
"outputs": [
+ {
+ "name": "stderr",
+ "output_type": "stream",
+ "text": [
+ "c:\\Users\\HC\\AppData\\Local\\Programs\\Python\\Python314\\Lib\\site-packages\\sklearn\\utils\\validation.py:2691: UserWarning: X does not have valid feature names, but LinearRegression was fitted with feature names\n",
+ " warnings.warn(\n"
+ ]
+ },
{
"data": {
"text/plain": [
@@ -201,8 +966,8 @@
"metadata": {},
"outputs": [],
"source": [
- "with open('model_pickle','wb') as file:\n",
- " pickle.dump(model,file)"
+ "with open(\"model_pickle\", \"wb\") as f:\n",
+ " pickle.dump(model, f)"
]
},
{
@@ -218,8 +983,8 @@
"metadata": {},
"outputs": [],
"source": [
- "with open('model_pickle','rb') as file:\n",
- " mp = pickle.load(file)"
+ "with open(\"model_pickle\", \"rb\") as f:\n",
+ " mp = pickle.load(f)"
]
},
{
@@ -252,7 +1017,7 @@
{
"data": {
"text/plain": [
- "180616.43835616432"
+ "np.float64(180616.43835616432)"
]
},
"execution_count": 12,
@@ -269,6 +1034,14 @@
"execution_count": 13,
"metadata": {},
"outputs": [
+ {
+ "name": "stderr",
+ "output_type": "stream",
+ "text": [
+ "c:\\Users\\HC\\AppData\\Local\\Programs\\Python\\Python314\\Lib\\site-packages\\sklearn\\utils\\validation.py:2691: UserWarning: X does not have valid feature names, but LinearRegression was fitted with feature names\n",
+ " warnings.warn(\n"
+ ]
+ },
{
"data": {
"text/plain": [
@@ -293,16 +1066,16 @@
},
{
"cell_type": "code",
- "execution_count": 14,
+ "execution_count": 16,
"metadata": {},
"outputs": [],
"source": [
- "from sklearn.externals import joblib"
+ "import joblib"
]
},
{
"cell_type": "code",
- "execution_count": 15,
+ "execution_count": 17,
"metadata": {},
"outputs": [
{
@@ -311,13 +1084,13 @@
"['model_joblib']"
]
},
- "execution_count": 15,
+ "execution_count": 17,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
- "joblib.dump(model, 'model_joblib')"
+ "joblib.dump(model, \"model_joblib\")"
]
},
{
@@ -329,16 +1102,16 @@
},
{
"cell_type": "code",
- "execution_count": 16,
+ "execution_count": 18,
"metadata": {},
"outputs": [],
"source": [
- "mj = joblib.load('model_joblib')"
+ "mj = joblib.load(\"model_joblib\")"
]
},
{
"cell_type": "code",
- "execution_count": 17,
+ "execution_count": 19,
"metadata": {},
"outputs": [
{
@@ -347,7 +1120,7 @@
"array([135.78767123])"
]
},
- "execution_count": 17,
+ "execution_count": 19,
"metadata": {},
"output_type": "execute_result"
}
@@ -358,7 +1131,7 @@
},
{
"cell_type": "code",
- "execution_count": 18,
+ "execution_count": 20,
"metadata": {
"scrolled": true
},
@@ -366,10 +1139,10 @@
{
"data": {
"text/plain": [
- "180616.43835616432"
+ "np.float64(180616.43835616432)"
]
},
- "execution_count": 18,
+ "execution_count": 20,
"metadata": {},
"output_type": "execute_result"
}
@@ -380,16 +1153,24 @@
},
{
"cell_type": "code",
- "execution_count": 19,
+ "execution_count": 21,
"metadata": {},
"outputs": [
+ {
+ "name": "stderr",
+ "output_type": "stream",
+ "text": [
+ "c:\\Users\\HC\\AppData\\Local\\Programs\\Python\\Python314\\Lib\\site-packages\\sklearn\\utils\\validation.py:2691: UserWarning: X does not have valid feature names, but LinearRegression was fitted with feature names\n",
+ " warnings.warn(\n"
+ ]
+ },
{
"data": {
"text/plain": [
"array([859554.79452055])"
]
},
- "execution_count": 19,
+ "execution_count": 21,
"metadata": {},
"output_type": "execute_result"
}
@@ -415,7 +1196,7 @@
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
- "version": "3.7.3"
+ "version": "3.14.2"
}
},
"nbformat": 4,
diff --git a/ML/4_save_model/model_joblib b/ML/4_save_model/model_joblib
index 7f359b25..f5fc707d 100644
Binary files a/ML/4_save_model/model_joblib and b/ML/4_save_model/model_joblib differ
diff --git a/ML/4_save_model/model_pickle b/ML/4_save_model/model_pickle
index a46585e2..9d98c79e 100644
Binary files a/ML/4_save_model/model_pickle and b/ML/4_save_model/model_pickle differ
diff --git a/ML/5_one_hot_encoding/Exercise/exercise_one_hot_encoding.ipynb b/ML/5_one_hot_encoding/Exercise/exercise_one_hot_encoding.ipynb
deleted file mode 100644
index 53a78aab..00000000
--- a/ML/5_one_hot_encoding/Exercise/exercise_one_hot_encoding.ipynb
+++ /dev/null
@@ -1,1000 +0,0 @@
-{
- "cells": [
- {
- "cell_type": "code",
- "execution_count": 1,
- "metadata": {},
- "outputs": [
- {
- "data": {
- "text/html": [
- "\n",
- "\n",
- "
\n",
- " \n",
- " \n",
- " \n",
- " Car Model \n",
- " Mileage \n",
- " Sell Price($) \n",
- " Age(yrs) \n",
- " \n",
- " \n",
- " \n",
- " \n",
- " 0 \n",
- " BMW X5 \n",
- " 69000 \n",
- " 18000 \n",
- " 6 \n",
- " \n",
- " \n",
- " 1 \n",
- " BMW X5 \n",
- " 35000 \n",
- " 34000 \n",
- " 3 \n",
- " \n",
- " \n",
- " 2 \n",
- " BMW X5 \n",
- " 57000 \n",
- " 26100 \n",
- " 5 \n",
- " \n",
- " \n",
- " 3 \n",
- " BMW X5 \n",
- " 22500 \n",
- " 40000 \n",
- " 2 \n",
- " \n",
- " \n",
- " 4 \n",
- " BMW X5 \n",
- " 46000 \n",
- " 31500 \n",
- " 4 \n",
- " \n",
- " \n",
- " 5 \n",
- " Audi A5 \n",
- " 59000 \n",
- " 29400 \n",
- " 5 \n",
- " \n",
- " \n",
- " 6 \n",
- " Audi A5 \n",
- " 52000 \n",
- " 32000 \n",
- " 5 \n",
- " \n",
- " \n",
- " 7 \n",
- " Audi A5 \n",
- " 72000 \n",
- " 19300 \n",
- " 6 \n",
- " \n",
- " \n",
- " 8 \n",
- " Audi A5 \n",
- " 91000 \n",
- " 12000 \n",
- " 8 \n",
- " \n",
- " \n",
- " 9 \n",
- " Mercedez Benz C class \n",
- " 67000 \n",
- " 22000 \n",
- " 6 \n",
- " \n",
- " \n",
- " 10 \n",
- " Mercedez Benz C class \n",
- " 83000 \n",
- " 20000 \n",
- " 7 \n",
- " \n",
- " \n",
- " 11 \n",
- " Mercedez Benz C class \n",
- " 79000 \n",
- " 21000 \n",
- " 7 \n",
- " \n",
- " \n",
- " 12 \n",
- " Mercedez Benz C class \n",
- " 59000 \n",
- " 33000 \n",
- " 5 \n",
- " \n",
- " \n",
- "
\n",
- "
"
- ],
- "text/plain": [
- " Car Model Mileage Sell Price($) Age(yrs)\n",
- "0 BMW X5 69000 18000 6\n",
- "1 BMW X5 35000 34000 3\n",
- "2 BMW X5 57000 26100 5\n",
- "3 BMW X5 22500 40000 2\n",
- "4 BMW X5 46000 31500 4\n",
- "5 Audi A5 59000 29400 5\n",
- "6 Audi A5 52000 32000 5\n",
- "7 Audi A5 72000 19300 6\n",
- "8 Audi A5 91000 12000 8\n",
- "9 Mercedez Benz C class 67000 22000 6\n",
- "10 Mercedez Benz C class 83000 20000 7\n",
- "11 Mercedez Benz C class 79000 21000 7\n",
- "12 Mercedez Benz C class 59000 33000 5"
- ]
- },
- "execution_count": 1,
- "metadata": {},
- "output_type": "execute_result"
- }
- ],
- "source": [
- "import pandas as pd\n",
- "df = pd.read_csv(\"carprices.csv\")\n",
- "df"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 2,
- "metadata": {},
- "outputs": [
- {
- "data": {
- "text/html": [
- "\n",
- "\n",
- "
\n",
- " \n",
- " \n",
- " \n",
- " Audi A5 \n",
- " BMW X5 \n",
- " Mercedez Benz C class \n",
- " \n",
- " \n",
- " \n",
- " \n",
- " 0 \n",
- " 0 \n",
- " 1 \n",
- " 0 \n",
- " \n",
- " \n",
- " 1 \n",
- " 0 \n",
- " 1 \n",
- " 0 \n",
- " \n",
- " \n",
- " 2 \n",
- " 0 \n",
- " 1 \n",
- " 0 \n",
- " \n",
- " \n",
- " 3 \n",
- " 0 \n",
- " 1 \n",
- " 0 \n",
- " \n",
- " \n",
- " 4 \n",
- " 0 \n",
- " 1 \n",
- " 0 \n",
- " \n",
- " \n",
- " 5 \n",
- " 1 \n",
- " 0 \n",
- " 0 \n",
- " \n",
- " \n",
- " 6 \n",
- " 1 \n",
- " 0 \n",
- " 0 \n",
- " \n",
- " \n",
- " 7 \n",
- " 1 \n",
- " 0 \n",
- " 0 \n",
- " \n",
- " \n",
- " 8 \n",
- " 1 \n",
- " 0 \n",
- " 0 \n",
- " \n",
- " \n",
- " 9 \n",
- " 0 \n",
- " 0 \n",
- " 1 \n",
- " \n",
- " \n",
- " 10 \n",
- " 0 \n",
- " 0 \n",
- " 1 \n",
- " \n",
- " \n",
- " 11 \n",
- " 0 \n",
- " 0 \n",
- " 1 \n",
- " \n",
- " \n",
- " 12 \n",
- " 0 \n",
- " 0 \n",
- " 1 \n",
- " \n",
- " \n",
- "
\n",
- "
"
- ],
- "text/plain": [
- " Audi A5 BMW X5 Mercedez Benz C class\n",
- "0 0 1 0\n",
- "1 0 1 0\n",
- "2 0 1 0\n",
- "3 0 1 0\n",
- "4 0 1 0\n",
- "5 1 0 0\n",
- "6 1 0 0\n",
- "7 1 0 0\n",
- "8 1 0 0\n",
- "9 0 0 1\n",
- "10 0 0 1\n",
- "11 0 0 1\n",
- "12 0 0 1"
- ]
- },
- "execution_count": 2,
- "metadata": {},
- "output_type": "execute_result"
- }
- ],
- "source": [
- "dummies = pd.get_dummies(df['Car Model'])\n",
- "dummies"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 3,
- "metadata": {},
- "outputs": [
- {
- "data": {
- "text/html": [
- "\n",
- "\n",
- "
\n",
- " \n",
- " \n",
- " \n",
- " Car Model \n",
- " Mileage \n",
- " Sell Price($) \n",
- " Age(yrs) \n",
- " Audi A5 \n",
- " BMW X5 \n",
- " Mercedez Benz C class \n",
- " \n",
- " \n",
- " \n",
- " \n",
- " 0 \n",
- " BMW X5 \n",
- " 69000 \n",
- " 18000 \n",
- " 6 \n",
- " 0 \n",
- " 1 \n",
- " 0 \n",
- " \n",
- " \n",
- " 1 \n",
- " BMW X5 \n",
- " 35000 \n",
- " 34000 \n",
- " 3 \n",
- " 0 \n",
- " 1 \n",
- " 0 \n",
- " \n",
- " \n",
- " 2 \n",
- " BMW X5 \n",
- " 57000 \n",
- " 26100 \n",
- " 5 \n",
- " 0 \n",
- " 1 \n",
- " 0 \n",
- " \n",
- " \n",
- " 3 \n",
- " BMW X5 \n",
- " 22500 \n",
- " 40000 \n",
- " 2 \n",
- " 0 \n",
- " 1 \n",
- " 0 \n",
- " \n",
- " \n",
- " 4 \n",
- " BMW X5 \n",
- " 46000 \n",
- " 31500 \n",
- " 4 \n",
- " 0 \n",
- " 1 \n",
- " 0 \n",
- " \n",
- " \n",
- " 5 \n",
- " Audi A5 \n",
- " 59000 \n",
- " 29400 \n",
- " 5 \n",
- " 1 \n",
- " 0 \n",
- " 0 \n",
- " \n",
- " \n",
- " 6 \n",
- " Audi A5 \n",
- " 52000 \n",
- " 32000 \n",
- " 5 \n",
- " 1 \n",
- " 0 \n",
- " 0 \n",
- " \n",
- " \n",
- " 7 \n",
- " Audi A5 \n",
- " 72000 \n",
- " 19300 \n",
- " 6 \n",
- " 1 \n",
- " 0 \n",
- " 0 \n",
- " \n",
- " \n",
- " 8 \n",
- " Audi A5 \n",
- " 91000 \n",
- " 12000 \n",
- " 8 \n",
- " 1 \n",
- " 0 \n",
- " 0 \n",
- " \n",
- " \n",
- " 9 \n",
- " Mercedez Benz C class \n",
- " 67000 \n",
- " 22000 \n",
- " 6 \n",
- " 0 \n",
- " 0 \n",
- " 1 \n",
- " \n",
- " \n",
- " 10 \n",
- " Mercedez Benz C class \n",
- " 83000 \n",
- " 20000 \n",
- " 7 \n",
- " 0 \n",
- " 0 \n",
- " 1 \n",
- " \n",
- " \n",
- " 11 \n",
- " Mercedez Benz C class \n",
- " 79000 \n",
- " 21000 \n",
- " 7 \n",
- " 0 \n",
- " 0 \n",
- " 1 \n",
- " \n",
- " \n",
- " 12 \n",
- " Mercedez Benz C class \n",
- " 59000 \n",
- " 33000 \n",
- " 5 \n",
- " 0 \n",
- " 0 \n",
- " 1 \n",
- " \n",
- " \n",
- "
\n",
- "
"
- ],
- "text/plain": [
- " Car Model Mileage Sell Price($) Age(yrs) Audi A5 BMW X5 \\\n",
- "0 BMW X5 69000 18000 6 0 1 \n",
- "1 BMW X5 35000 34000 3 0 1 \n",
- "2 BMW X5 57000 26100 5 0 1 \n",
- "3 BMW X5 22500 40000 2 0 1 \n",
- "4 BMW X5 46000 31500 4 0 1 \n",
- "5 Audi A5 59000 29400 5 1 0 \n",
- "6 Audi A5 52000 32000 5 1 0 \n",
- "7 Audi A5 72000 19300 6 1 0 \n",
- "8 Audi A5 91000 12000 8 1 0 \n",
- "9 Mercedez Benz C class 67000 22000 6 0 0 \n",
- "10 Mercedez Benz C class 83000 20000 7 0 0 \n",
- "11 Mercedez Benz C class 79000 21000 7 0 0 \n",
- "12 Mercedez Benz C class 59000 33000 5 0 0 \n",
- "\n",
- " Mercedez Benz C class \n",
- "0 0 \n",
- "1 0 \n",
- "2 0 \n",
- "3 0 \n",
- "4 0 \n",
- "5 0 \n",
- "6 0 \n",
- "7 0 \n",
- "8 0 \n",
- "9 1 \n",
- "10 1 \n",
- "11 1 \n",
- "12 1 "
- ]
- },
- "execution_count": 3,
- "metadata": {},
- "output_type": "execute_result"
- }
- ],
- "source": [
- "merged = pd.concat([df,dummies],axis='columns')\n",
- "merged"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 4,
- "metadata": {},
- "outputs": [
- {
- "data": {
- "text/html": [
- "\n",
- "\n",
- "
\n",
- " \n",
- " \n",
- " \n",
- " Mileage \n",
- " Sell Price($) \n",
- " Age(yrs) \n",
- " Audi A5 \n",
- " BMW X5 \n",
- " \n",
- " \n",
- " \n",
- " \n",
- " 0 \n",
- " 69000 \n",
- " 18000 \n",
- " 6 \n",
- " 0 \n",
- " 1 \n",
- " \n",
- " \n",
- " 1 \n",
- " 35000 \n",
- " 34000 \n",
- " 3 \n",
- " 0 \n",
- " 1 \n",
- " \n",
- " \n",
- " 2 \n",
- " 57000 \n",
- " 26100 \n",
- " 5 \n",
- " 0 \n",
- " 1 \n",
- " \n",
- " \n",
- " 3 \n",
- " 22500 \n",
- " 40000 \n",
- " 2 \n",
- " 0 \n",
- " 1 \n",
- " \n",
- " \n",
- " 4 \n",
- " 46000 \n",
- " 31500 \n",
- " 4 \n",
- " 0 \n",
- " 1 \n",
- " \n",
- " \n",
- " 5 \n",
- " 59000 \n",
- " 29400 \n",
- " 5 \n",
- " 1 \n",
- " 0 \n",
- " \n",
- " \n",
- " 6 \n",
- " 52000 \n",
- " 32000 \n",
- " 5 \n",
- " 1 \n",
- " 0 \n",
- " \n",
- " \n",
- " 7 \n",
- " 72000 \n",
- " 19300 \n",
- " 6 \n",
- " 1 \n",
- " 0 \n",
- " \n",
- " \n",
- " 8 \n",
- " 91000 \n",
- " 12000 \n",
- " 8 \n",
- " 1 \n",
- " 0 \n",
- " \n",
- " \n",
- " 9 \n",
- " 67000 \n",
- " 22000 \n",
- " 6 \n",
- " 0 \n",
- " 0 \n",
- " \n",
- " \n",
- " 10 \n",
- " 83000 \n",
- " 20000 \n",
- " 7 \n",
- " 0 \n",
- " 0 \n",
- " \n",
- " \n",
- " 11 \n",
- " 79000 \n",
- " 21000 \n",
- " 7 \n",
- " 0 \n",
- " 0 \n",
- " \n",
- " \n",
- " 12 \n",
- " 59000 \n",
- " 33000 \n",
- " 5 \n",
- " 0 \n",
- " 0 \n",
- " \n",
- " \n",
- "
\n",
- "
"
- ],
- "text/plain": [
- " Mileage Sell Price($) Age(yrs) Audi A5 BMW X5\n",
- "0 69000 18000 6 0 1\n",
- "1 35000 34000 3 0 1\n",
- "2 57000 26100 5 0 1\n",
- "3 22500 40000 2 0 1\n",
- "4 46000 31500 4 0 1\n",
- "5 59000 29400 5 1 0\n",
- "6 52000 32000 5 1 0\n",
- "7 72000 19300 6 1 0\n",
- "8 91000 12000 8 1 0\n",
- "9 67000 22000 6 0 0\n",
- "10 83000 20000 7 0 0\n",
- "11 79000 21000 7 0 0\n",
- "12 59000 33000 5 0 0"
- ]
- },
- "execution_count": 4,
- "metadata": {},
- "output_type": "execute_result"
- }
- ],
- "source": [
- "final = merged.drop([\"Car Model\",\"Mercedez Benz C class\"],axis='columns')\n",
- "final"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 6,
- "metadata": {},
- "outputs": [
- {
- "data": {
- "text/html": [
- "\n",
- "\n",
- "
\n",
- " \n",
- " \n",
- " \n",
- " Mileage \n",
- " Age(yrs) \n",
- " Audi A5 \n",
- " BMW X5 \n",
- " \n",
- " \n",
- " \n",
- " \n",
- " 0 \n",
- " 69000 \n",
- " 6 \n",
- " 0 \n",
- " 1 \n",
- " \n",
- " \n",
- " 1 \n",
- " 35000 \n",
- " 3 \n",
- " 0 \n",
- " 1 \n",
- " \n",
- " \n",
- " 2 \n",
- " 57000 \n",
- " 5 \n",
- " 0 \n",
- " 1 \n",
- " \n",
- " \n",
- " 3 \n",
- " 22500 \n",
- " 2 \n",
- " 0 \n",
- " 1 \n",
- " \n",
- " \n",
- " 4 \n",
- " 46000 \n",
- " 4 \n",
- " 0 \n",
- " 1 \n",
- " \n",
- " \n",
- " 5 \n",
- " 59000 \n",
- " 5 \n",
- " 1 \n",
- " 0 \n",
- " \n",
- " \n",
- " 6 \n",
- " 52000 \n",
- " 5 \n",
- " 1 \n",
- " 0 \n",
- " \n",
- " \n",
- " 7 \n",
- " 72000 \n",
- " 6 \n",
- " 1 \n",
- " 0 \n",
- " \n",
- " \n",
- " 8 \n",
- " 91000 \n",
- " 8 \n",
- " 1 \n",
- " 0 \n",
- " \n",
- " \n",
- " 9 \n",
- " 67000 \n",
- " 6 \n",
- " 0 \n",
- " 0 \n",
- " \n",
- " \n",
- " 10 \n",
- " 83000 \n",
- " 7 \n",
- " 0 \n",
- " 0 \n",
- " \n",
- " \n",
- " 11 \n",
- " 79000 \n",
- " 7 \n",
- " 0 \n",
- " 0 \n",
- " \n",
- " \n",
- " 12 \n",
- " 59000 \n",
- " 5 \n",
- " 0 \n",
- " 0 \n",
- " \n",
- " \n",
- "
\n",
- "
"
- ],
- "text/plain": [
- " Mileage Age(yrs) Audi A5 BMW X5\n",
- "0 69000 6 0 1\n",
- "1 35000 3 0 1\n",
- "2 57000 5 0 1\n",
- "3 22500 2 0 1\n",
- "4 46000 4 0 1\n",
- "5 59000 5 1 0\n",
- "6 52000 5 1 0\n",
- "7 72000 6 1 0\n",
- "8 91000 8 1 0\n",
- "9 67000 6 0 0\n",
- "10 83000 7 0 0\n",
- "11 79000 7 0 0\n",
- "12 59000 5 0 0"
- ]
- },
- "execution_count": 6,
- "metadata": {},
- "output_type": "execute_result"
- }
- ],
- "source": [
- "X = final.drop('Sell Price($)',axis='columns')\n",
- "X"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 7,
- "metadata": {},
- "outputs": [
- {
- "data": {
- "text/plain": [
- "0 18000\n",
- "1 34000\n",
- "2 26100\n",
- "3 40000\n",
- "4 31500\n",
- "5 29400\n",
- "6 32000\n",
- "7 19300\n",
- "8 12000\n",
- "9 22000\n",
- "10 20000\n",
- "11 21000\n",
- "12 33000\n",
- "Name: Sell Price($), dtype: int64"
- ]
- },
- "execution_count": 7,
- "metadata": {},
- "output_type": "execute_result"
- }
- ],
- "source": [
- "y = final['Sell Price($)']\n",
- "y"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 8,
- "metadata": {
- "collapsed": true
- },
- "outputs": [],
- "source": [
- "from sklearn.linear_model import LinearRegression\n",
- "model = LinearRegression()"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 10,
- "metadata": {},
- "outputs": [
- {
- "data": {
- "text/plain": [
- "LinearRegression(copy_X=True, fit_intercept=True, n_jobs=1, normalize=False)"
- ]
- },
- "execution_count": 10,
- "metadata": {},
- "output_type": "execute_result"
- }
- ],
- "source": [
- "model.fit(X,y)"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 11,
- "metadata": {},
- "outputs": [
- {
- "data": {
- "text/plain": [
- "0.94170509372810818"
- ]
- },
- "execution_count": 11,
- "metadata": {},
- "output_type": "execute_result"
- }
- ],
- "source": [
- "model.score(X,y)"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "**Price of mercedez benz that is 4 yr old with mileage 45000**"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 14,
- "metadata": {
- "scrolled": true
- },
- "outputs": [
- {
- "data": {
- "text/plain": [
- "array([ 36991.31721061])"
- ]
- },
- "execution_count": 14,
- "metadata": {},
- "output_type": "execute_result"
- }
- ],
- "source": [
- "model.predict([[45000,4,0,0]])"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "**Price of BMW X5 that is 7 yr old with mileage 86000**"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 17,
- "metadata": {},
- "outputs": [
- {
- "data": {
- "text/plain": [
- "array([ 11080.74313219])"
- ]
- },
- "execution_count": 17,
- "metadata": {},
- "output_type": "execute_result"
- }
- ],
- "source": [
- "model.predict([[86000,7,0,1]])"
- ]
- }
- ],
- "metadata": {
- "kernelspec": {
- "display_name": "Python 3",
- "language": "python",
- "name": "python3"
- },
- "language_info": {
- "codemirror_mode": {
- "name": "ipython",
- "version": 3
- },
- "file_extension": ".py",
- "mimetype": "text/x-python",
- "name": "python",
- "nbconvert_exporter": "python",
- "pygments_lexer": "ipython3",
- "version": "3.6.1"
- }
- },
- "nbformat": 4,
- "nbformat_minor": 2
-}
diff --git a/ML/5_one_hot_encoding/Exercise/solution.ipynb b/ML/5_one_hot_encoding/Exercise/solution.ipynb
new file mode 100644
index 00000000..1fdc3687
--- /dev/null
+++ b/ML/5_one_hot_encoding/Exercise/solution.ipynb
@@ -0,0 +1,2709 @@
+{
+ "cells": [
+ {
+ "cell_type": "markdown",
+ "id": "090478b0",
+ "metadata": {},
+ "source": [
+ "## Using the get_dummies method"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 7,
+ "id": "74161224",
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "import pandas as pd\n",
+ "import numpy as np\n",
+ "from sklearn.linear_model import LinearRegression\n",
+ "from sklearn.preprocessing import OneHotEncoder\n",
+ "from sklearn.preprocessing import LabelEncoder\n",
+ "from sklearn.compose import ColumnTransformer"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 2,
+ "id": "de8f1462",
+ "metadata": {},
+ "outputs": [
+ {
+ "data": {
+ "text/html": [
+ "\n",
+ "\n",
+ "
\n",
+ " \n",
+ " \n",
+ " \n",
+ " Car Model \n",
+ " Mileage \n",
+ " Sell Price($) \n",
+ " Age(yrs) \n",
+ " \n",
+ " \n",
+ " \n",
+ " \n",
+ " 0 \n",
+ " BMW X5 \n",
+ " 69000 \n",
+ " 18000 \n",
+ " 6 \n",
+ " \n",
+ " \n",
+ " 1 \n",
+ " BMW X5 \n",
+ " 35000 \n",
+ " 34000 \n",
+ " 3 \n",
+ " \n",
+ " \n",
+ " 2 \n",
+ " BMW X5 \n",
+ " 57000 \n",
+ " 26100 \n",
+ " 5 \n",
+ " \n",
+ " \n",
+ " 3 \n",
+ " BMW X5 \n",
+ " 22500 \n",
+ " 40000 \n",
+ " 2 \n",
+ " \n",
+ " \n",
+ " 4 \n",
+ " BMW X5 \n",
+ " 46000 \n",
+ " 31500 \n",
+ " 4 \n",
+ " \n",
+ " \n",
+ " 5 \n",
+ " Audi A5 \n",
+ " 59000 \n",
+ " 29400 \n",
+ " 5 \n",
+ " \n",
+ " \n",
+ " 6 \n",
+ " Audi A5 \n",
+ " 52000 \n",
+ " 32000 \n",
+ " 5 \n",
+ " \n",
+ " \n",
+ " 7 \n",
+ " Audi A5 \n",
+ " 72000 \n",
+ " 19300 \n",
+ " 6 \n",
+ " \n",
+ " \n",
+ " 8 \n",
+ " Audi A5 \n",
+ " 91000 \n",
+ " 12000 \n",
+ " 8 \n",
+ " \n",
+ " \n",
+ " 9 \n",
+ " Mercedez Benz C class \n",
+ " 67000 \n",
+ " 22000 \n",
+ " 6 \n",
+ " \n",
+ " \n",
+ " 10 \n",
+ " Mercedez Benz C class \n",
+ " 83000 \n",
+ " 20000 \n",
+ " 7 \n",
+ " \n",
+ " \n",
+ " 11 \n",
+ " Mercedez Benz C class \n",
+ " 79000 \n",
+ " 21000 \n",
+ " 7 \n",
+ " \n",
+ " \n",
+ " 12 \n",
+ " Mercedez Benz C class \n",
+ " 59000 \n",
+ " 33000 \n",
+ " 5 \n",
+ " \n",
+ " \n",
+ "
\n",
+ "
"
+ ],
+ "text/plain": [
+ " Car Model Mileage Sell Price($) Age(yrs)\n",
+ "0 BMW X5 69000 18000 6\n",
+ "1 BMW X5 35000 34000 3\n",
+ "2 BMW X5 57000 26100 5\n",
+ "3 BMW X5 22500 40000 2\n",
+ "4 BMW X5 46000 31500 4\n",
+ "5 Audi A5 59000 29400 5\n",
+ "6 Audi A5 52000 32000 5\n",
+ "7 Audi A5 72000 19300 6\n",
+ "8 Audi A5 91000 12000 8\n",
+ "9 Mercedez Benz C class 67000 22000 6\n",
+ "10 Mercedez Benz C class 83000 20000 7\n",
+ "11 Mercedez Benz C class 79000 21000 7\n",
+ "12 Mercedez Benz C class 59000 33000 5"
+ ]
+ },
+ "execution_count": 2,
+ "metadata": {},
+ "output_type": "execute_result"
+ }
+ ],
+ "source": [
+ "df = pd.read_csv(\"carprices.csv\")\n",
+ "df"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 3,
+ "id": "dad124e4",
+ "metadata": {},
+ "outputs": [
+ {
+ "data": {
+ "text/html": [
+ "\n",
+ "\n",
+ "
\n",
+ " \n",
+ " \n",
+ " \n",
+ " Audi A5 \n",
+ " BMW X5 \n",
+ " Mercedez Benz C class \n",
+ " \n",
+ " \n",
+ " \n",
+ " \n",
+ " 0 \n",
+ " False \n",
+ " True \n",
+ " False \n",
+ " \n",
+ " \n",
+ " 1 \n",
+ " False \n",
+ " True \n",
+ " False \n",
+ " \n",
+ " \n",
+ " 2 \n",
+ " False \n",
+ " True \n",
+ " False \n",
+ " \n",
+ " \n",
+ " 3 \n",
+ " False \n",
+ " True \n",
+ " False \n",
+ " \n",
+ " \n",
+ " 4 \n",
+ " False \n",
+ " True \n",
+ " False \n",
+ " \n",
+ " \n",
+ " 5 \n",
+ " True \n",
+ " False \n",
+ " False \n",
+ " \n",
+ " \n",
+ " 6 \n",
+ " True \n",
+ " False \n",
+ " False \n",
+ " \n",
+ " \n",
+ " 7 \n",
+ " True \n",
+ " False \n",
+ " False \n",
+ " \n",
+ " \n",
+ " 8 \n",
+ " True \n",
+ " False \n",
+ " False \n",
+ " \n",
+ " \n",
+ " 9 \n",
+ " False \n",
+ " False \n",
+ " True \n",
+ " \n",
+ " \n",
+ " 10 \n",
+ " False \n",
+ " False \n",
+ " True \n",
+ " \n",
+ " \n",
+ " 11 \n",
+ " False \n",
+ " False \n",
+ " True \n",
+ " \n",
+ " \n",
+ " 12 \n",
+ " False \n",
+ " False \n",
+ " True \n",
+ " \n",
+ " \n",
+ "
\n",
+ "
"
+ ],
+ "text/plain": [
+ " Audi A5 BMW X5 Mercedez Benz C class\n",
+ "0 False True False\n",
+ "1 False True False\n",
+ "2 False True False\n",
+ "3 False True False\n",
+ "4 False True False\n",
+ "5 True False False\n",
+ "6 True False False\n",
+ "7 True False False\n",
+ "8 True False False\n",
+ "9 False False True\n",
+ "10 False False True\n",
+ "11 False False True\n",
+ "12 False False True"
+ ]
+ },
+ "execution_count": 3,
+ "metadata": {},
+ "output_type": "execute_result"
+ }
+ ],
+ "source": [
+ "dummies = pd.get_dummies(df[\"Car Model\"])\n",
+ "dummies"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 4,
+ "id": "eeadb9dd",
+ "metadata": {},
+ "outputs": [
+ {
+ "data": {
+ "text/html": [
+ "\n",
+ "\n",
+ "
\n",
+ " \n",
+ " \n",
+ " \n",
+ " Car Model \n",
+ " Mileage \n",
+ " Sell Price($) \n",
+ " Age(yrs) \n",
+ " Audi A5 \n",
+ " BMW X5 \n",
+ " Mercedez Benz C class \n",
+ " \n",
+ " \n",
+ " \n",
+ " \n",
+ " 0 \n",
+ " BMW X5 \n",
+ " 69000 \n",
+ " 18000 \n",
+ " 6 \n",
+ " False \n",
+ " True \n",
+ " False \n",
+ " \n",
+ " \n",
+ " 1 \n",
+ " BMW X5 \n",
+ " 35000 \n",
+ " 34000 \n",
+ " 3 \n",
+ " False \n",
+ " True \n",
+ " False \n",
+ " \n",
+ " \n",
+ " 2 \n",
+ " BMW X5 \n",
+ " 57000 \n",
+ " 26100 \n",
+ " 5 \n",
+ " False \n",
+ " True \n",
+ " False \n",
+ " \n",
+ " \n",
+ " 3 \n",
+ " BMW X5 \n",
+ " 22500 \n",
+ " 40000 \n",
+ " 2 \n",
+ " False \n",
+ " True \n",
+ " False \n",
+ " \n",
+ " \n",
+ " 4 \n",
+ " BMW X5 \n",
+ " 46000 \n",
+ " 31500 \n",
+ " 4 \n",
+ " False \n",
+ " True \n",
+ " False \n",
+ " \n",
+ " \n",
+ " 5 \n",
+ " Audi A5 \n",
+ " 59000 \n",
+ " 29400 \n",
+ " 5 \n",
+ " True \n",
+ " False \n",
+ " False \n",
+ " \n",
+ " \n",
+ " 6 \n",
+ " Audi A5 \n",
+ " 52000 \n",
+ " 32000 \n",
+ " 5 \n",
+ " True \n",
+ " False \n",
+ " False \n",
+ " \n",
+ " \n",
+ " 7 \n",
+ " Audi A5 \n",
+ " 72000 \n",
+ " 19300 \n",
+ " 6 \n",
+ " True \n",
+ " False \n",
+ " False \n",
+ " \n",
+ " \n",
+ " 8 \n",
+ " Audi A5 \n",
+ " 91000 \n",
+ " 12000 \n",
+ " 8 \n",
+ " True \n",
+ " False \n",
+ " False \n",
+ " \n",
+ " \n",
+ " 9 \n",
+ " Mercedez Benz C class \n",
+ " 67000 \n",
+ " 22000 \n",
+ " 6 \n",
+ " False \n",
+ " False \n",
+ " True \n",
+ " \n",
+ " \n",
+ " 10 \n",
+ " Mercedez Benz C class \n",
+ " 83000 \n",
+ " 20000 \n",
+ " 7 \n",
+ " False \n",
+ " False \n",
+ " True \n",
+ " \n",
+ " \n",
+ " 11 \n",
+ " Mercedez Benz C class \n",
+ " 79000 \n",
+ " 21000 \n",
+ " 7 \n",
+ " False \n",
+ " False \n",
+ " True \n",
+ " \n",
+ " \n",
+ " 12 \n",
+ " Mercedez Benz C class \n",
+ " 59000 \n",
+ " 33000 \n",
+ " 5 \n",
+ " False \n",
+ " False \n",
+ " True \n",
+ " \n",
+ " \n",
+ "
\n",
+ "
"
+ ],
+ "text/plain": [
+ " Car Model Mileage Sell Price($) Age(yrs) Audi A5 BMW X5 \\\n",
+ "0 BMW X5 69000 18000 6 False True \n",
+ "1 BMW X5 35000 34000 3 False True \n",
+ "2 BMW X5 57000 26100 5 False True \n",
+ "3 BMW X5 22500 40000 2 False True \n",
+ "4 BMW X5 46000 31500 4 False True \n",
+ "5 Audi A5 59000 29400 5 True False \n",
+ "6 Audi A5 52000 32000 5 True False \n",
+ "7 Audi A5 72000 19300 6 True False \n",
+ "8 Audi A5 91000 12000 8 True False \n",
+ "9 Mercedez Benz C class 67000 22000 6 False False \n",
+ "10 Mercedez Benz C class 83000 20000 7 False False \n",
+ "11 Mercedez Benz C class 79000 21000 7 False False \n",
+ "12 Mercedez Benz C class 59000 33000 5 False False \n",
+ "\n",
+ " Mercedez Benz C class \n",
+ "0 False \n",
+ "1 False \n",
+ "2 False \n",
+ "3 False \n",
+ "4 False \n",
+ "5 False \n",
+ "6 False \n",
+ "7 False \n",
+ "8 False \n",
+ "9 True \n",
+ "10 True \n",
+ "11 True \n",
+ "12 True "
+ ]
+ },
+ "execution_count": 4,
+ "metadata": {},
+ "output_type": "execute_result"
+ }
+ ],
+ "source": [
+ "merged = pd.concat([df, dummies], axis=\"columns\")\n",
+ "merged"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 6,
+ "id": "d534beb3",
+ "metadata": {},
+ "outputs": [
+ {
+ "data": {
+ "text/html": [
+ "\n",
+ "\n",
+ "
\n",
+ " \n",
+ " \n",
+ " \n",
+ " Mileage \n",
+ " Sell Price($) \n",
+ " Age(yrs) \n",
+ " BMW X5 \n",
+ " Mercedez Benz C class \n",
+ " \n",
+ " \n",
+ " \n",
+ " \n",
+ " 0 \n",
+ " 69000 \n",
+ " 18000 \n",
+ " 6 \n",
+ " True \n",
+ " False \n",
+ " \n",
+ " \n",
+ " 1 \n",
+ " 35000 \n",
+ " 34000 \n",
+ " 3 \n",
+ " True \n",
+ " False \n",
+ " \n",
+ " \n",
+ " 2 \n",
+ " 57000 \n",
+ " 26100 \n",
+ " 5 \n",
+ " True \n",
+ " False \n",
+ " \n",
+ " \n",
+ " 3 \n",
+ " 22500 \n",
+ " 40000 \n",
+ " 2 \n",
+ " True \n",
+ " False \n",
+ " \n",
+ " \n",
+ " 4 \n",
+ " 46000 \n",
+ " 31500 \n",
+ " 4 \n",
+ " True \n",
+ " False \n",
+ " \n",
+ " \n",
+ " 5 \n",
+ " 59000 \n",
+ " 29400 \n",
+ " 5 \n",
+ " False \n",
+ " False \n",
+ " \n",
+ " \n",
+ " 6 \n",
+ " 52000 \n",
+ " 32000 \n",
+ " 5 \n",
+ " False \n",
+ " False \n",
+ " \n",
+ " \n",
+ " 7 \n",
+ " 72000 \n",
+ " 19300 \n",
+ " 6 \n",
+ " False \n",
+ " False \n",
+ " \n",
+ " \n",
+ " 8 \n",
+ " 91000 \n",
+ " 12000 \n",
+ " 8 \n",
+ " False \n",
+ " False \n",
+ " \n",
+ " \n",
+ " 9 \n",
+ " 67000 \n",
+ " 22000 \n",
+ " 6 \n",
+ " False \n",
+ " True \n",
+ " \n",
+ " \n",
+ " 10 \n",
+ " 83000 \n",
+ " 20000 \n",
+ " 7 \n",
+ " False \n",
+ " True \n",
+ " \n",
+ " \n",
+ " 11 \n",
+ " 79000 \n",
+ " 21000 \n",
+ " 7 \n",
+ " False \n",
+ " True \n",
+ " \n",
+ " \n",
+ " 12 \n",
+ " 59000 \n",
+ " 33000 \n",
+ " 5 \n",
+ " False \n",
+ " True \n",
+ " \n",
+ " \n",
+ "
\n",
+ "
"
+ ],
+ "text/plain": [
+ " Mileage Sell Price($) Age(yrs) BMW X5 Mercedez Benz C class\n",
+ "0 69000 18000 6 True False\n",
+ "1 35000 34000 3 True False\n",
+ "2 57000 26100 5 True False\n",
+ "3 22500 40000 2 True False\n",
+ "4 46000 31500 4 True False\n",
+ "5 59000 29400 5 False False\n",
+ "6 52000 32000 5 False False\n",
+ "7 72000 19300 6 False False\n",
+ "8 91000 12000 8 False False\n",
+ "9 67000 22000 6 False True\n",
+ "10 83000 20000 7 False True\n",
+ "11 79000 21000 7 False True\n",
+ "12 59000 33000 5 False True"
+ ]
+ },
+ "execution_count": 6,
+ "metadata": {},
+ "output_type": "execute_result"
+ }
+ ],
+ "source": [
+ "final = merged.drop([\"Car Model\", \"Audi A5\"], axis = \"columns\")\n",
+ "final"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 9,
+ "id": "7edda3d2",
+ "metadata": {},
+ "outputs": [
+ {
+ "data": {
+ "text/html": [
+ "LinearRegression() In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook. On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org. \n",
+ "
\n",
+ "
\n",
+ " Parameters \n",
+ " \n",
+ " \n",
+ " \n",
+ " \n",
+ " \n",
+ " \n",
+ " \n",
+ " fit_intercept\n",
+ " fit_intercept: bool, default=True Whether to calculate the intercept for this model. If set to False, no intercept will be used in calculations (i.e. data is expected to be centered). \n",
+ " \n",
+ " \n",
+ " True \n",
+ " \n",
+ " \n",
+ "\n",
+ " \n",
+ " \n",
+ " \n",
+ " \n",
+ " copy_X\n",
+ " copy_X: bool, default=True If True, X will be copied; else, it may be overwritten. \n",
+ " \n",
+ " \n",
+ " True \n",
+ " \n",
+ " \n",
+ "\n",
+ " \n",
+ " \n",
+ " \n",
+ " \n",
+ " tol\n",
+ " tol: float, default=1e-6 The precision of the solution (`coef_`) is determined by `tol` which specifies a different convergence criterion for the `lsqr` solver. `tol` is set as `atol` and `btol` of :func:`scipy.sparse.linalg.lsqr` when fitting on sparse training data. This parameter has no effect when fitting on dense data. .. versionadded:: 1.7 \n",
+ " \n",
+ " \n",
+ " 1e-06 \n",
+ " \n",
+ " \n",
+ "\n",
+ " \n",
+ " \n",
+ " \n",
+ " \n",
+ " n_jobs\n",
+ " n_jobs: int, default=None The number of jobs to use for the computation. This will only provide speedup in case of sufficiently large problems, that is if firstly `n_targets > 1` and secondly `X` is sparse or if `positive` is set to `True`. ``None`` means 1 unless in a :obj:`joblib.parallel_backend` context. ``-1`` means using all processors. See :term:`Glossary ` for more details. \n",
+ " \n",
+ " \n",
+ " None \n",
+ " \n",
+ " \n",
+ "\n",
+ " \n",
+ " \n",
+ " \n",
+ " \n",
+ " positive\n",
+ " positive: bool, default=False When set to ``True``, forces the coefficients to be positive. This option is only supported for dense arrays. For a comparison between a linear regression model with positive constraints on the regression coefficients and a linear regression without such constraints, see :ref:`sphx_glr_auto_examples_linear_model_plot_nnls.py`. .. versionadded:: 0.24 \n",
+ " \n",
+ " \n",
+ " False \n",
+ " \n",
+ " \n",
+ " \n",
+ "
\n",
+ " \n",
+ "
\n",
+ "
"
+ ],
+ "text/plain": [
+ "LinearRegression()"
+ ]
+ },
+ "execution_count": 9,
+ "metadata": {},
+ "output_type": "execute_result"
+ }
+ ],
+ "source": [
+ "x = final.drop([\"Sell Price($)\"], axis=\"columns\")\n",
+ "y = final[\"Sell Price($)\"]\n",
+ "\n",
+ "model = LinearRegression()\n",
+ "model.fit(x, y)"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 27,
+ "id": "2ed408a8",
+ "metadata": {},
+ "outputs": [
+ {
+ "name": "stderr",
+ "output_type": "stream",
+ "text": [
+ "c:\\Users\\HC\\AppData\\Local\\Programs\\Python\\Python314\\Lib\\site-packages\\sklearn\\utils\\validation.py:2691: UserWarning: X does not have valid feature names, but LinearRegression was fitted with feature names\n",
+ " warnings.warn(\n"
+ ]
+ },
+ {
+ "data": {
+ "text/plain": [
+ "array([36991.31721061])"
+ ]
+ },
+ "execution_count": 27,
+ "metadata": {},
+ "output_type": "execute_result"
+ }
+ ],
+ "source": [
+ "model.predict([[45000, 4, 0, 1]])"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 12,
+ "id": "bb3aa9fa",
+ "metadata": {},
+ "outputs": [
+ {
+ "name": "stderr",
+ "output_type": "stream",
+ "text": [
+ "c:\\Users\\HC\\AppData\\Local\\Programs\\Python\\Python314\\Lib\\site-packages\\sklearn\\utils\\validation.py:2691: UserWarning: X does not have valid feature names, but LinearRegression was fitted with feature names\n",
+ " warnings.warn(\n"
+ ]
+ },
+ {
+ "data": {
+ "text/plain": [
+ "array([11080.74313219])"
+ ]
+ },
+ "execution_count": 12,
+ "metadata": {},
+ "output_type": "execute_result"
+ }
+ ],
+ "source": [
+ "model.predict([[86000,7,1,0]])"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 14,
+ "id": "4247e464",
+ "metadata": {},
+ "outputs": [
+ {
+ "data": {
+ "text/plain": [
+ "0.9417050937281082"
+ ]
+ },
+ "execution_count": 14,
+ "metadata": {},
+ "output_type": "execute_result"
+ }
+ ],
+ "source": [
+ "model.score(x,y)"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "94d96c39",
+ "metadata": {},
+ "source": [
+ "## Using the One Hot Encoder"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 16,
+ "id": "3fb4ea2f",
+ "metadata": {},
+ "outputs": [
+ {
+ "data": {
+ "text/html": [
+ "\n",
+ "\n",
+ "
\n",
+ " \n",
+ " \n",
+ " \n",
+ " Car Model \n",
+ " Mileage \n",
+ " Sell Price($) \n",
+ " Age(yrs) \n",
+ " \n",
+ " \n",
+ " \n",
+ " \n",
+ " 0 \n",
+ " BMW X5 \n",
+ " 69000 \n",
+ " 18000 \n",
+ " 6 \n",
+ " \n",
+ " \n",
+ " 1 \n",
+ " BMW X5 \n",
+ " 35000 \n",
+ " 34000 \n",
+ " 3 \n",
+ " \n",
+ " \n",
+ " 2 \n",
+ " BMW X5 \n",
+ " 57000 \n",
+ " 26100 \n",
+ " 5 \n",
+ " \n",
+ " \n",
+ " 3 \n",
+ " BMW X5 \n",
+ " 22500 \n",
+ " 40000 \n",
+ " 2 \n",
+ " \n",
+ " \n",
+ " 4 \n",
+ " BMW X5 \n",
+ " 46000 \n",
+ " 31500 \n",
+ " 4 \n",
+ " \n",
+ " \n",
+ " 5 \n",
+ " Audi A5 \n",
+ " 59000 \n",
+ " 29400 \n",
+ " 5 \n",
+ " \n",
+ " \n",
+ " 6 \n",
+ " Audi A5 \n",
+ " 52000 \n",
+ " 32000 \n",
+ " 5 \n",
+ " \n",
+ " \n",
+ " 7 \n",
+ " Audi A5 \n",
+ " 72000 \n",
+ " 19300 \n",
+ " 6 \n",
+ " \n",
+ " \n",
+ " 8 \n",
+ " Audi A5 \n",
+ " 91000 \n",
+ " 12000 \n",
+ " 8 \n",
+ " \n",
+ " \n",
+ " 9 \n",
+ " Mercedez Benz C class \n",
+ " 67000 \n",
+ " 22000 \n",
+ " 6 \n",
+ " \n",
+ " \n",
+ " 10 \n",
+ " Mercedez Benz C class \n",
+ " 83000 \n",
+ " 20000 \n",
+ " 7 \n",
+ " \n",
+ " \n",
+ " 11 \n",
+ " Mercedez Benz C class \n",
+ " 79000 \n",
+ " 21000 \n",
+ " 7 \n",
+ " \n",
+ " \n",
+ " 12 \n",
+ " Mercedez Benz C class \n",
+ " 59000 \n",
+ " 33000 \n",
+ " 5 \n",
+ " \n",
+ " \n",
+ "
\n",
+ "
"
+ ],
+ "text/plain": [
+ " Car Model Mileage Sell Price($) Age(yrs)\n",
+ "0 BMW X5 69000 18000 6\n",
+ "1 BMW X5 35000 34000 3\n",
+ "2 BMW X5 57000 26100 5\n",
+ "3 BMW X5 22500 40000 2\n",
+ "4 BMW X5 46000 31500 4\n",
+ "5 Audi A5 59000 29400 5\n",
+ "6 Audi A5 52000 32000 5\n",
+ "7 Audi A5 72000 19300 6\n",
+ "8 Audi A5 91000 12000 8\n",
+ "9 Mercedez Benz C class 67000 22000 6\n",
+ "10 Mercedez Benz C class 83000 20000 7\n",
+ "11 Mercedez Benz C class 79000 21000 7\n",
+ "12 Mercedez Benz C class 59000 33000 5"
+ ]
+ },
+ "execution_count": 16,
+ "metadata": {},
+ "output_type": "execute_result"
+ }
+ ],
+ "source": [
+ "df"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 19,
+ "id": "ef406607",
+ "metadata": {},
+ "outputs": [
+ {
+ "data": {
+ "text/plain": [
+ "array([[1.00e+00, 0.00e+00, 6.90e+04, 1.80e+04, 6.00e+00],\n",
+ " [1.00e+00, 0.00e+00, 3.50e+04, 3.40e+04, 3.00e+00],\n",
+ " [1.00e+00, 0.00e+00, 5.70e+04, 2.61e+04, 5.00e+00],\n",
+ " [1.00e+00, 0.00e+00, 2.25e+04, 4.00e+04, 2.00e+00],\n",
+ " [1.00e+00, 0.00e+00, 4.60e+04, 3.15e+04, 4.00e+00],\n",
+ " [0.00e+00, 0.00e+00, 5.90e+04, 2.94e+04, 5.00e+00],\n",
+ " [0.00e+00, 0.00e+00, 5.20e+04, 3.20e+04, 5.00e+00],\n",
+ " [0.00e+00, 0.00e+00, 7.20e+04, 1.93e+04, 6.00e+00],\n",
+ " [0.00e+00, 0.00e+00, 9.10e+04, 1.20e+04, 8.00e+00],\n",
+ " [0.00e+00, 1.00e+00, 6.70e+04, 2.20e+04, 6.00e+00],\n",
+ " [0.00e+00, 1.00e+00, 8.30e+04, 2.00e+04, 7.00e+00],\n",
+ " [0.00e+00, 1.00e+00, 7.90e+04, 2.10e+04, 7.00e+00],\n",
+ " [0.00e+00, 1.00e+00, 5.90e+04, 3.30e+04, 5.00e+00]])"
+ ]
+ },
+ "execution_count": 19,
+ "metadata": {},
+ "output_type": "execute_result"
+ }
+ ],
+ "source": [
+ "ct = ColumnTransformer([(\"Model\", OneHotEncoder(drop=\"first\"), [0])], remainder=\"passthrough\")\n",
+ "x = ct.fit_transform(df)\n",
+ "x"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 24,
+ "id": "00912847",
+ "metadata": {},
+ "outputs": [
+ {
+ "data": {
+ "text/plain": [
+ "array([[1.00e+00, 0.00e+00, 6.90e+04, 6.00e+00],\n",
+ " [1.00e+00, 0.00e+00, 3.50e+04, 3.00e+00],\n",
+ " [1.00e+00, 0.00e+00, 5.70e+04, 5.00e+00],\n",
+ " [1.00e+00, 0.00e+00, 2.25e+04, 2.00e+00],\n",
+ " [1.00e+00, 0.00e+00, 4.60e+04, 4.00e+00],\n",
+ " [0.00e+00, 0.00e+00, 5.90e+04, 5.00e+00],\n",
+ " [0.00e+00, 0.00e+00, 5.20e+04, 5.00e+00],\n",
+ " [0.00e+00, 0.00e+00, 7.20e+04, 6.00e+00],\n",
+ " [0.00e+00, 0.00e+00, 9.10e+04, 8.00e+00],\n",
+ " [0.00e+00, 1.00e+00, 6.70e+04, 6.00e+00],\n",
+ " [0.00e+00, 1.00e+00, 8.30e+04, 7.00e+00],\n",
+ " [0.00e+00, 1.00e+00, 7.90e+04, 7.00e+00],\n",
+ " [0.00e+00, 1.00e+00, 5.90e+04, 5.00e+00]])"
+ ]
+ },
+ "execution_count": 24,
+ "metadata": {},
+ "output_type": "execute_result"
+ }
+ ],
+ "source": [
+ "x = x[:, [0, 1, 2, 4]]\n",
+ "x"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 20,
+ "id": "801d0d49",
+ "metadata": {},
+ "outputs": [
+ {
+ "data": {
+ "text/plain": [
+ "0 18000\n",
+ "1 34000\n",
+ "2 26100\n",
+ "3 40000\n",
+ "4 31500\n",
+ "5 29400\n",
+ "6 32000\n",
+ "7 19300\n",
+ "8 12000\n",
+ "9 22000\n",
+ "10 20000\n",
+ "11 21000\n",
+ "12 33000\n",
+ "Name: Sell Price($), dtype: int64"
+ ]
+ },
+ "execution_count": 20,
+ "metadata": {},
+ "output_type": "execute_result"
+ }
+ ],
+ "source": [
+ "y = df[\"Sell Price($)\"]\n",
+ "y"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 25,
+ "id": "6977136d",
+ "metadata": {},
+ "outputs": [
+ {
+ "data": {
+ "text/html": [
+ "LinearRegression() In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook. On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org. \n",
+ "
\n",
+ "
\n",
+ " Parameters \n",
+ " \n",
+ " \n",
+ " \n",
+ " \n",
+ " \n",
+ " \n",
+ " \n",
+ " fit_intercept\n",
+ " fit_intercept: bool, default=True Whether to calculate the intercept for this model. If set to False, no intercept will be used in calculations (i.e. data is expected to be centered). \n",
+ " \n",
+ " \n",
+ " True \n",
+ " \n",
+ " \n",
+ "\n",
+ " \n",
+ " \n",
+ " \n",
+ " \n",
+ " copy_X\n",
+ " copy_X: bool, default=True If True, X will be copied; else, it may be overwritten. \n",
+ " \n",
+ " \n",
+ " True \n",
+ " \n",
+ " \n",
+ "\n",
+ " \n",
+ " \n",
+ " \n",
+ " \n",
+ " tol\n",
+ " tol: float, default=1e-6 The precision of the solution (`coef_`) is determined by `tol` which specifies a different convergence criterion for the `lsqr` solver. `tol` is set as `atol` and `btol` of :func:`scipy.sparse.linalg.lsqr` when fitting on sparse training data. This parameter has no effect when fitting on dense data. .. versionadded:: 1.7 \n",
+ " \n",
+ " \n",
+ " 1e-06 \n",
+ " \n",
+ " \n",
+ "\n",
+ " \n",
+ " \n",
+ " \n",
+ " \n",
+ " n_jobs\n",
+ " n_jobs: int, default=None The number of jobs to use for the computation. This will only provide speedup in case of sufficiently large problems, that is if firstly `n_targets > 1` and secondly `X` is sparse or if `positive` is set to `True`. ``None`` means 1 unless in a :obj:`joblib.parallel_backend` context. ``-1`` means using all processors. See :term:`Glossary ` for more details. \n",
+ " \n",
+ " \n",
+ " None \n",
+ " \n",
+ " \n",
+ "\n",
+ " \n",
+ " \n",
+ " \n",
+ " \n",
+ " positive\n",
+ " positive: bool, default=False When set to ``True``, forces the coefficients to be positive. This option is only supported for dense arrays. For a comparison between a linear regression model with positive constraints on the regression coefficients and a linear regression without such constraints, see :ref:`sphx_glr_auto_examples_linear_model_plot_nnls.py`. .. versionadded:: 0.24 \n",
+ " \n",
+ " \n",
+ " False \n",
+ " \n",
+ " \n",
+ " \n",
+ "
\n",
+ " \n",
+ "
\n",
+ "
"
+ ],
+ "text/plain": [
+ "LinearRegression()"
+ ]
+ },
+ "execution_count": 25,
+ "metadata": {},
+ "output_type": "execute_result"
+ }
+ ],
+ "source": [
+ "modelohe = LinearRegression()\n",
+ "modelohe.fit(x,y)"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 26,
+ "id": "8c0389b9",
+ "metadata": {},
+ "outputs": [
+ {
+ "data": {
+ "text/plain": [
+ "array([51981.26203335])"
+ ]
+ },
+ "execution_count": 26,
+ "metadata": {},
+ "output_type": "execute_result"
+ }
+ ],
+ "source": [
+ "modelohe.predict([[0,1,4500,4]])"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 28,
+ "id": "594e81ba",
+ "metadata": {},
+ "outputs": [
+ {
+ "data": {
+ "text/plain": [
+ "array([11080.74313219])"
+ ]
+ },
+ "execution_count": 28,
+ "metadata": {},
+ "output_type": "execute_result"
+ }
+ ],
+ "source": [
+ "modelohe.predict([[1,0,86000,7]])"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 29,
+ "id": "d67ff510",
+ "metadata": {},
+ "outputs": [
+ {
+ "data": {
+ "text/plain": [
+ "0.9417050937281082"
+ ]
+ },
+ "execution_count": 29,
+ "metadata": {},
+ "output_type": "execute_result"
+ }
+ ],
+ "source": [
+ "modelohe.score(x,y)"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "07bbf366",
+ "metadata": {},
+ "source": [
+ "Answers Matched😎 "
+ ]
+ }
+ ],
+ "metadata": {
+ "kernelspec": {
+ "display_name": "Python 3",
+ "language": "python",
+ "name": "python3"
+ },
+ "language_info": {
+ "codemirror_mode": {
+ "name": "ipython",
+ "version": 3
+ },
+ "file_extension": ".py",
+ "mimetype": "text/x-python",
+ "name": "python",
+ "nbconvert_exporter": "python",
+ "pygments_lexer": "ipython3",
+ "version": "3.14.2"
+ }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 5
+}
diff --git a/ML/8_logistic_reg_multiclass/Exercise/Solution.ipynb b/ML/8_logistic_reg_multiclass/Exercise/Solution.ipynb
new file mode 100644
index 00000000..a109a67d
--- /dev/null
+++ b/ML/8_logistic_reg_multiclass/Exercise/Solution.ipynb
@@ -0,0 +1,1270 @@
+{
+ "cells": [
+ {
+ "cell_type": "markdown",
+ "id": "0c112487",
+ "metadata": {},
+ "source": [
+ "### Imports and dataload"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "id": "31beb396",
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "import seaborn as sns\n",
+ "from sklearn.linear_model import LogisticRegression\n",
+ "from sklearn.datasets import load_iris\n",
+ "from sklearn.model_selection import train_test_split\n",
+ "from sklearn.metrics import confusion_matrix"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 2,
+ "id": "b5650cef",
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "iris = load_iris()"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "4436209e",
+ "metadata": {},
+ "source": [
+ "### Exploring the data"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 4,
+ "id": "9545fd69",
+ "metadata": {},
+ "outputs": [
+ {
+ "data": {
+ "text/plain": [
+ "['DESCR',\n",
+ " 'data',\n",
+ " 'data_module',\n",
+ " 'feature_names',\n",
+ " 'filename',\n",
+ " 'frame',\n",
+ " 'target',\n",
+ " 'target_names']"
+ ]
+ },
+ "execution_count": 4,
+ "metadata": {},
+ "output_type": "execute_result"
+ }
+ ],
+ "source": [
+ "dir(iris)"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 5,
+ "id": "c69ae022",
+ "metadata": {},
+ "outputs": [
+ {
+ "data": {
+ "text/plain": [
+ "array([5.1, 3.5, 1.4, 0.2])"
+ ]
+ },
+ "execution_count": 5,
+ "metadata": {},
+ "output_type": "execute_result"
+ }
+ ],
+ "source": [
+ "iris.data[0]"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 7,
+ "id": "31b4e2b8",
+ "metadata": {},
+ "outputs": [
+ {
+ "data": {
+ "text/plain": [
+ "['sepal length (cm)',\n",
+ " 'sepal width (cm)',\n",
+ " 'petal length (cm)',\n",
+ " 'petal width (cm)']"
+ ]
+ },
+ "execution_count": 7,
+ "metadata": {},
+ "output_type": "execute_result"
+ }
+ ],
+ "source": [
+ "iris.feature_names"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 9,
+ "id": "7cacc24b",
+ "metadata": {},
+ "outputs": [
+ {
+ "data": {
+ "text/plain": [
+ "'iris.csv'"
+ ]
+ },
+ "execution_count": 9,
+ "metadata": {},
+ "output_type": "execute_result"
+ }
+ ],
+ "source": [
+ "iris.filename"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "e7d5f3f3",
+ "metadata": {},
+ "source": [
+ "iris.filename is of no use"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 11,
+ "id": "1c8b6725",
+ "metadata": {},
+ "outputs": [
+ {
+ "data": {
+ "text/plain": [
+ "array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,\n",
+ " 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,\n",
+ " 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,\n",
+ " 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,\n",
+ " 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,\n",
+ " 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,\n",
+ " 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2])"
+ ]
+ },
+ "execution_count": 11,
+ "metadata": {},
+ "output_type": "execute_result"
+ }
+ ],
+ "source": [
+ "iris.target"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 12,
+ "id": "c89ad2b4",
+ "metadata": {},
+ "outputs": [
+ {
+ "data": {
+ "text/plain": [
+ "array(['setosa', 'versicolor', 'virginica'], dtype='#sk-container-id-2 {\n",
+ " /* Definition of color scheme common for light and dark mode */\n",
+ " --sklearn-color-text: #000;\n",
+ " --sklearn-color-text-muted: #666;\n",
+ " --sklearn-color-line: gray;\n",
+ " /* Definition of color scheme for unfitted estimators */\n",
+ " --sklearn-color-unfitted-level-0: #fff5e6;\n",
+ " --sklearn-color-unfitted-level-1: #f6e4d2;\n",
+ " --sklearn-color-unfitted-level-2: #ffe0b3;\n",
+ " --sklearn-color-unfitted-level-3: chocolate;\n",
+ " /* Definition of color scheme for fitted estimators */\n",
+ " --sklearn-color-fitted-level-0: #f0f8ff;\n",
+ " --sklearn-color-fitted-level-1: #d4ebff;\n",
+ " --sklearn-color-fitted-level-2: #b3dbfd;\n",
+ " --sklearn-color-fitted-level-3: cornflowerblue;\n",
+ "}\n",
+ "\n",
+ "#sk-container-id-2.light {\n",
+ " /* Specific color for light theme */\n",
+ " --sklearn-color-text-on-default-background: black;\n",
+ " --sklearn-color-background: white;\n",
+ " --sklearn-color-border-box: black;\n",
+ " --sklearn-color-icon: #696969;\n",
+ "}\n",
+ "\n",
+ "#sk-container-id-2.dark {\n",
+ " --sklearn-color-text-on-default-background: white;\n",
+ " --sklearn-color-background: #111;\n",
+ " --sklearn-color-border-box: white;\n",
+ " --sklearn-color-icon: #878787;\n",
+ "}\n",
+ "\n",
+ "#sk-container-id-2 {\n",
+ " color: var(--sklearn-color-text);\n",
+ "}\n",
+ "\n",
+ "#sk-container-id-2 pre {\n",
+ " padding: 0;\n",
+ "}\n",
+ "\n",
+ "#sk-container-id-2 input.sk-hidden--visually {\n",
+ " border: 0;\n",
+ " clip: rect(1px 1px 1px 1px);\n",
+ " clip: rect(1px, 1px, 1px, 1px);\n",
+ " height: 1px;\n",
+ " margin: -1px;\n",
+ " overflow: hidden;\n",
+ " padding: 0;\n",
+ " position: absolute;\n",
+ " width: 1px;\n",
+ "}\n",
+ "\n",
+ "#sk-container-id-2 div.sk-dashed-wrapped {\n",
+ " border: 1px dashed var(--sklearn-color-line);\n",
+ " margin: 0 0.4em 0.5em 0.4em;\n",
+ " box-sizing: border-box;\n",
+ " padding-bottom: 0.4em;\n",
+ " background-color: var(--sklearn-color-background);\n",
+ "}\n",
+ "\n",
+ "#sk-container-id-2 div.sk-container {\n",
+ " /* jupyter's `normalize.less` sets `[hidden] { display: none; }`\n",
+ " but bootstrap.min.css set `[hidden] { display: none !important; }`\n",
+ " so we also need the `!important` here to be able to override the\n",
+ " default hidden behavior on the sphinx rendered scikit-learn.org.\n",
+ " See: https://github.com/scikit-learn/scikit-learn/issues/21755 */\n",
+ " display: inline-block !important;\n",
+ " position: relative;\n",
+ "}\n",
+ "\n",
+ "#sk-container-id-2 div.sk-text-repr-fallback {\n",
+ " display: none;\n",
+ "}\n",
+ "\n",
+ "div.sk-parallel-item,\n",
+ "div.sk-serial,\n",
+ "div.sk-item {\n",
+ " /* draw centered vertical line to link estimators */\n",
+ " background-image: linear-gradient(var(--sklearn-color-text-on-default-background), var(--sklearn-color-text-on-default-background));\n",
+ " background-size: 2px 100%;\n",
+ " background-repeat: no-repeat;\n",
+ " background-position: center center;\n",
+ "}\n",
+ "\n",
+ "/* Parallel-specific style estimator block */\n",
+ "\n",
+ "#sk-container-id-2 div.sk-parallel-item::after {\n",
+ " content: \"\";\n",
+ " width: 100%;\n",
+ " border-bottom: 2px solid var(--sklearn-color-text-on-default-background);\n",
+ " flex-grow: 1;\n",
+ "}\n",
+ "\n",
+ "#sk-container-id-2 div.sk-parallel {\n",
+ " display: flex;\n",
+ " align-items: stretch;\n",
+ " justify-content: center;\n",
+ " background-color: var(--sklearn-color-background);\n",
+ " position: relative;\n",
+ "}\n",
+ "\n",
+ "#sk-container-id-2 div.sk-parallel-item {\n",
+ " display: flex;\n",
+ " flex-direction: column;\n",
+ "}\n",
+ "\n",
+ "#sk-container-id-2 div.sk-parallel-item:first-child::after {\n",
+ " align-self: flex-end;\n",
+ " width: 50%;\n",
+ "}\n",
+ "\n",
+ "#sk-container-id-2 div.sk-parallel-item:last-child::after {\n",
+ " align-self: flex-start;\n",
+ " width: 50%;\n",
+ "}\n",
+ "\n",
+ "#sk-container-id-2 div.sk-parallel-item:only-child::after {\n",
+ " width: 0;\n",
+ "}\n",
+ "\n",
+ "/* Serial-specific style estimator block */\n",
+ "\n",
+ "#sk-container-id-2 div.sk-serial {\n",
+ " display: flex;\n",
+ " flex-direction: column;\n",
+ " align-items: center;\n",
+ " background-color: var(--sklearn-color-background);\n",
+ " padding-right: 1em;\n",
+ " padding-left: 1em;\n",
+ "}\n",
+ "\n",
+ "\n",
+ "/* Toggleable style: style used for estimator/Pipeline/ColumnTransformer box that is\n",
+ "clickable and can be expanded/collapsed.\n",
+ "- Pipeline and ColumnTransformer use this feature and define the default style\n",
+ "- Estimators will overwrite some part of the style using the `sk-estimator` class\n",
+ "*/\n",
+ "\n",
+ "/* Pipeline and ColumnTransformer style (default) */\n",
+ "\n",
+ "#sk-container-id-2 div.sk-toggleable {\n",
+ " /* Default theme specific background. It is overwritten whether we have a\n",
+ " specific estimator or a Pipeline/ColumnTransformer */\n",
+ " background-color: var(--sklearn-color-background);\n",
+ "}\n",
+ "\n",
+ "/* Toggleable label */\n",
+ "#sk-container-id-2 label.sk-toggleable__label {\n",
+ " cursor: pointer;\n",
+ " display: flex;\n",
+ " width: 100%;\n",
+ " margin-bottom: 0;\n",
+ " padding: 0.5em;\n",
+ " box-sizing: border-box;\n",
+ " text-align: center;\n",
+ " align-items: center;\n",
+ " justify-content: center;\n",
+ " gap: 0.5em;\n",
+ "}\n",
+ "\n",
+ "#sk-container-id-2 label.sk-toggleable__label .caption {\n",
+ " font-size: 0.6rem;\n",
+ " font-weight: lighter;\n",
+ " color: var(--sklearn-color-text-muted);\n",
+ "}\n",
+ "\n",
+ "#sk-container-id-2 label.sk-toggleable__label-arrow:before {\n",
+ " /* Arrow on the left of the label */\n",
+ " content: \"▸\";\n",
+ " float: left;\n",
+ " margin-right: 0.25em;\n",
+ " color: var(--sklearn-color-icon);\n",
+ "}\n",
+ "\n",
+ "#sk-container-id-2 label.sk-toggleable__label-arrow:hover:before {\n",
+ " color: var(--sklearn-color-text);\n",
+ "}\n",
+ "\n",
+ "/* Toggleable content - dropdown */\n",
+ "\n",
+ "#sk-container-id-2 div.sk-toggleable__content {\n",
+ " display: none;\n",
+ " text-align: left;\n",
+ " /* unfitted */\n",
+ " background-color: var(--sklearn-color-unfitted-level-0);\n",
+ "}\n",
+ "\n",
+ "#sk-container-id-2 div.sk-toggleable__content.fitted {\n",
+ " /* fitted */\n",
+ " background-color: var(--sklearn-color-fitted-level-0);\n",
+ "}\n",
+ "\n",
+ "#sk-container-id-2 div.sk-toggleable__content pre {\n",
+ " margin: 0.2em;\n",
+ " border-radius: 0.25em;\n",
+ " color: var(--sklearn-color-text);\n",
+ " /* unfitted */\n",
+ " background-color: var(--sklearn-color-unfitted-level-0);\n",
+ "}\n",
+ "\n",
+ "#sk-container-id-2 div.sk-toggleable__content.fitted pre {\n",
+ " /* unfitted */\n",
+ " background-color: var(--sklearn-color-fitted-level-0);\n",
+ "}\n",
+ "\n",
+ "#sk-container-id-2 input.sk-toggleable__control:checked~div.sk-toggleable__content {\n",
+ " /* Expand drop-down */\n",
+ " display: block;\n",
+ " width: 100%;\n",
+ " overflow: visible;\n",
+ "}\n",
+ "\n",
+ "#sk-container-id-2 input.sk-toggleable__control:checked~label.sk-toggleable__label-arrow:before {\n",
+ " content: \"▾\";\n",
+ "}\n",
+ "\n",
+ "/* Pipeline/ColumnTransformer-specific style */\n",
+ "\n",
+ "#sk-container-id-2 div.sk-label input.sk-toggleable__control:checked~label.sk-toggleable__label {\n",
+ " color: var(--sklearn-color-text);\n",
+ " background-color: var(--sklearn-color-unfitted-level-2);\n",
+ "}\n",
+ "\n",
+ "#sk-container-id-2 div.sk-label.fitted input.sk-toggleable__control:checked~label.sk-toggleable__label {\n",
+ " background-color: var(--sklearn-color-fitted-level-2);\n",
+ "}\n",
+ "\n",
+ "/* Estimator-specific style */\n",
+ "\n",
+ "/* Colorize estimator box */\n",
+ "#sk-container-id-2 div.sk-estimator input.sk-toggleable__control:checked~label.sk-toggleable__label {\n",
+ " /* unfitted */\n",
+ " background-color: var(--sklearn-color-unfitted-level-2);\n",
+ "}\n",
+ "\n",
+ "#sk-container-id-2 div.sk-estimator.fitted input.sk-toggleable__control:checked~label.sk-toggleable__label {\n",
+ " /* fitted */\n",
+ " background-color: var(--sklearn-color-fitted-level-2);\n",
+ "}\n",
+ "\n",
+ "#sk-container-id-2 div.sk-label label.sk-toggleable__label,\n",
+ "#sk-container-id-2 div.sk-label label {\n",
+ " /* The background is the default theme color */\n",
+ " color: var(--sklearn-color-text-on-default-background);\n",
+ "}\n",
+ "\n",
+ "/* On hover, darken the color of the background */\n",
+ "#sk-container-id-2 div.sk-label:hover label.sk-toggleable__label {\n",
+ " color: var(--sklearn-color-text);\n",
+ " background-color: var(--sklearn-color-unfitted-level-2);\n",
+ "}\n",
+ "\n",
+ "/* Label box, darken color on hover, fitted */\n",
+ "#sk-container-id-2 div.sk-label.fitted:hover label.sk-toggleable__label.fitted {\n",
+ " color: var(--sklearn-color-text);\n",
+ " background-color: var(--sklearn-color-fitted-level-2);\n",
+ "}\n",
+ "\n",
+ "/* Estimator label */\n",
+ "\n",
+ "#sk-container-id-2 div.sk-label label {\n",
+ " font-family: monospace;\n",
+ " font-weight: bold;\n",
+ " line-height: 1.2em;\n",
+ "}\n",
+ "\n",
+ "#sk-container-id-2 div.sk-label-container {\n",
+ " text-align: center;\n",
+ "}\n",
+ "\n",
+ "/* Estimator-specific */\n",
+ "#sk-container-id-2 div.sk-estimator {\n",
+ " font-family: monospace;\n",
+ " border: 1px dotted var(--sklearn-color-border-box);\n",
+ " border-radius: 0.25em;\n",
+ " box-sizing: border-box;\n",
+ " margin-bottom: 0.5em;\n",
+ " /* unfitted */\n",
+ " background-color: var(--sklearn-color-unfitted-level-0);\n",
+ "}\n",
+ "\n",
+ "#sk-container-id-2 div.sk-estimator.fitted {\n",
+ " /* fitted */\n",
+ " background-color: var(--sklearn-color-fitted-level-0);\n",
+ "}\n",
+ "\n",
+ "/* on hover */\n",
+ "#sk-container-id-2 div.sk-estimator:hover {\n",
+ " /* unfitted */\n",
+ " background-color: var(--sklearn-color-unfitted-level-2);\n",
+ "}\n",
+ "\n",
+ "#sk-container-id-2 div.sk-estimator.fitted:hover {\n",
+ " /* fitted */\n",
+ " background-color: var(--sklearn-color-fitted-level-2);\n",
+ "}\n",
+ "\n",
+ "/* Specification for estimator info (e.g. \"i\" and \"?\") */\n",
+ "\n",
+ "/* Common style for \"i\" and \"?\" */\n",
+ "\n",
+ ".sk-estimator-doc-link,\n",
+ "a:link.sk-estimator-doc-link,\n",
+ "a:visited.sk-estimator-doc-link {\n",
+ " float: right;\n",
+ " font-size: smaller;\n",
+ " line-height: 1em;\n",
+ " font-family: monospace;\n",
+ " background-color: var(--sklearn-color-unfitted-level-0);\n",
+ " border-radius: 1em;\n",
+ " height: 1em;\n",
+ " width: 1em;\n",
+ " text-decoration: none !important;\n",
+ " margin-left: 0.5em;\n",
+ " text-align: center;\n",
+ " /* unfitted */\n",
+ " border: var(--sklearn-color-unfitted-level-3) 1pt solid;\n",
+ " color: var(--sklearn-color-unfitted-level-3);\n",
+ "}\n",
+ "\n",
+ ".sk-estimator-doc-link.fitted,\n",
+ "a:link.sk-estimator-doc-link.fitted,\n",
+ "a:visited.sk-estimator-doc-link.fitted {\n",
+ " /* fitted */\n",
+ " background-color: var(--sklearn-color-fitted-level-0);\n",
+ " border: var(--sklearn-color-fitted-level-3) 1pt solid;\n",
+ " color: var(--sklearn-color-fitted-level-3);\n",
+ "}\n",
+ "\n",
+ "/* On hover */\n",
+ "div.sk-estimator:hover .sk-estimator-doc-link:hover,\n",
+ ".sk-estimator-doc-link:hover,\n",
+ "div.sk-label-container:hover .sk-estimator-doc-link:hover,\n",
+ ".sk-estimator-doc-link:hover {\n",
+ " /* unfitted */\n",
+ " background-color: var(--sklearn-color-unfitted-level-3);\n",
+ " border: var(--sklearn-color-fitted-level-0) 1pt solid;\n",
+ " color: var(--sklearn-color-unfitted-level-0);\n",
+ " text-decoration: none;\n",
+ "}\n",
+ "\n",
+ "div.sk-estimator.fitted:hover .sk-estimator-doc-link.fitted:hover,\n",
+ ".sk-estimator-doc-link.fitted:hover,\n",
+ "div.sk-label-container:hover .sk-estimator-doc-link.fitted:hover,\n",
+ ".sk-estimator-doc-link.fitted:hover {\n",
+ " /* fitted */\n",
+ " background-color: var(--sklearn-color-fitted-level-3);\n",
+ " border: var(--sklearn-color-fitted-level-0) 1pt solid;\n",
+ " color: var(--sklearn-color-fitted-level-0);\n",
+ " text-decoration: none;\n",
+ "}\n",
+ "\n",
+ "/* Span, style for the box shown on hovering the info icon */\n",
+ ".sk-estimator-doc-link span {\n",
+ " display: none;\n",
+ " z-index: 9999;\n",
+ " position: relative;\n",
+ " font-weight: normal;\n",
+ " right: .2ex;\n",
+ " padding: .5ex;\n",
+ " margin: .5ex;\n",
+ " width: min-content;\n",
+ " min-width: 20ex;\n",
+ " max-width: 50ex;\n",
+ " color: var(--sklearn-color-text);\n",
+ " box-shadow: 2pt 2pt 4pt #999;\n",
+ " /* unfitted */\n",
+ " background: var(--sklearn-color-unfitted-level-0);\n",
+ " border: .5pt solid var(--sklearn-color-unfitted-level-3);\n",
+ "}\n",
+ "\n",
+ ".sk-estimator-doc-link.fitted span {\n",
+ " /* fitted */\n",
+ " background: var(--sklearn-color-fitted-level-0);\n",
+ " border: var(--sklearn-color-fitted-level-3);\n",
+ "}\n",
+ "\n",
+ ".sk-estimator-doc-link:hover span {\n",
+ " display: block;\n",
+ "}\n",
+ "\n",
+ "/* \"?\"-specific style due to the `` HTML tag */\n",
+ "\n",
+ "#sk-container-id-2 a.estimator_doc_link {\n",
+ " float: right;\n",
+ " font-size: 1rem;\n",
+ " line-height: 1em;\n",
+ " font-family: monospace;\n",
+ " background-color: var(--sklearn-color-unfitted-level-0);\n",
+ " border-radius: 1rem;\n",
+ " height: 1rem;\n",
+ " width: 1rem;\n",
+ " text-decoration: none;\n",
+ " /* unfitted */\n",
+ " color: var(--sklearn-color-unfitted-level-1);\n",
+ " border: var(--sklearn-color-unfitted-level-1) 1pt solid;\n",
+ "}\n",
+ "\n",
+ "#sk-container-id-2 a.estimator_doc_link.fitted {\n",
+ " /* fitted */\n",
+ " background-color: var(--sklearn-color-fitted-level-0);\n",
+ " border: var(--sklearn-color-fitted-level-1) 1pt solid;\n",
+ " color: var(--sklearn-color-fitted-level-1);\n",
+ "}\n",
+ "\n",
+ "/* On hover */\n",
+ "#sk-container-id-2 a.estimator_doc_link:hover {\n",
+ " /* unfitted */\n",
+ " background-color: var(--sklearn-color-unfitted-level-3);\n",
+ " color: var(--sklearn-color-background);\n",
+ " text-decoration: none;\n",
+ "}\n",
+ "\n",
+ "#sk-container-id-2 a.estimator_doc_link.fitted:hover {\n",
+ " /* fitted */\n",
+ " background-color: var(--sklearn-color-fitted-level-3);\n",
+ "}\n",
+ "\n",
+ ".estimator-table {\n",
+ " font-family: monospace;\n",
+ "}\n",
+ "\n",
+ ".estimator-table summary {\n",
+ " padding: .5rem;\n",
+ " cursor: pointer;\n",
+ "}\n",
+ "\n",
+ ".estimator-table summary::marker {\n",
+ " font-size: 0.7rem;\n",
+ "}\n",
+ "\n",
+ ".estimator-table details[open] {\n",
+ " padding-left: 0.1rem;\n",
+ " padding-right: 0.1rem;\n",
+ " padding-bottom: 0.3rem;\n",
+ "}\n",
+ "\n",
+ ".estimator-table .parameters-table {\n",
+ " margin-left: auto !important;\n",
+ " margin-right: auto !important;\n",
+ " margin-top: 0;\n",
+ "}\n",
+ "\n",
+ ".estimator-table .parameters-table tr:nth-child(odd) {\n",
+ " background-color: #fff;\n",
+ "}\n",
+ "\n",
+ ".estimator-table .parameters-table tr:nth-child(even) {\n",
+ " background-color: #f6f6f6;\n",
+ "}\n",
+ "\n",
+ ".estimator-table .parameters-table tr:hover {\n",
+ " background-color: #e0e0e0;\n",
+ "}\n",
+ "\n",
+ ".estimator-table table td {\n",
+ " border: 1px solid rgba(106, 105, 104, 0.232);\n",
+ "}\n",
+ "\n",
+ "/*\n",
+ " `table td`is set in notebook with right text-align.\n",
+ " We need to overwrite it.\n",
+ "*/\n",
+ ".estimator-table table td.param {\n",
+ " text-align: left;\n",
+ " position: relative;\n",
+ " padding: 0;\n",
+ "}\n",
+ "\n",
+ ".user-set td {\n",
+ " color:rgb(255, 94, 0);\n",
+ " text-align: left !important;\n",
+ "}\n",
+ "\n",
+ ".user-set td.value {\n",
+ " color:rgb(255, 94, 0);\n",
+ " background-color: transparent;\n",
+ "}\n",
+ "\n",
+ ".default td {\n",
+ " color: black;\n",
+ " text-align: left !important;\n",
+ "}\n",
+ "\n",
+ ".user-set td i,\n",
+ ".default td i {\n",
+ " color: black;\n",
+ "}\n",
+ "\n",
+ "/*\n",
+ " Styles for parameter documentation links\n",
+ " We need styling for visited so jupyter doesn't overwrite it\n",
+ "*/\n",
+ "a.param-doc-link,\n",
+ "a.param-doc-link:link,\n",
+ "a.param-doc-link:visited {\n",
+ " text-decoration: underline dashed;\n",
+ " text-underline-offset: .3em;\n",
+ " color: inherit;\n",
+ " display: block;\n",
+ " padding: .5em;\n",
+ "}\n",
+ "\n",
+ "/* \"hack\" to make the entire area of the cell containing the link clickable */\n",
+ "a.param-doc-link::before {\n",
+ " position: absolute;\n",
+ " content: \"\";\n",
+ " inset: 0;\n",
+ "}\n",
+ "\n",
+ ".param-doc-description {\n",
+ " display: none;\n",
+ " position: absolute;\n",
+ " z-index: 9999;\n",
+ " left: 0;\n",
+ " padding: .5ex;\n",
+ " margin-left: 1.5em;\n",
+ " color: var(--sklearn-color-text);\n",
+ " box-shadow: .3em .3em .4em #999;\n",
+ " width: max-content;\n",
+ " text-align: left;\n",
+ " max-height: 10em;\n",
+ " overflow-y: auto;\n",
+ "\n",
+ " /* unfitted */\n",
+ " background: var(--sklearn-color-unfitted-level-0);\n",
+ " border: thin solid var(--sklearn-color-unfitted-level-3);\n",
+ "}\n",
+ "\n",
+ "/* Fitted state for parameter tooltips */\n",
+ ".fitted .param-doc-description {\n",
+ " /* fitted */\n",
+ " background: var(--sklearn-color-fitted-level-0);\n",
+ " border: thin solid var(--sklearn-color-fitted-level-3);\n",
+ "}\n",
+ "\n",
+ ".param-doc-link:hover .param-doc-description {\n",
+ " display: block;\n",
+ "}\n",
+ "\n",
+ ".copy-paste-icon {\n",
+ " background-image: url();\n",
+ " background-repeat: no-repeat;\n",
+ " background-size: 14px 14px;\n",
+ " background-position: 0;\n",
+ " display: inline-block;\n",
+ " width: 14px;\n",
+ " height: 14px;\n",
+ " cursor: pointer;\n",
+ "}\n",
+ "LogisticRegression() In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook. On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org. \n",
+ "
\n",
+ "
\n",
+ " Parameters \n",
+ " \n",
+ " \n",
+ " \n",
+ " \n",
+ " \n",
+ " \n",
+ " \n",
+ " penalty\n",
+ " penalty: {'l1', 'l2', 'elasticnet', None}, default='l2' Specify the norm of the penalty: - `None`: no penalty is added; - `'l2'`: add a L2 penalty term and it is the default choice; - `'l1'`: add a L1 penalty term; - `'elasticnet'`: both L1 and L2 penalty terms are added. .. warning:: Some penalties may not work with some solvers. See the parameter `solver` below, to know the compatibility between the penalty and solver. .. versionadded:: 0.19 l1 penalty with SAGA solver (allowing 'multinomial' + L1) .. deprecated:: 1.8 `penalty` was deprecated in version 1.8 and will be removed in 1.10. Use `l1_ratio` instead. `l1_ratio=0` for `penalty='l2'`, `l1_ratio=1` for `penalty='l1'` and `l1_ratio` set to any float between 0 and 1 for `'penalty='elasticnet'`. \n",
+ " \n",
+ " \n",
+ " 'deprecated' \n",
+ " \n",
+ " \n",
+ "\n",
+ " \n",
+ " \n",
+ " \n",
+ " \n",
+ " C\n",
+ " C: float, default=1.0 Inverse of regularization strength; must be a positive float. Like in support vector machines, smaller values specify stronger regularization. `C=np.inf` results in unpenalized logistic regression. For a visual example on the effect of tuning the `C` parameter with an L1 penalty, see: :ref:`sphx_glr_auto_examples_linear_model_plot_logistic_path.py`. \n",
+ " \n",
+ " \n",
+ " 1.0 \n",
+ " \n",
+ " \n",
+ "\n",
+ " \n",
+ " \n",
+ " \n",
+ " \n",
+ " l1_ratio\n",
+ " l1_ratio: float, default=0.0 The Elastic-Net mixing parameter, with `0 <= l1_ratio <= 1`. Setting `l1_ratio=1` gives a pure L1-penalty, setting `l1_ratio=0` a pure L2-penalty. Any value between 0 and 1 gives an Elastic-Net penalty of the form `l1_ratio * L1 + (1 - l1_ratio) * L2`. .. warning:: Certain values of `l1_ratio`, i.e. some penalties, may not work with some solvers. See the parameter `solver` below, to know the compatibility between the penalty and solver. .. versionchanged:: 1.8 Default value changed from None to 0.0. .. deprecated:: 1.8 `None` is deprecated and will be removed in version 1.10. Always use `l1_ratio` to specify the penalty type. \n",
+ " \n",
+ " \n",
+ " 0.0 \n",
+ " \n",
+ " \n",
+ "\n",
+ " \n",
+ " \n",
+ " \n",
+ " \n",
+ " dual\n",
+ " dual: bool, default=False Dual (constrained) or primal (regularized, see also :ref:`this equation `) formulation. Dual formulation is only implemented for l2 penalty with liblinear solver. Prefer `dual=False` when n_samples > n_features. \n",
+ " \n",
+ " \n",
+ " False \n",
+ " \n",
+ " \n",
+ "\n",
+ " \n",
+ " \n",
+ " \n",
+ " \n",
+ " tol\n",
+ " tol: float, default=1e-4 Tolerance for stopping criteria. \n",
+ " \n",
+ " \n",
+ " 0.0001 \n",
+ " \n",
+ " \n",
+ "\n",
+ " \n",
+ " \n",
+ " \n",
+ " \n",
+ " fit_intercept\n",
+ " fit_intercept: bool, default=True Specifies if a constant (a.k.a. bias or intercept) should be added to the decision function. \n",
+ " \n",
+ " \n",
+ " True \n",
+ " \n",
+ " \n",
+ "\n",
+ " \n",
+ " \n",
+ " \n",
+ " \n",
+ " intercept_scaling\n",
+ " intercept_scaling: float, default=1 Useful only when the solver `liblinear` is used and `self.fit_intercept` is set to `True`. In this case, `x` becomes `[x, self.intercept_scaling]`, i.e. a \"synthetic\" feature with constant value equal to `intercept_scaling` is appended to the instance vector. The intercept becomes ``intercept_scaling * synthetic_feature_weight``. .. note:: The synthetic feature weight is subject to L1 or L2 regularization as all other features. To lessen the effect of regularization on synthetic feature weight (and therefore on the intercept) `intercept_scaling` has to be increased. \n",
+ " \n",
+ " \n",
+ " 1 \n",
+ " \n",
+ " \n",
+ "\n",
+ " \n",
+ " \n",
+ " \n",
+ " \n",
+ " class_weight\n",
+ " class_weight: dict or 'balanced', default=None Weights associated with classes in the form ``{class_label: weight}``. If not given, all classes are supposed to have weight one. The \"balanced\" mode uses the values of y to automatically adjust weights inversely proportional to class frequencies in the input data as ``n_samples / (n_classes * np.bincount(y))``. Note that these weights will be multiplied with sample_weight (passed through the fit method) if sample_weight is specified. .. versionadded:: 0.17 *class_weight='balanced'* \n",
+ " \n",
+ " \n",
+ " None \n",
+ " \n",
+ " \n",
+ "\n",
+ " \n",
+ " \n",
+ " \n",
+ " \n",
+ " random_state\n",
+ " random_state: int, RandomState instance, default=None Used when ``solver`` == 'sag', 'saga' or 'liblinear' to shuffle the data. See :term:`Glossary ` for details. \n",
+ " \n",
+ " \n",
+ " None \n",
+ " \n",
+ " \n",
+ "\n",
+ " \n",
+ " \n",
+ " \n",
+ " \n",
+ " solver\n",
+ " solver: {'lbfgs', 'liblinear', 'newton-cg', 'newton-cholesky', 'sag', 'saga'}, default='lbfgs' Algorithm to use in the optimization problem. Default is 'lbfgs'. To choose a solver, you might want to consider the following aspects: - 'lbfgs' is a good default solver because it works reasonably well for a wide class of problems. - For :term:`multiclass` problems (`n_classes >= 3`), all solvers except 'liblinear' minimize the full multinomial loss, 'liblinear' will raise an error. - 'newton-cholesky' is a good choice for `n_samples` >> `n_features * n_classes`, especially with one-hot encoded categorical features with rare categories. Be aware that the memory usage of this solver has a quadratic dependency on `n_features * n_classes` because it explicitly computes the full Hessian matrix. - For small datasets, 'liblinear' is a good choice, whereas 'sag' and 'saga' are faster for large ones; - 'liblinear' can only handle binary classification by default. To apply a one-versus-rest scheme for the multiclass setting one can wrap it with the :class:`~sklearn.multiclass.OneVsRestClassifier`. .. warning:: The choice of the algorithm depends on the penalty chosen (`l1_ratio=0` for L2-penalty, `l1_ratio=1` for L1-penalty and `0 < l1_ratio < 1` for Elastic-Net) and on (multinomial) multiclass support: ================= ======================== ====================== solver l1_ratio multinomial multiclass ================= ======================== ====================== 'lbfgs' l1_ratio=0 yes 'liblinear' l1_ratio=1 or l1_ratio=0 no 'newton-cg' l1_ratio=0 yes 'newton-cholesky' l1_ratio=0 yes 'sag' l1_ratio=0 yes 'saga' 0<=l1_ratio<=1 yes ================= ======================== ====================== .. note:: 'sag' and 'saga' fast convergence is only guaranteed on features with approximately the same scale. You can preprocess the data with a scaler from :mod:`sklearn.preprocessing`. .. seealso:: Refer to the :ref:`User Guide ` for more information regarding :class:`LogisticRegression` and more specifically the :ref:`Table ` summarizing solver/penalty supports. .. versionadded:: 0.17 Stochastic Average Gradient (SAG) descent solver. Multinomial support in version 0.18. .. versionadded:: 0.19 SAGA solver. .. versionchanged:: 0.22 The default solver changed from 'liblinear' to 'lbfgs' in 0.22. .. versionadded:: 1.2 newton-cholesky solver. Multinomial support in version 1.6. \n",
+ " \n",
+ " \n",
+ " 'lbfgs' \n",
+ " \n",
+ " \n",
+ "\n",
+ " \n",
+ " \n",
+ " \n",
+ " \n",
+ " max_iter\n",
+ " max_iter: int, default=100 Maximum number of iterations taken for the solvers to converge. \n",
+ " \n",
+ " \n",
+ " 100 \n",
+ " \n",
+ " \n",
+ "\n",
+ " \n",
+ " \n",
+ " \n",
+ " \n",
+ " verbose\n",
+ " verbose: int, default=0 For the liblinear and lbfgs solvers set verbose to any positive number for verbosity. \n",
+ " \n",
+ " \n",
+ " 0 \n",
+ " \n",
+ " \n",
+ "\n",
+ " \n",
+ " \n",
+ " \n",
+ " \n",
+ " warm_start\n",
+ " warm_start: bool, default=False When set to True, reuse the solution of the previous call to fit as initialization, otherwise, just erase the previous solution. Useless for liblinear solver. See :term:`the Glossary `. .. versionadded:: 0.17 *warm_start* to support *lbfgs*, *newton-cg*, *sag*, *saga* solvers. \n",
+ " \n",
+ " \n",
+ " False \n",
+ " \n",
+ " \n",
+ "\n",
+ " \n",
+ " \n",
+ " \n",
+ " \n",
+ " n_jobs\n",
+ " n_jobs: int, default=None Does not have any effect. .. deprecated:: 1.8 `n_jobs` is deprecated in version 1.8 and will be removed in 1.10. \n",
+ " \n",
+ " \n",
+ " None \n",
+ " \n",
+ " \n",
+ " \n",
+ "
\n",
+ " \n",
+ "
\n",
+ "
"
+ ],
+ "text/plain": [
+ "LogisticRegression()"
+ ]
+ },
+ "execution_count": 23,
+ "metadata": {},
+ "output_type": "execute_result"
+ }
+ ],
+ "source": [
+ "x_train, x_test, y_train, y_test = train_test_split(iris.data, iris.target, train_size=0.8)\n",
+ "\n",
+ "model = LogisticRegression()\n",
+ "model.fit(x_train, y_train)"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 24,
+ "id": "d8098877",
+ "metadata": {},
+ "outputs": [
+ {
+ "data": {
+ "text/plain": [
+ "0.9666666666666667"
+ ]
+ },
+ "execution_count": 24,
+ "metadata": {},
+ "output_type": "execute_result"
+ }
+ ],
+ "source": [
+ "model.score(x_test, y_test)"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "49aea81d",
+ "metadata": {},
+ "source": [
+ "### Confusion Matrix"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 25,
+ "id": "b54f661d",
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "y_predicted = model.predict(x_test)"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 27,
+ "id": "2390b059",
+ "metadata": {},
+ "outputs": [
+ {
+ "data": {
+ "text/plain": [
+ ""
+ ]
+ },
+ "execution_count": 27,
+ "metadata": {},
+ "output_type": "execute_result"
+ },
+ {
+ "data": {
+ "image/png": "",
+ "text/plain": [
+ ""
+ ]
+ },
+ "metadata": {},
+ "output_type": "display_data"
+ }
+ ],
+ "source": [
+ "cm = confusion_matrix(y_test, y_predicted)\n",
+ "sns.heatmap(cm, annot=True)"
+ ]
+ }
+ ],
+ "metadata": {
+ "kernelspec": {
+ "display_name": "Python 3",
+ "language": "python",
+ "name": "python3"
+ },
+ "language_info": {
+ "codemirror_mode": {
+ "name": "ipython",
+ "version": 3
+ },
+ "file_extension": ".py",
+ "mimetype": "text/x-python",
+ "name": "python",
+ "nbconvert_exporter": "python",
+ "pygments_lexer": "ipython3",
+ "version": "3.14.2"
+ }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 5
+}
diff --git a/ML/9_decision_tree/Exercise/Solution_for_understanding.ipynb b/ML/9_decision_tree/Exercise/Solution_for_understanding.ipynb
new file mode 100644
index 00000000..f1443e60
--- /dev/null
+++ b/ML/9_decision_tree/Exercise/Solution_for_understanding.ipynb
@@ -0,0 +1,1745 @@
+{
+ "cells": [
+ {
+ "cell_type": "markdown",
+ "id": "9e1bc565",
+ "metadata": {},
+ "source": [
+ "# The Titanic Survival Prediction"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "e87270cb",
+ "metadata": {},
+ "source": [
+ "## Imports and Data Loading"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 1,
+ "id": "0f3b85cf",
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "import pandas as pd\n",
+ "from sklearn.compose import ColumnTransformer\n",
+ "from sklearn.model_selection import train_test_split\n",
+ "from sklearn.preprocessing import OneHotEncoder\n",
+ "import seaborn as sns\n",
+ "from sklearn.tree import DecisionTreeClassifier"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 2,
+ "id": "5e270d2b",
+ "metadata": {},
+ "outputs": [
+ {
+ "data": {
+ "text/html": [
+ "\n",
+ "\n",
+ "
\n",
+ " \n",
+ " \n",
+ " \n",
+ " PassengerId \n",
+ " Survived \n",
+ " Pclass \n",
+ " Name \n",
+ " Sex \n",
+ " Age \n",
+ " SibSp \n",
+ " Parch \n",
+ " Ticket \n",
+ " Fare \n",
+ " Cabin \n",
+ " Embarked \n",
+ " \n",
+ " \n",
+ " \n",
+ " \n",
+ " 0 \n",
+ " 1 \n",
+ " 0 \n",
+ " 3 \n",
+ " Braund, Mr. Owen Harris \n",
+ " male \n",
+ " 22.0 \n",
+ " 1 \n",
+ " 0 \n",
+ " A/5 21171 \n",
+ " 7.2500 \n",
+ " NaN \n",
+ " S \n",
+ " \n",
+ " \n",
+ " 1 \n",
+ " 2 \n",
+ " 1 \n",
+ " 1 \n",
+ " Cumings, Mrs. John Bradley (Florence Briggs Th... \n",
+ " female \n",
+ " 38.0 \n",
+ " 1 \n",
+ " 0 \n",
+ " PC 17599 \n",
+ " 71.2833 \n",
+ " C85 \n",
+ " C \n",
+ " \n",
+ " \n",
+ " 2 \n",
+ " 3 \n",
+ " 1 \n",
+ " 3 \n",
+ " Heikkinen, Miss. Laina \n",
+ " female \n",
+ " 26.0 \n",
+ " 0 \n",
+ " 0 \n",
+ " STON/O2. 3101282 \n",
+ " 7.9250 \n",
+ " NaN \n",
+ " S \n",
+ " \n",
+ " \n",
+ " 3 \n",
+ " 4 \n",
+ " 1 \n",
+ " 1 \n",
+ " Futrelle, Mrs. Jacques Heath (Lily May Peel) \n",
+ " female \n",
+ " 35.0 \n",
+ " 1 \n",
+ " 0 \n",
+ " 113803 \n",
+ " 53.1000 \n",
+ " C123 \n",
+ " S \n",
+ " \n",
+ " \n",
+ " 4 \n",
+ " 5 \n",
+ " 0 \n",
+ " 3 \n",
+ " Allen, Mr. William Henry \n",
+ " male \n",
+ " 35.0 \n",
+ " 0 \n",
+ " 0 \n",
+ " 373450 \n",
+ " 8.0500 \n",
+ " NaN \n",
+ " S \n",
+ " \n",
+ " \n",
+ "
\n",
+ "
"
+ ],
+ "text/plain": [
+ " PassengerId Survived Pclass \\\n",
+ "0 1 0 3 \n",
+ "1 2 1 1 \n",
+ "2 3 1 3 \n",
+ "3 4 1 1 \n",
+ "4 5 0 3 \n",
+ "\n",
+ " Name Sex Age SibSp \\\n",
+ "0 Braund, Mr. Owen Harris male 22.0 1 \n",
+ "1 Cumings, Mrs. John Bradley (Florence Briggs Th... female 38.0 1 \n",
+ "2 Heikkinen, Miss. Laina female 26.0 0 \n",
+ "3 Futrelle, Mrs. Jacques Heath (Lily May Peel) female 35.0 1 \n",
+ "4 Allen, Mr. William Henry male 35.0 0 \n",
+ "\n",
+ " Parch Ticket Fare Cabin Embarked \n",
+ "0 0 A/5 21171 7.2500 NaN S \n",
+ "1 0 PC 17599 71.2833 C85 C \n",
+ "2 0 STON/O2. 3101282 7.9250 NaN S \n",
+ "3 0 113803 53.1000 C123 S \n",
+ "4 0 373450 8.0500 NaN S "
+ ]
+ },
+ "execution_count": 2,
+ "metadata": {},
+ "output_type": "execute_result"
+ }
+ ],
+ "source": [
+ "df = pd.read_csv(\"titanic.csv\")\n",
+ "df.head()"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "963be91a",
+ "metadata": {},
+ "source": [
+ "## Preprocessing and Preparing x and y"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 3,
+ "id": "0b3c8359",
+ "metadata": {},
+ "outputs": [
+ {
+ "data": {
+ "text/html": [
+ "\n",
+ "\n",
+ "
\n",
+ " \n",
+ " \n",
+ " \n",
+ " Pclass \n",
+ " Sex \n",
+ " Age \n",
+ " Fare \n",
+ " \n",
+ " \n",
+ " \n",
+ " \n",
+ " 0 \n",
+ " 3 \n",
+ " male \n",
+ " 22.0 \n",
+ " 7.2500 \n",
+ " \n",
+ " \n",
+ " 1 \n",
+ " 1 \n",
+ " female \n",
+ " 38.0 \n",
+ " 71.2833 \n",
+ " \n",
+ " \n",
+ " 2 \n",
+ " 3 \n",
+ " female \n",
+ " 26.0 \n",
+ " 7.9250 \n",
+ " \n",
+ " \n",
+ " 3 \n",
+ " 1 \n",
+ " female \n",
+ " 35.0 \n",
+ " 53.1000 \n",
+ " \n",
+ " \n",
+ " 4 \n",
+ " 3 \n",
+ " male \n",
+ " 35.0 \n",
+ " 8.0500 \n",
+ " \n",
+ " \n",
+ "
\n",
+ "
"
+ ],
+ "text/plain": [
+ " Pclass Sex Age Fare\n",
+ "0 3 male 22.0 7.2500\n",
+ "1 1 female 38.0 71.2833\n",
+ "2 3 female 26.0 7.9250\n",
+ "3 1 female 35.0 53.1000\n",
+ "4 3 male 35.0 8.0500"
+ ]
+ },
+ "execution_count": 3,
+ "metadata": {},
+ "output_type": "execute_result"
+ }
+ ],
+ "source": [
+ "x = df[[\"Pclass\",\"Sex\",\"Age\",\"Fare\"]]\n",
+ "x.head()"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 4,
+ "id": "a7062760",
+ "metadata": {},
+ "outputs": [
+ {
+ "data": {
+ "text/html": [
+ "\n",
+ "\n",
+ "
\n",
+ " \n",
+ " \n",
+ " \n",
+ " Pclass \n",
+ " Age \n",
+ " Fare \n",
+ " \n",
+ " \n",
+ " \n",
+ " \n",
+ " count \n",
+ " 891.000000 \n",
+ " 714.000000 \n",
+ " 891.000000 \n",
+ " \n",
+ " \n",
+ " mean \n",
+ " 2.308642 \n",
+ " 29.699118 \n",
+ " 32.204208 \n",
+ " \n",
+ " \n",
+ " std \n",
+ " 0.836071 \n",
+ " 14.526497 \n",
+ " 49.693429 \n",
+ " \n",
+ " \n",
+ " min \n",
+ " 1.000000 \n",
+ " 0.420000 \n",
+ " 0.000000 \n",
+ " \n",
+ " \n",
+ " 25% \n",
+ " 2.000000 \n",
+ " 20.125000 \n",
+ " 7.910400 \n",
+ " \n",
+ " \n",
+ " 50% \n",
+ " 3.000000 \n",
+ " 28.000000 \n",
+ " 14.454200 \n",
+ " \n",
+ " \n",
+ " 75% \n",
+ " 3.000000 \n",
+ " 38.000000 \n",
+ " 31.000000 \n",
+ " \n",
+ " \n",
+ " max \n",
+ " 3.000000 \n",
+ " 80.000000 \n",
+ " 512.329200 \n",
+ " \n",
+ " \n",
+ "
\n",
+ "
"
+ ],
+ "text/plain": [
+ " Pclass Age Fare\n",
+ "count 891.000000 714.000000 891.000000\n",
+ "mean 2.308642 29.699118 32.204208\n",
+ "std 0.836071 14.526497 49.693429\n",
+ "min 1.000000 0.420000 0.000000\n",
+ "25% 2.000000 20.125000 7.910400\n",
+ "50% 3.000000 28.000000 14.454200\n",
+ "75% 3.000000 38.000000 31.000000\n",
+ "max 3.000000 80.000000 512.329200"
+ ]
+ },
+ "execution_count": 4,
+ "metadata": {},
+ "output_type": "execute_result"
+ }
+ ],
+ "source": [
+ "x.describe()"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 5,
+ "id": "70079661",
+ "metadata": {},
+ "outputs": [
+ {
+ "data": {
+ "text/plain": [
+ "(891, 4)"
+ ]
+ },
+ "execution_count": 5,
+ "metadata": {},
+ "output_type": "execute_result"
+ }
+ ],
+ "source": [
+ "x.shape"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 6,
+ "id": "ff1db4db",
+ "metadata": {},
+ "outputs": [
+ {
+ "data": {
+ "text/plain": [
+ ""
+ ]
+ },
+ "execution_count": 6,
+ "metadata": {},
+ "output_type": "execute_result"
+ }
+ ],
+ "source": [
+ "x.Sex.count"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "0fa55a7f",
+ "metadata": {},
+ "source": [
+ "Age column has some missing values, we have to fill it with suitable value"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "61207335",
+ "metadata": {},
+ "source": [
+ "### Checking the best value to fillna in Age"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 7,
+ "id": "5c5d7166",
+ "metadata": {},
+ "outputs": [
+ {
+ "data": {
+ "text/plain": [
+ ""
+ ]
+ },
+ "execution_count": 7,
+ "metadata": {},
+ "output_type": "execute_result"
+ },
+ {
+ "data": {
+ "image/png": "",
+ "text/plain": [
+ ""
+ ]
+ },
+ "metadata": {},
+ "output_type": "display_data"
+ }
+ ],
+ "source": [
+ "sns.histplot(x.Age, kde=True)"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "d903de8b",
+ "metadata": {},
+ "source": [
+ "The data is normal, fine to fill with mean"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 8,
+ "id": "9d6f439e",
+ "metadata": {},
+ "outputs": [
+ {
+ "data": {
+ "text/html": [
+ "\n",
+ "\n",
+ "
\n",
+ " \n",
+ " \n",
+ " \n",
+ " Pclass \n",
+ " Age \n",
+ " Fare \n",
+ " \n",
+ " \n",
+ " \n",
+ " \n",
+ " count \n",
+ " 891.000000 \n",
+ " 891.000000 \n",
+ " 891.000000 \n",
+ " \n",
+ " \n",
+ " mean \n",
+ " 2.308642 \n",
+ " 29.699118 \n",
+ " 32.204208 \n",
+ " \n",
+ " \n",
+ " std \n",
+ " 0.836071 \n",
+ " 13.002015 \n",
+ " 49.693429 \n",
+ " \n",
+ " \n",
+ " min \n",
+ " 1.000000 \n",
+ " 0.420000 \n",
+ " 0.000000 \n",
+ " \n",
+ " \n",
+ " 25% \n",
+ " 2.000000 \n",
+ " 22.000000 \n",
+ " 7.910400 \n",
+ " \n",
+ " \n",
+ " 50% \n",
+ " 3.000000 \n",
+ " 29.699118 \n",
+ " 14.454200 \n",
+ " \n",
+ " \n",
+ " 75% \n",
+ " 3.000000 \n",
+ " 35.000000 \n",
+ " 31.000000 \n",
+ " \n",
+ " \n",
+ " max \n",
+ " 3.000000 \n",
+ " 80.000000 \n",
+ " 512.329200 \n",
+ " \n",
+ " \n",
+ "
\n",
+ "
"
+ ],
+ "text/plain": [
+ " Pclass Age Fare\n",
+ "count 891.000000 891.000000 891.000000\n",
+ "mean 2.308642 29.699118 32.204208\n",
+ "std 0.836071 13.002015 49.693429\n",
+ "min 1.000000 0.420000 0.000000\n",
+ "25% 2.000000 22.000000 7.910400\n",
+ "50% 3.000000 29.699118 14.454200\n",
+ "75% 3.000000 35.000000 31.000000\n",
+ "max 3.000000 80.000000 512.329200"
+ ]
+ },
+ "execution_count": 8,
+ "metadata": {},
+ "output_type": "execute_result"
+ }
+ ],
+ "source": [
+ "x.Age = x.Age.fillna(x.Age.mean())\n",
+ "x.describe()"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "7de16e02",
+ "metadata": {},
+ "source": [
+ "### Encoding the Sex column"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 9,
+ "id": "cb332b65",
+ "metadata": {},
+ "outputs": [
+ {
+ "data": {
+ "text/plain": [
+ "array([[ 1. , 3. , 22. , 7.25 ],\n",
+ " [ 0. , 1. , 38. , 71.2833 ],\n",
+ " [ 0. , 3. , 26. , 7.925 ],\n",
+ " ...,\n",
+ " [ 0. , 3. , 29.69911765, 23.45 ],\n",
+ " [ 1. , 1. , 26. , 30. ],\n",
+ " [ 1. , 3. , 32. , 7.75 ]],\n",
+ " shape=(891, 4))"
+ ]
+ },
+ "execution_count": 9,
+ "metadata": {},
+ "output_type": "execute_result"
+ }
+ ],
+ "source": [
+ "y = df.Survived\n",
+ "\n",
+ "#encoding Sex with one hot encoder, this is better and using label encoder before it is depricated now\n",
+ "ct = ColumnTransformer([(\"Sex\", OneHotEncoder(drop=\"first\"), [1])], remainder=\"passthrough\")\n",
+ "x = ct.fit_transform(x)\n",
+ "x"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "30d65fee",
+ "metadata": {},
+ "source": [
+ "### Making a dataframe with column names"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 10,
+ "id": "19a273c3",
+ "metadata": {},
+ "outputs": [
+ {
+ "data": {
+ "text/html": [
+ "\n",
+ "\n",
+ "
\n",
+ " \n",
+ " \n",
+ " \n",
+ " Sex__Sex_male \n",
+ " remainder__Pclass \n",
+ " remainder__Age \n",
+ " remainder__Fare \n",
+ " \n",
+ " \n",
+ " \n",
+ " \n",
+ " 0 \n",
+ " 1.0 \n",
+ " 3.0 \n",
+ " 22.0 \n",
+ " 7.2500 \n",
+ " \n",
+ " \n",
+ " 1 \n",
+ " 0.0 \n",
+ " 1.0 \n",
+ " 38.0 \n",
+ " 71.2833 \n",
+ " \n",
+ " \n",
+ " 2 \n",
+ " 0.0 \n",
+ " 3.0 \n",
+ " 26.0 \n",
+ " 7.9250 \n",
+ " \n",
+ " \n",
+ " 3 \n",
+ " 0.0 \n",
+ " 1.0 \n",
+ " 35.0 \n",
+ " 53.1000 \n",
+ " \n",
+ " \n",
+ " 4 \n",
+ " 1.0 \n",
+ " 3.0 \n",
+ " 35.0 \n",
+ " 8.0500 \n",
+ " \n",
+ " \n",
+ "
\n",
+ "
"
+ ],
+ "text/plain": [
+ " Sex__Sex_male remainder__Pclass remainder__Age remainder__Fare\n",
+ "0 1.0 3.0 22.0 7.2500\n",
+ "1 0.0 1.0 38.0 71.2833\n",
+ "2 0.0 3.0 26.0 7.9250\n",
+ "3 0.0 1.0 35.0 53.1000\n",
+ "4 1.0 3.0 35.0 8.0500"
+ ]
+ },
+ "execution_count": 10,
+ "metadata": {},
+ "output_type": "execute_result"
+ }
+ ],
+ "source": [
+ "# to see the column name, i always do the following thing\n",
+ "feature_columns = ct.get_feature_names_out()\n",
+ "features = pd.DataFrame(x, columns=feature_columns)\n",
+ "features.head()"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "f5f0da0f",
+ "metadata": {},
+ "source": [
+ "## Training the model"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "f97f5b34",
+ "metadata": {},
+ "source": [
+ "### Train Test Split"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 11,
+ "id": "e830f752",
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "x_train, x_test, y_train, y_test = train_test_split(x, y , train_size=0.8)"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "255a1893",
+ "metadata": {},
+ "source": [
+ "### Model fitting"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 12,
+ "id": "199b2197",
+ "metadata": {},
+ "outputs": [
+ {
+ "data": {
+ "text/html": [
+ "DecisionTreeClassifier() In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook. On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org. \n",
+ "
\n",
+ "
\n",
+ " Parameters \n",
+ " \n",
+ " \n",
+ " \n",
+ " \n",
+ " \n",
+ " \n",
+ " \n",
+ " criterion\n",
+ " criterion: {\"gini\", \"entropy\", \"log_loss\"}, default=\"gini\" The function to measure the quality of a split. Supported criteria are \"gini\" for the Gini impurity and \"log_loss\" and \"entropy\" both for the Shannon information gain, see :ref:`tree_mathematical_formulation`. \n",
+ " \n",
+ " \n",
+ " 'gini' \n",
+ " \n",
+ " \n",
+ "\n",
+ " \n",
+ " \n",
+ " \n",
+ " \n",
+ " splitter\n",
+ " splitter: {\"best\", \"random\"}, default=\"best\" The strategy used to choose the split at each node. Supported strategies are \"best\" to choose the best split and \"random\" to choose the best random split. \n",
+ " \n",
+ " \n",
+ " 'best' \n",
+ " \n",
+ " \n",
+ "\n",
+ " \n",
+ " \n",
+ " \n",
+ " \n",
+ " max_depth\n",
+ " max_depth: int, default=None The maximum depth of the tree. If None, then nodes are expanded until all leaves are pure or until all leaves contain less than min_samples_split samples. \n",
+ " \n",
+ " \n",
+ " None \n",
+ " \n",
+ " \n",
+ "\n",
+ " \n",
+ " \n",
+ " \n",
+ " \n",
+ " min_samples_split\n",
+ " min_samples_split: int or float, default=2 The minimum number of samples required to split an internal node: - If int, then consider `min_samples_split` as the minimum number. - If float, then `min_samples_split` is a fraction and `ceil(min_samples_split * n_samples)` are the minimum number of samples for each split. .. versionchanged:: 0.18 Added float values for fractions. \n",
+ " \n",
+ " \n",
+ " 2 \n",
+ " \n",
+ " \n",
+ "\n",
+ " \n",
+ " \n",
+ " \n",
+ " \n",
+ " min_samples_leaf\n",
+ " min_samples_leaf: int or float, default=1 The minimum number of samples required to be at a leaf node. A split point at any depth will only be considered if it leaves at least ``min_samples_leaf`` training samples in each of the left and right branches. This may have the effect of smoothing the model, especially in regression. - If int, then consider `min_samples_leaf` as the minimum number. - If float, then `min_samples_leaf` is a fraction and `ceil(min_samples_leaf * n_samples)` are the minimum number of samples for each node. .. versionchanged:: 0.18 Added float values for fractions. \n",
+ " \n",
+ " \n",
+ " 1 \n",
+ " \n",
+ " \n",
+ "\n",
+ " \n",
+ " \n",
+ " \n",
+ " \n",
+ " min_weight_fraction_leaf\n",
+ " min_weight_fraction_leaf: float, default=0.0 The minimum weighted fraction of the sum total of weights (of all the input samples) required to be at a leaf node. Samples have equal weight when sample_weight is not provided. \n",
+ " \n",
+ " \n",
+ " 0.0 \n",
+ " \n",
+ " \n",
+ "\n",
+ " \n",
+ " \n",
+ " \n",
+ " \n",
+ " max_features\n",
+ " max_features: int, float or {\"sqrt\", \"log2\"}, default=None The number of features to consider when looking for the best split: - If int, then consider `max_features` features at each split. - If float, then `max_features` is a fraction and `max(1, int(max_features * n_features_in_))` features are considered at each split. - If \"sqrt\", then `max_features=sqrt(n_features)`. - If \"log2\", then `max_features=log2(n_features)`. - If None, then `max_features=n_features`. .. note:: The search for a split does not stop until at least one valid partition of the node samples is found, even if it requires to effectively inspect more than ``max_features`` features. \n",
+ " \n",
+ " \n",
+ " None \n",
+ " \n",
+ " \n",
+ "\n",
+ " \n",
+ " \n",
+ " \n",
+ " \n",
+ " random_state\n",
+ " random_state: int, RandomState instance or None, default=None Controls the randomness of the estimator. The features are always randomly permuted at each split, even if ``splitter`` is set to ``\"best\"``. When ``max_features < n_features``, the algorithm will select ``max_features`` at random at each split before finding the best split among them. But the best found split may vary across different runs, even if ``max_features=n_features``. That is the case, if the improvement of the criterion is identical for several splits and one split has to be selected at random. To obtain a deterministic behaviour during fitting, ``random_state`` has to be fixed to an integer. See :term:`Glossary ` for details. \n",
+ " \n",
+ " \n",
+ " None \n",
+ " \n",
+ " \n",
+ "\n",
+ " \n",
+ " \n",
+ " \n",
+ " \n",
+ " max_leaf_nodes\n",
+ " max_leaf_nodes: int, default=None Grow a tree with ``max_leaf_nodes`` in best-first fashion. Best nodes are defined as relative reduction in impurity. If None then unlimited number of leaf nodes. \n",
+ " \n",
+ " \n",
+ " None \n",
+ " \n",
+ " \n",
+ "\n",
+ " \n",
+ " \n",
+ " \n",
+ " \n",
+ " min_impurity_decrease\n",
+ " min_impurity_decrease: float, default=0.0 A node will be split if this split induces a decrease of the impurity greater than or equal to this value. The weighted impurity decrease equation is the following:: N_t / N * (impurity - N_t_R / N_t * right_impurity - N_t_L / N_t * left_impurity) where ``N`` is the total number of samples, ``N_t`` is the number of samples at the current node, ``N_t_L`` is the number of samples in the left child, and ``N_t_R`` is the number of samples in the right child. ``N``, ``N_t``, ``N_t_R`` and ``N_t_L`` all refer to the weighted sum, if ``sample_weight`` is passed. .. versionadded:: 0.19 \n",
+ " \n",
+ " \n",
+ " 0.0 \n",
+ " \n",
+ " \n",
+ "\n",
+ " \n",
+ " \n",
+ " \n",
+ " \n",
+ " class_weight\n",
+ " class_weight: dict, list of dict or \"balanced\", default=None Weights associated with classes in the form ``{class_label: weight}``. If None, all classes are supposed to have weight one. For multi-output problems, a list of dicts can be provided in the same order as the columns of y. Note that for multioutput (including multilabel) weights should be defined for each class of every column in its own dict. For example, for four-class multilabel classification weights should be [{0: 1, 1: 1}, {0: 1, 1: 5}, {0: 1, 1: 1}, {0: 1, 1: 1}] instead of [{1:1}, {2:5}, {3:1}, {4:1}]. The \"balanced\" mode uses the values of y to automatically adjust weights inversely proportional to class frequencies in the input data as ``n_samples / (n_classes * np.bincount(y))`` For multi-output, the weights of each column of y will be multiplied. Note that these weights will be multiplied with sample_weight (passed through the fit method) if sample_weight is specified. \n",
+ " \n",
+ " \n",
+ " None \n",
+ " \n",
+ " \n",
+ "\n",
+ " \n",
+ " \n",
+ " \n",
+ " \n",
+ " ccp_alpha\n",
+ " ccp_alpha: non-negative float, default=0.0 Complexity parameter used for Minimal Cost-Complexity Pruning. The subtree with the largest cost complexity that is smaller than ``ccp_alpha`` will be chosen. By default, no pruning is performed. See :ref:`minimal_cost_complexity_pruning` for details. See :ref:`sphx_glr_auto_examples_tree_plot_cost_complexity_pruning.py` for an example of such pruning. .. versionadded:: 0.22 \n",
+ " \n",
+ " \n",
+ " 0.0 \n",
+ " \n",
+ " \n",
+ "\n",
+ " \n",
+ " \n",
+ " \n",
+ " \n",
+ " monotonic_cst\n",
+ " monotonic_cst: array-like of int of shape (n_features), default=None Indicates the monotonicity constraint to enforce on each feature. - 1: monotonic increase - 0: no constraint - -1: monotonic decrease If monotonic_cst is None, no constraints are applied. Monotonicity constraints are not supported for: - multiclass classifications (i.e. when `n_classes > 2`), - multioutput classifications (i.e. when `n_outputs_ > 1`), - classifications trained on data with missing values. The constraints hold over the probability of the positive class. Read more in the :ref:`User Guide `. .. versionadded:: 1.4 \n",
+ " \n",
+ " \n",
+ " None \n",
+ " \n",
+ " \n",
+ " \n",
+ "
\n",
+ " \n",
+ "
\n",
+ "
"
+ ],
+ "text/plain": [
+ "DecisionTreeClassifier()"
+ ]
+ },
+ "execution_count": 12,
+ "metadata": {},
+ "output_type": "execute_result"
+ }
+ ],
+ "source": [
+ "model = DecisionTreeClassifier()\n",
+ "model.fit(x_train, y_train)"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "6b883285",
+ "metadata": {},
+ "source": [
+ "### Checking the Score"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 13,
+ "id": "89e3e334",
+ "metadata": {},
+ "outputs": [
+ {
+ "data": {
+ "text/plain": [
+ "0.8324022346368715"
+ ]
+ },
+ "execution_count": 13,
+ "metadata": {},
+ "output_type": "execute_result"
+ }
+ ],
+ "source": [
+ "model.score(x_test, y_test)"
+ ]
+ }
+ ],
+ "metadata": {
+ "kernelspec": {
+ "display_name": "Python 3",
+ "language": "python",
+ "name": "python3"
+ },
+ "language_info": {
+ "codemirror_mode": {
+ "name": "ipython",
+ "version": 3
+ },
+ "file_extension": ".py",
+ "mimetype": "text/x-python",
+ "name": "python",
+ "nbconvert_exporter": "python",
+ "pygments_lexer": "ipython3",
+ "version": "3.14.2"
+ }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 5
+}