MachineLearning-Notes/notes.tex at master · sonicxml/MachineLearning-Notes · GitHub

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
\documentclass{article}
% Input Encoding and LaTeX formatting
\usepackage[utf8]{inputenc}
\usepackage[parfill]{parskip}

% General Formatting
\usepackage{titlesec}
\usepackage{framed}
\usepackage[title,titletoc,toc]{appendix}
\usepackage[hidelinks]{hyperref}

% Math Shenanigans
\usepackage{amssymb}
\usepackage{mathtools}
\usepackage{amsmath}
\usepackage{amsthm}
\usepackage{amssymb}

% For graphics
\usepackage{graphicx}
\usepackage{placeins}
\usepackage{float}

% For macros
\usepackage{ifthen}

% For including code
\usepackage{listings}
\usepackage[noend]{algpseudocode}

% For including graphics
\usepackage{graphicx}

% For rendering diagrams
\usepackage{tikz}
\usetikzlibrary{arrows}
\usepackage{pgfplots}
\pgfplotsset{compat=newest}
\usetikzlibrary{patterns}

\usepackage{complexity}


\newtheorem{thm}{Theorem}

\newtheorem*{lemma}{Lemma}

\DeclarePairedDelimiter\abs{\lvert}{\rvert}

\newcommand{\LL}{\ensuremath{\mathcal{L}}}

\newcommand{\ee}{\ensuremath{\epsilon}}

% Redefine section
\titleformat{\section}{\normalfont\Large\bfseries}{\thesection. }{0em}{}[{\titlerule[0.5pt]}]

% parskip sets \topsep to 0. Fixes vertical spacing with amsthm.
\makeatletter
\def\thm@space@setup{%
  \thm@preskip=\parskip \thm@postskip=0pt
}
\makeatother

\graphicspath{{images/}}

\title{CIS 625: Introduction to Computational Learning Theory \\ Lecture Notes}
\author{JT Cho, Trevin Gandhi, and Brady Neal \\
Lectures by Professor Michael Kearns}
\date{}

\begin{document}

\maketitle

\tableofcontents
\clearpage

\section{A Theoretical Model for Classification}

We define $X$, $\mathcal{D}$, $\mathcal{C}$, $c_t$, and EX here:

\begin{itemize}

    \item
    an input space $X$ (e.g. $X = \mathbb{R}^n$, $X = \{0, 1\}^n$)

    \item
    unknown and arbitrary distribution $\mathcal{D}$ over $X$

    \item
    known class $\mathcal{C}$ of binary functions
    \begin{itemize}
        \item
        examples: $c \in \mathcal{C}$, $c : X \rightarrow \{1, -1\}$ or $c : X \rightarrow \{0, 1\}$
    \end{itemize}

    \item
    unknown target function $c_t \in \mathcal{C}$

    \item
    access to procedure (oracle) EX$(c_t, \mathcal{D})$ that returns $\langle x, c_t(x) \rangle$ pairs where $X \sim D$
    \begin{itemize}
        \item
        This oracle essentially gives you training data.
    \end{itemize}

    \item
    The goal is not to fit the training data from the oracle well but to learn a function $h \in \mathcal{C}$ that generalizes well. In other words, we'd like to learn a function $h$ that is close to the target function $c_t \in \mathcal{C}$. Notice that in this model, the function we learn and the target function are both in the same class of binary functions.

    \item
    The learning algorithm is constrained by ``sample efficiency" and ``computational efficiency." These are formalized below.

\end{itemize}

\subsection{Probably Approximately Correct (PAC) Learning Definition}

A class $\mathcal{C}$ is PAC-learnable if: \\
$\exists$ an algorithm $L$ such that \\
$\forall \, c_t \in \mathcal{C}$, $\forall \, \mathcal{D}$ over $X$, $\forall \, \epsilon, \delta > 0$, \\
given $\epsilon$, $\delta$, and EX$(c_t, D)$, $L$ outputs $h \in \mathcal{C}$ with probability $1 - \delta$ (``Probably") and the following conditions hold:

\begin{enumerate}

    \item Learning Condition (``Approximately Correct"):
    \[\Pr_{X \sim \mathcal{D}}[h(x) \neq c_t(x)] \leq \epsilon\]

    \item Sample Efficiency:
    \[\text{number of calls to EX} (c_t, \mathcal{D}) \text{ is } \poly\left(\frac{1}{\epsilon}, \frac{1}{\delta}, n\right)\]

    \item Computational Efficiency:
    \[\text{running time of } L \text{ is } \poly\left(\frac{1}{\epsilon}, \frac{1}{\delta}, n\right)\]
    Note that with the model of computation where EX takes constant time, this $3$\textsuperscript{rd} condition implies the $2$\textsuperscript{nd} condition.

\end{enumerate}

\subsection{Rectangle Example}

Consider the problem of classifying people of medium build. It might make sense to classify a person as having medium build if they're inside some rectangle. Similarly, all people outside the rectangle would be classified as not having medium build.

\begin{itemize}

    \item
    Input Space: $X = \mathbb{R}^2$

    \item
    Known Class of Binary Functions: $\mathcal{C} =$ all axis-aligned (edges parallel to coordinate axes) rectangles in $\mathbb{R}^2$

\end{itemize}

\begin{center}
\begin{tikzpicture}
\begin{axis}[
    title=Example Scatter Plot with some Classification Rectangle:,
    axis lines=middle,
    xmin=0, xmax=10,
    ymin=0, ymax=10,
    xlabel=height, xlabel near ticks,
    ylabel=weight, ylabel near ticks,
    xtick=\empty, ytick=\empty,
    legend pos=outer north east,
]
\addplot[only marks,blue] plot coordinates {
    (3.5, 6.5)
    (4, 5)
    (5, 4.5)
    (5, 7)
    (6, 5.5)
    (6.7, 7.3)
    (7.2, 4.8)
};
\addplot[only marks,red] plot coordinates {
    (1, 1.5)
    (1.2, 7)
    (1.5, 8)
    (1.8, 5)
    %(2, 2)
    (2.5, 3)
    (2.5, 9.5)
    (4, 2.5)
    (4.2, 8.7)
    (5.5, 9.8)
    (6, 2.7)
    (6, 9)
    (8, 2)
    (8, 9.5)
    (9, 3)
    (9, 10)
    (9.5, 6)
    (9.8, 9)
};
\draw (3,4) rectangle (8,8);
\legend{Medium Build, Not Medium Build}

\end{axis}
\end{tikzpicture}
\end{center}

Because the rectangle ($h$) above correctly classifies all the data points, we say $h$ is consistent. More formally, $c \in \mathcal{C}$ is consistent with $S$ if for all $1 \leq i \leq m$, $c(x_i) = b_i$ where $S = \{\langle x_1, b_1 \rangle, \dots, \langle x_m, b_m \rangle\}$ is the training data obtained from the oracle. There are many different consistent $c \in C$. For example, there is one on the next page that draws the rectangle according to the extremal values of the \emph{positive} examples (left) and the other draws the rectangle according to the extremal values of the \emph{negative} examples (right).

\begin{tikzpicture}[scale = .8]
\begin{axis}[
    title=Example Scatter Plot with Positive Extremal Values:,
    axis lines=middle,
    xmin=0, xmax=10,
    ymin=0, ymax=10,
    xlabel=height, xlabel near ticks,
    ylabel=weight, ylabel near ticks,
    xtick=\empty, ytick=\empty,
]
\addplot[only marks,blue] plot coordinates {
    (3.5, 6.5)
    (4, 5)
    (5, 4.5)
    (5, 7)
    (6, 5.5)
    (6.7, 7.3)
    (7.2, 4.8)
};
\addplot[only marks,red] plot coordinates {
    (1, 1.5)
    (1.2, 7)
    (1.5, 8)
    (1.8, 5)
    %(2, 2)
    (2.5, 3)
    (2.5, 9.5)
    (4, 2.5)
    (4.2, 8.7)
    (5.5, 9.8)
    (6, 2.7)
    (6, 9)
    (8, 2)
    (8, 9.5)
    (9, 3)
    (9, 10)
    (9.5, 6)
    (9.8, 9)
};
\draw (3.5, 4.5) rectangle (7.2, 7.3);

\end{axis}
\end{tikzpicture}
\hspace{1cm}
\begin{tikzpicture}[scale = .8]
\begin{axis}[
    title=Example Scatter Plot with Negative Extremal Values:,
    axis lines=middle,
    xmin=0, xmax=10,
    ymin=0, ymax=10,
    xlabel=height, xlabel near ticks,
    ylabel=weight, ylabel near ticks,
    xtick=\empty, ytick=\empty,
]
\addplot[only marks,blue] plot coordinates {
    (3.5, 6.5)
    (4, 5)
    (5, 4.5)
    (5, 7)
    (6, 5.5)
    (6.7, 7.3)
    (7.2, 4.8)
};
\addplot[only marks,red] plot coordinates {
    (1, 1.5)
    (1.2, 7)
    (1.5, 8)
    (1.8, 5)
    %(2, 2)
    (2.5, 3)
    (2.5, 9.5)
    (4, 2.5)
    (4.2, 8.7)
    (5.5, 9.8)
    (6, 2.7)
    (6, 9)
    (8, 2)
    (8, 9.5)
    (9, 3)
    (9, 10)
    (9.5, 6)
    (9.8, 9)
};
\draw (2.5, 3) rectangle (9, 8.7);

\end{axis}
\end{tikzpicture}

Although these two rectangles are consistent, they might not generalize as well to data not in $S$ because they tightly fit either the positive training examples (left) or the negative training examples (right).

Let's consider the rectangle on the left, the one that is drawn tightly around the extremal values of the positive training examples. Let that rectangle be the function we have learned so far: $h$. It is the solid rectangle in the graph below. The dotted rectangle is the target function $c_t$.

\begin{center}
\begin{tikzpicture}
\begin{axis}[
    axis lines=middle,
    xmin=0, xmax=10,
    ymin=0, ymax=10,
    xlabel=height, xlabel near ticks,
    ylabel=weight, ylabel near ticks,
    xtick=\empty, ytick=\empty,
]

\draw (3.5, 4.5) rectangle (7.2, 7.3);
\draw[dashed] (2.5, 3.5) rectangle (8.2, 8.3);

\end{axis}
\end{tikzpicture}
\end{center}

We want to bound the probability that $h$ differs from $c_t$. To do this, first consider just the topmost strip between the edge of $h$ and the edge of $c_t$. We'll call this region $T$.

\begin{center}
\begin{tikzpicture}
\begin{axis}[
    hide axis,
    xmin=0, xmax=10,
    ymin=0, ymax=10,
    xtick=\empty, ytick=\empty,
]

\draw (3.5, 4.5) rectangle (7.2, 7.3);
\draw[dashed] (2.5, 3.5) rectangle (8.2, 8.3);
\fill[pattern=north east lines, pattern color=gray] (2.5, 7.3) rectangle (8.2, 8.3);

\end{axis}
\end{tikzpicture}
\end{center}

We can bound the size of $T$. Let the probability mass that's in $T$ be $\frac{\epsilon}{4}$. This choice will make sense soon. This means that when sampling a single example $x_i$ from EX, we know the following:
\[\Pr_{X \sim \mathcal{D}}[x_i \text{ not in } T] = 1 - \frac{\epsilon}{4}\]
Assuming independence, for $m$ examples from EX, we know:
\[\Pr_{X \sim \mathcal{D}}[\text{all of } x_1 \dots x_m \text{ are not in } T] = \left(1 - \frac{\epsilon}{4}\right)^m\]

This is important because the probability of not expanding $h$ is exponentially small in our number of training examples. Now that we've examined a single strip, let's look at the whole portion of area that is outside of $h$ and inside of $c_t$. Let this region be $B$.

\begin{center}
\begin{tikzpicture}
\begin{axis}[
    hide axis,
    xmin=0, xmax=10,
    ymin=0, ymax=10,
    xtick=\empty, ytick=\empty,
]

\draw (3.5, 4.5) rectangle (7.2, 7.3);
\draw[dashed] (2.5, 3.5) rectangle (8.2, 8.3);
\fill[pattern=north east lines, pattern color=gray] (2.5, 3.5) rectangle (8.2,8.3);
\fill[white] (3.5, 4.5) rectangle (7.2, 7.3);

\end{axis}
\end{tikzpicture}
\end{center}

If we let each of the strips have roughly the same probability mass, we can bound the probability of not getting a single example in $B$ using the union bound:
\[\Pr_{X \sim \mathcal{D}}[\text{all of } x_1 \dots x_m \text{ are not in } B] \leq 4 \left(1 - \frac{\epsilon}{4}\right)^m\]

What we've been doing is working on making that shaded rectangular border $B$ as small as necessary to make $h$ close to $c_t$. This is the learning condition or the ``Approximately Correct" (AC) part of PAC learning. To get the ``Probably" (P) part of PAC learning, we need to bound this probability:
\[4 \left(1 - \frac{\epsilon}{4}\right)^m \leq \delta\]
Using $(1 - x) \approx e^{-x}$ (for small $x$), we have:
\[4e^{-\frac{\epsilon}{4} m} \approx 4 \left(1 - \frac{\epsilon}{4}\right)^m \leq \delta\]
\[4e^{-\frac{\epsilon}{4} m} \leq \delta\]
\[\frac{-\epsilon}{4} m \leq \ln \frac{\delta}{4}\]
\[m \geq \frac{4}{\epsilon} \ln \frac{4}{\delta}\]
Because $m$ is polynomial in $\frac{1}{\epsilon}$ and logarithmic in $\frac{1}{\delta}$, we have satisfied the Sample Efficiency condition. Although we did this analysis for 2D, the number of ``strips" grows linearly with the number of dimensions, so that's not a problem. Also, the analysis doesn't use the distribution $\mathcal{D}$ at all. It only uses independence of training examples.

\section{PAC Learning Model}
Let's first review the definition of a PAC-learnable class of functions:

Given the instance space $\mathcal{X}$, we say that a \textbf{concept}
$c$ is a subset $c \subseteq \mathcal{X}$. A concept may be considered a
set of all positive instances that satisfy a particular rule. Therefore,
we can write $c : \mathcal{X} \rightarrow \{0, 1\}$. A
\textbf{concept class} is a collection of concepts over $\mathcal{X}$.

Given a model's hypothesis $h(x)$ and the target concept $c(x)$, we
define the \textbf{general error} or \textbf{true error} of the model as
follows:
$$err_D(h) = P_{x\sim D}(h(x) \neq c(x))$$

The true error describes the probability of that the hypothesis
misclassifies an arbitrary data point from $\mathcal{X}$. That is, it
characterizes how well our hypothesis performs well on test data, not
just the training set.

\emph{PAC-learnability.} A concept class $\mathcal{C}$ is PAC-learnable
if $\exists$ a learning algorithm
$\mathcal{L}$ such that $\forall c_t \in \mathcal{C}$, $\forall$
distributions $\mathcal{D}$ over $\mathcal{X}$, and
$\forall \ee, \delta > 0$, and given some $\ee$, $\delta$, and
an oracle $EX(c_t, \mathcal{D})$, $\mathcal{L}$ outputs an
$h \in \mathcal{C}$ such that with probability at least $1-\delta$ it
satisfies the following conditions:
\begin{enumerate}
    \item (Learning Condition) $err_D(h) \leq
        \ee$. That is, the probability of a misclassification is
        bounded by a constant.
    \item (Sample Efficiency) The number of calls to $EX(c, D)$ is
        polynomial with respect to $\frac{1}{\ee}$,
        $\frac{1}{\delta}$, and $n$.
    \item (Computational Efficiency) The running time is
        polynomial with respect to $\frac{1}{\ee}$,
        $\frac{1}{\delta}$, and $n$.
\end{enumerate}

\emph{Consistency.} Given labeled
samples $S = \{<x_1, y_1>, <x_2, y_2>, ..., <x_m, y_m> | y_i = c(x_i)\
\forall 1 \leq i \leq m\}$, then we say some $h \in \mathcal{C}$ is
\textbf{consistent} with $S$ if $\forall 1 \leq i \leq m,$
$h(x_i) = y_i$. That is, a function $h \in \mathcal{C}$ is consistent
with our sample data if it correctly classifies every sample in our
sample data $S$.

\section{Generalizing the Sample Bound}
Suppose then, that we have a learning algorithm $\mathcal{L}$ that always
finds a hypothesis $h \in \mathcal{C}$ that is consistent over a sample $S$,
where $\mathcal{C}$ is a finite concept class. We can upper bound the
probability that the general error of $h$ will never exceed some $\epsilon$
and use this to compute a prudent choice of $m = |S|$. We can then say that
the learning algorithm $\mathcal{L}$ satisfies the learning-condition from
our PAC model.

\emph{$\epsilon$-badness.} We say that some hypothesis $h \in \mathcal{C}$
is $\epsilon$-bad if $err_D(h) > \epsilon$. Inversely, $h$ is
$\epsilon$-good if $err_D(h) \leq \epsilon$.

This means that for an $\epsilon$-bad hypothesis $h$ and a sample $S$ of
$m$ points drawn from the distribution $\mathcal{D}$, the probability
that $h$ is consistent over $S$ is

$$P_{S \sim \mathcal{D}, c}(\text{$\epsilon$-bad $h$ consistent with $S$}) \leq
        (1-\epsilon)^m$$

What if we wanted to find the same probability, except \textit{for some}
$h \in \mathcal{C}$? That is, what if we wanted to find the probability
that some $h \in \mathcal{C}$ is consistent with $S$ and $\epsilon$-bad?
We could assume that the probability that each $h$ is consistent with
$S$ is independent for the purpose of defining an upper bound. Thus we
can say that
\[ P_{S \sim D, c} (\text{$\exists\;\epsilon$-bad $h$ that is consistent
with $S$}) \leq \abs{\mathcal{C}}(1-\epsilon)^m \] by using a union bound.

Now we want to show that this probability is less than some $\delta$ so
that we can meet the guarantee that with probability at least $1-\delta$
our algorithm doesn't output an $\epsilon$-bad function.

\begin{gather*}
    |\mathcal{C}|(1-\epsilon)^m \leq \delta\\
        \leq |\mathcal{C}|e^{-\epsilon m} \leq \delta\\
        \implies  m \geq \frac{c_0}{\epsilon}\ln(\frac{\abs{\mathcal{C}}}{\delta})
\end{gather*}

where $c_0$ is some constant. Thus, we have shown that if we have
$m \geq \frac{c_0}{\epsilon} \ln(\frac{\abs{\mathcal{C}}}{\delta})$
samples, the learning algorithm $\mathcal{L}$ satisfies the learning
condition.

Then, given a number of $m$ examples, the error we expect is, with
probability at least $1-\delta$,
$$err_D(h) \leq \frac{c_0}{m}\ln(\frac{|\mathcal{C}|}{\delta})$$

We can formalize this as a theorem:

\begin{framed}
\begin{thm}
    Let $\mathcal{C}$ be a finite concept class defined on the domain
    $X$. Let $\mathcal{L}$ be an algorithm such that for any $m$ and for
    any $c \in \mathcal{C}$, if $S$ is a labeled training set
    $\{\langle x_i, c(x_i) \rangle\}_{i=1}^m$, then the hypothesis $h$
    generated by $\mathcal{L}(S)$ is consistent with $S$. Then, with
    sample complexity
    $m \geq \frac{c_0}{\epsilon}\ln(\frac{\abs{\mathcal{C}}}{\delta})$,
    $\mathcal{L}$ satisfies the learning condition with
    probability $1-\delta$.
\end{thm}
\end{framed}


\section{Application 1: Learning Boolean Conjunctions}
We can define the problem of learning boolean conjunctions. Here, our instance
space is $X = \{0, 1\}^n$, and our class of functions $\mathcal{C}$ are the
conjunctions over $x_1, x_2, ..., x_n$. For example, one such $c \in \mathcal{C}$
could be $x_3 \wedge \neg x_5 \wedge x_6 \wedge \neg x_{10}$. We can see that
$\mathcal{C}$ is finite, with $\abs{C} = 3^n$, since for
each $x_i$ we can either include it, include its negation, or not include it.

One simple algorithm to solve this problem is shown below.

\begin{figure}[H]
\begin{framed}
    \begin{algorithmic}
        \State $h \leftarrow x_1 \wedge \neg x_1 \wedge x_2 \wedge \neg x_2 \dots x_n \wedge \neg x_n$
        \For{$<x_i, y> \in S$}
            \If{$y$ is $false$} \State ignore
            \Else \If{$x_i = 1$} \State delete $\neg x_i$ from $h$
                \Else
                \State delete $x_i$ from $h$
                \EndIf
            \EndIf
        \EndFor
    \end{algorithmic}
\end{framed}
\begin{center}
    \emph{Simple learning algorithm for boolean conjunctions.}
\end{center}
\end{figure}

While there could be more specific algorithms that, for example, use the
false samples, we can see that this one is consistent with our samples
and thus, by the theorem above, will satisfy the learning condition
given enough samples.

Some things to note about this algorithm:
\begin{itemize}
    \item This algorithm works even if all samples are negative,
    because it always returns no (since it never modifies its
    original value of $h$, which is guaranteed to return false),
    which makes it consistent with the samples. This still satisfies
    the learning condition because it means the probability of a
    sample evaluating to true is extremely low.
    \item If this was an online learning algorithm, it would make
    at most $2n$ mistakes, since it would delete at least 1
    contradicting literal each time it gets a new sample.
\end{itemize}

\section{Application 2: Learning Decision Lists}
Like before, we have $X = \{0, 1\}^n$, except this time $\mathcal{C}$
is the set of all decision lists over $x_1, ... x_n$. A decision list
is like a ``chain'' version of a decision tree: at each rule,
if the result is yes, the decision list breaks and returns a specified
output. If the comparison returns no, then the decision list moves on
to the next comparison. We can see that $\abs{C}$ is large, but still
finite (and its log is still polynomial in $n$).

Unlike the last algorithm, where we started with a ``full''
conjunction, for this one we'll start with an empty decision
list and slowly build it up. So how do we find the first
rule in our decision list? One approach would be to do something similar to a
greedy algorithm. We could find a subset of the $x$'s such that they all
have the same $x_i$ and they all evaluate to the same thing (either True or
False), remove those samples from our set, and then turn the value of $x_i$ and
the corresponding output into a new rule in our decision list. We can keep
repeating this until no samples are left. To formalize this:

\begin{figure}[H]
\begin{framed}
    \begin{algorithmic}
        \State $B \leftarrow \{f(x) | \langle x, f(x) \rangle \in S\}$
        \While{$\abs{B} > 1$}
            \State Let $x_i$ be a component with no rule in the decision list
            \State Let $j$ be 0 or 1
            \State $Q \leftarrow \{f(x) | \langle x, f(x) \rangle \in S$ and $x_i = j \}$
            \If{$\abs{Q} = 1$}
                \State Make new rule: If $x_i = j$ then $f(x) = Q$
                \State $S \leftarrow S \setminus \{ \langle x, f(x) \rangle | x_i = j \}$
            \EndIf
            \State $B \leftarrow \{f(x) | \langle x, f(x) \rangle \in S\}$
        \EndWhile
        \State Output final return value to be $B$
    \end{algorithmic}
\end{framed}
\begin{center}
    \emph{Simple learning algorithm for decision lists.}
\end{center}
\end{figure}

% TODO: Add worked-out example

\section{Intractability of Learning 3-Term DNF}

We now return to the concept class of boolean conjunctions and instead consider
a slight generalization: the class of disjunctions of three boolean conjunctions
(\textbf{3-term disjunctive normal form formulae}). We will show that this concept
class is \emph{not} efficiently PAC-learnable unless every problem in $NP$ can be
solved efficiently in a worst-case sense by a randomized algorithm!

\textbf{3-Term DNF Formulae.} The representation class $\mathcal{C}_n$ of 3-term
DNF formulae is the set of all disjunctions $T_1 \vee T_2 \vee T_3$, where each
$T_i$ is a conjunction of literals over the boolean variables $x_1, \dots, x_n$.
We define the maximum size of an instance of the 3-term DNF formula to be
$size(c) \leq 6n$, as there can be at most $2n$ literals in each of the three
terms. An efficient learning algorithm for this problem therefore must run in
time polynomial in $n$, $1/\epsilon$, and $1/\delta$.\\

\begin{framed}
\begin{thm}
    If $RP \neq NP$, the representation class of 3-term DNF formulae is not
    efficiently PAC learnable when restricting the output hypothesis to the
    class of 3-term DNF formulae.
\end{thm}
\end{framed}

\begin{proof}
The core of this proof is to reduce an $NP$-complete language $A$ to the problem
of PAC learning 3-term DNF formulae. To be precise, suppose we have any string
$\alpha$ for which we want to determine whether $\alpha \in A$. We will find a
mapping from $\alpha$ to some set $S_\alpha$ of labeled examples, where
$|S_\alpha|$ is polynomially bounded by $|\alpha|$. We will show that given a
PAC learning algorithm $\mathcal{L}$ for 3-term DNF formula, we can run
$\mathcal{L}$ on $S_\alpha$ to determine with
high probability whether $\alpha \in A$.

The mapping we want to find will demonstrate the key property: $\alpha \in A$
iff $S_\alpha$ is \textbf{consistent} with some concept $c \in \mathcal{C}$.

\emph{Determining Existence of a Concept Consistent with $S_\alpha$}. Let us
assume that we have found our desired mapping such that $\alpha \in A$ iff
$S_\alpha$ is consistent with some $c \in \mathcal{C}$. We then demonstrate how
a PAC learning algorithm $\mathcal{L}$ may be used to determine whether there
exists a concept $c \in \mathcal{C}$ consistent with $S_\alpha$ (and thus
whether $\alpha \in A$) with high probability.

Set $\epsilon = \frac{1}{2|S_\alpha|}$. When running $\mathcal{L}$ on
$S_\alpha$, whenever $\mathcal{L}$ requests a new sample, select a pair
$\langle x_i, b_i \rangle$ uniformly at random from $S_\alpha$. If there does in
fact exist a concept $c \in \mathcal{C}$, then this process simulates the oracle
$EX(c, \mathcal{D})$, where $\mathcal{D}$ is uniform over the instances
appearing in $S_\alpha$.

Given the choice of $\epsilon$, we can guarantee that any hypothesis $h$ with
error less than $\epsilon$ must be consistent with $S_\alpha$. If $h$ were to
incorrectly classify even a single example in $S_\alpha$, its error with respect
to $c$ and $\mathcal{D}$ is at least $\frac{1}{|S_\alpha|} = 2\epsilon > \epsilon$.
Of course, if there is no concept in $\mathcal{C}$ consistent with $S_\alpha$,
$\mathcal{L}$ can not find one. Therefore, we can simply verify the output of
$\mathcal{L}$ for consistency with $S_\alpha$ to determine with confidence
$1 - \delta$ whether such a consistent concept exists in $\mathcal{C}$.

We can therefore use this process combined with the assumed mapping from
$\alpha \rightarrow S_\alpha$ to determine with probability at least
$1-\delta$ whether $\alpha \in A$ by simulating $\mathcal{L}$ on $S_\alpha$.\\

\begin{framed}
\textbf{The Graph 3-Coloring Problem.} Given an undirected graph $G = (V, E)$
with vertex set $V = \{1, \dots, n\}$ and edge set $E \subseteq V \times V$ and
three different colors $c_1, c_2, c_3$, determine whether there is such a
mapping $c : V \rightarrow \{c_1, c_2, c_3\}$ such that for every edge
$(i, j) \in E$, $c(i) \neq c(j)$.
\end{framed}

\emph{Mapping from Graph 3-Coloring to 3-Term DNF Formulae}. We now define the
mapping from an instance $G = (V,E)$ of Graph 3-Coloring to a set $S_G$ of
labeled examples. We say $S_G = S_G^{+} \cup S_G^{-}$, where $S_G^{+}$ is the
set of positively labeled examples and $S_G^{-}$ the set of negatively labeled
ones.

For each $1 \leq i \leq n$, $S_G^{+}$ contains the labeled example
$\langle v(i), 1 \rangle$, where $v(i) \in \{0, 1\}^n$ is the vector of all
$1$'s except with a $0$ in the $i$th position.

For each edge $(i, j) \in E$, the set $S_G^{-}$ contains the labeled example
$\langle e(i,j), 0 \rangle$, where $e(i,j) \in \{0, 1\}^n$ is the vector of all
$1$'s except with $0$'s in the $i$th and $j$th positions.

\begin{figure}[t]
\begin{center}
\includegraphics[width=\textwidth*7/8]{threecoloring.png}

\emph{A graph $G$ with a legal 3-coloring, the associated sample, and the terms defined by the coloring.}
\end{center}
\end{figure}
We now show that $G$ is 3-colorable iff $S_G$ is consistent with some $3$-term
DNF formula. Let us first consider the necessary condition. Suppose $G$ is
$3$-colorable and fix a $3$-coloring of $G$. Let $R$ be the set of all red
vertices, and let $T_R$ be the conjunction of all variables in $x_1, \dots, x_n$
whose index does \emph{not} appear in $R$. Therefore, for each $i \in R$, $v(i)$
must satisfy $T_R$. Additionally, no such $e(i,j) \in S_G^{-}$ can satisfy $T_R$.
Since $i$ and
$j$ can't both be colored red, at least $x_i$ or $x_j$ must be in $T_R$. We can
define terms for $T_B$ and $T_Y$ (blue, yellow) conjunctions in a similar fashion,
with no negative examples being accepted by any term.

We now show the other direction. Assume that the formula $T_R \vee T_B \vee T_Y$
is consistent with $S_G$. We define a 3-coloring of $G$: the color of the vertex
$i$ is red if $v(i)$ satisfies $T_R$, blue if $v(i)$ satisfies $T_B$, and yellow
if $v(i)$ satisfies $T_Y$. Break ties arbitrarily if $v(i)$ satisfies more than
one term. Since the formula is consistent with $S_G$, then every $v(i)$ must
satisfy some term and every vertex must be assigned a color.

We now prove that
this is a valid legal 3-coloring. Assume WLOG that two vertices $i$ and $j$ with
$(i \neq j)$ are given the same color, red. Then, both $v(i)$ and $v(j)$ satisfy
$T_R$. By the definition of $v(\cdot)$, the $i$th bit of $v(i)$ is 0 and the
$i$th bit of $v(j)$ is 1. Hence, neither $x_i$ nor $\neg x_i$ can appear in
$T_R$ - and the $i$th assignment does not affect the result of $T_R$. Since
$v(j)$ and $e(i,j)$ differ only in the $i$th bits, if $v(j)$ satisfies $T_R$,
then so does $e(i,j)$. This implies that $e(i,j) \not\in S_G^{-}$ and hence
$(i,j) \not\in E$.

We therefore observe that 3-term DNF formulae are not efficiently PAC learnable
under the following assumptions:
\begin{itemize}
    \item $NP$-complete problems cannot be solved with high probability by a
    probabilistic polynomial time algorithm
    \item the PAC learning algorithm is restricted to outputting a hypothesis
    from the same representation class as the target concept.
\end{itemize}

\end{proof}

\section{Using 3-CNF to Avoid Intractability}

If we relax the constraint on learning 3-term DNF formulae by expanding the
output hypothesis concept class to one that is more expressive, then the class
of 3-term DNF formula \emph{is} in fact efficiently PAC-learnable.

We first observe that any 3-term DNF formula over the variables
$x_1, \dots, x_n$ may be expanded to an equivalent 3-term CNF formula via
distributivity of $\vee$ over $\wedge$.

$$T_1 \vee T_2 \vee T_3 \equiv \bigwedge_{u \in T_1, v \in T_2, w \in T_3} (u \vee v \vee w)$$

We can reduce the problem of PAC learning 3-CNF formulae to the problem of PAC
learning conjunctions, which we saw in section $3$. Given an oracle for random
examples for some unknown target $3$-CNF formula, we can apply a transformation
to render each positive/negative example into a corresponding positive/negative
example over a larger set of variables. This transformation will render the
$3$-CNF formula as a simple boolean conjunction which we may efficiently
learn.

The transformation is as follows: for every triple of literals $u, v, w$ over
the original variable set $x_1, \dots, x_n$, the new variable set contains a
variable $y_{u,v,w} = u \vee v \vee w$. When $u = v = w$, $y_{u,v,w} = u$.
Hence, this expanded variable set fully contains all of the original variables.
The total number of new variables is thus $(2n)^3 = O(n^3)$.

Thus, for any assignment $a \in \{0, 1\}^n$ to the original variables
$x_1, \dots, x_n$, we can compute the corresponding assignment $a'$ to the new
variables $\big\{y_{u,v,w} | u, v, w \in \{x_1, \dots, x_n\}\big\}$ in $O(n^3)$
time. It is obvious that any 3-CNF formula over $x_1, \dots x_n$ is a simple
boolean conjunction over $y_{u, v, w}$. We can run the algorithm for learning
boolean conjunctions by expanding each assignment to $x_1, \dots, x_n$ that is a
positive example for the target formula to an assignment for the corresponding
$y_{u,v,w}$ and giving this expanded assignment to the algorithm as a positive
example for the unknown conjunction over $y_{u, v, w}$. When finished, we can
convert the hypothesis conjunction $h'$ back to a 3-CNF $h$ by expanding every
variable to its corresponding 3-term conjunction.

To conclude, we must argue that if $c$ and $\mathcal{D}$ are the target 3-CNF
formula and distribution over $\{0, 1\}^n$, and $c'$ and $\mathcal{D}'$ are the
corresponding conjunction over the $y_{u,v,w}$ and induced distribution over
assignments $a'$ to the $y_{u,v,w}$, then if $h'$ has error less than $\epsilon$
with respect to $c'$ and $\mathcal{D}'$, $h$ has error less than $\epsilon$ with
respect to $c$ and $\mathcal{D}$ as well. This is simple to see when noting that
the
applied transformation is injective. If $a_1$ is mapped to $a_1'$ and $a_2$ is
mapped to $a_2'$, then $a_1 \neq a_2$ implies $a_1' \neq a_2'$. Thus, each
assignment $a'$ on which $h'$ differs from $c'$ has a unique pre-image $a$ on
which $h$ differs from $c$, and the weight of $a$ under $\mathcal{D}$ is exactly
that of $a'$ under $\mathcal{D}'$. (Do note that the reduction exploits the
notion that the conjunctions learning algorithm works for any distribution
$\mathcal{D}$, as the transformation alters the original distribution).

Therefore, we have proven:\\

\begin{framed}
\begin{thm}
    The representation class of $3$-CNF formulae is efficiently PAC learnable.
\end{thm}
\end{framed}

Therefore, we can efficiently PAC-learn 3-term DNF formulae by rendering them as
3-CNF and allowing the hypothesis to be a 3-CNF formula. This does not hold if
we restrict the hypothesis class to 3-term DNF! As it turns out, this statement
holds for any constant $k \geq 2$ for $k$-term DNF formulae and $k$-CNF formulae.

It is critical to recognize that even with a fixed concept class from which
targets are chosen, the choice of \emph{hypothesis representation} can have a
huge impact on efficiency and tractability.

\section{Boosting}
In the problem of PAC-learnability, we were trying to find learning algorithms
that learned the problem really well - to within some $\epsilon$ error rate.
This is a pretty strong guarantee, so what if we have a problem where we can
only find an algorithm that does slightly better than random guessing? To
formalize this a little bit, let us define \emph{weak PAC-learning}.
Weak PAC-learning is largely the same as PAC-learning, except we alter the
learning condition to be (for a binary classification problem)
\begin{equation*}
    err_D(h) \leq \frac{1}{2} - \alpha.
\end{equation*}
for some $0 < \alpha < \frac{1}{2}$. Clearly, this is a much weaker guarantee than
the standard PAC-learning model, so we can call learning algorithms that are
successful under this model \emph{weak learners} and learning algorithms
successful under standard PAC-learning \emph{strong learners}. It is also easy
to see that a strong learner is also a weak learner. However, a somewhat
counterintuitive claim is that all weak learners are capable of strong learning.
How?

In knowledge-based game shows, a common ``lifeline'' to help out a contestant
is to let them ask the audience what they think the answer is and then that
contestant can use the most popular answer from the crowd as their answer. This
crowdsourcing idea can be quite powerful and we can do something similar with
weak learners. If you had a weak learner $\LL$, we could keep an
\emph{ensemble} of separate instances of $\LL$ and take the majority opinion
to be your classification. This would almost surely be better than a single weak
learner, but considering each ensemble member is still only a weak learner,
there is still a significant chance the
majority could be wrong. A natural improvement on this algorithm would be to
take a weighted majority by weighting the opinion of each member of the ensemble
based on its error rate, so that ``better'' members are weighted more highly.

We can do even better than this though. Using a technique called \emph{adaptive
boosting}, or AdaBoost for short, we have a guaranteed method of turning a weak
learner into a strong learner. You can read more about the AdaBoost algorithm
in Appendix D.

\appendix
\section{Appendix A: Agnostic Learning}
\textbf{Unrealizable case.} In the sample-bound analysis presented
earlier, we consider the case
where the target concept $c(x)$ always exists and $c \in \mathcal{C}$.
We call this the realizable case. In the \emph{unrealizable} situation,
such a $c$ may not exist. Hence, we may not be able to produce a
hypothesis that is perfectly consistent with a training set.

\emph{Agnostic learnability.} A concept class $\mathcal{C}$ is
agnostically learnable if $\exists$ a learning algorithm $\mathcal{L}$
such that $\forall c_t \in \mathcal{C}$, $\forall$ distributions
$\mathcal{D}$ over $\mathcal{X}$, and $\forall \epsilon, \delta > 0$,
and given some $\epsilon, \delta,$ and an oracle $EX(c_t, \mathcal{D})$,
$\mathcal{L}$ outputs an $h \in \mathcal{H}$ such that with probability
at least $1 - \delta$ it satisfies along with the efficiency conditions
for PAC-learning:

$$err_D(h) \leq \min_{h'\in \mathcal{C}} \hat{err_D}(h') + \epsilon$$

As a note, in various contexts you may see $min_{h' \in \mathcal{C}}
\hat{err}_D(h')$ written as $Opt(\mathcal{C})$ - the minimal empirical
error over all the possible hypotheses in $\mathcal{C}$.

The idea is that we want an algorithm that minimizes true error despite
that the true errors of the functions in $C$ are unknown.

As before, we want to identify a sufficient sample size $m$ that will
let a learning algorithm produce a minimal error hypothesis. First,
we'll show that for any particular $h$, the empirical and true errors
converge. Then, we'll show that the errors converge for all $h \in
\mathcal{C}$ - thus, minimizing $\hat{err}_D(h)$ over all $h \in
\mathcal{C}$ will approximately minimize $err_D(h)$.

\emph{Claim.} Let $h$ be a function $h\;:\;\mathcal{X}\rightarrow
\mathcal{Y} = \{0,1\}$. Then, $$P(|\hat{err_D}(h) - err_D(h)| \geq
\epsilon) \leq 2e^{-2\epsilon^2 m}$$ for any probability distribution
$\mathcal{D}$, any $\epsilon >0$, and any positive integer $m$.

\begin{proof}
    Consider the indicator function $f_i(x_i)$,
    $$f_i(x_i) = \begin{cases}
        1&\text{ if $h(x_i) \neq y_i$}\\
        0&\text{ otherwise}
    \end{cases}$$

    Then, we can define the empirical error - the probabiliy that $h$
    misclassifies a point over the examples.

    $$\hat{err}_D(h) = \frac{1}{m}\sum_{i=1}^m f_i(x_i)$$

    Finally, we define true error as: $$err_D(h) = E_{X\sim D}[f_i(x)]$$

    Using the Hoeffding inequality (see Appendix A), we can then write:

    $$P(|\hat{err_D}(h) - err_D(h)| \geq \epsilon) \leq 2 e^{-2\epsilon^2 m}$$

    This concentration inequality tells us that with larger sample size
    $m$, the empirical error exponentially rapidly approaches the true
    error.

\end{proof}

The above claim gives a bound on the difference between empirical and
true error for a particular hypothesis $h$. We now bound the difference
between the errors for all $h \in \mathcal{C}$.

\emph{Claim.} $P(\exists h \in \mathcal{C}\;:\; |err_D(h) - \hat{err}_D(h)| \geq \epsilon) \leq 2 |\mathcal{C}|e^{-2\epsilon^2 m}$

\begin{proof}
    The claim immediately follows from the union bound:
    \begin{align*}
        P(\exists h \in \mathcal{C}\;:\; |err_D(h) - \hat{err}_D(h)|
        \geq \epsilon) &= P(\bigcup_{h\in\mathcal{C}} \bigg\{ |err_D(h)
        - \hat{err}_D(h)| \geq \epsilon \bigg\})\\
        &\leq \sum_{h\in \mathcal{C}} P(|err_D(h) - \hat{err}_D(h)| \geq \epsilon)\\
        &\leq 2|\mathcal{C}|e^{-2\epsilon^2 m}
    \end{align*}
\end{proof}

We set $\delta \geq 2|\mathcal{C}|e^{-2\epsilon^2 m}$. This way, we
upper bound the probability of any $h$'s true and empirical errors not
converging. Solving for $m$,

\begin{align*}
    \delta &\geq 2|\mathcal{C}|e^{-2\epsilon^2 m}\\
    \ln \delta &\geq \ln 2|\mathcal{C}| - 2\epsilon^2 m\\
    2\epsilon^2 m &\geq \ln 2 |\mathcal{C}| - \ln \delta\\
    m &\geq \frac{\ln 2 |\mathcal{C}| + \ln \frac{1}{\delta}}{2\epsilon^2}
\end{align*}

Looking at the equation obtained for finding an appropriate $m$, we
observe that the sample size depends on $\epsilon^2$ as opposed to
$\epsilon$ in the PAC-learning case. Thus, in practice when we cannot
find a consistent hypothesis, we need more data to obtain a hypothesis
with less true error.

Equivalently, given some number $m$ of examples, the error we expect is,
with probability at least $1 - \delta$,
$$err_D(h) \leq \hat{err}_D(h) + O\bigg(\sqrt{\frac{\ln(2|\mathcal{C}|)
+ \ln(\frac{1}{\delta})}{m}}\bigg)$$

From this, we can derive three core conditions for learning:

\begin{itemize}
    \item \emph{large amount of training data} - by increasing $m$, the
    second term becomes smaller, decreasing the total error
    \item \emph{low training error} - the first term in the summation -
    decreasing training error consequently decreases the true error
    \item \emph{simple hypothesis space} - the measure for simplicity
    here is the size of the hypothesis class. The smaller the hypothesis,
    the lower the true error
\end{itemize}

Additionally, there is a trade-off between decreasing training error and
maintaining a simple hypothesis. As the complexity of the hypotheses
increases, the probability of finding a consistent hypothesis increases
and the training error tends to zero. However, the big-Oh term will
eventually begin to dominate, and after $err_D(h)$ reaches a minimum, it
will begin to rise again. At this point, we have \emph{overfitting}, and
while the hypothesis chosen is consistent on the
training data, the variance introduced is too high! Complex hypotheses
assume more about the data ``from thin air'' without actual training data
to support the complexity.

\begin{center}
\includegraphics[width=\textwidth*6/8]{overfitting.png}

\emph{empirical error vs. true error}
\end{center}

Overfitting is one of the most difficult and most important issues to
deal with in practical machine learning. Often times, only the training
error $\hat{err_D}(h)$ can be observed directly, making it difficult to
minimize the other term. Here are a couple of main approaches towards
solving the overfitting problem:

\begin{itemize}
    \item \emph{Structural Risk Minimization (SRM)}: an inductive
    principle for model selection. Select from a set of models in order
    of increasing complexity - pick one for which the sum of empirical
    risk and complexity (VC-dimension) is minimal. The goal is to find
    the exact value of the theoretical bounds to minimize the bound
    directly.
    \item \emph{Cross-Validation}: Separate training data into two
    parts: a training and a testing set. This lets us estimate the true
    error and identify the stopping point for the algorithm (i.e. when
    the true error reaches a minimal point). Variations on this include
    splitting the data into $k$ parts ($k$-fold CV), or trying the
    algorithm on all data points except one and repeating (LOOCV).
\end{itemize}

As a side note (this is outside the scope of this lecture),
this theorem can actually be generalized to
infinite classes. But then wouldn't the number of samples
needed just be infinity, you ask? Actually, no! We can use
something called the Vapnik-Chervonenkis dimension, which
is the largest dimension in which all labels of points can
be induced by functions in $\mathcal{C}$. Appendix B further explores
the notion of VC dimension. Using this concept gives
us the following theorem:\\

\begin{framed}
\begin{thm}
    If you find an algorithm that is consistent given
    $m \geq \frac{VCD(\mathcal{C})}{\epsilon}\ln(\frac{1}{\delta})$ samples,
    then with probability $1-\delta$ it satisfies the
    learning condition.
\end{thm}
\end{framed}

\section{Appendix B: Jensen's and Hoeffding's Inequalities}

\textbf{Jensen's Inequality.} Jensen's inequality is arguably one of the most
significant inequalities in all of statistics, and we will in fact use it later
to prove Hoeffding's lemma. The inequality relates the convex function of an
expectation to the expectation of the convex function. We first briefly review
important properties of convex functions.

\emph{Convex function.} Let $I$ be an open subset of $\mathbb{R}$.
Let $f\;:\;I\rightarrow \mathbb{R}$. We say that $g$ is \textbf{convex} if
$\forall x_1, x_2 \in I$ and $\forall t \in [0, 1]$ we have
$$f(t x_1 + (1-t) x_2) \leq t f(x_1) + (1-t)f(x_2)$$

In layman's terms, this means that the value of a convex function evaluated at a
value in the range $[x_1, x_2]$ is always bounded above by a point through the
secant line through $f(x_1)$ and $f(x_2)$.

\begin{center}
\includegraphics[width=\textwidth*6/8]{convexfunction.png}
\end{center}

We can also define convex functions in more familiar terms: