canopy-framework.github.io/case-study.html at main · canopy-framework/canopy-framework.github.io · GitHub

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
<!DOCTYPE html>
<html lang="en" dir="ltr">

<head>
  <meta charset="UTF-8" />
  <title>Canopy | Framework for monitoring CDN logs in real-time</title>

  <link rel="stylesheet" href="stylesheets/reset.css" />
  <link rel="stylesheet" href="stylesheets/prism.css" />
  <link rel="stylesheet" href="stylesheets/main.css" />

  <link rel="apple-touch-icon" sizes="180x180" href="images/icons/favicons/graphic-red.png" />
  <link rel="icon" type="image/png" sizes="32x32" href="images/logos/logo.png" />
  <link rel="icon" type="image/png" sizes="16x16" href="images/logos/logo.png" />
  <meta name="viewport" content="width=device-width, initial-scale=1" />

  <link rel="preconnect" href="https://fonts.gstatic.com" />
  <link href="https://fonts.googleapis.com/css2?family=Red+Hat+Display:ital,wght@0,400;0,700;0,900;1,400&display=swap"
    rel="stylesheet" />


  <script src="https://code.jquery.com/jquery-3.6.0.min.js"></script>
  <script type="text/javascript" src="javascripts/prismCustom.js"></script>
  <script src="javascripts/prism.js"></script>
  <script type="text/javascript" src="javascripts/sidebar.js"></script>
  <meta property="og:image" content="/images/logos/logo.png" />
  <meta property="og:title" content="Canopy" />
  <meta property="og:description" content="Canopy: A CDN monitoring solution." />
</head>

<body>
  <header class="header-short">
    <nav>
      <ul>
        <li>
          <a href="/">
            <img src="images/logos/logo.png" />
          </a>
        </li>
        <li>
          <a href="/">Home</a>
        </li>
        <li><a href="/case-study" class="active">Case Study</a></li>
        <li><a href="/team">Our Team</a></li>
        <li class="flex-float-right">
          <a href="https://github.com/canopy-framework/canopy-cli" target="_blank">
            <img src="images/logos/github-mark-light.png" alt="Canopy GitHub" class="github" />
          </a>
        </li>
      </ul>
    </nav>
  </header>

  <div class="study-wrapper">
    <aside class="sidebar">
      <ul>
        <li>
          <a href="#Introduction"> 1. Introduction</a>
        </li>
        <li>
          <a href="#what-is-a-cdn"> 2. What is a CDN?</a>
        </li>
        <li>
          <a href="#building-a-logging-pipeline"> 3. Building a Logging Pipeline</a>
        </li>
        <li>
          <a href="#challenges-with-building-a-logging-pipeline">4. Challenges with Building a CDN Logging Pipeline</a>
        </li>
        <li>
          <a href="#existing-solutions"> 5. Existing Solutions</a>
        </li>
        <li>
          <a href="#introducing-canopy"> 6. Introducing Canopy</a>
        </li>
        <li>
          <a href="#using-canopy"> 7. Using Canopy</a>
        </li>
        <li>
          <a href="#architecture-overview">8. Architecture Overview</a>
        </li>
        <li>
          <a href="#fundamental-challenges">9. Fundamental Challenges</a>
        </li>
        <li>
          <a href="#automating-cloud-deployment">10. Automating Cloud Deployment</a>
        </li>
        <li>
          <a href="#improving-the-core-pipeline">11. Technical Challenges: Improving the Core Pipeline</a>
        </li>
        <li>
          <a href="#beyond-the-core-pipeline">12. Beyond the Core Pipeline</a>
        </li>
        <li>
          <a href="#final-architecture">13. Final Architecture</a>
        </li>
        <li>
          <a href="#conclusion">14. Conclusion</a>
        </li>
        <li>
          <a href="#future-work">15. Future work</a>
        </li>
      </ul>
    </aside>

    <main>
      <section id="case-study">
        <h1>Case Study</h1>
        <h2 id="Introduction">1. Introduction</h2>
        <p>
          Canopy is an open-source real-time monitoring framework designed specifically for use with the Amazon CloudFront CDN. We automate the deployment of an end-to-end pipeline for collecting, transforming, and storing Amazon CloudFront CDN logs, and process those logs to generate a critical suite of metrics for analysis.
        </p>
        <p>
          In this case study, we introduce CDNs and discuss why you would want to monitor the CDN. Next, we discuss the challenges of working with CDN data and existing real-world solutions. Finally, we outline the evolution of Canopy’s architecture, from an initial prototype, to its current form.
        </p>

        <h2 id="what-is-a-cdn">2. What is a CDN and Why do we Need to Monitor it?</h2>
        <p>
          Before we discuss how Canopy works, we need to review some basic concepts, including CDNs, monitoring, and what specifically can be gained from monitoring the CDN.
        </p>
        <h3 id="content-delivery-network">2.1 Content Delivery Network (CDN)</h3>
        <p>
          A Content Delivery Network (CDN) is a geographically distributed network of servers that stores cached versions of web content - HTML pages, images, video and other media - at locations closer to end users. Using CDNs can improve the performance of web applications in two primary ways:
        </p>

        <p>
          <strong>Reducing latency for end users.</strong> When a user visits a website, the data they are trying to retrieve must travel across the network from the website owner’s server (or origin) all the way to their computer. If a user is located very far from the origin server, they will experience higher latency than a user located nearby. In the image below, User B is located farther from the origin server than User A, and so experiences 400ms of additional latency.
        </p>
        <div class="img-wrapper">
          <img src="images/diagrams/reduceLatency.png" alt="latency" />
        </div>
        <p>
          CDNs reduce latency by hosting data across many different regions, ensuring that users in different areas receive similar response times. See the image below, where users in the US East Coast  need to access a website with origin servers based in London. Instead of every request they make being served by the UK Origin server, a majority of their requests are served by the CDN edge location closest to them, reducing the distance the data travels and providing a faster experience for the user.
        </p>
        <div class="img-wrapper">
          <img src="images/diagrams/reduceLatency2.png" alt="latency2" />
        </div>

        <p>
          <strong>Reducing the request load on the origin server.</strong> The second major benefit that comes from using CDNs is they can lower the load on a company’s web servers, otherwise known as <strong>origin servers</strong>. A request fulfilled by the CDN is a request the origin server doesn’t have to address, reducing both bandwidth at the origin, and the system resources required to process requests.
        </p>
        <div class="img-wrapper">
          <img src="images/diagrams/reduceLoadOnServer.png" alt="latency2" />
        </div>
        <p>
          CDNs can improve the performance of web applications. But in order to gain insights into how the CDN is functioning, developers need to <strong>monitor</strong> the CDN.
        </p>

        <h3 id="monitoring">2.2 Monitoring</h3>
        <p>
          In software engineering, monitoring is a process developers use to understand how systems and code are functioning. Google describes monitoring as: “Collecting, processing, aggregating, and displaying real-time quantitative data about a system.”<sup class="footnote-ref"><a id="fnref1" href="#fn1">[1]</a></sup>
        </p>
        <p>
          By visualizing aggregated data, it’s much easier for developers and system administrators to recognize trends and gain insight into the health of a system. In order to monitor a system, we need to collect data, specifically telemetry data.
        </p>

        <h4>Telemetry</h4>
        <p>
          Telemetry data is “data that production systems emit to provide feedback about what’s happening inside the system”.<sup class="footnote-ref"><a id="fnref2" href="#fn2">[2]</a></sup> The three main telemetry data types used in software engineering are logs, metrics, and traces.
        </p>
        <div class="img-wrapper">
          <img src="images/diagrams/logsMetricsTraces.png" alt="logsMetricsTraces" />
        </div>

        <p>
          CDNs primarily emit Log data, which can be used to generate metrics.Traces are important when monitoring distributed systems, but are less relevant to our use case.
        </p>

        <h4>Logs and Metrics</h4>

        <p>
          <strong>Logs</strong> are timestamped records of events that occur within a software system. They provide detailed and precise information, are often human-readable, and tend to be verbose compared to other types of data. Logs come in many formats and vary greatly depending on context.
        </p>

        <p>
          The primary way to monitor the CDN is through analysis of log data, or CDN logs. The image below shows a <strong>CDN log</strong>. It contains information about the date the event occurred, IP address of the client, as well as other information, such as whether the CDN was able to serve the client directly (cache hit), and information related to the requested content.
        </p>

        <div class="img-wrapper">
          <img src="images/diagrams/unstructuredLog.png" alt="unstructuredLog" />
        </div>

        <p>
          When we want to understand a particular aspect of a system in more detail and how it changes over time, we turn to metrics.
        </p>

        <p>
          <strong>Metrics</strong> are a numerical representation of data. They are usually smaller in size than logs, have fewer fields, and measure a single aspect of the system being monitored. Metric data can be presented in a dashboard, offering a comprehensive view of a system’s overall health.
        </p>

        <p>
          Monitoring log and metric data provides developers with insights they need to debug, optimize and analyze their applications. But why would developers need to monitor the CDN specifically?
        </p>

        <h3 id="why-monitor-the-cdn">2.3 Why Monitor the CDN?</h3>
        <h4>CDNs are a "black box"</h4>

        <p>
          Due to their performance advantages, CDNs have become an indispensable piece of infrastructure for public-facing web applications. CDNs currently handle an estimated 72% of all internet traffic,<sup class="footnote-ref"><a id="fnref3" href="#fn3">[3]</a></sup>  including dynamic content and streaming video and audio.
        </p>

        <div class="img-wrapper">
          <img src="images/diagrams/CDNsControlledByThirdParties.png" alt="CDNs Controlled by Third Parties" />
        </div>

        <p>
          However, CDNs are in many ways a “black box”. The physical infrastructure that makes up the CDN is operated by third parties and largely outside of our control: we can’t just “ssh” into a CDN to see what is going on. Therefore, monitoring the logs generated from CDN traffic is one of the only ways to gain some level of observability into this system.
        </p>

        <h4>CDN Logs Contain Valuable Data</h4>

        <p>
          Additionally, because so much user activity is served directly by the CDN, CDN logs are full of valuable information which can be used to answer diverse questions about our web applications, including questions related to user behavior, latency, security and beyond.<sup class="footnote-ref"><a id="fnref4" href="#fn4">[4]</a></sup>
        </p>

        <p>
          The image below shows examples of information found in CDN logs. Client Fields relate to the client (or user) that sent the request. Resource fields relate to the information they are trying to access. Response fields relate to the success or failure of the request to the CDN.
        </p>

        <div class="img-wrapper">
          <img src="images/diagrams/CDNLogFieldsWhyMonitor.png" alt="CDN Log Fields" />
        </div>

        <p>
          The information in these logs can provide insights into fundamental metrics of system health corresponding to the <strong>four golden signals</strong> of monitoring.
        </p>

        <h4>The Four Golden Signals</h4>

        <p>
          Google identified the <strong>four golden signals</strong> of monitoring: Latency, Traffic, Errors and Saturation. Taken together, these four signals serve as a guidepost for monitoring teams and provide developers with a well-rounded understanding of what is happening in production systems.
        </p>

        <div class="img-wrapper">
          <img src="images/diagrams/fourGoldenSignals.png" alt="Four Golden Signals" />
        </div>

        <p>
          <strong>Latency</strong> refers to the time it takes to service a request. Latency directly impacts user experience and is a primary indicator for the performance of a system.
        </p>

        <p>
          <strong>Traffic</strong> refers to how much demand is being placed on a system. Traffic can vary according to time of day or by region and can be difficult to manage without sufficient data.
        </p>

        <p>
          <strong>Errors</strong> refer to the rate of requests that fail, i.e. 4xx and 5xx status codes. Errors may be isolated to a specific region, time of day, or resource path, and identifying these trends is key.
        </p>

        <p>
          <strong>Saturation</strong> refers to the load on the network and servers relative to capacity (i.e. “system fraction”). Measuring saturation can help identify bottlenecks in a system.
        </p>

        <p>
          CDN log data can generate metrics that correspond to the golden signals, for example, “90th percentile time to first byte (TTFB)” for latency.<sup class="footnote-ref"><a id="fnref5" href="#fn5">[5]</a></sup> However, in order to monitor data from the CDN, it is necessary to transport and process that data to a central location where it can be used. Let’s take a look at what goes into building a data pipeline for taking CDN logs from the source all the way to visualizing those metrics in a dashboard UI.
        </p>

        <h2 id="building-a-logging-pipeline">3. Building a Logging Pipeline</h2>

        <p>
          Data pipelines are systems that process and move data from a source to a destination, where it can then be analyzed. Data pipelines for telemetry data are essential to software monitoring.<sup class="footnote-ref"><a id="fnref6" href="#fn6">[6]</a></sup>
        </p>

        <p>
          A logging pipeline is a kind of data pipeline for log-based telemetry data. Logging pipelines allow us to collect, visualize and analyze log data for software monitoring and data analysis.
        </p>

        <p>
          To better define what makes up a typical logging pipeline, we use the model for telemetry pipelines outlined by Jamie Riedesel in her book Software Telemetry. Riedesel identifies three key “stages” for telemetry pipelines, Emitting, Shipping, and Presentation.<sup class="footnote-ref"><a id="fnref7" href="#fn7">[7]</a></sup> Taken together, these three stages describe the flow of telemetry data as it moves through a pipeline.
        </p>

        <p>
          Let’s look at each stage.
        </p>

        <div class="img-wrapper">
          <img src="images/diagrams/CDNLoggingPipeline.png" alt="CDN Logging Pipeline" />
        </div>

        <h3 id="emitting">3.1 Emitting</h3>

        <p>
          First is the emitting stage. In the emitting stage, data is accepted from production systems and prepared for shipment in the logging pipeline.
        </p>

        <div class="img-wrapper">
          <img src="images/diagrams/emitting.png" alt="Emitting" />
        </div>

        <h3 id="shipping">3.2 Shipping</h3>
        <p>
          The shipping stage takes raw log data from the data source and moves it to storage. In order to do this, there are 3 necessary steps: Collection, transformation and storage.
        </p>

        <div class="img-wrapper">
          <img src="images/diagrams/shipping.png" alt="Shipping" />
        </div>

        <h3 id="presentation">3.3 Presentation</h3>

        <p>
          The final stage in a logging pipeline is presentation. This is where log data is queried and visualized through a user interface. Here, users make sense of the data and derive insights for various purposes.
        </p>

        <div class="img-wrapper">
          <img src="images/diagrams/presentation.png" alt="Presentation" />
        </div>

        <p>
          Now that we have a better understanding of what a logging pipeline is, let’s take a look at the challenges associated with building a CDN logging pipeline.
        </p>


        <h2 id="challenges-with-building-a-logging-pipeline">4. Challenges with Building a CDN Logging Pipeline </h2>

        <p>
          Working with CDN logs comes with a unique set of challenges. The root of these challenges lies with the fact that <strong>CDNs emit massive amounts of log data</strong>. The deluge of logs makes data ingestion, querying, visualization, and storage tricky to manage.
        </p>

        <h3 id="the-scale-of-CDN-log-data">4.1 The Scale of CDN Log Data </h3>

        <p>
          When a user opens a web page, the browser might issue many requests to the CDN. A single web page can consist of a variety of different assets, such as images, videos, and javascript files, and the web browser must issue requests for each of these resources.
        </p>

        <div class="img-wrapper">
          <img src="images/diagrams/scaleOfCDNLogData1Person.png" alt="The Scale of CDN Log Data One Person" />
        </div>

        <p>
          Each of these requests will then hit the CDN layer. This means that traffic from a relatively small number of viewers can add up to a large number of requests at the CDN.
        </p>

        <div class="img-wrapper">
          <img src="images/diagrams/scaleOfCDNLogData2Computers.png" alt="The Scale of CDN Log Data Two Computers" />
        </div>

        <p>
          This means that even for small and medium sized companies, web traffic can result in millions or even billions of requests to the CDN, and an equivalent number of logs. This presents a challenge to engineering teams at these companies, who need to handle all this data.
        </p>

        <p>
          LoveHolidays, an online travel agent, states in an article they published on upgrading their CDN monitoring solution that they process more than <strong>30 gigabytes per day</strong> of CDN logs.<sup class="footnote-ref"><a id="fnref8" href="#fn8">[8]</a></sup> That is quite a lot for a medium-sized company to deal with for just one type of telemetry data for one component of their cloud architecture! Altinity, an enterprise database provider, used CDN telemetry as an example of a typical trillion row dataset. <sup class="footnote-ref"><a id="fnref9" href="#fn9">[1]</a></sup>
        </p>

        <p>
          The effects of this data flow are felt at every stage of a logging pipeline, from data ingestion to storage and visualization. Let’s take a look at data ingestion, where the effects are felt first.
        </p>

        <h3 id="ingesting-data-from-the-cdn">4.2 Ingesting Data from the CDN</h3>

        <p>
          Ingesting CDN log data into a pipeline can be challenging. Internet traffic tends to be bursty, similar to traffic at a major train station, which might be busy during rush hour, but nearly empty in the late evening.
        </p>

        <div class="img-wrapper">
          <img src="images/diagrams/burstyIngestion.png" alt="Bursty Ingestion" />
        </div>

        <p>
          For the CDN, this means that user activity can fluctuate according to special events, such as flash sales or time of day, resulting in varying log output.  The pipeline needs to be able to handle this varying flow without slowing down or backing up.
        </p>

        <h3 id="querying-and-visualizing-cdn-log-data">4.3 Querying and Visualizing CDN Log Data</h3>

        <p>
          An equally important challenge is how to efficiently query and visualize CDN log data. Monitoring the CDN requires running analytic queries, like data aggregates, where we perform mathematical operations such as “sum” “count” “min/max” and “average” over a large portion of the dataset. This can be slow and expensive to run, especially with such large datasets.
        </p>

        <div class="img-wrapper">
          <img src="images/diagrams/aggregateQueriesSlow.png" alt="aggregate queries are slow" />
        </div>

        <p>
          However, data aggregation is essential to deriving useful information from text-based log data. For example, examining the “time taken” field, (a measure of latency) for one log is rarely useful by itself. If we have many logs, it is much more useful to ask “what was the average time taken?” This data can then be visualized in a chart or graph.
        </p>

        <div class="img-wrapper">
          <img src="images/diagrams/aggregateQueriesTableAvgTimeTaken.png" alt="aggregate queries table avg time taken" />
        </div>

        <p>
          While many databases support aggregate queries, not all would perform well with such large datasets. This can be slow and expensive to run, especially with such large datasets, and can be difficult to process in real-time. This poses a significant challenge for a real-time monitoring solution.
        </p>

        <h3 id="storage-requirements">4.4 Storage Requirements</h3>

        <p>
          Aside from efficiently running analytic queries, a storage solution for CDN logs has three major requirements:
        </p>

        <div class="img-wrapper">
          <img src="images/diagrams/storageRequirements.png" alt="storage requirements" />
        </div>

        <h4>Retaining Individual Log Lines</h4>

        <p>
          First, it should ideally be able to store and retrieve <strong>every</strong> log line the CDN emits for debugging and compliance. If we only wanted to use CDN log data for data analysis, we might be able to reduce storage requirements by sampling the data set (or only storing a fraction of the log data emitted by the CDN).
        </p>

        <p>
          However, if something goes wrong in a production system’s cloud architecture, the answer to what went wrong might lie in a single log line. In ApacheCon 2019, Geoff Genz gave a presentation describing the way Comcast stores CDN log data. He addressed this problem directly saying “We can't do just aggregates... we have to have the actual events. Because if somebody calls up and asks what happened with this client at this time, we need every single thing that happened”.<sup class="footnote-ref"><a id="fnref10" href="#fn10">[10]</a></sup>
        </p>

        <h4>Providing Efficient Compression</h4>

        <p>
          Because we need to store all this data, doing so efficiently becomes very important. The storage solution should therefore be able to efficiently compress and store log data. Reducing the required storage size for large CDN datasets can result in substantial benefits in terms of cost and maintainability.
        </p>

        <p>
          A corollary is that databases storing CDN log data should not add large indexes to log data. Indexes are data structures that improve the speed of queries. Large indexes increase the storage requirements and generating the indexes eats up system resources. <sup class="footnote-ref"><a id="fnref11" href="#fn11">[11]</a></sup>
        </p>

        <h4>Supporting Quick Batch Insertions</h4>

        <p>
          Finally, because CDNs produce so much log data, our storage solution needs to be able to quickly ingest large quantities of log data. Working with large batches has advantages over streaming individual log lines to the database. Working with batches means that log data can be shipped to the database using fewer requests, saving network bandwidth and making retries easier to manage in the event the database is unavailable.
        </p>

        <p>
          In summary, the large volume of logs emitted by CDNs presents challenges at every stage of a logging pipeline, from data ingestion to storage and visualization. However, these technical challenges are not the only considerations teams need to keep in mind when choosing a monitoring solution. Different solutions offer different tradeoffs that we need to consider.
        </p>

        <h2 id="existing-solutions">5. Existing Solutions</h2>

        <p>
          Companies interested in monitoring their AWS CloudFront CDN distributions have three main choices: they can use AWS’ ‘native’ monitoring tools, a third-party (SaaS) solution, or they can build their own DIY solution using open source tools. These choices have different advantages and tradeoffs.
        </p>

        <div class="img-wrapper">
          <img src="images/diagrams/tableWithouCanopy.png" alt="table without canopy" />
        </div>

        <h3 id="aws-native-monitoring-tools">5.1 AWS ‘Native’ Monitoring Tools</h3>

        <p>
          The first and easiest choice would be to use the CDN’s “native solution”. For CloudFront, this would be the included <strong>Reports and Analytics</strong> page. CDN native solutions are easy to use and don’t require teams to send data to a third party, but don’t easily integrate with other observability data, and in Amazon’s case, don’t offer customizable dashboards.
        </p>

        <p>
          Note: AWS also offers AWS CloudWatch, a fully-featured monitoring solution that can be used to visualize CloudFront logs and metrics, however its cost is on par with third-party SaaS providers, discussed below, and is no easier to use with CloudFront real-time logs than other SaaS solutions, as the logs must be shipped to CloudWatch manually.
        </p>

        <h3 id="third-party-saas">5.2 Third-Party (SaaS)</h3>

        <p>
          An appealing option for many teams would be to use a third-party SaaS (“Software as a Service”) solution, such as Datadog or New Relic.
        </p>

        <p>
          SaaS solutions have several advantages. They are easy to use, integrate with other observability data, and feature customizable dashboards. They also manage the logging pipeline for you, relieving the developer from concerns about deploying, scaling, or maintaining pipeline infrastructure.
        </p>

        <p>
          However, SaaS solutions are not ideal for teams that have strict data ownership requirements. Teams handling sensitive data or operating in regulated industries must consider data privacy and compliance requirements. They would have to give up control over their log data and infrastructure to a third-party, including the ability to decide how and where the logs are stored and processed. Third-party vendors can also be expensive.
        </p>

        <h3 id="diy">5.3 DIY</h3>

        <p>
          Finally, teams looking to monitor the CDN can choose to build a custom DIY solution.
        </p>

        <p>
          A main advantage of DIY solutions is data ownership. DIY solutions allow development teams to retain complete control over their data: who accesses it, where it’s stored, and how long to store it. It also means the flexibility to customize the pipeline according to their specific requirements.
        </p>

        <p>
          The downside to this approach is the labor required to build it. Building a solution could take weeks or months depending on the complexity of the project and available developer time.
        </p>

        <h3 id="another-option">5.4 Another Option?</h3>

        <p>
          For some teams, both SaaS solutions and the AWS native solutions may not work for their specific use case.  However, they may also not want to devote substantial developer time to building a DIY solution. <strong>This is where Canopy fits in.</strong>
        </p>

        <h2 id="introducing-canopy">6. Introducing Canopy</h2>

        <p>
          Canopy’s design incorporates the ease of use of a third-party SaaS solution with the data ownership and control associated with a DIY approach.
        </p>

        <div class="img-wrapper">
          <img src="images/diagrams/comparisonTableWCanopy.png" alt="comparison table with canopy" />
        </div>

        <p>
          Canopy’s architecture is built using open-source components that are configured within the team’s own AWS account, allowing them full control of their data. Canopy also features customizable, real-time dashboards and fully-automated deployment.
        </p>

        <p>
          However, Canopy lacks certain features offered by platforms like DataDog or a fully customized DIY solution. For example Canopy does not support integrating CDN log data with other observability data.
        </p>

        <p>
          Let’s explore how to use Canopy for your team’s monitoring needs.
        </p>

        <h2 id="using-canopy">7. Using Canopy</h2>

        <h3 id="installing-and-deploying-the-canopy-pipeline">7.1 Installing and Deploying the Canopy Pipeline</h3>

        <p>
          Canopy is designed to be easy to use and require minimal configuration. The Canopy logging pipeline can be deployed to AWS in one command using “canopy deploy”. Detailed installation and configuration information can be found on our Github page.
        </p>

        <h3 id="monitoring-cdn-log-data">7.2 Monitoring CDN Log Data</h3>

        <p>
          Canopy provides a custom set of Grafana-powered real-time dashboards divided into three tabs: CDN Logs Overview, Client Information, and Performance. They include metrics and visualizations corresponding to the four golden signals.
        </p>

        <div class="img-wrapper">
          <img src="images/dashboardPics/graf_cdn_logs_overview.png" alt="Grafana Overview" />
        </div>

        <p>
          This image shows the “CDN logs overview” dashboard, which is the main landing page for our users to monitor their CloudFront distributions. Here, we present traffic and error metrics, allowing users to quickly assess the health of their CDN traffic. The top row shows the overall cache hit ratio as well as information related to errors and total requests.
        </p>

        <p>
          There are 2 other Grafana dashboards.
        </p>

        <p>
          The Client Information tab presents traffic metrics, specifically related to the client.
        </p>

        <div class="img-wrapper">
          <img src="images/dashboardPics/graf_client_info.png" alt="Grafana Client Info" />
        </div>

        <p>
          The Performance tab presents latency and saturation metrics.
        </p>

        <div class="img-wrapper">
          <img src="images/dashboardPics/graf_performance.png" alt="Grafana Performance" />
        </div>

        <h4>Admin Dashboard</h4>

        <p>
          Canopy also has an Admin dashboard, displayed here. From the Admin Dashboard, users can conveniently deploy and configure pipeline infrastructure as well as monitor the status of pipeline architecture after it has been deployed.
        </p>

        <div class="img-wrapper">
          <img src="images/dashboardPics/admin_configure.png" alt="Admin Dash Configure" />
        </div>

        <p>
          From the Admin Dashboard, teams can also configure “quick alerts” with the click of a button. Quick alerts send email notifications to teams when certain thresholds are met, corresponding to the golden signals.
        </p>

        <div class="img-wrapper">
          <img src="images/dashboardPics/admin_alerts.png" alt="Admin Dash Alerts" />
        </div>

        <p>
          In the upcoming sections, we will discuss the challenges and engineering considerations that went into building Canopy’s pipeline architecture and examine the evolution of that architecture from an initial prototype to its current form.
        </p>

        <h2 id="architecture-overview">8. Architecture Overview</h2>

        <p>
          Building Canopy was a multi-step process. Through several iterations, we took Canopy from an idea to a fully automated real-time logging pipeline. We began by identifying the core requirements for our project and then proceeded to build an initial prototype.
        </p>

        <h3 id="core-pipeline-architecture">8.1 Core Pipeline Architecture</h3>

        <p>
          From the model for Telemetry Pipelines described earlier, we knew that we needed an emitting stage, a shipping stage and a presentation stage. When we set out to build Canopy, we started by mapping out the core components we needed for each stage of its pipeline.
        </p>

        <div class="img-wrapper">
          <img src="images/diagrams/keyComponentsArchitecture.png" alt="prototype architecture" />
        </div>

        <p>
          During the emitting stage, the <strong>CDN</strong> emits a continuous flow of logs as users make requests to the CDN.
        </p>

        <p>
          The shipping stage consists of 3 steps: collection, transformation and storage. For collection, a <strong>stream</strong> collects and stores logs in real-time as they flow from the CDN. For transformation, the <strong>log transformer</strong> transforms those logs to a format appropriate for storage. For storage, the <strong>log shipper</strong> buffers and batches the transformed logs, and inserts and stores the log data into the <strong>database</strong>.
        </p>

        <p>
          During the presentation stage, the <strong>visualizer</strong> queries the data stored in the <strong>database</strong> and visualizes the results in charts, graphs and tables in real-time.
        </p>

        <p>
          After pinpointing the core components required for Canopy, the next step we undertook was building a working prototype based off of this architecture. In the following section, we discuss the challenges we encountered during development and how we addressed and resolved them.
        </p>

        <h2 id="fundamental-challenges">9. Fundamental Challenges</h2>

        <h3 id="data-storage">9.1 Data Storage</h3>

        <p>
          Selecting a storage solution for CDN logs proved to be one of the most difficult decisions we had to make. We explored various possibilities, including Elasticsearch, time-series databases and columnar databases.
        </p>

        <div class="img-wrapper">
          <img src="images/diagrams/elasticTimescaleClickhouse.png" alt="elastic vs timescale vs clickhouse" />
        </div>

        <p>
          Our use case had unique requirements due to the nature of CDN log data. Our database should be able to handle large amounts of CDN logs at scale without sampling and efficiently handle data aggregates. Therefore, we needed to consider the level of indexing to minimize storage for preventing sampling. We also needed to consider the type of queries that each database optimizes for.
        </p>

        <div class="img-wrapper">
          <img src="images/diagrams/databaseTableComp.png" alt="Database table comparison" />
        </div>

        <p>
          Elasticsearch, a popular search engine and database for log analysis, indexes the full contents of stored documents, thus making it great for full-text search.
        </p>

        <p>
          We also considered time-series databases, since several of our metrics analyze changes over time. Time-series databases index data over a single dimension - time - making them efficient for running simple queries, such as aggregating a single metric over time.
        </p>

        <h4>Final Verdict: Columnar Database </h4>

        <p>
          Ultimately, we decided on a columnar database, more specifically ClickHouse. ClickHouse offered sparse indexing and efficient compression for reducing storage. It also offered a column-oriented approach to processing data. This approach optimizes for our use case’s more complex queries that aggregate several metrics over multiple dimensions. This includes examples such as URI by HTTP status code and requests by IP address.
        </p>

        <p>
          The vast majority of the queries we planned to make to the database follow a pattern: they use a fairly large number of rows in the database, but only a small subset of columns.
        </p>

        <p>
          A columnar database stores data in columns, as opposed to rows. Its power lies in the ability to access a column of data and collapse it into a calculated result. This facilitates faster aggregation of data because the database only reads the necessary columns and omits the ones not needed for the query. CDN logs tend to form large datasets, and a columnar database can consolidate a high volume of data from a small subset of columns without the need to search entire rows of a table.
        </p>

        <div class="img-wrapper">
          <img src="images/diagrams/columnarDatabaseCacheHits.png" alt="Calculating cache hits with a columnar database" />
        </div>

        <p>
          ClickHouse also uses sparse indexes. In ClickHouse, the values for each column are stored in separate files and sorted by a primary index. In Canopy’s case, we use timestamp as the primary index. ClickHouse indexes by default do not assign a unique value for each row, but rather, assign a unique value to a medium-sized batch of data, which can then be retrieved efficiently. Sparse indexes minimize storage needs and work well with batch insertions, which is well-suited to our use case.
        </p>

        <p>
          While sparse indexes are less suitable for queries that fetch individual rows, our use case focuses primarily on aggregates. Furthermore, using timestamp as a primary key and sparse index still allows for acceptable query performance when searching for a subset of logs within a particular time range.
        </p>

        <p>
          Sorted columnar data also adds another side benefit: efficient compression. Since sorted data in a column file tends to be similar and contains repeated adjacent values, it compresses well, as compared to compressing a series of rows.
        </p>

        <p>
          Companies, such as Uber and CloudFlare, transitioned from Elasticsearch to ClickHouse due to Elasticsearch’s limitations in handling high data volumes. The combination of sparse indexes and efficient compression allowed CloudFlare to remove sampling completely.<sup class="footnote-ref"><a id="fnref12" href="#fn12">[1]</a></sup>
        </p>

        <h3 id="moving-data-in-near-real-time">9.2 Moving Data in (near) Real-Time </h3>

        <p>
          Now that we had chosen ClickHouse as our database, the next challenge we faced was how to move log data from the CDN to ClickHouse in real-time. This represents the shipping stage, and therefore, we needed stream storage, a log transformer and a log shipper. For the prototype, we prioritized development speed and reliability, relying on well-established and mature industry tools for these components.
        </p>

        <div class="img-wrapper">
          <img src="images/diagrams/streamsFirehoseVector.png" alt="Kinesis Data Streams to FireHose to Vector" />
        </div>

        <p>
          By default, CloudFront real-time logs are delivered to <strong>AWS Kinesis Data Streams</strong>, a fully managed service for collecting and storing data in a stream. As a result, this is the first stop in our pipeline. Each stream consists of one or more shards - a unit of capacity - where log records are grouped and stored, with their order preserved.
        </p>

        <p>
          After logs are stored in a stream, we needed a way to deliver them to our log transformer. One option was to build an application that would read data from Kinesis Data Streams, process the data and deliver it.
        </p>

        <p>
          Ultimately, we chose <strong>AWS Kinesis Data Firehose</strong>, a fully managed service for data delivery. As Firehose has a minimum buffer interval of 60 seconds, we use it to provide a “near real-time” solution for delivering logs.
        </p>

        <p>
          Finally, we needed a log transformer and a log shipper. One option was to build both components in tandem, which would streamline our overall architecture.
        </p>

        <p>
          Ultimately we opted to use <strong>Vector</strong>, an open-source tool for aggregating, transforming and routing observability data. Since Vector was out-of-the-box compatible with both Firehose and ClickHouse, it was a convenient choice to use as a data pipe between the two components, in addition to performing log transformation.
        </p>

        <h3 id="data-transformation">9.3 Data Transformation
        </h3>

        <p>
          The next challenge we faced was how to transform logs before loading the data into ClickHouse. Each CloudFront log is emitted in the form of a plain-text string with no predefined fields included.
        </p>

        <div class="img-wrapper">
          <img src="images/diagrams/parseLogData.png" alt="parse log data" />
        </div>

        <p>
          With Vector, our log transformer, we use its built-in parsing function to write a custom regex pattern. This pattern converts the log into a structured JSON object and maps field names to their corresponding values within the log. Finally, we convert specific values to the appropriate data types for storage in ClickHouse. For example, timestamps are stored in the “DateTime” format, while CDN edge locations are stored as strings.
        </p>

        <p>
          The appeal of using Vector in this context stems from its dual functionality as a data pipe linking Firehose to ClickHouse. Using Vector allowed us to rapidly build our prototype architecture.
        </p>

        <h3 id="deploying-the-canopy-backend">9.4 Deploying the Canopy Backend
        </h3>

        <p>
          After successfully addressing data storage, moving data in real-time and data transformation, we had developed a functional prototype on our local computers. However, one final challenge remained: ensuring that anyone could set up and start using Canopy as easily as possible. This goal necessitated deploying Canopy to the cloud.
        </p>

        <div class="img-wrapper">
          <img src="images/diagrams/howToDeployBackendQuestion.png" alt="How to deploy" />
        </div>

        <p>
          While Data Streams and Firehose are directly created within AWS, for the other components comprising the Canopy backend (Vector, ClickHouse and our visualizer, Grafana), our first priority was deploying them locally.
        </p>

        <p>
          Ultimately, the solution we chose was Docker. We leveraged Docker to containerize our backend components and used Docker Compose to deploy and run them as containers on a host. With built-in service discovery, our containers automatically communicated with each over a private network. Moreover, built-in data persistence ensured preservation of both the data in our database and the dashboards in Grafana.
        </p>

        <div class="img-wrapper">
          <img src="images/diagrams/devAWStoLocalComputerWDocker.png" alt="AWS to Local Computer with Docker" />
        </div>

        <p>
          Docker also integrates well with AWS. Once we had the Docker setup working locally, it simplified the process of moving our backend to the cloud.
        </p>

        <p>
          We considered two options for cloud deployment: Amazon EC2 and Amazon Elastic Container Service (ECS) with Fargate.
        </p>

        <div class="img-wrapper">
          <img src="images/diagrams/ec2VSecs.png" alt="EC2 vs ECS" />
        </div>

        <p>
          EC2 is a virtual private server service. In this scenario, we rent a virtual private server - or an instance of EC2 - and run our backend as Docker containers within that instance.  ECS and Fargate are fully managed services, eliminating the need for manual management and scaling of containers and the infrastructure they’re hosted on.
        </p>

        <p>
          Ultimately, for simplicity purposes, we opted for the EC2 instance. Deploying Docker containers on EC2 mimicked the process of deploying on a local computer. Configuring our containers to work with ECS and Fargate for data persistence and service discovery involved substantial complexity. We decided to forgo that complexity for creating a working prototype.
        </p>

        <div class="img-wrapper">
          <img src="images/diagrams/packageDockerLocalToEC2.png" alt="package docker local to EC2" />
        </div>

        <h3 id="prototype-architecture">9.5 Prototype Architecture</h3>

        <p>
          The diagram below shows our prototype architecture, which we built to address the fundamental challenges discussed previously.
        </p>

        <div class="img-wrapper">
          <img src="images/diagrams/prototypeArchitecture.png" alt="prototype Architecture" />
        </div>

        <p>
          With this prototype, we successfully implemented a working solution. Users could deploy our backend infrastructure to an EC2 instance, and visualize logs and metrics in near real-time on Grafana dashboards. However, to improve ease of use, we still needed to automate the deployment of our cloud infrastructure.
        </p>

        <h2 id="automating-cloud-deployment">10. Automating Cloud Deployment</h2>

        <p>
          Setting up and properly configuring AWS resources can be complex, and automating this process would relieve developers from that burden. To accomplish this, we used AWS CDK to automate cloud deployment, and we built a command line interface to make configuration and deployment more intuitive.
        </p>

        <h3 id="amazon-cdk">10.1 Amazon CDK</h3>

        <p>
          AWS CDK (Cloud Development Kit) is an infrastructure-as-code tool that deploys all our AWS resources with code written in JavaScript. We use it to automatically deploy Kinesis Data Streams, Kinesis Firehose and Canopy’s backend - Vector, ClickHouse and Grafana - on Amazon EC2.
        </p>

        <p>
          AWS CDK serves as a wrapper for Amazon CloudFormation. CloudFormation uses a declarative language - YAML or JSON - as a template to provision cloud resources. With CDK, we combine the capabilities of CloudFormation with the convenience of JavaScript. We selected CDK over other tools like Terraform, primarily for simplicity. We did not want to introduce another third-party tool into our architecture when a native AWS solution was readily available.
        </p>

        <h3 id="building-the-canopy-cli">10.2 Building the Canopy CLI</h3>

        <p>
          With Docker containers and CDK code in place, our users could now deploy all of Canopy’s architectural components with greater ease. To make this process more intuitive, we decided to build a command-line interface (CLI).
        </p>

        <p>
          Deploying cloud architecture requires correctly configuring AWS account details. Building a CLI enabled us to prompt users for the required information, such as their AWS account ID and CloudFront distribution ID, instead of requiring them to navigate through a detailed set of configuration steps. Additionally, by offering intuitive CLI commands, such as “canopy deploy” and “canopy destroy”, we removed the need for our users to worry about the underlying file structure of our deployment code.
        </p>

        <p>
          Upon building and automating Canopy’s prototype architecture, we achieved a working solution accompanied with a user-friendly deployment and configuration process. However, during the course of prototyping, we identified areas where we could improve upon how Canopy could better suit its use case. With its core functionality in place, we set out to optimize and refine our current working version of Canopy.
        </p>

        <h2 id="improving-the-core-pipeline">11. Technical Challenges: Improving the Core Pipeline</h2>

        <p>
          When we set out to make Canopy, two of our key goals were:
        </p>

        <ul>
          <li>
            <p>
              <strong>Ease of Use:</strong> We wanted to create a solution that would be as easy to set up and use as possible.
            </p>
          </li>
          <li>
            <p>
              <strong>Real-Time Dashboard Updates:</strong> We wanted to deliver a true "real-time" experience. This meant dashboards that updated instantaneously as log data streamed in, enabling users to monitor events in real-time.
            </p>
          </li>
        </ul>

        <h3 id="limitations-to-ease-of-use">11.1 Limitations to Ease of Use</h3>

        <p>
          Our initial prototype relied on Kinesis Data Firehose to deliver data from Kinesis Data Streams to the log transformer, Vector, over an encrypted HTTPS connection. This approach exposed several limitations.
        </p>

        <p>
          First, Firehose’s HTTPS requirement limited ease of use by making configuration more complex. Specifically, it forced our users to create a new domain or subdomain, generate their own TLS certificate file, upload the certificate, maintain certificate validity, and update DNS records with the IP address of our dynamically generated EC2 instance. These tasks placed a substantial burden on the user and made our solution more difficult to set up and use.
        </p>

        <h3 id="limitations-to-real-time-dashboards">11.2 Limitations to Real-Time Dashboards</h3>

        <p>
          Using AWS Firehose revealed another limitation. Firehose buffers data for at least 60 seconds before streaming it to its destination, resulting in unwanted latency. Additional latency is also introduced by:
        </p>

        <ol>
          <li>
            <p>
              The HTTPS connection between Firehose and Vector: A 3-way handshake plus a TLS handshake means 3 additional round trips across the wire before log data is routed to ClickHouse.
            </p>
          </li>
          <li>
            <p>
              Routing parsed log data from Vector to Clickhouse: another 3-way handshake.
            </p>
          </li>
        </ol>

        <div class="img-wrapper">
          <img src="images/diagrams/handshakes.png" alt="tls and tcp handshakes" />
        </div>

        <p>
          This introduced a delay of over a minute before logs could be stored in ClickHouse and visualized in Grafana, thus preventing Canopy from delivering a true real-time experience.
        </p>

        <h3 id="building-a-custom-log-transformer-shipper">11.3: Solution: Building a Custom Log Transformer/Shipper Using AWS Lambda</h3>

        <p>
          To overcome these limitations, we made the strategic choice of building a custom log transformer and shipper using AWS Lambda. This decision enabled us to simplify the architecture, eliminating the need for both Vector and Firehose. The Lambda function fulfills two critical roles:
        </p>

        <ul>
          <li>
            <p>
              <strong>Log Transformer:</strong> By incorporating the decoding, parsing, and transformation logic within the Lambda function, we reduced the number of hops logs needed to traverse to reach ClickHouse, resulting in reduced latency.
            </p>
          </li>
          <li>
            <p>
              <strong>Log Shipper:</strong> The Lambda function routes logs directly to ClickHouse immediately after they are read from Kinesis Data Streams and transformed by the Lambda. This results in a true “real-time” solution, in contrast to the previous “near real-time” setup associated with Firehose.
            </p>
          </li>
        </ul>

        <div class="img-wrapper">
          <img src="images/diagrams/beforeLambdaNonIsoflowArchitecture.png" alt="before lambda architecture" />
        </div>

        <div class="img-wrapper">
          <img src="images/diagrams/afterLambdaNonIsoflowArchitecture.png" alt="after lambda architecture" />
        </div>

        <h3 id="using-lambda-with-sqs-for-failover">11.4 Using AWS Lambda with SQS & S3 For Failover</h3>

        <p>
          In addition to simplifying Canopy’s architecture, our Lambda-based solution offered the following benefits:
        </p>

        <ul>
          <li>
            <p>
              <strong>Flexible Buffering:</strong> Using Lambda gave us complete control over log buffering. We could adjust the buffer to replicate Firehose's 60-second buffer or achieve an even more real-time buffer (with intervals as low as 0 seconds), depending on the needs of our users. While opting for a shorter buffer interval could lead to more frequent database writes, ClickHouse can handle batch insertions as frequently as once per second.
            </p>
          </li>
          <li>
            <p>
              <strong>Versatile Data Transport:</strong> Using a Lambda function allowed us to ship logs over HTTP, removing the need for users to perform the manual configuration steps required for setting up an HTTPS endpoint with Firehose. Since log data is transported within our architecture, we felt the benefits of using HTTPS were outweighed by the inconveniences in terms of user-friendly deployment and maintainability. In the future, we aim to reintroduce support for users who prefer or need to configure deployment using HTTPS. Lambda can support both use cases.
            </p>
          </li>
          <li>
            <p>
              <strong>Improved Latency:</strong> Logs can now be parsed and routed directly to ClickHouse from the Lambda function, reducing network hops from 2 to 1 (by removing Vector) and network round trips from 3 to 1 (by removing the need for TLS handshakes), resulting in reduced latency.
            </p>
          </li>
          <li>
            <p>
              <strong>Improved Debugging: </strong> When our Lambda code throws an error, Lambda records a log that details exactly what happened, which is convenient for debugging network and infrastructure faults. Firehose lacks this feature, making debugging a potentially frustrating process.
            </p>
          </li>
          <li>
            <p>
              <strong>Scalability: </strong> The Lambda function's ability to process up to 10,000 records per invocation aligned well with Canopy's requirement to handle a massive number of logs, surpassing Firehose's maximum batch size of 500 records.
            </p>
          </li>
        </ul>

        <h3 id="handing-failed-logs">11.5 Handling Failed Logs</h3>

        <p>
          One of the major challenges in dealing with a distributed system is accounting for network failures.
        </p>

        <p>
          <strong>
            What happens if logs cannot be successfully delivered to the database due to ephemeral network failures or if they fail to be inserted into the database due to database server errors?
          </strong>
        </p>

        <div class="img-wrapper">
          <img src="images/diagrams/networkFailureIncompleteData.png" alt="network Failure Incomplete Data" />
        </div>

        <p>
          In our prototype, Firehose handled failed logs for us. It routed batches of logs that couldn't be successfully delivered to the database towards S3 for storage.
        </p>

        <div class="img-wrapper">
          <img src="images/diagrams/prototypeFailedLogsPipelineNonIso.png" alt="prototype Failed Logs Pipeline" />
        </div>

        <p>
          We could have implemented a similar solution with our Lambda function, routing failed logs to S3 based on error response codes from requests sent to ClickHouse, or by utilizing the failed request callback handler. However, this approach posed a problem: logs that failed and were stored in S3 cannot be visualized on our Grafana dashboards. As a result, the data and metrics presented to our users could become inaccurate.
        </p>

        <p>
          To address this issue, we leveraged Kinesis Data Streams as a buffer for failed logs. We configured our Lambda function to throw an error if a batch of logs could not be inserted into Clickhouse or if the request failed. This action triggered Kinesis Data Streams to re-stream failed logs, according to a configurable setting that tracks the age of the logs in the stream.
        </p>

        <div class="img-wrapper">
          <img src="images/diagrams/dataStreamsBuffersFailedLogs.png" alt="data streams buffers failed logs" />
        </div>

        <p>
          We also configured the Lambda function to retry shipping failed logs based on the recorded number of retries. Only after reaching a max-retry limit would the data be sent to S3. This approach ensured that ephemeral network outages or database server errors would not result in inconsistent data in our Grafana dashboards.
        </p>

        <p>
          It’s important to note that this solution has a tradeoff: the additional monetary cost incurred for each Lambda invocation. However, we deemed these additional invocations necessary to prioritize data integrity and maintain the accuracy of the visualizations on the Grafana dashboards.
        </p>

        <h3 id="archiving-failed-logs">11.6 Archiving Failed Logs for Debugging & Compliance</h3>

        <p>
          Although this solution was effective, it raised another question: <strong>What happens when records in Kinesis Data Streams reach their max age or when the max-retry limit is met?</strong> In either scenario, failed logs would be ejected from the stream, and we would lose any record of them.
        </p>

        <div class="img-wrapper">
          <img src="images/diagrams/dataStreamsCarryingLogs.png" alt="data streams carrying failed logs" />
        </div>

        <p>
          To address this issue, we created a separate pipeline for handling failed log data. This pipeline consists of a dead-letter queue, managed by AWS Simple Queue Service, along with a Lambda function that pushes failed logs to S3 for persistent storage.
        </p>

        <div class="img-wrapper">
          <img src="images/diagrams/lambdaFailedLogsPipelineNonIso.png" alt="lambda failed logs pipeline" />
        </div>

        <p>
          A dead letter queue is a type of message queue designed to temporarily store messages that a system fails to process due to errors.<sup class="footnote-ref"><a id="fnref13" href="#fn13">[13]</a></sup> Kinesis Data Streams pushes failed logs to the queue when the batch has failed after multiple retries. The Lambda function then reads from the dead-letter queue and stores the failed logs in S3 before clearing them from the queue asynchronously. While this approach introduced added complexity, it ensured that failed logs would be archived for debugging and compliance needs, supporting our core use case.
        </p>

        <h2 id="beyond-the-core-pipeline">12. Beyond the Core Pipeline</h2>

        <p>
          The improved architecture provided our users a true real-time monitoring experience and simplified deployment, while also accounting for potential network failures. With the core pipeline elements in place, we turned our attention to "quality of life" improvements. This included adding support for monitoring multiple CloudFront distributions in parallel, as well as creating an Admin Dashboard for pipeline management.
        </p>

        <h3 id="adding-support-for-multiple-distributions">12.1 Adding Support for Parallel CloudFront Distributions</h3>

        <p>
          At this juncture, Canopy only supported working with a single CloudFront distribution. However, it is not uncommon for development teams to have multiple distributions within a single AWS account. Distributions can be configured with different cache-control policies, geographic restrictions, and real-time log configurations, which can be useful when working with multiple domains that require different CDN settings.
        </p>

        <p>
          To accommodate monitoring multiple distributions, teams would need to either duplicate Canopy’s pipeline infrastructure, or manually attach another distribution to the existing pipeline. To better serve these users, we decided to add native support for multiple parallel CloudFront distributions.
        </p>

        <h3 id="weighing-options-for-parallelization">12.2 Weighing Options for Parallelization</h3>

        <p>