Prerequisites and co-requisites |
- Sound knowledge of at least one object-oriented programming language (preferably Java).
- Basic knowledge of algorithms and data structures.
|
Course content |
In addition to databases and batch processing systems, stream processing systems form another class of tools that are used to store and process data. Stream processing systems are used in particular when processing unrestricted and / or unordered data and / or when low latency times are required. Knowledge of the associated concepts, models and algorithms is the basis for the efficient use of the tools (e.g. Apache Spark / Flink / Apex / Kafka) which are used for processing.
- Introduction to general concepts of stream processing: differences / similarities between services (online systems), batch processing systems (offline systems) and stream processing systems (near real-time systems), event streams, messaging Systems, unconstrained and / or disordered data, latency
- The concept of “time”: event time vs. processing time, stateful operations, handling of “late data” and watermarking, “windowed operations” based on the event time
- Relationships between state, streams and immutability
- Fault tolerance: micro-batching and checkpointing, idempotency, recovery
- Introduction to Apache Spark and Apache Kafka as stream processing platforms
- Get to know typical use cases for stream processing systems: complex event processing, stream analytics
- Architectures, programming models, challenges in processing data streams: parallel processing (e.g. load balancing, acknowledgment and redelivery), partitioned logs
- Algorithms and patterns for efficient processing of data streams: stream joins (stream table join, table table join, time dependency of joins), probabilistic data structures
- Best practices when using state-of-the-art stream processing platforms
|
Learning outcomes |
Students can:
- name essential concepts which form the basis for stream processing systems and their tasks.
- explain the similarities / differences between stream processing and batch processing systems as well as the relationships between the two paradigms.
- recognize typical use cases for stream processing systems and implement them with the help of state-of-the-art frameworks (e.g. Spark Structured Streaming, Apache Kafka).
- explain the concept of "time" in the context of stream processing and take into account the associated challenges when implementing algorithms.
- name fundamental patterns for efficient processing of data streams and use them in a targeted manner.
- compare different frameworks for stream processing based on their architecture or properties in practical use.
- implement simple practical tasks with the help of state-of-the-art frameworks.
|
Planned learning activities and teaching methods |
Lecture plus in-class exercises (supplemented with flipped classroom elements) Guided processing of relevant literature Processing (analysis, design, implementation, testing, documenting and presentation) of case studies in small groups Discussions and dialogues in plenary to receive and give efficient feedback. |
Assessment methods and criteria |
Evaluation of the exercises (written elaboration / documentation, quality of the solution achieved, ...) Evaluation of the presentation of the exercises Evaluation of participation in the discourse (e.g. peer feedback) |
Comment |
None |
Recommended or required reading |
- Akidau, Tyler; Chernyak, Slava; Lax, Reuven (2018): Streaming systems: the what, where, when, and how of large-scale data processing. First edition. Sebastopol, CA: O’Reilly.
- Bifet, Albert (2017): Machine learning for data streams: with practical examples in MOA. Cambridge, Massachusetts: MIT Press (= Adaptive computation and machine learning series).
- Gakhov, Andrii (2019): Probabilistic Data Structures and Algorithms for Big Data Applications. 1st Ed. Books on Demand.
- Kleppmann, Martin (2017): Designing Data-Intensive Applications: The Big Ideas Behind Reliable, Scalable, and Maintainable Systems. Revised. Boston: O’Reilly UK Ltd.
- Kleppmann, Martin (2016): Making sense of stream processing: the philosophy behind Apache Kafka and scalable stream data platforms. O’Reilly. Available at: URL: https://assets.confluent.io/m/2a60fabedb2dfbb1/original/20190307-EB-Making_Sense_of_Stream_Processing_Confluent.pdf (Accessed on: 10 August 2021).
- Maas, Gerard; Garillot, François (2019): Stream processing with Apache Spark: mastering structured streaming and Spark streaming. First edition. Sebastopol, CA: O’Reilly Media, Inc.
- Marz, Nathan; Warren, James (2015): Big Data: Principles and best practices of scalable realtime data systems. Pap/Psc. Shelter Island, NY: Manning.
- Seymour, Mitch (2021): Mastering Kafka Streams and ksqlDB: building real-time data systems by example. First edition. Sebastopol: O’Reilly.
- Shapira, Gwen et al. (2021): Kafka: The Definitive Guide: Real-Time Data and Stream Processing at Scale. 2nd Edition. S.l.: O’Reilly UK Ltd.
- Stopford, Ben; Safari, an O’Reilly Media Company (2018): Designing Event-Driven Systems. Available at: URL: http://sd.blackball.lv/library/Designing_Event-Driven_Systems_(2018).pdf (Accessed on: 10 August 2021).
|
Mode of delivery (face-to-face, distance learning) |
Face-to-face event with selected online elements. |