Information on individual educational components (ECTS-Course descriptions) per semester

  
Degree programme:Master Computer Science
Type of degree:FH MasterĀ“s Degree Programme
 Full-time
 Winter Semester 2024
  

Course unit titleBatch Processing Systems
Course unit code024913110501
Language of instructionGerman
Type of course unit (compulsory, optional)Elective
Semester when the course unit is deliveredWinter Semester 2024
Teaching hours per week2
Year of study2024
Level of course unit (e.g. first, second or third cycle)Second Cycle (Master)
Number of ECTS credits allocated4
Name of lecturer(s)Peter REITER


Prerequisites and co-requisites
  • Sound knowledge of at least one object-oriented programming language (preferably Java).
  • Basic knowledge of algorithms and data structures.
Course content

Concepts, models and algorithms that enable the efficient use of state-of-the-art data processing platforms for processing large amounts of data. This knowledge is used to use widely used data processing platforms (e.g. Google Cloud Dataproc, Azure HDInsigth, Amazon EMR, Databricks) for the efficient processing of problems in the areas of business intelligence and data analytics.

  • Introduction to general concepts of batch processing: MapReduce, distributed file systems, differences between MapReduce and distributed database systems, MapReduce workflows, dataflow engines (e.g. Apache Spark), high-level APIs and languages.
  • Introduction to Apache Hadoop and Apache Spark as data processing platforms.
  • Getting to know typical use cases for batch processing systems
  • Introduction to (a) a process model (a project lifecycle) for handling big data projects (different phases, their tasks and goals), (b) the different roles and tasks of those involved in big data projects.
  • Architectures, programming models, challenges in parallel data processing (e.g. error handling)
  • Algorithms and patterns for efficient processing of large amounts of data (e.g. secondary sorting, parallel joins, one-pass algorithms)
  • Best practices when using state-of-the-art data processing platforms
  • Awareness of the challenges that arise when processing data with regard to data protection.
  • Classes of data: personal; sensitive
  • Best practices in using anonymization techniques: solutions and problems
Learning outcomes

Students can:

  • name essential concepts which form the basis for batch processing systems and name their tasks.
  • recognize typical use cases for batch processing systems and implement them with the help of state-of-the-art frameworks.
  • name and explain programming paradigms and architectures of distributed and parallel data processing as well as their properties and advantages / disadvantages.
  • assess state-of-the-art frameworks with regard to their properties in practical use.
  • explain the differences / similarities between business intelligence and data analytics.
  • use standardized process models and the appropriate tools in data processing applications.
  • describe the limits of batch processing systems and frameworks and implement your own extensions / solutions (parallel or one-pass algorithms) based on the standard functionalities of the frameworks.
  • assess the importance of data protection in data analysis and aggregation.
  • differentiate between the various classes of data from a data protection perspective. Describe and evaluate techniques for anonymization (data de-identification, differential privacy and cryptological techniques).
  • show respective problems of the individual technologies, e.g. that anonymized data that is aggregated can become personal data.
Planned learning activities and teaching methods
  • Lecture plus in-class exercises (supplemented with flipped classroom elements)
  • Guided processing of relevant literature
  • Processing (analysis, design, implementation, testing, documenting and presentation) of case studies in small groups
  • Discussions and dialogues in plenary to receive and give efficient feedback.
Assessment methods and criteria
  • Evaluation of the exercises (written elaboration / documentation, quality of the solution achieved, ...) and Evaluation of the presentation of the exercises 90%
  • Evaluation of participation in the discourse (e.g. peer feedback) 10%

For a positive grade, a minimum of 50% of the possible points must be achieved across all parts of the examination.

Comment

 None

Recommended or required reading
  • Chambers, Bill; Zaharu, Matei (2018): Spark: The Definitive Guide: Big data processing made simple. O’Reilly UK Ltd.
  • Damji, Jules S. et al. (2020): Learning Spark: Lightning-fast Data Analytics. 2nd Edition. O’Reilly UK Ltd.
  • Karau, Holden; Warren, Rachel (2017): High Performance Spark: Best practices for scaling and optimizing Apache Spark. Sebastopol, CA: O’Reilly UK Ltd.
  • Kleppmann, Martin (2017): Designing Data-Intensive Applications: The Big Ideas Behind Reliable, Scalable, and Maintainable Systems. Revised. Boston: O’Reilly UK Ltd.
  • Lin, Jimmy; Dyer, Chris; Hirst, Graeme (2010): Data-Intensive Text Processing with MapReduce. San Rafael, Calif: Morgan and Claypool Publishers. Available at: URL: https://lintool.github.io/MapReduceAlgorithms/MapReduce-book-final.pdf
  • Marz, Nathan; Warren, James (2015): Big Data: Principles and best practices of scalable realtime data systems. Pap/Psc. Shelter Island, NY: Manning.
  • Miner, Donald; Shook, Adam (2012): MapReduce Design Patterns: Building Effective Algorithms and Analytics for Hadoop and Other Systems. 1st Ed. Sebastopol, CA: O’Reilly and Associates.
  • Parsian, Mahmoud (2015): Data Algorithms: Recipes for Scaling Up with Hadoop and Spark. 1st Ed. Boston, MA: O’Reilly and Associates.
  • Raghunathan, Balaji (2013): The complete book of data anonymization: from planning to implementation. Boca Raton: CRC Press, Taylor & Francis Group.
  • Ryza, Sandy et al. (2017): Advanced Analytics With Spark: Patterns for Learning from Data at Scale. 2nd Ed. Beijing: O’Reilly Media, Inc, USA.
  • Venkataramanan, Nataraj (2017): Data privacy: principles and practice. Boca Raton, FL: CRC Press.
  • Weber, Rolf H.; Heinrich, Ulrike I. (2012): Anonymization. London: Springer (= SpringerBriefs in cybersecurity).
  • White, Tom (2015): Hadoop: The Definitive Guide. 4 edition. Beijing: O’Reilly and Associates.
Mode of delivery (face-to-face, distance learning)

Face-to-face event with selected online elements.

Winter Semester 2024go Top