Map Reduce

Prof. Dr. Tim Weber

Deggendorf Institute of Technology

MapReduce

Single Node Architecture

Cluster Architecture

Google has an estimated 1M servers http://bit.ly/Shh0RO

SERVERS!!!

Large Scale Computing

  • Challenges:
    • How do you distribute computations?
    • How can we make it easy to write distributed programs

Machines fail:

  • One Server may stay up 3 years (1000 days)
  • If there are 1000 servers, expect to loose 1/day
  • Google estimate: 1M servers \(\rightarrow\) 1000 machines fail EVERY DAY

Issue and Solution

  • Issue: Copying data over a network takes time
  • Idea
    • Bring computation close to the data
    • Store files multiple times for reliability
  • MapReduce adresses these problems
    • Elegant way to work with big data
    • Storage Infrastructure - File System (Google: GFS, Hadoop: HDFS)
    • Programming Model: MapReduce

(D)istributed (F)ile (S)ystem

  • Chunk servers
    • File is split into contiguous chunks
    • Typically each chunk is 16-64MB
    • Each chunk replicated (usually 2x or 3x)
    • Try to keep replicas in different racks

Chunk Servers

Theory

Classroom

We will simulate MapReduce in the classroom

  1. step: 4 students count the words in the raw data. Time is ticking.
  2. step: 2 groups are formed, the words are counted. One student acts as the reduce (collecting the counts in the end). Time is ticking.
  3. step: 4 groups are formed, the words are counted. One student acts as the reduce (collecting the counts in the end). Time is ticking.
  4. step: 5 groups are formed, the words are counted. One student acts as the reduce (collecting the counts in the end). Time is ticking.
  5. step: 10 groups are formed, the words are counted. One student acts as the reduce (collecting the counts in the end). Time is ticking.
  6. step: 20 groups are formed, the words are counted. One student acts as the reduce (collecting the counts in the end). Time is ticking.

Classroom Evaluation

  1. Which method took the longest?
  2. What was the most balanced method?
  3. Would more “servers” help?

using Software

Dealing with Failures

  • Map worker failure
    • Map tasks completed or in-progress at worker are reset to idle
    • Reduce workers are notified when task is rescheduled on another worker
  • Reduce worker failure
    • Only in-progress tasks are reset to idle
    • Reduce task is restarted
  • Master failure
    • MapReduce task is aborted and client is notified

How many MapReduce jobs?

\(M\) map tasks, \(R\) reduce tasks

Rule of thumb:

  • Make \(M\) much larger than the number of nodes in the cluster
  • One DFS chunk per Map is common
  • Improves dynammics load balancing and speeds up recovery from worker failures

Usually \(R\) is smaller than \(M\)

  • Because output is spread across \(R\) files

MapReduce summary

  • MapReduce is significant for its role in enabling the processing of massive datasets efficiently across distributed computing clusters.
  • It revolutionized big data processing by providing a scalable and fault-tolerant framework for handling large-scale computations.
  • Its simplicity and scalability made it accessible to a wide range of industries and applications, from web search engines to scientific research.
  • MapReduce paved the way for the development of other big data processing frameworks and technologies, influencing the evolution of distributed computing paradigms.
  • Its impact extends beyond its original implementation, as concepts and principles from MapReduce have influenced the design of subsequent systems and architectures for big data processing.

References

Leskovec, Jure, Anand Rajaraman, and Jeffrey David Ullman. 2014. “Mining of Massive Datasets,” November. https://doi.org/10.1017/cbo9781139924801.