Map Reduce

Prof. Dr. Tim Weber

Deggendorf Institute of Technology

MapReduce

Google has an estimated 1M servers http://bit.ly/Shh0RO

Challenges:
- How do you distribute computations?
- How can we make it easy to write distributed programs

Machines fail:

Issue: Copying data over a network takes time
Idea
- Bring computation close to the data
- Store files multiple times for reliability

MapReduce adresses these problems
- Elegant way to work with big data
- Storage Infrastructure - File System (Google: GFS, Hadoop: HDFS)
- Programming Model: MapReduce

Chunk servers
- File is split into contiguous chunks
- Typically each chunk is 16-64MB
- Each chunk replicated (usually 2x or 3x)
- Try to keep replicas in different racks

We will simulate MapReduce in the classroom

step: 4 students count the words in the raw data. Time is ticking.
step: 2 groups are formed, the words are counted. One student acts as the reduce (collecting the counts in the end). Time is ticking.
step: 4 groups are formed, the words are counted. One student acts as the reduce (collecting the counts in the end). Time is ticking.
step: 5 groups are formed, the words are counted. One student acts as the reduce (collecting the counts in the end). Time is ticking.
step: 10 groups are formed, the words are counted. One student acts as the reduce (collecting the counts in the end). Time is ticking.
step: 20 groups are formed, the words are counted. One student acts as the reduce (collecting the counts in the end). Time is ticking.

Map worker failure
- Map tasks completed or in-progress at worker are reset to idle
- Reduce workers are notified when task is rescheduled on another worker
Reduce worker failure
- Only in-progress tasks are reset to idle
- Reduce task is restarted
Master failure
- MapReduce task is aborted and client is notified

\(M\) map tasks, \(R\) reduce tasks

Rule of thumb:

Usually \(R\) is smaller than \(M\)

MapReduce is significant for its role in enabling the processing of massive datasets efficiently across distributed computing clusters.
It revolutionized big data processing by providing a scalable and fault-tolerant framework for handling large-scale computations.
Its simplicity and scalability made it accessible to a wide range of industries and applications, from web search engines to scientific research.
MapReduce paved the way for the development of other big data processing frameworks and technologies, influencing the evolution of distributed computing paradigms.
Its impact extends beyond its original implementation, as concepts and principles from MapReduce have influenced the design of subsequent systems and architectures for big data processing.

Leskovec, Jure, Anand Rajaraman, and Jeffrey David Ullman. 2014. “Mining of Massive Datasets,” November. https://doi.org/10.1017/cbo9781139924801.