step: 4 students count the words in the raw data. Time is ticking.
step: 2 groups are formed, the words are counted. One student acts as the reduce (collecting the counts in the end). Time is ticking.
step: 4 groups are formed, the words are counted. One student acts as the reduce (collecting the counts in the end). Time is ticking.
step: 5 groups are formed, the words are counted. One student acts as the reduce (collecting the counts in the end). Time is ticking.
step: 10 groups are formed, the words are counted. One student acts as the reduce (collecting the counts in the end). Time is ticking.
step: 20 groups are formed, the words are counted. One student acts as the reduce (collecting the counts in the end). Time is ticking.
Classroom Evaluation
Which method took the longest?
What was the most balanced method?
Would more “servers” help?
using Software
Dealing with Failures
Map worker failure
Map tasks completed or in-progress at worker are reset to idle
Reduce workers are notified when task is rescheduled on another worker
Reduce worker failure
Only in-progress tasks are reset to idle
Reduce task is restarted
Master failure
MapReduce task is aborted and client is notified
How many MapReduce jobs?
\(M\) map tasks, \(R\) reduce tasks
Rule of thumb:
Make \(M\) much larger than the number of nodes in the cluster
One DFS chunk per Map is common
Improves dynammics load balancing and speeds up recovery from worker failures
Usually \(R\) is smaller than \(M\)
Because output is spread across \(R\) files
MapReduce summary
MapReduce is significant for its role in enabling the processing of massive datasets efficiently across distributed computing clusters.
It revolutionized big data processing by providing a scalable and fault-tolerant framework for handling large-scale computations.
Its simplicity and scalability made it accessible to a wide range of industries and applications, from web search engines to scientific research.
MapReduce paved the way for the development of other big data processing frameworks and technologies, influencing the evolution of distributed computing paradigms.
Its impact extends beyond its original implementation, as concepts and principles from MapReduce have influenced the design of subsequent systems and architectures for big data processing.