Dean and Ghemawat (2004). MapReduce: Simplified Data Processing on Large ClustersFri Feb 05 2021
tags: draft programming computer science self study notes public 6.824 MIT distributed systems
This paper introduces MapReduce.
I read this paper as part of my self-study of 6.824 with NUS SoC folks. It may or may not be part of my goal to finish a Stanford CS degree in a year.
Overview of the paper
Interesting things about the paper
MapReduce gets map jobs
For some reason, the paper gets Map workers to send the list of the generated intermediate file locations back to the server, but doesn't do it for Reduce workers. I get it--there's not really a need for Reduce workers to send back since Reduce output files are saved on a shared file system while Map intermediate files are saved on the individual workers' storage. The drawback is that it results in additional work for the server to have to constantly check for the presence of output files. I compared my implementation to
I think this could be trying to save unnecessary network I/O.
How MapReduce avoids inconsistency
The problem is
Suppose you have a m
The paper prevents master from observing files that have been partially written by using temporary output files plus a atomic rename operation.