Sunday, November 9, 2014

Let's start with Map-Reduce

As this is my first post regarding the Map-Reduce Programming, I will be trying to focus on the important points in a Map-Reduce framework.


Map-Reduce works by breaking the processing into two phases. 
1) The Map phase
2) The Reduce phase
Each phase (i.e map and reduce phase) has key-value pair as input and output.
The type of key value pair is been decided by a programmer as per the requirement.
The programmer also specify two function in each phase. The map function and reduce function respectively.
The map function is just the data preparation phase, setting up the data in such a way that the reducer function can do its work on it.
Note : The map function is a good place to drop the bad records.
The output from the map function is processed by the Map-Reduce framework before being sent to the reduce function.
Note : This processing Sorts and groups the key-value pairs by key.
The Mapper class is a generic type, with four formal type parameters that specifies the input key, input value, output key, output value types of the map function.
The map() method also provides an instance of Context to write the output.
In the reduce function again their are four formal type parameters are used to specify the input and output types.
Note : The input types of the reduce function must be of same match as of mapper output function.
Job - A job object forms the specification of the job and gives you control over how the job is run. When we run the job on Hadoop cluster we will package the code into jar file (which Hadoop will distribute around the cluster). Rather then explicitly specify the name of the jar file we can pass the class in the job's setJarByClass() method, which Hadoop will use to locate the relevant JAR file by looking for the JAR file containing this class.
In Job object, we specify the input and output paths. An input path is specified by calling the static addInputPath() method on FileInputFormat , and it can be a single file, a directory, or a file pattern.
Note : addInputPath() can be called more than once to use input from multiple paths.
Note : MultipleInputs class supports Map-Reduce jobs that have multiple input paths with a different InputFormat and Mapper for each path.
The output path (of which there is only one) is specified by the static setOutput Path() method on FileOutputFormat . It specifies a directory where the output files from the reducer functions are written. The directory shouldn’t exist before running the job because Hadoop will complain and not run the job.

No comments: