This module allows you to write the components of your MapReduce application.
The basic MapReduce components (Mapper, Reducer, RecordReader, etc.) are provided as abstract classes. Application developers must subclass them, providing implementations for all methods called by the framework.
Works exactly as a Reducer, but values aggregation is performed locally to the machine hosting each map task.
Called after the combiner has finished its job.
Overriding this method is not required.
Creates MapReduce application components.
The classes to use for each component must be specified as arguments to the constructor.
Represents the data to be processed by an individual Mapper.
Typically, it presents a byte-oriented view on the input and it is the responsibility of the RecordReader to convert this to a record-oriented view.
The InputSplit is a logical representation of the actual dataset chunk, expressed through the filename, offset and length attributes.
Parameters: | data (string) – the byte string returned by MapContext.getInputSplit() |
---|
Maps input key/value pairs to a set of intermediate key/value pairs.
Called after the mapper has finished its job.
Overriding this method is not required.
Called once for each key/value pair in the input split. Applications must override this, emitting an output key/value pair through the context.
Parameters: | context – the MapContext object passed by the framework, used to get the input key/value pair and emit the output key/value pair. |
---|
Controls the partitioning of intermediate keys output by the Mapper. The key (or a subset of it) is used to derive the partition, typically by a hash function. The total number of partitions is the same as the number of reduce tasks for the job. Hence this controls which of the m reduce tasks the intermediate key (and hence the record) is sent to for reduction.
Get the partition number for key given the total number of partitions, i.e., the number of reduce tasks for the job. Applications must override this.
Parameters: |
|
---|---|
Return type: | int |
Returns: | the partition number for key. |
Breaks the data into key/value pairs for input to the Mapper.
Called after the record reader has finished its job.
Overriding this method is not required.
The current progress of the record reader through its data. Applications must override this.
Return type: | float |
---|---|
Returns: | the fraction of data read up to now, as a float between 0 and 1. |
Called by the framework to provide a key/value pair to the Mapper. Applications must override this.
Return type: | tuple |
---|---|
Returns: | a tuple of three elements. The first one is a bool which is True if a record is being read and False otherwise (signaling the end of the input split). The second and third element are, respectively, the key and the value (as strings). |
Writes the output key/value pairs to an output file.
Called after the record writer has finished its job.
Overriding this method is not required.
Writes a key/value pair. Applications must override this.
Parameters: |
|
---|
Reduces a set of intermediate values which share a key to a (possibly) smaller set of values.
Called after the reducer has finished its job.
Overriding this method is not required.
Called once for each key. Applications must override this, emitting an output key/value pair through the context.
Parameters: | context – the ReduceContext object passed by the framework, used to get the input key and corresponding set of values and emit the output key/value pair. |
---|
Run the assigned task in the framework.
Parameters: | factory (Factory) – a Factory instance. |
---|---|
Return type: | bool |
Returns: | True, if the task succeeded. |
The following classes are not accessible to the MapReduce programmer. The framework instantiates context objects and passes them as parameters to some of the methods (e.g., map, reduce) you define. Through the context, you get access to several objects and methods related to the job’s current configuration and state.
Provides information about the task and the job. This is the base class for MapContext and ReduceContext.
Get the current job configuration.
Return type: | JobConf |
---|---|
Returns: | the job configuration object |
Get the JobConf for the current task.
Get the current input key.
Return type: | string |
---|---|
Returns: | the current input key |
Get the current input value.
Return type: | string |
---|---|
Returns: | the current input value |
Generate an output record.
Parameters: |
|
---|
Mark your task as having made progress without changing the status message.
Set the status message and call progress.
Parameters: | status (string) – the status message |
---|
Register a Counter with the given group and name.
Parameters: |
|
---|
Increment the value of the counter with the given amount.
Parameters: |
|
---|
Provides information about the map task and the job. Inherited from TaskContext.
Access the (serialized) InputSplit of the Mapper.
This is a raw byte string that should not be used directly, but rather passed to InputSplit‘s constructor.
Return type: | string |
---|
Get the name of the key class of the input to this task.
Return type: | string |
---|
Get the name of the value class of the input to this task.
Return type: | string |
---|
Provides information about the reduce task and the job. Inherited from TaskContext.
Advance to the next value.
Return type: | bool |
---|
A JobConf defines the properties for a job.
Return True if key is a configuration parameter.
Parameters: | key (string) – the name of a configuration parameter |
---|---|
Return type: | bool |
Returns: | True if key is present in this JobConf object. |
Get the value of the configuration parameter key.
Parameters: | key (string) – the name of a configuration parameter |
---|---|
Return type: | string |
Returns: | the value of the configuration parameter key |
Get the value of the configuration parameter key as an integer.
Parameters: | key (string) – the name of a configuration parameter |
---|---|
Return type: | int |
Returns: | the value of the configuration parameter key as an integer. |
Get the value of the configuration parameter key as a float.
Parameters: | key (string) – the name of a configuration parameter |
---|---|
Return type: | float |
Returns: | the value of the configuration parameter key as a float. |
Get the value of the configuration parameter key as a boolean.
Parameters: | key (string) – the name of a configuration parameter |
---|---|
Return type: | bool |
Returns: | the value of the configuration parameter key as a boolean. |