Table Of Contents

Previous topic

API Docs

Next topic

pydoop.hdfs — HDFS API

Get Pydoop

Contributors

Pydoop is developed by: CRS4

And generously hosted by: Get Pydoop at SourceForge.net. Fast, secure and Free Open Source software downloads

pydoop.pipes — MapReduce API

This module allows you to write the components of your MapReduce application.

The basic MapReduce components (Mapper, Reducer, RecordReader, etc.) are provided as abstract classes. Application developers must subclass them, providing implementations for all methods called by the framework.

class pydoop.pipes.Combiner(context=None)

Works exactly as a Reducer, but values aggregation is performed locally to the machine hosting each map task.

close()

Called after the combiner has finished its job.

Overriding this method is not required.

class pydoop.pipes.Factory(mapper_class, reducer_class, record_reader_class=None, record_writer_class=None, combiner_class=None, partitioner_class=None)

Creates MapReduce application components.

The classes to use for each component must be specified as arguments to the constructor.

class pydoop.pipes.InputSplit(data)

Represents the data to be processed by an individual Mapper.

Typically, it presents a byte-oriented view on the input and it is the responsibility of the RecordReader to convert this to a record-oriented view.

The InputSplit is a logical representation of the actual dataset chunk, expressed through the filename, offset and length attributes.

Parameters:data (string) – the byte string returned by MapContext.getInputSplit()
class pydoop.pipes.Mapper(context=None)

Maps input key/value pairs to a set of intermediate key/value pairs.

close()

Called after the mapper has finished its job.

Overriding this method is not required.

map(context)

Called once for each key/value pair in the input split. Applications must override this, emitting an output key/value pair through the context.

Parameters:context – the MapContext object passed by the framework, used to get the input key/value pair and emit the output key/value pair.
class pydoop.pipes.Partitioner(context=None)

Controls the partitioning of intermediate keys output by the Mapper. The key (or a subset of it) is used to derive the partition, typically by a hash function. The total number of partitions is the same as the number of reduce tasks for the job. Hence this controls which of the m reduce tasks the intermediate key (and hence the record) is sent to for reduction.

partition(key, numOfReduces)

Get the partition number for key given the total number of partitions, i.e., the number of reduce tasks for the job. Applications must override this.

Parameters:
  • key (string) – the key of the key/value pair being dispatched
  • numOfReduces (int) – the total number of reduces.
Return type:

int

Returns:

the partition number for key.

class pydoop.pipes.RecordReader(context=None)

Breaks the data into key/value pairs for input to the Mapper.

close()

Called after the record reader has finished its job.

Overriding this method is not required.

getProgress()

The current progress of the record reader through its data. Applications must override this.

Return type:float
Returns:the fraction of data read up to now, as a float between 0 and 1.
next()

Called by the framework to provide a key/value pair to the Mapper. Applications must override this.

Return type:tuple
Returns:a tuple of three elements. The first one is a bool which is True if a record is being read and False otherwise (signaling the end of the input split). The second and third element are, respectively, the key and the value (as strings).
class pydoop.pipes.RecordWriter(context=None)

Writes the output key/value pairs to an output file.

close()

Called after the record writer has finished its job.

Overriding this method is not required.

emit(key, value)

Writes a key/value pair. Applications must override this.

Parameters:
  • key (string) – a final output key
  • value (string) – a final output value
class pydoop.pipes.Reducer(context=None)

Reduces a set of intermediate values which share a key to a (possibly) smaller set of values.

close()

Called after the reducer has finished its job.

Overriding this method is not required.

reduce(context)

Called once for each key. Applications must override this, emitting an output key/value pair through the context.

Parameters:context – the ReduceContext object passed by the framework, used to get the input key and corresponding set of values and emit the output key/value pair.
pydoop.pipes.runTask(factory)

Run the assigned task in the framework.

Parameters:factory (Factory) – a Factory instance.
Return type:bool
Returns:True, if the task succeeded.

Classes Instantiated by the Framework

The following classes are not accessible to the MapReduce programmer. The framework instantiates context objects and passes them as parameters to some of the methods (e.g., map, reduce) you define. Through the context, you get access to several objects and methods related to the job’s current configuration and state.

class pydoop.pipes.TaskContext

Provides information about the task and the job. This is the base class for MapContext and ReduceContext.

getJobConf()

Get the current job configuration.

Return type:JobConf
Returns:the job configuration object

Get the JobConf for the current task.

getInputKey()

Get the current input key.

Return type:string
Returns:the current input key
getInputValue()

Get the current input value.

Return type:string
Returns:the current input value
emit(key, value)

Generate an output record.

Parameters:
  • key (string) – the intermediate key
  • value (string) – the intermediate value
progress()

Mark your task as having made progress without changing the status message.

setStatus(status)

Set the status message and call progress.

Parameters:status (string) – the status message
getCounter(group, name)

Register a Counter with the given group and name.

Parameters:
  • group (string) – a counter group
  • name (string) – a counter name
incrementCounter(counter, amount)

Increment the value of the counter with the given amount.

Parameters:
  • counter (Counter) – an application counter
  • amount (int) – the increment value
class pydoop.pipes.MapContext

Provides information about the map task and the job. Inherited from TaskContext.

getInputSplit()

Access the (serialized) InputSplit of the Mapper.

This is a raw byte string that should not be used directly, but rather passed to InputSplit‘s constructor.

Return type:string
getInputKeyClass()

Get the name of the key class of the input to this task.

Return type:string
getInputValueClass()

Get the name of the value class of the input to this task.

Return type:string
class pydoop.pipes.ReduceContext

Provides information about the reduce task and the job. Inherited from TaskContext.

nextValue()

Advance to the next value.

Return type:bool
class pydoop.pipes.JobConf

A JobConf defines the properties for a job.

hasKey(key)

Return True if key is a configuration parameter.

Parameters:key (string) – the name of a configuration parameter
Return type:bool
Returns:True if key is present in this JobConf object.
get(key)

Get the value of the configuration parameter key.

Parameters:key (string) – the name of a configuration parameter
Return type:string
Returns:the value of the configuration parameter key
getInt(key)

Get the value of the configuration parameter key as an integer.

Parameters:key (string) – the name of a configuration parameter
Return type:int
Returns:the value of the configuration parameter key as an integer.
getFloat(key)

Get the value of the configuration parameter key as a float.

Parameters:key (string) – the name of a configuration parameter
Return type:float
Returns:the value of the configuration parameter key as a float.
getBoolean(key)

Get the value of the configuration parameter key as a boolean.

Parameters:key (string) – the name of a configuration parameter
Return type:bool
Returns:the value of the configuration parameter key as a boolean.
class pydoop.pipes.Counter(id)

Keeps track of a property and its value.

getId()
Return type:int
Returns:counter id