Previous topic

pydoop.jc — Pydoop Script Configuration Access

Next topic

Examples

Get Pydoop

Contributors

Pydoop is developed by: CRS4

And generously hosted by: Get Pydoop at SourceForge.net. Fast, secure and Free Open Source software downloads

pydoop.hadut – Run Hadoop Shell Commands

The hadut module provides access to some Hadoop functionalities available via the Hadoop shell.

class pydoop.hadut.PipesRunner(prefix=None, logger=None)

Allows to set up and run pipes jobs, optionally automating a few common tasks.

Parameters:
  • prefix (string) – if specified, it must be a writable directory path that all nodes can see (the latter could be an issue if the local file system is used rather than HDFS)
  • logger (logging.Logger) – optional logger

If prefix is set, the runner object will create a working directory with that prefix and use it to store the job’s input and output — the intended use is for quick application testing. If it is not set, you must call set_output() with an hdfs path as its argument, and put will be ignored in your call to set_input(). In any event, the launcher script will be placed in the output directory’s parent (this has to be writable for the job to succeed).

clean()

Remove the working directory, if any.

collect_output(out_file=None)

Run collect_output() on the job’s output directory.

run(**kwargs)

Run the pipes job. Keyword arguments are passed to run_pipes().

set_exe(pipes_code)

Dump launcher code to the distributed file system.

set_input(input_, put=False)

Set the input path for the job. If put is True, copy (local) input_ to the working directory.

set_output(output)

Set the output path for the job. Optional if the runner has been instantiated with a prefix.

class pydoop.hadut.PydoopScriptRunner(prefix=None, logger=None)

Specialization of PipesRunner to support the set up and running of pydoop script jobs.

exception pydoop.hadut.RunCmdError(returncode, cmd, output=None)

This exception is raised by run_cmd and all functions that make use of it to indicate that the call failed (returned non-zero).

pydoop.hadut.collect_output(mr_out_dir, out_file=None)

Return all mapreduce output in mr_out_dir.

Append the output to out_file if provided. Otherwise, return the result as a single string (it is the caller’s responsibility to ensure that the amount of data retrieved fits into memory).

pydoop.hadut.dfs(args=None, properties=None, hadoop_conf_dir=None)

Run Hadoop dfs/fs.

args and properties are passed to run_cmd().

pydoop.hadut.find_jar(jar_name, root_path=None)

Look for the named jar in:

  1. root_path, if specified
  2. working directory – PWD
  3. ${PWD}/build
  4. /usr/share/java

Return the full path of the jar if found; else return None.

pydoop.hadut.get_num_nodes(properties=None, hadoop_conf_dir=None, offline=False)

Get the number of task trackers in the Hadoop cluster.

properties is passed to get_task_trackers().

pydoop.hadut.get_task_trackers(properties=None, hadoop_conf_dir=None, offline=False)

Get the list of task trackers in the Hadoop cluster.

Each element in the returned list is in the (host, port) format. properties is passed to run_cmd().

If offline is True, try getting the list of task trackers from the ‘slaves’ file in Hadoop’s configuration directory (no attempt is made to contact the Hadoop daemons). In this case, ports are set to 0.

pydoop.hadut.path_exists(path, properties=None, hadoop_conf_dir=None)

Return True if path exists in the default HDFS, else False.

properties is passed to dfs().

This function does the same thing as hdfs.path.exists, but it uses a wrapper for the Hadoop shell rather than the hdfs extension.

pydoop.hadut.run_class(class_name, args=None, properties=None, classpath=None, hadoop_conf_dir=None, logger=None)

Run a class that needs the Hadoop jars in its class path.

args and properties are passed to run_cmd().

>>> cls = 'org.apache.hadoop.hdfs.tools.DFSAdmin'
>>> print run_class(cls, args=['-help', 'report'])
-report: Reports basic filesystem information and statistics.
pydoop.hadut.run_cmd(cmd, args=None, properties=None, hadoop_home=None, hadoop_conf_dir=None, logger=None)

Run a Hadoop command.

If the command succeeds, return its output; if it fails, raise a RunCmdError with its error output as the message.

>>> import uuid
>>> properties = {'dfs.block.size': 32*2**20}
>>> args = ['-put', 'hadut.py', uuid.uuid4().hex]
>>> res = run_cmd('fs', args, properties)
>>> res
''
>>> print run_cmd('dfsadmin', ['-help', 'report'])
-report: Reports basic filesystem information and statistics.
>>> try:
...     run_cmd('foo')
... except RunCmdError as e:
...     print e
...
Exception in thread "main" java.lang.NoClassDefFoundError: foo
...
pydoop.hadut.run_jar(jar_name, more_args=None, properties=None, hadoop_conf_dir=None)

Run a jar on Hadoop (hadoop jar command).

more_args (after prepending jar_name) and properties are passed to run_cmd().

>>> import glob, pydoop
>>> hadoop_home = pydoop.hadoop_home()
>>> v = pydoop.hadoop_version_info()
>>> if v.cdh >= (4, 0, 0): hadoop_home += '-0.20-mapreduce'
>>> jar_name = glob.glob('%s/*examples*.jar' % hadoop_home)[0]
>>> more_args = ['wordcount']
>>> try:
...     run_jar(jar_name, more_args=more_args)
... except RunCmdError as e:
...     print e
...
Usage: wordcount <in> <out>
pydoop.hadut.run_pipes(executable, input_path, output_path, more_args=None, properties=None, force_pydoop_submitter=False, hadoop_conf_dir=None, logger=None)

Run a pipes command.

more_args (after setting input/output path) and properties are passed to run_cmd().

If not specified otherwise, this function sets the properties hadoop.pipes.java.recordreader and hadoop.pipes.java.recordwriter to ‘true’.

This function works around a bug in Hadoop pipes that affects versions of Hadoop with security when the local file system is used as the default FS (no HDFS); see https://issues.apache.org/jira/browse/MAPREDUCE-4000. In those set-ups, the function uses Pydoop’s own pipes submitter application. You can force the use of Pydoop’s submitter by passing the argument force_pydoop_submitter=True.