Table Of Contents

Previous topic

Pydoop Script User Guide

Next topic

API Docs

Get Pydoop

Contributors

Pydoop is developed by: CRS4

Pydoop Submit User Guide

Pydoop applications are run via the pydoop submit command. To start, you will need a working Hadoop cluster. If you don’t have one available, you can bring up a single-node Hadoop cluster on your machine – see the Hadoop web site for instructions.

If your applications is contained in a single (local) file named wc.py, with an entry point called __main__ (see Writing Full-Featured Applications) you can run it as follows:

pydoop submit --upload-file-to-cache wc.py wc input output

where input (file or directory) and output (directory) are HDFS paths. Note that the output directory will not be overwritten: instead, an error will be generated if it already exists when you launch the program.

If your entry point has a different name, specify it via --entry-point.

The following table shows command line options for pydoop submit:

Short Long Meaning
  --num-reducers Number of reduce tasks. Specify 0 to only perform map phase
  --no-override-home Don’t set the script’s HOME directory to the $HOME in your environment. Hadoop will set it to the value of the ‘mapreduce.admin.user.home.dir’ property
  --no-override-env Use the default python executable and environment instead of overriding HOME, LD_LIBRARY_PATH and PYTHONPATH
-D   Set a Hadoop property, e.g., -D mapred.compress.map.output=true
  --python-zip Additional python zip file
  --upload-file-to-cache Upload and add this file to the distributed cache.
  --upload-archive-to-cache Upload and add this archive file to the distributed cache.
  --log-level Logging level
  --job-name name of the job
  --python-program python executable that should be used by the wrapper
  --pretend Do not actually submit a job, print the generated config settings and the command line that would be invoked
  --hadoop-conf Hadoop configuration file
  --disable-property-name-conversion Do not adapt property names to the hadoop version used.
  --mrv2 Use mapreduce v2 Hadoop Pipes framework. InputFormat and OutputFormat classes should be mrv2 compliant
  --local-fs Use a patched pipes submitter to sidestep a Hadoop security bug triggered when using local file systems
  --do-not-use-java-record-reader Disable java RecordReader
  --do-not-use-java-record-writer Disable java RecordWriter
  --input-format java classname of InputFormat. Default value depends on the mapreduce version used
  --output-format java classname of OutputFormat. Default value depends on the mapreduce version used
  --job-conf Set a Hadoop property, e.g., mapreduce.compress.map.output=true
  --libjars Additional comma-separated list of jar files
  --cache-file Add this HDFS file to the distributed cache as a file.
  --cache-archive Add this HDFS archive file to the distributed cacheas an archive.
  --entry-point Explicitly execute MODULE.ENTRY_POINT() in the launcher script.

Setting the Environment for your Program

When working on a shared cluster where you don’t have root access, you might have a lot of software installed in non-standard locations, such as your home directory. Since non-interactive ssh connections do not usually preserve your environment, you might lose some essential setting like LD_LIBRARY_PATH.

For this reason, by default pydoop submit copies some environment variables to the driver script that launches the job on Hadoop. If this behavior is not desired, you can disable it via the --no-override-env command line option.