Pydoop has been tested on Gentoo, Ubuntu and CentOS. Although we currently have no information regarding other Linux distributions, we expect Pydoop to work (possibly with some tweaking) on them as well.
We also have a walkthrough for compiling and installing on Apple OS X Mountain Lion.
Other platforms are not supported.
We recommend downloading the latest release from https://sourceforge.net/projects/pydoop/files.
You can also get the latest code from the Git repository:
git clone git://git.code.sf.net/p/pydoop/code pydoop
We also upload our releases to PyPI. After configuring your environment (see below), you should be able to automatically download and install Pydoop from PyPI using pip:
pip install pydoop
Download the latest .deb package from https://sourceforge.net/projects/pydoop/files.
In order to build and install Pydoop, you need the following software:
These are also runtime requirements for all cluster nodes. Note that installing Pydoop and your MapReduce application to all cluster nodes (or to an NFS share) is not required: see Installation-free Usage for a complete HowTo.
On Ubuntu you should install the .deb package (see the Get Pydoop section). Currently, we support the following setup:
The Boost Python library is included in the main Ubuntu repository:
sudo apt-get install libboost-python1.46.1
To install CDH4 with mrv1, install hadoop-0.20-conf-pseudo and hadoop-client from the CDH4 repository, following Cloudera’s instructions.
To install Oracle JDK 6, you can follow these instructions. Another option is to create a local repository with oab. Whatever method you choose, make sure that your Java package provides the sun-java6-jdk virtual dependency.
Finally, on Ubuntu Pydoop depends on the python-support package:
sudo apt-get install python-support
To install Pydoop, run:
sudo dpkg -i <PATH_TO_PYDOOP_DEB_PKG>
The following is a complete walkthrough that merges all of the above instructions (tested on an empty box):
# install canonical dependencies
sudo apt-get install libboost-python1.46.1 python-support
# remove openjdk if necessary
sudo apt-get purge openjdk*
# add repositories for CDH4 and Oracle Java
sudo sh -c "echo 'deb [arch=amd64] http://archive.cloudera.com/cdh4/ubuntu/precise/amd64/cdh precise-cdh4 contrib' > /etc/apt/sources.list.d/cloudera.list"
sudo sh -c "echo 'deb-src http://archive.cloudera.com/cdh4/ubuntu/precise/amd64/cdh precise-cdh4 contrib' >> /etc/apt/sources.list.d/cloudera.list"
sudo apt-get install curl
curl -s http://archive.cloudera.com/cdh4/ubuntu/precise/amd64/cdh/archive.key | sudo apt-key add -
sudo apt-get install python-software-properties
sudo add-apt-repository ppa:eugenesan/java
sudo apt-get update
# install Oracle Java and CDH4 with mrv1
sudo apt-get install oracle-java6-installer
cd /usr/lib/jvm && sudo ln -s java-6-oracle java-6-sun
sudo apt-get install hadoop-0.20-conf-pseudo hadoop-client
# install Pydoop
sudo dpkg -i <PATH_TO_PYDOOP_DEB_PKG>
Before compiling and installing Pydoop, install all missing dependencies.
On Ubuntu:
sudo apt-get install build-essential python-all-dev libboost-python-dev libssl-dev
On Gentoo:
echo 'dev-libs/boost python' >> /etc/portage/package.use
emerge boost openssl
If you’re using Boost version 1.48 or newer, you may need to specify the name of your Boost.Python library in order to build Pydoop. This is done via the BOOST_PYTHON environment variable. For instance:
export BOOST_PYTHON=boost_python-2.7
Set the JAVA_HOME environment variable to your JDK installation directory, e.g.:
export JAVA_HOME=/usr/local/java/jdk
Note
If you don’t know where your Java home is, try finding the actual path of the java executable and stripping the trailing /jre/bin/java:
$ readlink -f $(which java)
/usr/lib/jvm/java-6-oracle/jre/bin/java
$ export JAVA_HOME=/usr/lib/jvm/java-6-oracle
If you have installed Hadoop from a tarball, set the HADOOP_HOME environment variable so that it points to where the tarball was extracted, e.g.:
export HADOOP_HOME=/opt/hadoop-1.0.4
The above step is not necessary if you installed CDH from dist-specific packages. Build Pydoop with the following commands:
tar xzf pydoop-*.tar.gz
cd pydoop-*
python setup.py build
For a system-wide installation, run the following:
sudo python setup.py install --skip-build
For a user-local installation:
python setup.py install --skip-build --user
The latter installs Pydoop in ~/.local/lib/python2.X/site-packages. This may be a particularly handy solution if your home directory is accessible on the entire cluster.
To install to an arbitrary path:
python setup.py install --skip-build --home <PATH>
To build Pydoop on OS X you need the following prerequisites:
Install Boost:
brew install boost --build-from-source
See the common issues section of the Homebrew docs for more info on why we need the --build-from-source switch.
Install Hadoop:
brew install hadoop
You may follow this guide for Hadoop installation and configuration.
Set JAVA_HOME according to your JDK installation, e.g.:
export JAVA_HOME=/Library/Java/JavaVirtualMachines/jdk1.7.0_17.jdk/Contents/Home
To install Pydoop via Homebrew:
brew tap samueljohn/python
brew install pydoop
To compile and install from source, follow the instructions in the previous section, configuring the environment as follows:
export HADOOP_HOME=/usr/local/Cellar/hadoop/1.1.2/libexec
export BOOST_PYTHON=boost_python-mt
Note
The following instructions apply to installations from tarballs. Running a package-based Hadoop installation together with a “from-tarball” one is neither advised not supported.
If you’d like to use your Pydoop installation with multiple versions of Hadoop, you will need to rebuild the modules for each version of Hadoop.
After building Pydoop for the first time following the instructions above, modify your HADOOP-related environment variables to point to the other version of Hadoop to be supported. Then repeat the build and installation commands again.
Example:
tar xzf pydoop-*.tar.gz
cd pydoop-*
export HADOOP_HOME=/opt/hadoop-0.20.2
python setup.py install --user
python setup.py clean --all
export HADOOP_HOME=/opt/hadoop-1.0.4
python setup.py install --user
At run time, the appropriate version of the Pydoop modules will be loaded for the version of Hadoop selected by your HADOOP_HOME variable. If Pydoop is not able to retrieve your Hadoop home directory from the environment or by looking into standard paths, it falls back to a default location that is hardwired at compile time: the setup script looks for a file named DEFAULT_HADOOP_HOME in the current working directory; if the file does not exist, it is created and filled with the path to the current Hadoop home.
non-standard include/lib directories: the setup script looks for includes and libraries in standard places – read setup.py for details. If some of the requirements are stored in different locations, you need to add them to the search path. Example:
python setup.py build_ext -L/my/lib/path -I/my/include/path -R/my/lib/path
python setup.py build
python setup.py install --skip-build
Alternatively, you can write a small setup.cfg file for distutils:
[build_ext]
include_dirs=/my/include/path
library_dirs=/my/lib/path
rpath=%(library_dirs)s
and then run python setup.py install.
Finally, you can achieve the same result by manipulating the environment. This is particularly useful in the case of automatic download and install with pip:
export CPATH="/my/include/path:${CPATH}"
export LD_LIBRARY_PATH="/my/lib/path:${LD_LIBRARY_PATH}"
pip install pydoop
Hadoop version issues. The Hadoop version selected at compile time is automatically detected based on the output of running hadoop version. If this fails for any reason, you can provide the correct version string through the HADOOP_VERSION environment variable, e.g.:
export HADOOP_VERSION="1.0.4"
After Pydoop has been successfully installed, you might want to run unit tests to verify that everything works fine.
IMPORTANT NOTICE: in order to run HDFS tests you must:
make sure that Pydoop is able to detect your Hadoop home and configuration directories. If auto-detection fails, try setting the HADOOP_HOME and HADOOP_CONF_DIR environment variables to the appropriate locations;
since one of the test cases tests the connection to an HDFS instance with explicitly set host and port, if in your case these are different from, respectively, “localhost” and 9000 (8020 for package-based CDH), you must set the HDFS_HOST and HDFS_PORT environment variables accordingly;
start HDFS:
${HADOOP_HOME}/bin/start-dfs.sh
wait until HDFS exits from safe mode:
${HADOOP_HOME}/bin/hadoop dfsadmin -safemode wait
To run the unit tests, move to the test subdirectory and run as the cluster superuser (see below):
python all_tests.py
The following HDFS tests may fail if not run by the cluster superuser: capacity, chown and used. To get superuser privileges, you can either:
<property>
<name>dfs.permissions.supergroup</name>
<value>admin</value>
</property>
If you can’t acquire superuser privileges to run the tests, just keep in mind that the failures reported may be due to this reason.
Footnotes
[1] | To make Pydoop work with Python 2.6 you need to install the following additional modules: importlib and argparse. |