Search This Blog

Saturday, July 25, 2015

Installing RHadoop on Ubuntu

In this post, we will install and configure RHadoop on Ubuntu. RHadoop is essentially R modules that can interact with a Hadoop environment and can issue map reduce commands. Pre-requisites to this of course are R and Hadoop.

1. Installing Hadoop - If you don't have Hadoop installed on your machine, you can follow the steps listed in my post to set up a pseudo-distributed cluster on a single machine. You will also need to make sure HBase is setup on your machine to utilize hdfs. This post walks through steps to set up HBase in a pseudo distributed mode utilizing hdfs.


2. Installing R - If you don't already have R set up on your machine, you can follow steps in this post to set R up.

or

3. Upgrading R - If you don't have the latest version of R on your machine, it may be a good idea to upgrade it by following these steps.

Next we set up RHadoop. Overall steps are as follows..

1. Configuring R shell environment to access Hadoop
2. Configuring R library paths
3. Installing R libraries

We will go these steps in the following post.

1. Configuring R shell environment to access Hadoop

The first step is to add some configuration to the bashrc environment.

Lets open the configuration file.

$ sudo ~/.bashrc



Next, add the following lines to the end of the  file

export HADOOP_PREFIX=$HADOOP_HOME
export HADOOP_CMD=$HADOOP_HOME/bin/hadoop
export HADOOP_STREAMING=$HADOOP_HOME/share/hadoop/tools/lib/hadoop-streaming-2.7.1.jar


Next step is to apply these environment settings and see if the environment variables have been setup

$ source ~/.bashrc
$ echo $HADOOP_STREAMING





2. Configuring R library paths

Next step is to make sure that the library path is setup correctly, so that any installed packages will be available to all users. Start R and enter the following command

$ sudo R
> .libPaths()

Make sure the paths point to /usr/local or similar, else re-configure the paths to point to these folders, by setting a destination path in the install command. On my machine, I did not have to change anything.





Next, we need to download the needed packages from the RHadoop site.

I created a folder called rhadoop on my local account and downloaded the following packages from github using the wget command

wget http://cran.r-project.org/src/contrib/rJava_0.9-6.tar.gz
wget https://github.com/RevolutionAnalytics/rmr2/releases/download/3.3.1/rmr2_3.3.1.tar.gz
wget https://github.com/RevolutionAnalytics/rhdfs/blob/master/build/rhdfs_1.0.8.tar.gz?raw=true
wget https://github.com/RevolutionAnalytics/rhbase/blob/master/build/rhbase_1.2.1.tar.gz?raw=true
wget https://github.com/RevolutionAnalytics/ravro/blob/1.0.4/build/ravro_1.0.4.tar.gz?raw=true
wget https://github.com/RevolutionAnalytics/plyrmr/releases/download/0.6.0/plyrmr_0.6.0.tar.gz
mv rhdfs_1.0.8.tar.gz\?raw\=true rhdfs_1.0.8.tar.gz
mv rhbase_1.2.1.tar.gz\?raw\=true rhbase_1.2.1.tar.gz
mv ravro_1.0.4.tar.gz\?raw\=true ravro_1.0.4.tar.gz






Next we need to update the JDK 

$ sudo R CMD javareconf



Next, we will install rJava package that we downloaded

$ sudo R

install.packages("~/rhadoop/rJava_0.9-6.tar.gz", repos=NULL, type="source")

Next step is to install some pre-requisite R packages

> install.packages(c("Rcpp", "RJSONIO", "bitops", "digest", "functional", "stringr", "plyr", "reshape2", "dplyr", "R.methodsS3", "caTools", "Hmisc"))





Next, we install data.table package for handling large datasets.

> install.packages("data.table")






We need to make sure that you have thrift installed on your machine. If you don't have follow these steps in my post to install thrift on your Ubuntu machine to install thrift. It was thrift installation that posed the biggest challenges for me.

Next we will deploy these packages in R. Enter the following commands on the R command prompt.

Sys.setenv("HADOOP_CMD"="/usr/local/hadoop/bin/hadoop")

install.packages("~/rhadoop/rmr2_3.3.1.tar.gz", repos=NULL, type="source")
install.packages("~/rhadoop/rhdfs_1.0.8.tar.gz", repos=NULL, type="source")
install.packages("~/rhadoop/plyrmr_0.6.0.tar.gz", repos=NULL, type="source")
install.packages("~/rhadoop/rhbase_1.2.1.tar.gz", repos=NULL, type="source")
install.packages("~/rhadoop/ravro_1.0.4.tar.gz", repos=NULL, type="source")



That's it rhadoop is installed. To make sure everything is working, start R in normal mode and issue the following commands

> Sys.setenv("HADOOP_CMD"="/usr/local/hadoop/bin/hadoop")
> library(rmr2)
> library(rhdfs)
> hdfs.init()





Now the R environment is ready to use the HDFS environment.


Install Thrift on Ubuntu

Apache Thrift is a software project spanning a variety of programming languages and use cases. The goal is to make reliable, performant communication and data serialization across languages as efficient and seamless as possible.

Here are the steps to install thrift on your Ubuntu machine

sudo apt-get install libboost-dev libboost-test-dev libboost-program-options-dev libboost-system-dev libboost-filesystem-dev libevent-dev automake libtool flex bison pkg-config g++ libssl-dev



Next I created a folder where to download thrift as follows

$ cd ~/Work/Servers
$ mkdir thrift
$ cd thrift
$ wget http://apache.mirror.vexxhost.com/thrift/0.9.2/thrift-0.9.2.tar.gz


$ tar -xvzf thrift-0.9.2.tar.gz




Next we put the commands to build thrift

$ ./configure 

After configure, we need to make some minor changes to one of the files.

$ cd thrift-0.9.2
$ cd lib/cpp
$ gedit thrift.pc

Modify the following line to the version below

includedir=${prefix}/include
to
includedir=${prefix}/include/thrift



$ make 
$ sudo make install 
$ thrift --help 


Now that thrift is compiled, we need to make it accessible to all programs by entering the following command

$ sudo cp /usr/local/lib/libthrift-0.9.2.so /usr/lib/



Now thrift is installed

Installing HBase on pseudo-distributed mode

In this post, we will install HBase in a pseudo-distributed mode on Ubuntu. In a previous post, we setup Hadoop (HDFS) in a pseudo distributed mode.

To get started, first lets get the latest version of HBase from http://hbase.apache.org/


Next we download the hbase distribution to our local file system and extract the contents..

I downloaded hbase 1.1.0.1 from the Apache mirror and saved it my local folder. To extract the tar file use the following command...

$ tar -xzvf hbase-1.1.0.1-bin.tar.gz


After extracting the contents, we need to make some configuration changes. We start by opening the following file.

$ cd hbase-1.1.0.1
$ cd conf
$ sudo gedit hbase-site.xml

You will see a near empty file with an empty configuration XML node. Enter following configuration for hbase root directory as well as the zookeeper process.

<configuration>
  <property>
    <name>hbase.rootdir</name>
    <value>file:///home/arthgallo/hbase</value>
  </property>
  <property>
    <name>hbase.zookeeper.property.dataDir</name>
    <value>/home/arthgallo/zookeeper</value>
  </property>
</configuration>

Here is a screenshot of the updated file.



Next we need to make sure JAVA_HOME is set properly. Open the environment file to set it

$  sudo gedit /usr/local/hbase/conf/hbase-env.sh

Uncomment the JAVA_HOME line and point it to the path where JDK is installed. In case you have trouble determining the best way to set up JAVA_HOME, refer to this post for how to do so.

This entry to be made in the hbase-env file is shown as below



Now in the terminal shell cd to the folder where HBase was extracted.

$ cd ~/Work/Servers/HBase/hbase-1.1.0.1
$ cd bin
$ ./start-hbase.sh



Now enter jps to make sure that HBase started correctly.

$ jps

We can see the Master processes, to tell us that jps started correctly.


Next we will connect to the HBase instance and run a few commands to make sure everything is working correctly.

In the bin folder, enter the following command

$ ./hbase shell



This will connect to the HBase shell client. Enter the following commands on the hbase shell to make sure everything is working correctly.

> create 'test', 'cf'
> list 'test'
> put 'test', 'row1', 'cf:a', 'value1'
> put 'test', 'row2', 'cf:b', 'value2'
> put 'test', 'row3', 'cf:c', 'value3'
> scan 'test'


Now that we have verified that everything is working, it is time to clean up and switch to the pseudo-distributed mode.

Enter the following commands on the hbase shell prompt

> disable 'test'
> drop 'test'
> quit



Next step is to stop hbase

$ ./stop-hbase.sh


Next we need to edit the hbase configuration file to run it in the pseduo distributed mode

Edit the hbase conf file, and make the following entries

<configuration>
  <property>
    <name>hbase.cluster.distributed</name>
    <value>true</value>
  </property>
  <property>
    <name>hbase.rootdir</name>
    <value>hdfs://localhost:9000/hbase</value>
  </property>
  <property>
    <name>hbase.zookeeper.property.dataDir</name>
    <value>/usr/local/zookeeper</value>
  </property>
</configuration>


Save the file and close it.

Make sure the zookeeper folder is created and the zookeeper and hbase directories are accessible to the user.

$ mkdir /usr/local/zookeeper
$ sudo chown -R hadoop_admin /usr/local/zookeeper
$ sudo chown -R hadoop_admin /usr/local/hbase



Make sure HDFS (Hadoop Distributed File System) is installed and running. To check if HDFS is running enter the following command on the terminal window

$ hdfs dfsadmin -report

It will show a report as follows



 run the command to start hbase


Once it is started, you should see the following


Run jps command to see the running processes

$ jps



Next we can again open a new terminal to start hbase shell, and run the same commands again

> create 'test', 'cf'
> list 'test'
> put 'test', 'row1', 'cf:a', 'value1'
> put 'test', 'row2', 'cf:b', 'value1'
> put 'test', 'row3', 'cf:c', 'value3'
> scan 'test'

We can see if everything executed correctly.



That's it folks hbase is set up correctly.

Also, if you have set up thrift on your machine you can start the hbase thrift server with the following command

$ /usr/local/hbase/bin/hbase-daemon.sh start thrift


Tuesday, July 14, 2015

Upgrading R to latest version on Ubuntu

I am using a LTS version of Ubuntu on my machine so the R I was using was not the latest. You can check the installed version of R on your machine simply by typing the following command.

$ R

As you can see the version of R was 3.0.2, where as the latest version of R was 3.2.1




To update R, add your favourite cran mirror to your sources. Enter following command to access sources

$ sudo gedit /etc/apt/sources.list



Add the following line to the end of your file, For me it was the berkley mirror.

deb http://cran.cnr.berkeley.edu/bin/linux/ubuntu trusty/


Next, run the following commands, to obtain the key

$ gpg --keyserver keyserver.ubuntu.com --recv-key E084DAB9


Next we add the key to apt-key

$ gpg -a --export E084DAB9 | sudo apt-key add -


Next we run update, and install r-base

$ sudo apt-get update
$ sudo apt-get install r-base


Next, we can check the version of R to confirm


Yup, we have the latest version

Saturday, July 11, 2015

Setting up JAVA_HOME on Ubuntu

In the following post, we will follow steps to set up JAVA_HOME for our environment.

First, become root

$ sudo su

and enter password


Next check java install folder

$ which java


However this is just the symbolic link. To find the actual folder, we will use the following command.

$ readlink -f $(which java)



Now we will create a new symbolic link for the actual folder within /usr/lib/jvm called default-java pointing to my Oracle JDK in this case. This will allow me to re-point the symbolic link to any future versions if needed.

$ cd /usr/lib/jvm
$ sudo ln -s /usr/lib/jvm/jdk1.8.0-oracle default-java



Next, go back to home and open the .bashrc file in editor

$ cd 
$ gedit .bashrc


Add the following lines at the end of the file

JAVA_HOME=/usr/lib/jvm/default-java
export JAVA_HOME
PATH=$PATH:$JAVA_HOME
export PATH



Now save and close this file, and close the terminal

Open a new terminal using Ctrl-Alt-T

Enter the command

$ echo $JAVA_HOME
$ java -version

You should see the JAVA_HOME as well as version of Java installed. This way of setting up can allow you to update the current JVM using update-alternatives command. So see how to configure using update alternatives, refer to my previous post.



 That's it!

Thursday, July 9, 2015

Install Sun/ Oracle JDK on Ubuntu

For certain programs, I needed to install Sun/ Oracle JDK on my machine. My default JDKs were different versions of IcedTea that would not suffice for what I needed to do (To setup IcedTea check my previous post). In this post, we would see how to install Sun/ Oracle JDK on Ubuntu

My way of checking it was to list contents links to java through the following command..

> sudo update-alternatives --list java


I also verified the contents of the /usr/lib/jvm folder to make sure I had not missed configuring a previously installed JDK.

> ls /usr/lib/jvm


As you can see, I don't see a Sun JDK installed, and I need to start from the beginning.

First step was to go to Oracle site for downloading the latest JDK (http://www.oracle.com/technetwork/java/javase/overview/index.html). Follow the links to the latest version and download it to your local machine.


Next, I copied the downloaded JDK installation file from my Downloads folder to my local folder ~/Work/Servers/jdk , where I created the following script in a shell file called install-jdk1.8.sh.

#!/bin/sh

tar -xvf jdk-8*
sudo mkdir /usr/lib/jvm
sudo mv ./jdk1.8* /usr/lib/jvm/jdk1.8.0-oracle
sudo update-alternatives --install "/usr/bin/java" "java" "/usr/lib/jvm/jdk1.8.0-oracle/bin/java" 1
sudo update-alternatives --install "/usr/bin/javac" "javac" "/usr/lib/jvm/jdk1.8.0-oracle/bin/javac" 1
sudo update-alternatives --install "/usr/bin/javaws" "javaws" "/usr/lib/jvm/jdk1.8.0-oracle/bin/javaws" 1
sudo chmod a+x /usr/bin/java
sudo chmod a+x /usr/bin/javac
sudo chmod a+x /usr/bin/javaws

Here is a syntax coloured screenshot of the script.


Next we make this script executable, and execute it.

> chmod a+x *.sh
>./install-jdk1.8.sh


Once the script has run, we can test the java version to confirm everything got setup correctly..

> java -version


If you see everything setup correctly, its great. If you still see an older version of JDK, use the update-alternatives command to set it up correctly by following these steps. Also, many softwares require an explicit JAVA_HOME to be set to execute correctly. This post on setting JAVA_HOME, shows how I have set it up on my machine.