Pumudu's Blog: 2014

Thursday, November 6, 2014

SonarQube : Using SonarQube with Maven and Mac OSX

SonarQube (formally known as just sonar) is a server based open source platform for inspecting code quality of a project. It's mainly designed to inspect continuous code quality of projects with multi developer environment. But it can be use to inspect code quality in individual projects as well.

In this short tutorial I'll explain how to use sonar in single developer environment to inspect code quality of a java project.

I have used following configurations to up and running SonarQube.

Mac OS X (10.9.4)

Mysql

Maven 3.0

JDK 1.6 or Above

Sonar 4.4.1

1. Download, Install and run SonarQube

It's open source..!! you can download it free from sonar website.

SonarQube download [1]. Unzip somewhere you can reach.

Set the SONAR_HOME path to sonar qube base directory using .profile or .bashrc scripts.

2. Create Mysql database for SonarQube

Source following script on mysql to create new database for sonar sonar database script [2]. This is essential as it keeps all data related to projects on this database.

Make sure it's created properly by login to Mysql server.

3. Set up SonarQube

Goto sonarQube conf directory using "cd $SONAR_HOME"/conf and open sonar.properties file using vim.

# Permissions to create tables, indices and triggers must be granted to JDBC user.
# The schema must be created first.
sonar.jdbc.username=sonar
sonar.jdbc.password=sonar

#----- Embedded database H2
# Note: it does not accept connections from remote hosts, so the
# SonarQube server and the maven plugin must be executed on the same host.

# Comment the following line to deactivate the default embedded database.
sonar.jdbc.url=jdbc:h2:tcp://localhost:9092/sonar

# directory containing H2 database files. By default it's the /data directory in the SonarQube installation.
#sonar.embeddedDatabase.dataDir=
# H2 embedded database server listening port, defaults to 9092
#sonar.embeddedDatabase.port=9092


#----- MySQL 5.x
# Comment the embedded database and uncomment the following line to use MySQL
sonar.jdbc.url=jdbc:mysql://localhost:3306/sonar?useUnicode=true&characterEncoding=utf8&rewriteBatchedStatements=true

Note down sonar .jdbc username and password this is the username and password.
Uncomment sonar.jdbc.url to enable jdbc driver for mysql database.

4. Set up maven to run sonar upon build.

Move to .m2 directory using following command.

cd ~/.m2/

Create a new file settings.xml.

vim settings.xml

Save following xml as settings.xml using vim.

<settings>
    <profiles>
        <profile>
            <id>sonar</id>
            <activation>
                <activeByDefault>true</activeByDefault>
            </activation>
            <properties>
                <!-- Example for MySQL-->
                <sonar.jdbc.url>
                  jdbc:mysql://localhost:3306/sonar?useUnicode=true&amp;characterEncoding=utf8
                </sonar.jdbc.url>
                <sonar.jdbc.username>root</sonar.jdbc.username>
                <sonar.jdbc.password>{your_mysql_root_password}</sonar.jdbc.password>
 
                <!-- Optional URL to server. Default value is http://localhost:9000 -->
                <sonar.host.url>
                  http://localhost:9000
                </sonar.host.url>
            </properties>
        </profile>
     </profiles>
</settings>

Change your mysql root username and password accordingly and add localhost to sonar.host.url.

5. Run sonar server.

Move to sonar bin directory.

cd $SONAR_HOME/bin/

Run the sonar sever as follows.

./macosx-universal-64/sonar.sh start

Now using a web browser goto following url http://localhost:9000/ . Hopefully now you should see the sonar dashboard.

6. Inspect a project using SonarQube.

From CLI move to root directory of your project (where the main pom file is) and run following command.

 mvn clean install sonar:sonar

After build success using a web browser open sonar dashboard and you will see your newly build project under "Projects" tab.

Congratulations.. now you can use sonar for inspect quality of your code every time you build it.

References :

[1] http://www.sonarqube.org/downloads/

[2] http://svn.codehaus.org/sonar/branches/GSOC/sonarapplication/src/main/assembly/extras/database/mysql/create_database.sql

Friday, October 24, 2014

Hadoop : Single Node Cluster On Mac OS X

Creating a single node cluster proper way can be little challenging if you are new to hadoop and mac os. Hopefully following guide will be helpful to create a hadoop single node cluster on Mac OS X.

prerequisites :

Mac OS X 10.9.4 or higher
java 1.6
hadoop 2.2.0 or higher

Basic knowledge on terminal commands will definitely useful as well. :)

Create a separate user for hadoop

Even though it’s not required to create a user and a user group it’s recommended to separate hadoop installation from other applications and user accounts on the same machine.

goto System preferences ->user and user groups
unlock the window and create a group name 'hadoop'
create a user name 'hduser'
click on newly created hadoop group and select the hduser to add hduser to hadoop group.

This can be done in terminal but for some reason in mac os x linux commands didn't work for me.

SSH Configuration for hadoop

Hadoop need ssh to connect with it’s nodes. we need to configure ssh connection to localhost with hduser logged in.

ssh should be installed on your machine to proceed. There are plenty of tutorials out there how to install ssh on mac.

1. Generate ssh key for hduser

su - hduser

Generate ssh key for hduser.

ssh-keygen -t rsa -P ""

it’s not recommended to RSA key pair with empty password.But this way you don’t have to enter the password every time hadoop communicate with it’s nodes.

Now we are ready to enable ssh connection to your local machine with generated key. move into .ssh directory by,

cd .ssh/

Then copy the public key from id_rsa.pub to authorized_keys using following command.

cat id_rsa.pub >> authorized_keys

finally we are ready to connect through ssh.

ssh -vvv localhost

If you getting connection refuse error probably mac has turn of remote login to your machine. Goto system preferences-> sharing-> check on remote login. Refer following screenshot.

Now ssh should work fine with localhost.

Optional :
If you face any conflicts with local resources it's good idea to force hadoop to use IPv4.
add following line in {hadoop_2.2.0_base}/etc/hadoop/hadoop-env.sh

export HADOOP_OPTS=-Djava.net.preferIPv4Stack=true

Hadoop configurations

Extract downloaded hadoop 2.2.0 and move hadoop directory to /system/ or any globally accessible directory. I have moved it to /hadoop/ directory in the machines file system.

Make sure to give hduser ownership of the Hadoop directory using following command.

sudo chown -R hduser:hadoop hadoop-2.2.0

Execute following command and open bashrc for hduser.

vim ~/.bashrc

Add hadoop binary path to system PATH variable so you can access them system wide.

#set Hadoop-related environment variables
export HADOOP_HOME=/hadoop/hadoop-2.2.0/

# set JAVA_HOME (we will also configure JAVA_HOME directly for Hadoop later on)
export JAVA_HOME=/Library/Java/JavaVirtualMachines/1.6.0.jdk/Contents/Home/

# set hadoop executables system wide
export PATH=$PATH:/hadoop/hadoop-2.2.0/bin:/hadoop/hadoop-2.2.0/hadoop/sbin

make sure to change the JAVA_HOME to your JAVA_HOME path. Before proceed check if these environment variables are set. if not it's better to restart the computer and make sure all variables are set properly.

If by any chance you don't know where's your java path just add following JAVA_HOME command and mac will set the latest availabe java home path to environment variable.

export JAVA_HOME=`/usr/libexec/java_home`

Create two directories from hduser to hold name node and data node information on hdfs filesystem.

mkdir -p ~/hadoop/data/namenode
mkdir -p ~/hadoop/data/datanode

Export following variables to /hadoop-2.2.0/etc/hadoop/hadoop-env.sh

export JAVA_HOME="YOUR JAVA HOME PATH"
export HADOOP_COMMON_LIB_NATIVE_DIR="/hadoop/hadoop-2.2.0/lib"
export HADOOP_OPTS="$HADOOP_OPTS -Djava.library.path=/hadoop/hadoop-2.2.0/lib"

Edit following configurations in {hadoop base directory}/etc/hadoop .

/etc/hadoop/core-site.xml

<configuration>
<property>
<name>fs.default.name</name>
<value>hdfs://localhost:9000</value>
</property>
</configuration>

/etc/hadoop/hdfs-site.xml

<configuration>
<property>
<name>dfs.replication</name>
<value>1</value>
</property>
<property>
<name>dfs.namenode.name.dir</name>
<value>file:/hadoop/hadoop-2.2.0/yarn_data/hdfs/namenode</value>
</property>
<property>
<name>dfs.datanode.data.dir</name>
<value>file:/hadoop/hadoop-2.2.0/yarn_data/hdfs/datanode</value>
</property>
</configuration>

/etc/hadoop/yarn-site.xml

<configuration>
<!-- Site specific YARN configuration properties -->
<property>
<name>yarn.nodemanager.aux-services</name>
<value>mapreduce_shuffle</value>
</property>
<property>
<name>yarn.nodemanager.aux-services.mapreduce.shuffle.class</name>
<value>org.apache.hadoop.mapred.ShuffleHandler</value>
</property>
</configuration>

/etc/hadoop/mapred-site.xml

<configuration>
<property>
<name>mapreduce.framework.name</name>
<value>yarn</value>
</property>
</configuration>

Finally we can format the name node of hadoop by executing following.

./hdfs namenode -format

If everything went correctly you will see something similar as following. (i have removed some classpath for clarity)

14/10/24 10:35:29 INFO namenode.NameNode: STARTUP_MSG: 
/************************************************************
STARTUP_MSG: Starting NameNode
STARTUP_MSG:   host = pumudus-MacBook-Pro.local/10.100.x.xxx
STARTUP_MSG:   args = [-format]
STARTUP_MSG:   version = 2.2.0
STARTUP_MSG:   classpath = /hadoop/hadoop-2.2.0/etc/hadoop:/hadoop/hadoop-2.2.0/share/hadoop/common/lib/activation-1.1.jar:/hadoop/hadoop-2.2.0/share/hadoop/common/lib/asm-3.2.jar:/hadoop/hadoop-2.2.0/share/hadoop/common/lib/avro-1.7.4.jar:/hadoop/hadoop-2.2.0/share/hadoop/common/lib/commons-beanutils-1.7.0.jar:/hadoop/hadoop-2.2.0/share/hadoop/mapreduce/hadoop-mapreduce-client-jobclient-2.2.0-tests.jar:/hadoop/hadoop-2.2.0/share/hadoop/mapreduce/hadoop-mapreduce-client-jobclient-2.2.0.jar:/hadoop/hadoop-2.2.0/share/hadoop/mapreduce/hadoop-mapreduce-client-shuffle-2.2.0.jar:/hadoop/hadoop-2.2.0/share/hadoop/mapreduce/hadoop-mapreduce-examples-2.2.0.jar:/contrib/capacity-scheduler/*.jar
STARTUP_MSG:   build = https://svn.apache.org/repos/asf/hadoop/common -r 1529768; compiled by 'hortonmu' on 2013-10-07T06:28Z
STARTUP_MSG:   java = 1.7.0_65
************************************************************/
14/10/24 10:35:29 INFO namenode.NameNode: registered UNIX signal handlers for [TERM, HUP, INT]
14/10/24 10:35:29 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Formatting using clusterid: CID-b2490d76-561a-42cc-9918-1a94cc0bc96a
14/10/24 10:35:29 INFO namenode.HostFileManager: read includes:
HostSet(
)
14/10/24 10:35:29 INFO namenode.HostFileManager: read excludes:
HostSet(
)
14/10/24 10:35:29 INFO blockmanagement.DatanodeManager: dfs.block.invalidate.limit=1000
14/10/24 10:35:29 INFO util.GSet: Computing capacity for map BlocksMap
14/10/24 10:35:29 INFO util.GSet: VM type       = 64-bit
14/10/24 10:35:29 INFO util.GSet: 2.0% max memory = 889 MB
14/10/24 10:35:29 INFO util.GSet: capacity      = 2^21 = 2097152 entries
14/10/24 10:35:29 INFO blockmanagement.BlockManager: dfs.block.access.token.enable=false
14/10/24 10:35:29 INFO blockmanagement.BlockManager: defaultReplication         = 1
14/10/24 10:35:29 INFO blockmanagement.BlockManager: maxReplication             = 512
14/10/24 10:35:29 INFO blockmanagement.BlockManager: minReplication             = 1
14/10/24 10:35:29 INFO blockmanagement.BlockManager: maxReplicationStreams      = 2
14/10/24 10:35:29 INFO blockmanagement.BlockManager: shouldCheckForEnoughRacks  = false
14/10/24 10:35:29 INFO blockmanagement.BlockManager: replicationRecheckInterval = 3000
14/10/24 10:35:29 INFO blockmanagement.BlockManager: encryptDataTransfer        = false
14/10/24 10:35:29 INFO namenode.FSNamesystem: fsOwner             = hduser (auth:SIMPLE)
14/10/24 10:35:29 INFO namenode.FSNamesystem: supergroup          = supergroup
14/10/24 10:35:29 INFO namenode.FSNamesystem: isPermissionEnabled = true
14/10/24 10:35:29 INFO namenode.FSNamesystem: HA Enabled: false
14/10/24 10:35:29 INFO namenode.FSNamesystem: Append Enabled: true
14/10/24 10:35:30 INFO util.GSet: Computing capacity for map INodeMap
14/10/24 10:35:30 INFO util.GSet: VM type       = 64-bit
14/10/24 10:35:30 INFO util.GSet: 1.0% max memory = 889 MB
14/10/24 10:35:30 INFO util.GSet: capacity      = 2^20 = 1048576 entries
14/10/24 10:35:30 INFO namenode.NameNode: Caching file names occuring more than 10 times
14/10/24 10:35:30 INFO namenode.FSNamesystem: dfs.namenode.safemode.threshold-pct = 0.9990000128746033
14/10/24 10:35:30 INFO namenode.FSNamesystem: dfs.namenode.safemode.min.datanodes = 0
14/10/24 10:35:30 INFO namenode.FSNamesystem: dfs.namenode.safemode.extension     = 30000
14/10/24 10:35:30 INFO namenode.FSNamesystem: Retry cache on namenode is enabled
14/10/24 10:35:30 INFO namenode.FSNamesystem: Retry cache will use 0.03 of total heap and retry cache entry expiry time is 600000 millis
14/10/24 10:35:30 INFO util.GSet: Computing capacity for map Namenode Retry Cache
14/10/24 10:35:30 INFO util.GSet: VM type       = 64-bit
14/10/24 10:35:30 INFO util.GSet: 0.029999999329447746% max memory = 889 MB
14/10/24 10:35:30 INFO util.GSet: capacity      = 2^15 = 32768 entries
Re-format filesystem in Storage Directory /usr/local/hadoop/yarn_data/hdfs/namenode ? (Y or N) Y
14/10/24 10:35:40 INFO common.Storage: Storage directory /usr/local/hadoop/yarn_data/hdfs/namenode has been successfully formatted.
14/10/24 10:35:40 INFO namenode.FSImage: Saving image file /usr/local/hadoop/yarn_data/hdfs/namenode/current/fsimage.ckpt_0000000000000000000 using no compression
14/10/24 10:35:40 INFO namenode.FSImage: Image file /usr/local/hadoop/yarn_data/hdfs/namenode/current/fsimage.ckpt_0000000000000000000 of size 198 bytes saved in 0 seconds.
14/10/24 10:35:40 INFO namenode.NNStorageRetentionManager: Going to retain 1 images with txid >= 0
14/10/24 10:35:40 INFO util.ExitUtil: Exiting with status 0
14/10/24 10:35:40 INFO namenode.NameNode: SHUTDOWN_MSG: 
/************************************************************
SHUTDOWN_MSG: Shutting down NameNode at pumudus-MacBook-Pro.local/10.100.x.xxx
************************************************************/

Starting Hadoop file system and yarn

It's good practice to start each process one after another as it will be easy to debug if something goes wrong.

Start the namenode

hadoop-daemon.sh start namenode

Use jps tool to see if namenode process started successfully.

pumudus-MacBook-Pro:sbin hduser$ jps
33907 Jps
33880 NameNode

Start datanode

hadoop-daemon.sh start datanode

Start node manager

yarn-daemon.sh start nodemanager

Start history server

mr-jobhistory-daemon.sh start historyserver

If everything went correctly, you will see all the VMs started in jps as follows.

pumudus-MacBook-Pro:sbin hduser$ ./hadoop-daemon.sh start datanode
starting datanode, logging to /hadoop/hadoop-2.2.0/logs/hadoop-hduser-datanode-pumudus-MacBook-Pro.local.out
pumudus-MacBook-Pro:sbin hduser$ jps
33965 DataNode
34001 Jps
33880 NameNode
pumudus-MacBook-Pro:sbin hduser$ ./yarn-daemon.sh start nodemanager
starting nodemanager, logging to /hadoop/hadoop-2.2.0/logs/yarn-hduser-nodemanager-pumudus-MacBook-Pro.local.out
pumudus-MacBook-Pro:sbin hduser$ jps
34066 Jps
33965 DataNode
34035 NodeManager
33880 NameNode
pumudus-MacBook-Pro:sbin hduser$ ./mr-jobhistory-daemon.sh start historyserver
starting historyserver, logging to /hadoop/hadoop-2.2.0/logs/mapred-hduser-historyserver-pumudus-MacBook-Pro.local.out
pumudus-MacBook-Pro:sbin hduser$ jps
34096 JobHistoryServer
33965 DataNode
34035 NodeManager
33880 NameNode
34120 Jps

Troubleshooting for Mac OS X :

If you facing issues in mac os x add following lines to hadoop-env.sh as well.

1. "Unable to load realm info from SCDynamicStore put: .." when starting a namenode.
This is a known issue of hadoop. There's a open jira for this issue as well.
https://issues.apache.org/jira/browse/HADOOP-7489

export HADOOP_OPTS="${HADOOP_OPTS} -Djava.security.krb5.realm= -Djava.security.krb5.kdc="
export HADOOP_OPTS="${HADOOP_OPTS} -Djava.security.krb5.conf=/dev/null"

2. "Can't connect to window server - not enough permissions." When formatting a name node.
This is a java error specific to Mac os x. Add following lines and use Headless mode in hadoop.
Refer http://www.oracle.com/technetwork/articles/javase/headless-136834.html for more information.

export HADOOP_OPTS="${HADOOP_OPTS} -Djava.awt.headless=true"

3. If you unable to start any process refer the logs generated by hadoop in log directory. These logs are very descriptive therefore logs will be helpful to pin point issues.

Hadoop Web Interfaces

Hadoop comes with web interfaces which shows current statuses of hdfs and map reduced tasks.

1. See the status of HDFS name node. http://localhost:50070

2. See the status of HDFS secondary name node. http://localhost:50090

3. See hadoop job history http://localhost:19888/

If you can access these web interfaces that means hadoop has configured correctly on you machine with a single node. If there's any questions feel free to ask on comments bellow.

Good luck..!!

Thursday, July 24, 2014

Hadoop : Introduction

What's Hadoop ?

Apache Hadoop is a open-source software library which allows distributed processing of large data sets across clusters of computers using simple programming models. It's scaleable from single machine to thousands of cluster computers each capable of computing and storing data. Therefore Hadoop can be use to develop high-available services on top of cluster computers.

Hadoop framework is set of open source projects which provides common set of services. Key Attribute of Hadoop is that it's Redundant and reliable there's zero data loss even if several machine in a cluster fails.

Following are main modules available in latest hadoop release.

Hadoop Common: Utility framework which support other Hadoop modules.
Hadoop Distributed File System (HDFS™): Distributed file system that provides high-throughput access to application data.
Hadoop YARN: A framework for job scheduling and cluster resource management.
Hadoop MapReduce: A YARN-based system for parallel processing of large data sets.

Hadoop framework mainly targets on addressing following problems which are in traditional single disk based file systems.

Slow read/write speeds in disks -
There are performance limits in most of storage devices as most of them are still based on moving mechanical components. Still it's not practical to use SSD's for mass storage due to high costs. Therefore solution needed as a workaround to increase read / write speeds as a workaround.
Hadoop presented a simple solution for this issue. By replicating same data in multiple disks it can achieve high read speeds by parallel reading from multiple disks.

Hardware failure -
Since hard drives are based on mechanical equipments. it tends to fail more compared to other components in a computer.
This issue also handled in Hadoop by replicating data in several hard drives and storages. Incase of one hard drive fails another hard drive can act as back up and restore data.

How to merge data from different sources and communicate effectively -
Since requested data can be in several hard drives how to combine them together and transmit to requested parties with most efficient way was another main issue to be addressed.
HDFS - Map reduced method comes into picture to address this issue. Only complete final result will be transmitted through the network. It will transmitted using compressed format (Map reduced) in order to save data transmission time and network bandwidth.

Technical Overview

Basically Hadoop Machine consist of two parts.

Task Tracker : This is the processing part in Hadoop machine (Map Reduce Server)
Data Node : Data part in Hadoop machine (HDFS Server)

Hadoop Cluster consists of 3 main parts.

Name Node : Keeps directory tree of all files in the file system. (index of files in the HDFS)
Job Tracker : The service which allocate map reduce tasks to each task tracker in the pool.
Pool of Hadoop Machines