Hadoop Single Node Installation
Java environment setup
1 | sudo apt-get update |
Adding a dedicated Hadoop system user
This will add the user hduser and the group hadoop to your local machine.
1 | sudo addgroup hadoop |
Configuring SSH
1 | su hduser |
The output looks like this:
1 | hduser@hanson:~$ ssh-keygen -t rsa -P "" |
enable SSH access to your local machine with this newly created key.
1 | hduser@hanson:~$ s |
test the SSH setup by connecting to your local machine with the hduser
user
1 | hduser@hanson:~$ ssh localhost |
You should install ssh
first ( sudo apt-get install ssh)
Output should be like this:
1 | The authenticity of host 'localhost (127.0.0.1)' can't be established. |
Disabling IPv6
One problem with IPv6 on Ubuntu is that using 0.0.0.0
for the various networking-related Hadoop configuration options will result in Hadoop binding to the IPv6 addresses of my Ubuntu box. In my case, I realized that there’s no practical point in enabling IPv6 on a box when you are not connected to any IPv6 network. Hence, I simply disabled IPv6 on my Ubuntu machine. Your mileage may vary.
To disable IPv6 on Ubuntu 10.04 LTS, open /etc/sysctl.conf
in the editor of your choice and add the following lines to the end of the file:
1 | # disable ipv6 |
You have to reboot your machine in order to make the changes take effect.
You can check whether IPv6 is enabled on your machine with the following command:
1 | cat /proc/sys/net/ipv6/conf/all/disable_ipv6 |
If output is 1, then ipv6 is disabled.
Alternative
You can also disable IPv6 only for Hadoop as documented in HADOOP-3437. You can do so by adding the following line to hadoop/etc/hadoop/hadoop-env.sh
:
1 | export HADOOP_OPTS=-Djava.net.preferIPv4Stack=true |
Hadoop installation
Download Hadoop from the Apache Download Mirrors
and extract the contents of the Hadoop package to a location of your choice. I picked $HADOOP_HOME. Make sure to change the owner of all the files to the hduser user and hadoop group, for example:
- wget http://mirrors.ocf.berkeley.edu/apache/hadoop/common/hadoop-2.7.4/hadoop-2.7.4.tar.gz (download hadoop)
- tar -xzvf hadoop-2.7.4.tar.gz (extract compressed file)
- sudo mv hadoop-2.7.4 $HADOOP_HOME
- sudo chown -R hduser:hadoop hadoop
Update $HOME/.bashrc
Add the following lines to the end
of the $HOME/.bashrc
file of user hduser. If you use a shell other than bash, you should of course update its appropriate configuration files instead of .bashrc
.
1 | # Set Hadoop-related environment variables |
Then you should run ‘source .bashrc’ to enable the new configuration.
hadoop-env.sh
The only required environment variable we have to configure for Hadoop in this tutorial is JAVA_HOME
. Open hadoop/etc/hadoop/hadoop-env.sh
readlink -f /usr/bin/java | sed “s:bin/java::” (find the default Java path)
sudo vim $HADOOP_HOME/etc/hadoop/hadoop-env.sh
1
2
3# $HADOOP_HOME/etc/hadoop/hadoop-env.sh
#export JAVA_HOME=${JAVA_HOME}
export JAVA_HOME=$(readlink -f /usr/bin/java | sed "s:bin/java::")$HADOOP_HOME/bin/hadoop
orhadoop
(run hadoop, following appears, then hadoop is installed correctly)1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16Usage: hadoop [--config confdir] [COMMAND | CLASSNAME]
CLASSNAME run the class named CLASSNAME
or
where COMMAND is one of:
fs run a generic filesystem user client
version print the version
jar <jar> run a jar file
note: please use "yarn jar" to launch
YARN applications, not this command.
checknative [-a|-h] check native hadoop and compression libraries availability
distcp <srcurl> <desturl> copy file or directories recursively
archive -archiveName NAME -p <parent path> <src>* <dest> create a hadoop archive
classpath prints the class path needed to get the
credential interact with credential providers
Hadoop jar and the required libraries
daemonlog get/set the log level for each daemonYou can repeat this exercise also for other users who want to use Hadoop.
Create a directory called input in our home directory and copy Hadoop’s configuration files into it to use those files as our data.
1 | mkdir ~/input |
run cmd:
1 | hadoop jar $HADOOP_HOME/share/hadoop/mapreduce/hadoop-mapreduce-examples-2.7.4.jar grep ~/input ~/grep_example 'principal[.]*' |
You can see output like following:
1 | Output |
If you run cmd:
1 | cat ~/grep_example/* |
You will see:
1 | Output |
But if you run this test again, it will give some error. Don’t worry, just delete the grep_example
folder then everything will work fine.
Hadoop Pseudo-distributed Mode
HDFS Configuration
In this section, we will configure the directory where Hadoop will store its data files, the network ports it listens to, etc. Our setup will use Hadoop’s Distributed File System, HDFS
, even though our little “cluster” only contains our single local machine.
You can leave the settings below “as is” with the exception of the hadoop.tmp.dir
parameter – this parameter you must change to a directory of your choice. We will use the directory /app/hadoop/tmp
in this tutorial. Hadoop’s default configurations use hadoop.tmp.dir
as the base temporary directory both for the local file system and HDFS, so don’t be surprised if you see Hadoop creating the specified directory automatically on HDFS at some later point.
Now we create the directory and set the required ownerships and permissions:
1 | mkdir -p ~/Program/hadoop/tmp |
Add the following snippets between the
HDFS is the distributed file system used by Hadoop to store data in the cluster, capable of hosting very very (very) large files, splitting them over the nodes of the cluster. Theoretically, you don’t need to have it running and files could instead be stored elsewhere like S3 or even the local file system (if using a purely local Hadoop installation). However, some applications require interactions with HDFS so you may have to set it up sooner or later if you’re using third party modules. HDFS is composed of aNameNode
which holds all the metadata regarding the stored files, and DataNodes
(one per node in the cluster) which hold the actual data.
The main HDFS configuration file is located at $HADOOP_PREFIX/etc/hadoop/hdfs-site.xml
. If you’ve been following since the beginning, this file should be empty so it will use the default configurations outlined in this page. For a single-node installation of HDFS you’ll want to change hdfs-site.xml
to have, at the very least, the following:
First you should create dfs
folder, datanode
& namenode
folder under tmp
folder.
tmp
|– dfs
|– datanode
|– namenode
1 | <configuration> |
In addition, add the following to $HADOOP_PREFIX/etc/hadoop/core-site.xml
to let the Hadoop modules know where the HDFS NameNode
is located.
1 | <configuration> |
Note
:
If you configured core-site.xml
, then the bellow example testing will fail.
1 | # This test will fail in to connection refuse |
YARN on a Single Node
YARN is the component responsible for allocating containers to run tasks, coordinating the execution of said tasks, restart them in case of failure, among other housekeeping. Just like HDFS, it also has 2 main components: a ResourceManager which keeps track of the cluster resources and NodeManagers in each of the nodes which communicates with the ResourceManager and sets up containers for execution of tasks.
You can run a MapReduce job on YARN in a pseudo-distributed mode by setting a few parameters and running ResourceManager daemon and NodeManager daemon in addition.
Configure parameters as follows: etc/hadoop/mapred-site.xml:
1 | <configuration> |
$HADOOP_PREFIX/etc/hadoop/yarn-site.xml
. The file should currently be empty which means it’s using the default configurations you can find here. For a single-node installation of YARN you’ll want to add the following to that file:
1 | <configuration> |
Starting
Now that we’ve finished configuring everything, it’s time to setup the folders and start the daemons:
1 | ## Start HDFS daemons |
or
1 | # Format the namenode directory (DO THIS ONLY ONCE, THE FIRST TIME) |
Hopefully, everything should be running. Use the command jps
to see if all daemons are launched. If one is missing, check $HADOOP_PREFIX/logs/<daemon with problems>.log
for any errors.
The output looks like this:
1 | hduser@hanson:$HADOOP_HOME/etc/hadoop$ jps |
Make the HDFS directories required to execute MapReduce jobs:
1 | $ bin/hdfs dfs -mkdir /user |
Testing
To test if everything is working ok, lets run one of the example applications shipped with Hadoop called DistributedShell. This application spawns a specified number of containers and runs a shell command in each of them. Lets run DistributedShell with the ‘date’ command which outputs the current time:
1 | # Run Distributed shell with 2 containers and executing the script `date`. |
The output looks like this:
1 | 17/11/26 11:20:20 INFO distributedshell.Client: Initializing Client |
With this command we are telling hadoop to run the Client class
in the hadoop-yarn-applications-distributedshell-2.7.4.jar
, passing it the jar containing the definition of the ApplicationMaster (the same jar), the shell command to run in each of the hosts (date)
, the number of containers to spawn (2) and the memory used by the ApplicationMaster (1024MB). The value of 1024 was set empirically by trying to run the program several times until it stopped failing due to the ApplicationMaster using more memory than that which had been allocated to it. You can check the entire set of parameters
you can pass to DistributedShell by using the same command without any arguments
:
1 | # Check the parameters for the DistributedShell client. |
The output should look like this:
1 | 17/11/26 11:29:46 INFO distributedshell.Client: Initializing Client |
Hadoop Web Interfaces
Web UIs for the Common User
The default Hadoop ports are as follows:
Daemon | Default Port | Configuration Parameter |
---|---|---|
——HDFS——- | ||
Namenode | 50070 | dfs.http.address |
Datanodes | 50075 | dfs.datanode.http.address |
Secondarynamenode | 50090 | dfs.secondary.http.address |
Backup/Checkpoint node? | 50105 | dfs.backup.http.address |
—-MapReduce—- | ||
Jobracker | 50030 | mapred.job.tracker.http.address |
Tasktrackers | 50060 | mapred.task.tracker.http.address |
Hadoop daemons expose some information over HTTP. All Hadoop daemons expose the following:
/logs
:
Exposes, for downloading, log files in the Java system property hadoop.log.dir./logLevel
:
Allows you to dial up or down log4j logging levels. This is similar to hadoop daemonlog on the command line./stacks
:
Stack traces for all threads. Useful for debugging./metrics
:
Metrics for the server. Use /metrics?format=json to retrieve the data in a structured form.Available in 0.21
.
Individual daemons expose extra daemon-specific endpoints as well. Note that these are not necessarily part of Hadoop’s public API, so they tend to change over time.
The Namenode
exposes:
/
:
Shows information about the namenode as well as the HDFS. There’s a link from here to browse the filesystem, as well./dfsnodelist.jsp?whatNodes=(DEAD|LIVE)
:
Shows lists of nodes that are disconnected from (DEAD) or connected to (LIVE) the namenode./fsck
:
Runs the “fsck” command. Not recommended on a busy cluster./listPaths
:
Returns an XML-formatted directory listing. This is useful if you wish (for example) to poll HDFS to see if a file exists. The URL can include a path (e.g., /listPaths/user/philip) and can take optional GET arguments: /listPaths?recursive=yes will return all files on the file system; /listPaths/user/philip?filter=s.* will return all files in the home directory that start with s; and /listPaths/user/philip?exclude=.txt will return all files except text files in the home directory. Beware that filter and exclude operate on the directory listed in the URL, and they ignore the recursive flag./data
and/fileChecksum
These forward your HTTP request to an appropriate datanode, which in turn returns the data or the checksum.
Datanodes
expose the following:
/browseBlock.jsp
,/browseDirectory.jsp, tail.jsp
,/streamFile
,/getFileChecksum
These are the endpoints that the namenode redirects to when you are browsing filesystem content. You probably wouldn’t use these directly, but this is what’s going on underneath./blockScannerReport
Every datanode verifies its blocks at configurable intervals. This endpoint provides a listing of that check.
The secondarynamenode
exposes a simple status page with information including which namenode it’s talking to, when the last checkpoint was, how big it was, and which directories it’s using.
The jobtracker
‘s UI is commonly used to look at running jobs, and, especially, to find the causes of failed jobs. The UI is best browsed starting at /jobtracker.jsp
. There are over a dozen related pages providing details on tasks, history, scheduling queues, jobs, etc.
Tasktrackers
have a simple page (/tasktracker.jsp
), which shows running tasks. They also expose /taskLog?taskid=
to query logs for a specific task. They use /mapOutput
to serve the output of map tasks to reducers, but this is an internal API.
Under the Covers for the Developer and the System Administrator
Internally, Hadoop mostly uses Hadoop IPC to communicate amongst servers. (Part of the goal of the Apache Avro project is to replace Hadoop IPC with something that is easier to evolve and more language-agnostic; HADOOP-6170 is the relevant ticket.) Hadoop also uses HTTP (for the secondarynamenode communicating with the namenode and for the tasktrackers serving map outputs to the reducers) and a raw network socket protocol (for datanodes copying around data).
The following table presents the ports and protocols (including the relevant Java class) that Hadoop uses. This table does not include the HTTP ports mentioned above.
Daemon | Default Port | Configuration Parameter | Protocol | Used for |
---|---|---|---|---|
Namenode | 8020 | fs.default.name | IPC: ClientProtocol | Filesystem metadata operations |
Datanode | 50010 | dfs.datanode.address | Custom Hadoop Xceiver: DataNode and DFSClient | DFS data transfer |
Datanode | 50020 | dfs.datanode.ipc.address | IPC: InterDatanodeProtocol, ClientDatanodeProtocolClientProtocol | Block metadata operations and recovery |
Backupnode | 50100 | dfs.backup.address | Same as namenode | HDFS Metadata Operations |
> This is the port part of hdfs://host:8020/. | ||||
Default is not well-defined. Common values are 8021, 9001, or 8012. See MAPREDUCE-566. | ||||
Binds to an unused local port. |
Multinode configuration
Node configure
First get every machine’s ip address. Assuming we have 3 machine, 1 master and 2 slaves.
1 | master: 10.211.55.3 |
Then go to every machine’s /etc/hostname
file, change them:
1 | sudo vim /etc/hostname |
Then go to every machine’s /etc/hosts
file, add the same content:
1 | 10.211.55.3 master |
Then reboot 3 machine and conform every machine’s hostname:
1 | osboxes@master:~$ hostname |
Then in every machine, use ssh
command to link every other machine: ( This is used to enable ssh without password)
1 | osboxes@master:~$ ssh slave1 |
Update core-site.xml in all machine
Delete this configure:
1 | <property> |
And change all fs.default
property to:
1 | <property> |
Update hdfs-site.xml in all machine
in master node, delete datanode
property and change dfs.replication
to 2:
1 | <!-- Put site-specific property overrides in this file. --> |
In the slave node, delete namenode
property and change dfs.replication
to 2
1 | <configuration> |
Update yarn-site.xml in all machine
insert the following to all 3 machines.
1 | <property> |
Update mapred-site.xml
1 | <property> |
Master or Slaves only configuration
edit ~/Program/hadoop/etc/hadoop/slaves
file (Master only)
change to this value:
1 | slave1 |
edit ~/Program/hadoop/etc/hadoop/masters
file (Master only)
change to this value:
1 | master |
Recreate Namenode folder (Master Only):
1 | mkdir -p ~/Program/hadoop/tmp/dfs/namenode |
Recreate Datanode folder (All slave Nodes only):
1 | mkdir -p ~/Program/hadoop/tmp/dfs/datanode |
Format the Namenode (Master only)
1 | hdfs namenode -format |
Start the DFS & Yarn(Master only)
1 | osboxes@master:~/Program/hadoop/sbin$ ./start-dfs.sh |
If you go to your slave machine, then you will see:
1 | osboxes@slave1:~/Program/hadoop/tmp/dfs$ jps |
Review Yarn console
If all the services started successfully on all nodes, then you should see all of your nodes listed under Yarn nodes. You can hit the following url on your browser and verify that:
http://master:8088/cluster/nodes
http://master:50070
You can also get the report of your cluster by issuing the below commands:
1 | hdfs dfsadmin -report |