Running multiple datanodes on one machine

I have a few hadoop-related questions that we plan to implement in a production environment

We have a large cluster of machines and each machine is a server machine with large ram and 8 cores. Every 40 machines collect about 60 g of data every 5 minutes. These machines are also distributed to multiple locations and are located around the world. There is a separate, separate server machine that will act as a menode in the hadoop environment. Stay on all 40 machines that are data collectors. I make them part of the hadoop cluster as data nodes.

Since the data collection is quite high on each machine, I do not want my data to move across servers across geographic regions. So here are my 2 requirements

1) I want my 60GB data to be split into blocks but needs to be processed locally. For this, I want to have multiple datanodes deomons on the same server. Is it possible that there are multiple datanodes running on the same server?

2) Is it possible to process blocks on the specified datanodes.

I'll take an example to clear my points Let's say I have server machines like A, B, C, D ............

The machine will have 60 GB of data every 5 minutes. Can I run multiple datanodes daemons on machine A? If possible, I can tell that my namemode only sends blocks to the datanodes daemons running on Server A, not to other machines.

I do not want high availability of data and do not require fault tolerance, so there is no need to replicate data.

+3


source to share


2 answers


To run multiple data nodes on one node first hadoop download / build binary.

1) Download the binary or add the bid name from the hadoop.

2) Prepare your hadoop config to run in a single node (change the default dmp location in Hadoop from / tmp to another trusted location)

3) Add the following script to your $ HADOOP_HOME / bin directory and chmod it to 744.

4) HDFS format - bin / hasoop namenode -format (for Hadoop 0.20 and below), bin / hdfs namenode -format (for version> 0.21)



5) Run HDFS bin / start-dfs.sh (This will launch Namenode and 1 data node) which can be viewed at http: // localhost: 50070

6) Run additional data nodes using bin / run-additionalDN.sh More details

run-additionalDN.sh

#!/bin/sh
# This is used for starting multiple datanodes on the same machine.
# run it from hadoop-dir/ just like 'bin/hadoop' 

#Usage: run-additionalDN.sh [start|stop] dnnumber
#e.g. run-datanode.sh start 2

DN_DIR_PREFIX="/path/to/store/data_and_log_of_additionalDN/"

if [ -z $DN_DIR_PREFIX ]; then
echo $0: DN_DIR_PREFIX is not set. set it to something like "/hadoopTmp/dn"
exit 1
fi

run_datanode () {
DN=$2
export HADOOP_LOG_DIR=$DN_DIR_PREFIX$DN/logs
export HADOOP_PID_DIR=$HADOOP_LOG_DIR
DN_CONF_OPTS="\
-Dhadoop.tmp.dir=$DN_DIR_PREFIX$DN\
-Ddfs.datanode.address=0.0.0.0:5001$DN \
-Ddfs.datanode.http.address=0.0.0.0:5008$DN \
-Ddfs.datanode.ipc.address=0.0.0.0:5002$DN"
bin/hadoop-daemon.sh --script bin/hdfs $1 datanode $DN_CONF_OPTS
}

cmd=$1
shift;

for i in $*
do
run_datanode  $cmd $i
done

      

I hope this helps you

+4


source


Data nodes and name nodes are just a piece of software designed to run on any regular computer. So it is possible, but rarely used in the real world. If you suspect that the risks are related to the unavailability of data on the server, you may have an idea of ​​distributing the data nodes across different servers.

Also, the Apache official site mentions:



The architecture does not preclude running multiple data nodes on the same computer, but in a real-world deployment that is rarely seen.

source: https://hadoop.apache.org/docs/r1.2.1/hdfs_design.html#NameNode+and+DataNodes

0


source







All Articles