Failure to create temp file on datanode using Hadoop

Question

Failure to create temp file on datanode using Hadoop

I would like to create a file during my program. However, I don't want this file to be written to HDFS, but to the datanode filesystem where the operation is being performed map

.

I tried the following approach:

public void map(Object key, Text value, Context context)
        throws IOException, InterruptedException {
    // do some hadoop stuff, like counting words
    String path = "newFile.txt";
    try {
        File f = new File(path);
        f.createNewFile();
    } catch (IOException e) {
        System.out.println("Message easy to look up in the logs.");
        System.err.println("Error easy to look up in the logs.");
        e.printStackTrace();
        throw e;
    }
}

With an absolute path, I get the file where it should be. With a relative path, however, this code does not throw any errors, neither in the console from which I run the program, nor in the job logs. However, I cannot find the file to be created (I am working on a local cluster right now).

Any ideas where to find the file or error message? If there is indeed an error message, how do I go about writing files to the local datanodes file system?

+3

java hadoop hdfs yarn hadoop2

fxm 07 Aug 14 at 14:07

source to share

1 answer

Chris nauroth · Accepted Answer · 2014-08-07T15:26:30+0000

newFile.txt is a relative path, so the file will be displayed relative to the working directory of the task process. This will be located somewhere under the directories used by NodeManager for containers. This config property yarn.nodemanager.local-dirs

in the yarn-site.xml file or inherited by default from yarn-default.xml which is located under / tmp:

<property>
  <description>List of directories to store localized files in. An 
    application localized file directory will be found in:
    ${yarn.nodemanager.local-dirs}/usercache/${user}/appcache/application_${appid}.
    Individual containers' work directories, called container_${contid}, will
    be subdirectories of this.
  </description>
  <name>yarn.nodemanager.local-dirs</name>
  <value>${hadoop.tmp.dir}/nm-local-dir</value>
</property>

Here's a concrete example of one such directory in my test environment:

/tmp/hadoop-cnauroth/nm-local-dir/usercache/cnauroth/appcache/application_1363932793646_0002/container_1363932793646_0002_01_000001

These directories are a space with spaces for the container to execute, so they are not something you can rely on persistence. A background thread periodically deletes these files for terminated containers. You can postpone the cleanup by setting a config property yarn.nodemanager.delete.debug-delay-sec

in the yarn-site.xml file:

<property>
  <description>
    Number of seconds after an application finishes before the nodemanager 
    DeletionService will delete the application localized file directory
    and log directory.

    To diagnose Yarn application problems, set this property value large
    enough (for example, to 600 = 10 minutes) to permit examination of these
    directories. After changing the property value, you must restart the 
    nodemanager in order for it to have an effect.

    The roots of Yarn applications' work directories is configurable with
    the yarn.nodemanager.local-dirs property (see below), and the roots
    of the Yarn applications' log directories is configurable with the 
    yarn.nodemanager.log-dirs property (see also below).
  </description>
  <name>yarn.nodemanager.delete.debug-delay-sec</name>
  <value>0</value>
</property>

However, remember that this configuration is for troubleshooting purposes only, so you can see the directories more easily. It is not recommended as a permanent production configuration. If your application logic relies on deletion latency, this can cause a race condition between the application logic trying to access the directory and the NodeManager trying to delete it. Remaining files lingering on older containers also run the risk of cluttering local disk space.

Log messages will be sent to the stdout / stderr of the map task logs, but I suspect that doing it will not delete those log messages. Instead, I suspect that you are successfully creating the file, but either it is not easy to find (the directory structure will have a few unpredictable things like the application ID and YARN-managed container ID), or the file is flushed before you can get to it.

If you changed your code to use an absolute path pointing to some other directory, this will help. However, I don't expect this approach to work well in real life. Since Hadoop is distributed, you might find it difficult to find that a node in a cluster of hundreds or thousands contains the local file you want. Instead, you might be better off writing HDFS and then pulling the files you need locally to the node where you started the job.

Failure to create temp file on datanode using Hadoop

More articles: