Failure to create temp file on datanode using Hadoop
I would like to create a file during my program. However, I don't want this file to be written to HDFS, but to the datanode filesystem where the operation is being performed map
.
I tried the following approach:
public void map(Object key, Text value, Context context)
throws IOException, InterruptedException {
// do some hadoop stuff, like counting words
String path = "newFile.txt";
try {
File f = new File(path);
f.createNewFile();
} catch (IOException e) {
System.out.println("Message easy to look up in the logs.");
System.err.println("Error easy to look up in the logs.");
e.printStackTrace();
throw e;
}
}
With an absolute path, I get the file where it should be. With a relative path, however, this code does not throw any errors, neither in the console from which I run the program, nor in the job logs. However, I cannot find the file to be created (I am working on a local cluster right now).
Any ideas where to find the file or error message? If there is indeed an error message, how do I go about writing files to the local datanodes file system?
newFile.txt is a relative path, so the file will be displayed relative to the working directory of the task process. This will be located somewhere under the directories used by NodeManager for containers. This config property yarn.nodemanager.local-dirs
in the yarn-site.xml file or inherited by default from yarn-default.xml which is located under / tmp:
<property>
<description>List of directories to store localized files in. An
application localized file directory will be found in:
${yarn.nodemanager.local-dirs}/usercache/${user}/appcache/application_${appid}.
Individual containers' work directories, called container_${contid}, will
be subdirectories of this.
</description>
<name>yarn.nodemanager.local-dirs</name>
<value>${hadoop.tmp.dir}/nm-local-dir</value>
</property>
Here's a concrete example of one such directory in my test environment:
/tmp/hadoop-cnauroth/nm-local-dir/usercache/cnauroth/appcache/application_1363932793646_0002/container_1363932793646_0002_01_000001
These directories are a space with spaces for the container to execute, so they are not something you can rely on persistence. A background thread periodically deletes these files for terminated containers. You can postpone the cleanup by setting a config property yarn.nodemanager.delete.debug-delay-sec
in the yarn-site.xml file:
<property>
<description>
Number of seconds after an application finishes before the nodemanager
DeletionService will delete the application localized file directory
and log directory.
To diagnose Yarn application problems, set this property value large
enough (for example, to 600 = 10 minutes) to permit examination of these
directories. After changing the property value, you must restart the
nodemanager in order for it to have an effect.
The roots of Yarn applications' work directories is configurable with
the yarn.nodemanager.local-dirs property (see below), and the roots
of the Yarn applications' log directories is configurable with the
yarn.nodemanager.log-dirs property (see also below).
</description>
<name>yarn.nodemanager.delete.debug-delay-sec</name>
<value>0</value>
</property>
However, remember that this configuration is for troubleshooting purposes only, so you can see the directories more easily. It is not recommended as a permanent production configuration. If your application logic relies on deletion latency, this can cause a race condition between the application logic trying to access the directory and the NodeManager trying to delete it. Remaining files lingering on older containers also run the risk of cluttering local disk space.
Log messages will be sent to the stdout / stderr of the map task logs, but I suspect that doing it will not delete those log messages. Instead, I suspect that you are successfully creating the file, but either it is not easy to find (the directory structure will have a few unpredictable things like the application ID and YARN-managed container ID), or the file is flushed before you can get to it.
If you changed your code to use an absolute path pointing to some other directory, this will help. However, I don't expect this approach to work well in real life. Since Hadoop is distributed, you might find it difficult to find that a node in a cluster of hundreds or thousands contains the local file you want. Instead, you might be better off writing HDFS and then pulling the files you need locally to the node where you started the job.