Hadoop, when it runs under the spark, concatenates its stderr to stdout

When i type

hadoop fs -text /foo/bar/baz.bz2 2>err 1>out

      

I am getting two non-empty files: err

with

2015-05-26 15:33:49,786 INFO  [main] bzip2.Bzip2Factory (Bzip2Factory.java:isNativeBzip2Loaded(70)) - Successfully loaded & initialized native-bzip2 library system-native
2015-05-26 15:33:49,789 INFO  [main] compress.CodecPool (CodecPool.java:getDecompressor(179)) - Got brand-new decompressor [.bz2]

      

and out

with the contents of the file (as expected).

When I call the same command from Python (2.6):

from subprocess import Popen
with open("out","w") as out:
    with open("err","w") as err:
        p = Popen(['hadoop','fs','-text',"/foo/bar/baz.bz2"],
                  stdin=None,stdout=out,stderr=err)
print p.wait()

      

I am getting the same (correct) behavior.

However , when I run the same code under PySpark (or using spark-submit

) I get an empty file err

and run the file out

with the log messages above (and then it contains the actual data).

What am I doing wrong?

NB : The purpose of Python code is to provide output to hadoop fs -text

another program (i.e., pass stdout=PIPE

to Popen

), so please don't suggest hadoop fs -get

. Thank.

PS. When I run hadoop

under time

:

from subprocess import Popen
with open("out","w") as out:
    with open("err","w") as err:
        p = Popen(['/usr/bin/time','hadoop','fs','-text',"/foo/bar/baz.bz2"],
                  stdin=None,stdout=out,stderr=err)
print p.wait()

      

output time

correctly transitions to err

, but hadoop

incorrectly registers in out

.

Ie, combines his into his when he works with a spark . hadoop

stderr

stdout

+2


source to share





All Articles