Python - PipeMapRed.waitOutputThreads (): subprocess crash with code 1

Question

Python - PipeMapRed.waitOutputThreads (): subprocess crash with code 1

Lately I want to parse websites and then use BeautifulSoup to filter what I want and write to csv file in hdfs.

I am now in the process of filtering site code with BeautifulSoup.

I want to use the mapreduce method to execute it:

hadoop jar /usr/lib/hadoop-0.20-mapreduce/contrib/streaming/hadoop-streaming-2.3.0-mr1-cdh5.0.2.jar 
-mapper /pytemp/filter.py 
-input /user/root/py/input/ 
-output /user/root/py/output40/

input file is like kvs (PER LINE): (key, value) = (url, content)

I mean:

<html><head><title>...</title></head><body>...</body></html>

filter.py file:

#!/usr/bin/env python
#!/usr/bin/python
#coding:utf-8
from bs4 import BeautifulSoup
import sys

for line in sys.stdin:
    line = line.strip()
    key, content = line.split(",")

    #if the following two lines do not exist, the program will execute successfully
    soup = BeautifulSoup(content)
    output = soup.find()         

    print("Start-----------------")
    print("End------------------")

By the way, it seems to me that I don't need the reduce.py function to do my job.

However, I got the error :

Error: java.lang.RuntimeException: PipeMapRed.waitOutputThreads(): subprocess failed with code 1
at org.apache.hadoop.streaming.PipeMapRed.waitOutputThreads(PipeMapRed.java:320)
at org.apache.hadoop.streaming.PipeMapRed.mapRedFinished(PipeMapRed.java:533)
at org.apache.hadoop.streaming.PipeMapper.close(PipeMapper.java:130)
at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:61)
at org.apache.hadoop.streaming.PipeMapRunner.run(PipeMapRunner.java:34)
at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:430)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:342)
at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:168)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:415)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1548)
at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:163)

Here is the answer said there was a memory problem, but my input file is only 3MB. http://grokbase.com/t/gg/rhadoop/13924fs4as/972-getting-error-pipemapred-waitoutputthreads-while-running-mapreduce-program-for-40mb-of-sizedataset

I have no idea about my problem. I am looking for a lot of things for it, but it still doesn't work.

My environment:

CentOS6
python2.7
Cloudera CDH5

I would be grateful for your help in this situation.

EDIT 2016/06/24

First of all, I checked the error log and found the problem was too big to decompress. (also thanks to @kynan's answer)

Just give an example of why this happened

<font color="#0000FF">
  SomeText1
  <font color="#0000FF">
    SomeText2
  </font>
</font>

If some of the content is similar to the above, I call soup.find ("font", color = "# 0000FF") and assign the output. This will cause two fonts to be assigned to the same pin, which is why the error is too many values to unpack

Decision

Just change output = soup.find()

to (Var1, Var2, ...) = soup.find_all("font", color="#0000FF", limit=AmountOfVar)

and work well :)

+3

mapreduce beautifulsoup hadoop-streaming

Danny 05 Aug 14 at 9:53

source to share

1 answer

kynan · Accepted Answer · 2015-12-31T17:10:02+0000

This error usually means that the matching process is dead. To see why checking user logs in $HADOOP_PREFIX/logs/userlogs

: there is one directory for each job and within one directory per container. Each container directory contains a file stderr

containing the output sent to stderr ie messages

Python - PipeMapRed.waitOutputThreads (): subprocess crash with code 1

More articles: