Python - PipeMapRed.waitOutputThreads (): subprocess crash with code 1

Lately I want to parse websites and then use BeautifulSoup to filter what I want and write to csv file in hdfs.

I am now in the process of filtering site code with BeautifulSoup.

I want to use the mapreduce method to execute it:

hadoop jar /usr/lib/hadoop-0.20-mapreduce/contrib/streaming/hadoop-streaming-2.3.0-mr1-cdh5.0.2.jar 
-mapper /pytemp/filter.py 
-input /user/root/py/input/ 
-output /user/root/py/output40/

      

input file is like kvs (PER LINE): (key, value) = (url, content)

I mean:

<html><head><title>...</title></head><body>...</body></html>

      

filter.py file:

#!/usr/bin/env python
#!/usr/bin/python
#coding:utf-8
from bs4 import BeautifulSoup
import sys

for line in sys.stdin:
    line = line.strip()
    key, content = line.split(",")

    #if the following two lines do not exist, the program will execute successfully
    soup = BeautifulSoup(content)
    output = soup.find()         

    print("Start-----------------")
    print("End------------------")

      

By the way, it seems to me that I don't need the reduce.py function to do my job.

However, I got the error :

Error: java.lang.RuntimeException: PipeMapRed.waitOutputThreads(): subprocess failed with code 1
at org.apache.hadoop.streaming.PipeMapRed.waitOutputThreads(PipeMapRed.java:320)
at org.apache.hadoop.streaming.PipeMapRed.mapRedFinished(PipeMapRed.java:533)
at org.apache.hadoop.streaming.PipeMapper.close(PipeMapper.java:130)
at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:61)
at org.apache.hadoop.streaming.PipeMapRunner.run(PipeMapRunner.java:34)
at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:430)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:342)
at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:168)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:415)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1548)
at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:163)

      

Here is the answer said there was a memory problem, but my input file is only 3MB. http://grokbase.com/t/gg/rhadoop/13924fs4as/972-getting-error-pipemapred-waitoutputthreads-while-running-mapreduce-program-for-40mb-of-sizedataset

I have no idea about my problem. I am looking for a lot of things for it, but it still doesn't work.

My environment:

  • CentOS6
  • python2.7
  • Cloudera CDH5

I would be grateful for your help in this situation.

EDIT 2016/06/24

First of all, I checked the error log and found the problem was too big to decompress. (also thanks to @kynan's answer)

Just give an example of why this happened

<font color="#0000FF">
  SomeText1
  <font color="#0000FF">
    SomeText2
  </font>
</font>

      

If some of the content is similar to the above, I call soup.find ("font", color = "# 0000FF") and assign the output. This will cause two fonts to be assigned to the same pin, which is why the error is too many values ​​to unpack

Decision

Just change output = soup.find()

to (Var1, Var2, ...) = soup.find_all("font", color="#0000FF", limit=AmountOfVar)

and work well :)

+3


source to share


1 answer


This error usually means that the matching process is dead. To see why checking user logs in $HADOOP_PREFIX/logs/userlogs

: there is one directory for each job and within one directory per container. Each container directory contains a file stderr

containing the output sent to stderr ie messages



+1


source







All Articles