Python - PipeMapRed.waitOutputThreads (): subprocess crash with code 1
Lately I want to parse websites and then use BeautifulSoup to filter what I want and write to csv file in hdfs.
I am now in the process of filtering site code with BeautifulSoup.
I want to use the mapreduce method to execute it:
hadoop jar /usr/lib/hadoop-0.20-mapreduce/contrib/streaming/hadoop-streaming-2.3.0-mr1-cdh5.0.2.jar
-mapper /pytemp/filter.py
-input /user/root/py/input/
-output /user/root/py/output40/
input file is like kvs (PER LINE): (key, value) = (url, content)
I mean:
<html><head><title>...</title></head><body>...</body></html>
filter.py file:
#!/usr/bin/env python
#!/usr/bin/python
#coding:utf-8
from bs4 import BeautifulSoup
import sys
for line in sys.stdin:
line = line.strip()
key, content = line.split(",")
#if the following two lines do not exist, the program will execute successfully
soup = BeautifulSoup(content)
output = soup.find()
print("Start-----------------")
print("End------------------")
By the way, it seems to me that I don't need the reduce.py function to do my job.
However, I got the error :
Error: java.lang.RuntimeException: PipeMapRed.waitOutputThreads(): subprocess failed with code 1
at org.apache.hadoop.streaming.PipeMapRed.waitOutputThreads(PipeMapRed.java:320)
at org.apache.hadoop.streaming.PipeMapRed.mapRedFinished(PipeMapRed.java:533)
at org.apache.hadoop.streaming.PipeMapper.close(PipeMapper.java:130)
at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:61)
at org.apache.hadoop.streaming.PipeMapRunner.run(PipeMapRunner.java:34)
at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:430)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:342)
at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:168)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:415)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1548)
at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:163)
Here is the answer said there was a memory problem, but my input file is only 3MB. http://grokbase.com/t/gg/rhadoop/13924fs4as/972-getting-error-pipemapred-waitoutputthreads-while-running-mapreduce-program-for-40mb-of-sizedataset
I have no idea about my problem. I am looking for a lot of things for it, but it still doesn't work.
My environment:
- CentOS6
- python2.7
- Cloudera CDH5
I would be grateful for your help in this situation.
EDIT 2016/06/24
First of all, I checked the error log and found the problem was too big to decompress. (also thanks to @kynan's answer)
Just give an example of why this happened
<font color="#0000FF">
SomeText1
<font color="#0000FF">
SomeText2
</font>
</font>
If some of the content is similar to the above, I call soup.find ("font", color = "# 0000FF") and assign the output. This will cause two fonts to be assigned to the same pin, which is why the error is too many values ββto unpack
Decision
Just change output = soup.find()
to (Var1, Var2, ...) = soup.find_all("font", color="#0000FF", limit=AmountOfVar)
and work well :)
source to share
This error usually means that the matching process is dead. To see why checking user logs in $HADOOP_PREFIX/logs/userlogs
: there is one directory for each job and within one directory per container. Each container directory contains a file stderr
containing the output sent to stderr ie messages
source to share