Pig script not working with java.io.EOFException: Unexpected end of input stream

Question

Pig script not working with java.io.EOFException: Unexpected end of input stream

I have a Pig script to set a set of fields using a regex and store the data in a Hive table.

--Load data

cisoFortiGateDataAll = LOAD '/user/root/cisodata/Logs/Fortigate/ec-pix-log.20140625.gz' USING TextLoader AS (line:chararray);

--There are two types of data, filter type1 - The field dst_country seems unique there

cisoFortiGateDataType1 = FILTER cisoFortiGateDataAll BY (line matches '.*dst_country.*');

--Parse each line and pick up the required fields

cisoFortiGateDataType1Required = FOREACH cisoFortiGateDataType1 GENERATE
 FLATTEN(
 REGEX_EXTRACT_ALL(line, '(.*?)\\s(.*?)\\s(.*?)\\s(.*?)\\sdate=(.*?)\\s+time=(.*?)\\sdevname=(.*?)\\sdevice_id=(.*?)\\slog_id=(.*?)\\stype=(.*?)\\ssubtype=(.*?)\\spri=(.*?)\\svd=(.*?)\\ssrc=(.*?)\\ssrc_port=(.*?)\\ssrc_int=(.*?)\\sdst=(.*?)\\sdst_port=(.*?)\\sdst_int=(.*?)\\sSN=(.*?)\\sstatus=(.*?)\\spolicyid=(.*?)\\sdst_country=(.*?)\\ssrc_country=(.*?)\\s(.*?\\s.*)+')
 ) AS (
 rmonth:charArray, rdate:charArray, rtime:charArray, ip:charArray, date:charArray, time:charArray,
 devname:charArray, deviceid:charArray, logid:charArray, type:charArray, subtype:charArray,
 pri:charArray, vd:charArray, src:charArray, srcport:charArray, srcint:charArray, dst:charArray,
 dstport:charArray, dstint:charArray, sn:charArray, status:charArray, policyid:charArray,
 dstcountry:charArray, srccountry:charArray, rest:charArray );

--Store to hive table 

STORE cisoFortiGateDataType1Required INTO 'ciso_db.fortigate_type1_1_table' USING org.apache.hcatalog.pig.HCatStorer();

The script works fine with a small file, but in a larger file (750MB) it breaks with the following exception. Any idea how I can debug and find the root cause?

2014-09-03 15:31:33,562 [main] ERROR org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.Launcher - java.io.EOFException: Unexpected end of input stream
        at org.apache.hadoop.io.compress.DecompressorStream.decompress(DecompressorStream.java:145)
        at org.apache.hadoop.io.compress.DecompressorStream.read(DecompressorStream.java:85)
        at java.io.InputStream.read(InputStream.java:101)
        at org.apache.hadoop.util.LineReader.fillBuffer(LineReader.java:180)
        at org.apache.hadoop.util.LineReader.readDefaultLine(LineReader.java:216)
        at org.apache.hadoop.util.LineReader.readLine(LineReader.java:174)
        at org.apache.hadoop.mapreduce.lib.input.LineRecordReader.nextKeyValue(LineRecordReader.java:149)
        at org.apache.pig.builtin.TextLoader.getNext(TextLoader.java:58)
        at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigRecordReader.nextKeyValue(PigRecordReader.java:211)
        at org.apache.hadoop.mapred.MapTask$NewTrackingRecordReader.nextKeyValue(MapTask.java:533)
        at org.apache.hadoop.mapreduce.task.MapContextImpl.nextKeyValue(MapContextImpl.java:80)
        at org.apache.hadoop.mapreduce.lib.map.WrappedMapper$Context.nextKeyValue(WrappedMapper.java:91)
        at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:144)
        at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:764)
        at org.apache.hadoop.mapred.MapTask.run(MapTask.java:340)
        at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:167)
        at java.security.AccessController.doPrivileged(Native Method)
        at javax.security.auth.Subject.doAs(Subject.java:415)
        at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1594)

+3

hive apache-pig

Praveen kumar 03 Sep At 10:20

source to share

1 answer

VK_217 · Answer 1 · 2016-01-27T16:52:30+0000

Check the size of the text you are loading into the string: chararray. If the size is larger than the hdfs block size (64 MB), you will receive an error message.

Pig script not working with java.io.EOFException: Unexpected end of input stream

More articles: