Heap space problem when running Pig Script
I am trying to execute a pig script with about 30 million data and I am getting the following heap area error:
> ERROR 2998: Unhandled internal error. Java heap space
>
> java.lang.OutOfMemoryError: Java heap space
> at java.util.Arrays.copyOf(Arrays.java:2367)
> at java.lang.AbstractStringBuilder.expandCapacity(AbstractStringBuilder.java:130)
> at java.lang.AbstractStringBuilder.ensureCapacityInternal(AbstractStringBuilder.java:114)
> at java.lang.AbstractStringBuilder.append(AbstractStringBuilder.java:415)
> at java.lang.StringBuilder.append(StringBuilder.java:132)
> at org.apache.pig.newplan.logical.optimizer.LogicalPlanPrinter.shiftStringByTabs(LogicalPlanPrinter.java:223)
> at org.apache.pig.newplan.logical.optimizer.LogicalPlanPrinter.depthFirst(LogicalPlanPrinter.java:108)
> at org.apache.pig.newplan.logical.optimizer.LogicalPlanPrinter.depthFirst(LogicalPlanPrinter.java:102)
> at org.apache.pig.newplan.logical.optimizer.LogicalPlanPrinter.depthFirst(LogicalPlanPrinter.java:102)
> at org.apache.pig.newplan.logical.optimizer.LogicalPlanPrinter.depthFirst(LogicalPlanPrinter.java:102)
> at org.apache.pig.newplan.logical.optimizer.LogicalPlanPrinter.depthFirst(LogicalPlanPrinter.java:102)
> at org.apache.pig.newplan.logical.optimizer.LogicalPlanPrinter.depthFirst(LogicalPlanPrinter.java:102)
> at org.apache.pig.newplan.logical.optimizer.LogicalPlanPrinter.depthFirst(LogicalPlanPrinter.java:102)
> at org.apache.pig.newplan.logical.optimizer.LogicalPlanPrinter.depthFirst(LogicalPlanPrinter.java:102)
> at org.apache.pig.newplan.logical.optimizer.LogicalPlanPrinter.depthFirst(LogicalPlanPrinter.java:102)
> at org.apache.pig.newplan.logical.optimizer.LogicalPlanPrinter.depthFirstLP(LogicalPlanPrinter.java:83)
> at org.apache.pig.newplan.logical.optimizer.LogicalPlanPrinter.visit(LogicalPlanPrinter.java:69)
> at org.apache.pig.newplan.logical.relational.LogicalPlan.getLogicalPlanString(LogicalPlan.java:148)
> at org.apache.pig.newplan.logical.relational.LogicalPlan.getSignature(LogicalPlan.java:133)
> at org.apache.pig.PigServer.execute(PigServer.java:1295)
> at org.apache.pig.PigServer.executeBatch(PigServer.java:375)
> at org.apache.pig.PigServer.executeBatch(PigServer.java:353)
> at org.apache.pig.tools.grunt.GruntParser.executeBatch(GruntParser.java:140)
> at org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:202)
> at org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:173)
> at org.apache.pig.tools.grunt.Grunt.exec(Grunt.java:84)
> at org.apache.pig.Main.run(Main.java:607)
> at org.apache.pig.Main.main(Main.java:156)
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
> at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> at java.lang.reflect.Method.invoke(Method.java:606)
> ================================================================================
I ran the same code with 10 million data and it worked fine.
So what are the possible ways to avoid the above problem?
Does compression help avoid the heap memory problem?
I tried to split the code into multiple chunks and still I get the error. So if we increase the number of heap memory operations, is that guaranteed if we do the same with the amount of data?
source to share
One reason for this might be that you have a giant string in your data that doesn't fit into memory.
So try to check. You can run this bash command on a node of your cluster:
hdfs dfs -cat '/some/path/to/hdfs.file' | awk '{if (length($0) > SOME_VALUE_REASONABLE_VALUE) print $0}' > large_lines
If your data is not in the same file, you can use *
eg /some/path/to/hdfs.dir/part*
.
Then you should check if there are any huge lines:
less large_lines
source to share