Apache POI WorkbookFactory.create throws java.lang.OutOfMemoryError: Java heap space

My problem is pretty simple. I want to check files up to 50MB for correct formatting in App Engine.

This presents a number of serious problems today. The first of these is the Apache XLS / XLSX POI API. When I load 20MB of file data into memory locally before validating, it throws:

java.lang.OutOfMemoryError: Java heap space
    at java.util.Arrays.copyOf(Arrays.java:2271)
    at java.io.ByteArrayOutputStream.grow(ByteArrayOutputStream.java:113)
    at java.io.ByteArrayOutputStream.ensureCapacity(ByteArrayOutputStream.java:93)
    at java.io.ByteArrayOutputStream.write(ByteArrayOutputStream.java:140)
    at org.apache.poi.openxml4j.util.ZipInputStreamZipEntrySource$FakeZipEntry.<init>(ZipInputStreamZipEntrySource.java:128)
    at org.apache.poi.openxml4j.util.ZipInputStreamZipEntrySource.<init>(ZipInputStreamZipEntrySource.java:55)
    at org.apache.poi.openxml4j.opc.ZipPackage.<init>(ZipPackage.java:84)
    at org.apache.poi.openxml4j.opc.OPCPackage.open(OPCPackage.java:272)
    at org.apache.poi.ss.usermodel.WorkbookFactory.create(WorkbookFactory.java:79)

      

I need to open and validate spreadsheets between 20 and 25 MB. 50 would be a good stretch if possible. We are talking about hundreds of thousands of rows of data per sheet.

Now my traditional code loads the entire file into memory and quickly jumps to the heap crash of my application kernel instance. Here's my traditional code:

    public ErrorLog validateWorkbook(inputWorkbook)
    {
        int sheetCount = inputWorkbook.getNumberOfSheets();
        for (int x = 0; x< sheetCount; x++)
        {
            Sheet currentSheet = inputWorkbook.getSheetAt(x);
            Iterator<Row> rowIterator = currentSheet.rowIterator();
            while(rowIterator.hasNext())
            {
                Iterator<Cell> cellIterator = rowIterator.next().cellIterator();
                while(cellIterator.hasNext())
                {
                    Cell currentCell = cellIterator.next();
                    boolean success = validateCellContents(currentCell);
                    if(!success)
                        ErrorLog.appendError(new Error()); // detailed user error explicitly defining error location, cell value, and recommended steps to fix
                }
            }
        }
        return ErrorLog;
    }

      

There are now supposedly event-based ways of handling the actionlistener every time a cell is encountered. But in the dummy code, there is a link to:

ReadOnlySharedStringsTable strings = new ReadOnlySharedStringsTable(container); 

      

I checked this object in the debugger and it contains every unique string reference in the current sheet. This essentially does exactly what I am trying to avoid. It allocates a large block of memory to store each value in memory in advance. The ideal solution takes a stream of input bytes and decodes the lines as it moves through the file to reduce memory footprint.

As the string table will definitely take up a lot of memory space. I am working on processing 150,000 - 300,000 position spreadsheets

Now the quickstart mentions that you can use File or InputStream and if you use a file the input will be buffered. The problem is that App Engine and Blob storage are not aware of File Objects and return InputStreams (as far as I know).

Also, the other event-driven model Default handler seems to have no concept of a column or row for each value in its interface-defined methods called on action (and it also allocates the entire table of shared rows).

Launch ideas here! I will try to offer generosity for this. At least a specific "no, that's not possible" would be enough and then I can start looking for workarounds, but I feel like I'm just not using the extensive API as I might be.

+3


source to share


1 answer


Lawyer,

You can do this, but you'll have to get creative to get around some of GAE's limitations.

First, Front Server instances for applications require a 1 minute limit, so if you want to process files up to 50 MB you will have to use or use the Manual / basic scaling module "to avoid this time limit.



Secondly, memory. Here again you have 2 options, By using modules you have better control over the memory of your instances, which is a step in the right direction, t scale as good.

I was in your situation and I ended up using the Google Drive API + Google Spreadsets API or Blobstore service depending on the requirement. Using any of these alternatives, I have downloaded the excel files so that I can process them offline using queues.

+1


source







All Articles