You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@poi.apache.org by bu...@apache.org on 2010/12/08 00:03:41 UTC

DO NOT REPLY [Bug 50428] New: Need a way to avoid OutOfMemoryError's in RawDataBlockList

https://issues.apache.org/bugzilla/show_bug.cgi?id=50428

           Summary: Need a way to avoid OutOfMemoryError's in
                    RawDataBlockList
           Product: POI
           Version: unspecified
          Platform: PC
        OS/Version: Windows XP
            Status: NEW
          Severity: critical
          Priority: P2
         Component: POIFS
        AssignedTo: dev@poi.apache.org
        ReportedBy: dgoldenberg@attivio.com


We're dealing with a scenario where very large MS Office files are being
processed, with a tight limit on the heap size to be 100MB.

This causes OutOfMemoryError's in RawDataBlockList.

java.lang.OutOfMemoryError: KERNEL-10 : Java heap space
at org.apache.poi.poifs.storage.RawDataBlock.<init>(RawDataBlock.java:68)
at
org.apache.poi.poifs.storage.RawDataBlockList.<init>(RawDataBlockList.java:53)
at
org.apache.poi.poifs.filesystem.POIFSFileSystem.<init>(POIFSFileSystem.java:155)

RawDataBlockList loads all the blocks till end of file. Is there any way to
limit this, perhaps having there be an optional "sliding window"-ful of blocks
which gets repopulated on demand?

As a quicker fix, it'd be sufficient to have a way to ascertain whether a given
Office file is Excel, Word, or PPT. The way we do this is, once we know it's an
Office doc, by examining the magic bytes, we try to read the 'application name'
within the POI fs:

  public boolean isRecognized(DocumentPayload payload) {
    String application = null;

    try {
      application = getApplicationName(payload.getContentStream(),
payload.getDocId());
    } catch (Exception ex) {
      log.warn(TextExtractionError.ERROR, ex, "NON-FATAL error (proceeding with
text extraction). Failed to determine application for document. Payload: %s.",
payload);
    }

    return (application == null) ? false :
application.toLowerCase().contains(EXCEL) &&
application.toLowerCase().contains(MICROSOFT);
  }

Where

protected String getApplicationName(InputStream is, String docId) throws
IOException {
    String application = null;

    try {
      POIFSFileSystem filesystem = new POIFSFileSystem(is);

      // First, try to extract the application name from the metadata
      SummaryInformation si = null;
      PropertySet ps2 = getPropertySet(filesystem,
SummaryInformation.DEFAULT_STREAM_NAME, docId);
      if (ps2 instanceof SummaryInformation) {
        si = (SummaryInformation) ps2;
      }
      application = (si == null) ? null :
StringUtils.trim(si.getApplicationName());

      // Unfortunately, the app name may not be present in the document
metadata.

      // If that is the case, see if the file system has an entry by which we
can tell
      // that the document matches the type.
      if (StringUtils.isEmpty(application) &&
hasDistinguishedEntry(filesystem)) {
        application = getDefaultApplicationName();
      }

    } finally {
      is.close();
    }

    return application;
  }

And 'hasDistinguishedName' is as follows, e.g. for Excel

protected boolean hasDistinguishedEntry(POIFSFileSystem filesystem) {
    boolean hasIt = true;

    // See if the Workbook entry is there
    try {
      filesystem.getRoot().getEntry("Workbook");
    } catch (FileNotFoundException fe) {

      // Try the upper case form
      try {
        filesystem.getRoot().getEntry("WORKBOOK");
      } catch (FileNotFoundException wfe) {

        // Try Book
        try {
          filesystem.getRoot().getEntry("Book");
        } catch (FileNotFoundException wfee) {
          hasIt = false;
        }
      }
    }

    return hasIt;
  }

If we can avoid doing all this, then the OutOfMemory issue becomes less
significant. Otherwise we need a way to curtail the memory consumption on the
blocklist side and still be able to have access to properties and entries.

Any advise/recommendations?

-- 
Configure bugmail: https://issues.apache.org/bugzilla/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug.

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@poi.apache.org
For additional commands, e-mail: dev-help@poi.apache.org


DO NOT REPLY [Bug 50428] Need a way to avoid OutOfMemoryError's in RawDataBlockList

Posted by bu...@apache.org.
https://issues.apache.org/bugzilla/show_bug.cgi?id=50428

Yegor Kozlov <ye...@dinom.ru> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
             Status|NEW                         |RESOLVED
         Resolution|                            |FIXED

--- Comment #3 from Yegor Kozlov <ye...@dinom.ru> 2011-06-20 16:53:23 UTC ---
Try NIO Reading using NPOIFSFileSystem, see
"http://poi.apache.org/poifs/how-to.html" on 
http://poi.apache.org/poifs/how-to.html

It should be more efficient in terms of memory consumption.

Yegor

-- 
Configure bugmail: https://issues.apache.org/bugzilla/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug.

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@poi.apache.org
For additional commands, e-mail: dev-help@poi.apache.org


DO NOT REPLY [Bug 50428] Need a way to avoid OutOfMemoryError's in RawDataBlockList

Posted by bu...@apache.org.
https://issues.apache.org/bugzilla/show_bug.cgi?id=50428

--- Comment #2 from Nick Burch <ni...@alfresco.com> 2010-12-07 19:19:52 EST ---
For now you'll just have to bump up the heap size

There have been discussions on the dev list over the years about ways to reduce
the memory footprint of POIFS. However, as yet no-one has been willing to
sponsor the work for it.

If all you want is the names of the streams in the file, then you might be able
to cheat a bit to get them. It'd mean some NIO work, and taking advantage of
the FAT entries being special so you ought to be able to find them via the
header without touching the main data parts. It'd still take some work though

-- 
Configure bugmail: https://issues.apache.org/bugzilla/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug.

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@poi.apache.org
For additional commands, e-mail: dev-help@poi.apache.org


DO NOT REPLY [Bug 50428] Need a way to avoid OutOfMemoryError's in RawDataBlockList

Posted by bu...@apache.org.
https://issues.apache.org/bugzilla/show_bug.cgi?id=50428

Yegor Kozlov <ye...@dinom.ru> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
           Severity|blocker                     |major

-- 
Configure bugmail: https://issues.apache.org/bugzilla/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug.

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@poi.apache.org
For additional commands, e-mail: dev-help@poi.apache.org


DO NOT REPLY [Bug 50428] Need a way to avoid OutOfMemoryError's in RawDataBlockList

Posted by bu...@apache.org.
https://issues.apache.org/bugzilla/show_bug.cgi?id=50428

Dmitry Goldenberg <dg...@attivio.com> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
            Version|unspecified                 |3.2-FINAL
           Severity|critical                    |blocker

--- Comment #1 from Dmitry Goldenberg <dg...@attivio.com> 2010-12-07 18:04:54 EST ---
It's blocking a customer patch here. Would greatly appreciate your help!

-- 
Configure bugmail: https://issues.apache.org/bugzilla/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug.

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@poi.apache.org
For additional commands, e-mail: dev-help@poi.apache.org