You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@poi.apache.org by bu...@apache.org on 2011/12/20 13:08:45 UTC

DO NOT REPLY [Bug 52372] New: OutOfMemoryError parsing a word file

https://issues.apache.org/bugzilla/show_bug.cgi?id=52372

             Bug #: 52372
           Summary: OutOfMemoryError parsing a word file
           Product: POI
           Version: 3.8-dev
          Platform: All
        OS/Version: All
            Status: NEW
          Severity: critical
          Priority: P2
         Component: HPFS
        AssignedTo: dev@poi.apache.org
        ReportedBy: jerome.lacoste@gmail.com
    Classification: Unclassified


Created attachment 28090
  --> https://issues.apache.org/bugzilla/attachment.cgi?id=28090
An anonymised Doc file reproducing the problem

Calling Parser#parseToString on the attached file produces an OOME.

This is because Tika doesn't validate the size it tries to allocate. Had it
been C code, this could have been a buffer overflow...

Not sure if the file is corrupted or not, it opens fine on Word Mac and WIndows
platform. Saving the file in one of these editors causes the problem to
disappear, so we've manually edited the content of the file to anonymise it yet
keep it as close as possible to the original. We're able to create similar
problems by flipping bits in files.

java.lang.OutOfMemoryError: Java heap space
    at org.apache.poi.hpsf.Section.<init>(Section.java:207)
    at org.apache.poi.hpsf.PropertySet.init(PropertySet.java:451)
    at org.apache.poi.hpsf.PropertySet.<init>(PropertySet.java:246)
    at
org.apache.tika.parser.microsoft.SummaryExtractor.parseSummaryEntryIfExists(SummaryExtractor.java:73)
    at
org.apache.tika.parser.microsoft.SummaryExtractor.parseSummaries(SummaryExtractor.java:64)
    at
org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:186)
    at
org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:177)
    at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242)
    at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242)
    at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:120)
    at org.apache.tika.Tika.parseToString(Tika.java:380)
    at org.apache.tika.Tika.parseToString(Tika.java:414)
    at
no.finntech.tika.harderner.TikaIndexerHardenerTest.parseContent(TikaIndexerHardenerTest.java:100)
    at
no.finntech.tika.harderner.TikaIndexerHardenerTest.indexContent(TikaIndexerHardenerTest.java:91)
    at
no.finntech.tika.harderner.TikaIndexerHardenerTest.originalFileIndexesProperly2(TikaIndexerHardenerTest.java:34)

-- 
Configure bugmail: https://issues.apache.org/bugzilla/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug.

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@poi.apache.org
For additional commands, e-mail: dev-help@poi.apache.org


DO NOT REPLY [Bug 52372] OutOfMemoryError parsing a word file

Posted by bu...@apache.org.
https://issues.apache.org/bugzilla/show_bug.cgi?id=52372

--- Comment #5 from Nick Burch <ni...@alfresco.com> 2011-12-20 23:35:08 UTC ---
Bah, looks like there was a typo in the component name (dating back quite a
number of years....), should now be fixed. In general, each of the components
has help which describes the subject area + package it covers, which should
help with identifying

-- 
Configure bugmail: https://issues.apache.org/bugzilla/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug.

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@poi.apache.org
For additional commands, e-mail: dev-help@poi.apache.org


DO NOT REPLY [Bug 52372] OutOfMemoryError parsing a word file

Posted by bu...@apache.org.
https://issues.apache.org/bugzilla/show_bug.cgi?id=52372

Nick Burch <ni...@alfresco.com> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
             Status|NEW                         |NEEDINFO

--- Comment #1 from Nick Burch <ni...@alfresco.com> 2011-12-20 13:24:36 UTC ---
Can you confirm if this issue still occurs with POI 3.8 beta 5 (just released)
or not?

-- 
Configure bugmail: https://issues.apache.org/bugzilla/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug.

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@poi.apache.org
For additional commands, e-mail: dev-help@poi.apache.org


DO NOT REPLY [Bug 52372] OutOfMemoryError parsing a word file

Posted by bu...@apache.org.
https://issues.apache.org/bugzilla/show_bug.cgi?id=52372

Nick Burch <ni...@alfresco.com> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
             Status|NEEDINFO                    |NEW

--- Comment #3 from Nick Burch <ni...@alfresco.com> 2011-12-20 13:37:54 UTC ---
Note that this bug doesn't look to be word specific, the exception is coming
from the common HPSF properties, rather than HWPF

-- 
Configure bugmail: https://issues.apache.org/bugzilla/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug.

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@poi.apache.org
For additional commands, e-mail: dev-help@poi.apache.org


DO NOT REPLY [Bug 52372] OutOfMemoryError parsing a word file

Posted by bu...@apache.org.
https://issues.apache.org/bugzilla/show_bug.cgi?id=52372

Nick Burch <ni...@alfresco.com> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
             Status|NEW                         |NEEDINFO

--- Comment #6 from Nick Burch <ni...@alfresco.com> 2011-12-23 03:24:35 UTC ---
The issue is that we're reading a value that should contain the number of
properties in the section, then trying to create an array to hold that many
properties (before reading into them, so it couldn't be a buffer overflow even
in C!)

What we're not doing is sanity checking the number of properties, so if the
file has been corrupted and that value is very large, we trust it at that point
and try to allocate a big array. (Later on we'd throw a different exception on
discovering the value was corrupt and specified more properties than there's
data for)

We could probably do some checks on the size, and also move the array
initialisation to after the first pass too

Are you able to check the Microsoft Documentation to see what the limit on the
number of properties in a section is? (That'd be an easy sanity check to do
first)

-- 
Configure bugmail: https://issues.apache.org/bugzilla/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug.

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@poi.apache.org
For additional commands, e-mail: dev-help@poi.apache.org


DO NOT REPLY [Bug 52372] OutOfMemoryError parsing a word file

Posted by bu...@apache.org.
https://issues.apache.org/bugzilla/show_bug.cgi?id=52372

--- Comment #2 from Jerome Lacoste <je...@gmail.com> 2011-12-20 13:34:44 UTC ---
Yes it does fail with 3.8-beta5

-- 
Configure bugmail: https://issues.apache.org/bugzilla/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug.

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@poi.apache.org
For additional commands, e-mail: dev-help@poi.apache.org


DO NOT REPLY [Bug 52372] OutOfMemoryError parsing a word file

Posted by bu...@apache.org.
https://issues.apache.org/bugzilla/show_bug.cgi?id=52372

--- Comment #7 from Jerome Lacoste <je...@gmail.com> 2011-12-23 09:01:27 UTC ---
> What we're not doing is sanity checking the number of properties
> so if the file has been corrupted

Just a question: are we sure the file is corrupted ? Word opens it properly
with on both Windows and Mac.

Also the place where the code tries to read the property size contains some
text "Hewlett-Packard"

> Are you able to check the Microsoft Documentation to see what the limit on the
> number of properties in a section is? (That'd be an easy sanity check to do
> first) http://msdn.microsoft.com/en-us/library/dd949336%28v=office.12%29.aspx

I wasn't able to find a maximum number of properties.

>From the .Doc structure format:
http://msdn.microsoft.com/en-us/library/cc313153%28v=office.12%29.aspx

Example of a section
http://msdn.microsoft.com/en-us/library/dd907622%28v=office.12%29.aspx

Property storage 
http://msdn.microsoft.com/en-us/library/dd949336%28v=office.12%29.aspx

But we may be able to use a different limit. We know the document/buffer
length. Surely there are at most (buffer length) / (min property length).

-- 
Configure bugmail: https://issues.apache.org/bugzilla/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug.

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@poi.apache.org
For additional commands, e-mail: dev-help@poi.apache.org


DO NOT REPLY [Bug 52372] OutOfMemoryError parsing a word file

Posted by bu...@apache.org.
https://issues.apache.org/bugzilla/show_bug.cgi?id=52372

--- Comment #4 from Jerome Lacoste <je...@gmail.com> 2011-12-20 15:01:12 UTC ---
I filled it on HPFS being the module that had the closest name to HPSF...
Picking the right module was a bit confusing !

-- 
Configure bugmail: https://issues.apache.org/bugzilla/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug.

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@poi.apache.org
For additional commands, e-mail: dev-help@poi.apache.org