You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@poi.apache.org by bu...@apache.org on 2011/12/20 13:08:45 UTC
DO NOT REPLY [Bug 52372] New: OutOfMemoryError parsing a word file
https://issues.apache.org/bugzilla/show_bug.cgi?id=52372
Bug #: 52372
Summary: OutOfMemoryError parsing a word file
Product: POI
Version: 3.8-dev
Platform: All
OS/Version: All
Status: NEW
Severity: critical
Priority: P2
Component: HPFS
AssignedTo: dev@poi.apache.org
ReportedBy: jerome.lacoste@gmail.com
Classification: Unclassified
Created attachment 28090
--> https://issues.apache.org/bugzilla/attachment.cgi?id=28090
An anonymised Doc file reproducing the problem
Calling Parser#parseToString on the attached file produces an OOME.
This is because Tika doesn't validate the size it tries to allocate. Had it
been C code, this could have been a buffer overflow...
Not sure if the file is corrupted or not, it opens fine on Word Mac and WIndows
platform. Saving the file in one of these editors causes the problem to
disappear, so we've manually edited the content of the file to anonymise it yet
keep it as close as possible to the original. We're able to create similar
problems by flipping bits in files.
java.lang.OutOfMemoryError: Java heap space
at org.apache.poi.hpsf.Section.<init>(Section.java:207)
at org.apache.poi.hpsf.PropertySet.init(PropertySet.java:451)
at org.apache.poi.hpsf.PropertySet.<init>(PropertySet.java:246)
at
org.apache.tika.parser.microsoft.SummaryExtractor.parseSummaryEntryIfExists(SummaryExtractor.java:73)
at
org.apache.tika.parser.microsoft.SummaryExtractor.parseSummaries(SummaryExtractor.java:64)
at
org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:186)
at
org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:177)
at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242)
at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242)
at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:120)
at org.apache.tika.Tika.parseToString(Tika.java:380)
at org.apache.tika.Tika.parseToString(Tika.java:414)
at
no.finntech.tika.harderner.TikaIndexerHardenerTest.parseContent(TikaIndexerHardenerTest.java:100)
at
no.finntech.tika.harderner.TikaIndexerHardenerTest.indexContent(TikaIndexerHardenerTest.java:91)
at
no.finntech.tika.harderner.TikaIndexerHardenerTest.originalFileIndexesProperly2(TikaIndexerHardenerTest.java:34)
--
Configure bugmail: https://issues.apache.org/bugzilla/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug.
---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@poi.apache.org
For additional commands, e-mail: dev-help@poi.apache.org
DO NOT REPLY [Bug 52372] OutOfMemoryError parsing a word file
Posted by bu...@apache.org.
https://issues.apache.org/bugzilla/show_bug.cgi?id=52372
--- Comment #5 from Nick Burch <ni...@alfresco.com> 2011-12-20 23:35:08 UTC ---
Bah, looks like there was a typo in the component name (dating back quite a
number of years....), should now be fixed. In general, each of the components
has help which describes the subject area + package it covers, which should
help with identifying
--
Configure bugmail: https://issues.apache.org/bugzilla/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug.
---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@poi.apache.org
For additional commands, e-mail: dev-help@poi.apache.org
DO NOT REPLY [Bug 52372] OutOfMemoryError parsing a word file
Posted by bu...@apache.org.
https://issues.apache.org/bugzilla/show_bug.cgi?id=52372
Nick Burch <ni...@alfresco.com> changed:
What |Removed |Added
----------------------------------------------------------------------------
Status|NEW |NEEDINFO
--- Comment #1 from Nick Burch <ni...@alfresco.com> 2011-12-20 13:24:36 UTC ---
Can you confirm if this issue still occurs with POI 3.8 beta 5 (just released)
or not?
--
Configure bugmail: https://issues.apache.org/bugzilla/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug.
---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@poi.apache.org
For additional commands, e-mail: dev-help@poi.apache.org
DO NOT REPLY [Bug 52372] OutOfMemoryError parsing a word file
Posted by bu...@apache.org.
https://issues.apache.org/bugzilla/show_bug.cgi?id=52372
Nick Burch <ni...@alfresco.com> changed:
What |Removed |Added
----------------------------------------------------------------------------
Status|NEEDINFO |NEW
--- Comment #3 from Nick Burch <ni...@alfresco.com> 2011-12-20 13:37:54 UTC ---
Note that this bug doesn't look to be word specific, the exception is coming
from the common HPSF properties, rather than HWPF
--
Configure bugmail: https://issues.apache.org/bugzilla/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug.
---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@poi.apache.org
For additional commands, e-mail: dev-help@poi.apache.org
DO NOT REPLY [Bug 52372] OutOfMemoryError parsing a word file
Posted by bu...@apache.org.
https://issues.apache.org/bugzilla/show_bug.cgi?id=52372
Nick Burch <ni...@alfresco.com> changed:
What |Removed |Added
----------------------------------------------------------------------------
Status|NEW |NEEDINFO
--- Comment #6 from Nick Burch <ni...@alfresco.com> 2011-12-23 03:24:35 UTC ---
The issue is that we're reading a value that should contain the number of
properties in the section, then trying to create an array to hold that many
properties (before reading into them, so it couldn't be a buffer overflow even
in C!)
What we're not doing is sanity checking the number of properties, so if the
file has been corrupted and that value is very large, we trust it at that point
and try to allocate a big array. (Later on we'd throw a different exception on
discovering the value was corrupt and specified more properties than there's
data for)
We could probably do some checks on the size, and also move the array
initialisation to after the first pass too
Are you able to check the Microsoft Documentation to see what the limit on the
number of properties in a section is? (That'd be an easy sanity check to do
first)
--
Configure bugmail: https://issues.apache.org/bugzilla/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug.
---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@poi.apache.org
For additional commands, e-mail: dev-help@poi.apache.org
DO NOT REPLY [Bug 52372] OutOfMemoryError parsing a word file
Posted by bu...@apache.org.
https://issues.apache.org/bugzilla/show_bug.cgi?id=52372
--- Comment #2 from Jerome Lacoste <je...@gmail.com> 2011-12-20 13:34:44 UTC ---
Yes it does fail with 3.8-beta5
--
Configure bugmail: https://issues.apache.org/bugzilla/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug.
---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@poi.apache.org
For additional commands, e-mail: dev-help@poi.apache.org
DO NOT REPLY [Bug 52372] OutOfMemoryError parsing a word file
Posted by bu...@apache.org.
https://issues.apache.org/bugzilla/show_bug.cgi?id=52372
--- Comment #7 from Jerome Lacoste <je...@gmail.com> 2011-12-23 09:01:27 UTC ---
> What we're not doing is sanity checking the number of properties
> so if the file has been corrupted
Just a question: are we sure the file is corrupted ? Word opens it properly
with on both Windows and Mac.
Also the place where the code tries to read the property size contains some
text "Hewlett-Packard"
> Are you able to check the Microsoft Documentation to see what the limit on the
> number of properties in a section is? (That'd be an easy sanity check to do
> first) http://msdn.microsoft.com/en-us/library/dd949336%28v=office.12%29.aspx
I wasn't able to find a maximum number of properties.
>From the .Doc structure format:
http://msdn.microsoft.com/en-us/library/cc313153%28v=office.12%29.aspx
Example of a section
http://msdn.microsoft.com/en-us/library/dd907622%28v=office.12%29.aspx
Property storage
http://msdn.microsoft.com/en-us/library/dd949336%28v=office.12%29.aspx
But we may be able to use a different limit. We know the document/buffer
length. Surely there are at most (buffer length) / (min property length).
--
Configure bugmail: https://issues.apache.org/bugzilla/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug.
---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@poi.apache.org
For additional commands, e-mail: dev-help@poi.apache.org
DO NOT REPLY [Bug 52372] OutOfMemoryError parsing a word file
Posted by bu...@apache.org.
https://issues.apache.org/bugzilla/show_bug.cgi?id=52372
--- Comment #4 from Jerome Lacoste <je...@gmail.com> 2011-12-20 15:01:12 UTC ---
I filled it on HPFS being the module that had the closest name to HPSF...
Picking the right module was a bit confusing !
--
Configure bugmail: https://issues.apache.org/bugzilla/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug.
---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@poi.apache.org
For additional commands, e-mail: dev-help@poi.apache.org