You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@poi.apache.org by "Allison, Timothy B." <ta...@mitre.org> on 2016/11/04 18:25:30 UTC

RE: zip exceptions in objects embedded in HSLF

And for a larger collection of zip exceptions in embedded HSLF, see TIKA-2164.

-----Original Message-----
From: Allison, Timothy B. [mailto:tallison@mitre.org] 
Sent: Friday, November 4, 2016 11:49 AM
To: POI Users List <us...@poi.apache.org>
Subject: zip exceptions in objects embedded in HSLF

POI Colleagues, 
  On TIKA-2157 and TIKA-2130, Seva Alekseyev attached files that trigger a ZipException on an object embedded within a ppt.  We've seen these in our regression corpus as well.  For now, we're swallowing these in Tika.  If anyone has a chance to look into those triggering files to figure out if the embedded files are truly corrupt or if this is something we can fix in POI, I'd appreciate it.  I investigated a bit with TIKA-2130's file, and it _looks_ to me like the zip stream is truly corrupt, but this area of the code base is not one of my strengths.
  Thank you.

     Cheers,

              Tim



RE: zip exceptions in objects embedded in HSLF

Posted by "Allison, Timothy B." <ta...@mitre.org>.
Thank you, Yegor.  Y, I realized my initial catch was too narrow; I'll expand that generally, especially for image/pict.  Thank you, again!

-----Original Message-----
From: Yegor Kozlov [mailto:yegor.kozlov@dinom.ru] 
Sent: Saturday, November 5, 2016 10:20 AM
To: POI Developers List <de...@poi.apache.org>
Subject: Re: zip exceptions in objects embedded in HSLF

Hi Tim,

Research Forum 2013.3.ppt attached to TIKA-2164 is not a valid PPT file. My PowerPoint 2013 cannot open it and displays "The selected file does not appear to be a valid Microsoft PowerPoint file.". It seems that the OLE2 filesystem in this file is invalid. POI fails before parsing the PPT data, the error happens when reading POIFS blocks :

Exception in thread "main" java.lang.IndexOutOfBoundsException: Block 24081 not found at
org.apache.poi.poifs.filesystem.NPOIFSFileSystem.getBlockAt(NPOIFSFileSystem.java:486)
at
org.apache.poi.poifs.filesystem.NPOIFSFileSystem.readBAT(NPOIFSFileSystem.java:458)
at
org.apache.poi.poifs.filesystem.NPOIFSFileSystem.readCoreContents(NPOIFSFileSystem.java:411)
at
org.apache.poi.poifs.filesystem.NPOIFSFileSystem.<init>(NPOIFSFileSystem.java:335)

In all other cases the code fails when reading data from  "image/pict"
objects. I see your commit to swallow "incorrect data check" error messages. This is not to be relied on. I tried files attached to TIKA-2164 and the exception differs.
In case of Jankovic final Retreat 2002.PPT the error is
java.util.zip.ZipException: invalid literal/length code.
In case of Lab Meeting.ppt the error is java.io.EOFException: Unexpected end of ZLIB input stream In case of paperfigures.ppt the error is java.util.zip.ZipException:
invalid distance too far back

Apparently we don't handle all cases when reading PICT files. I'm not sure how much is the effort to fix it, but for now can you swallow all errors for the "image/pict" content type? It is a known troublemaker and the best you can do for now is to catch all its exceptions.

Yegor


On Fri, Nov 4, 2016 at 9:25 PM, Allison, Timothy B. <ta...@mitre.org>
wrote:

> And for a larger collection of zip exceptions in embedded HSLF, see 
> TIKA-2164.
>
> -----Original Message-----
> From: Allison, Timothy B. [mailto:tallison@mitre.org]
> Sent: Friday, November 4, 2016 11:49 AM
> To: POI Users List <us...@poi.apache.org>
> Subject: zip exceptions in objects embedded in HSLF
>
> POI Colleagues,
>   On TIKA-2157 and TIKA-2130, Seva Alekseyev attached files that 
> trigger a ZipException on an object embedded within a ppt.  We've seen 
> these in our regression corpus as well.  For now, we're swallowing 
> these in Tika.  If anyone has a chance to look into those triggering 
> files to figure out if the embedded files are truly corrupt or if this 
> is something we can fix in POI, I'd appreciate it.  I investigated a 
> bit with TIKA-2130's file, and it _looks_ to me like the zip stream is 
> truly corrupt, but this area of the code base is not one of my strengths.
>   Thank you.
>
>      Cheers,
>
>               Tim
>
>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@poi.apache.org
For additional commands, e-mail: dev-help@poi.apache.org


Re: zip exceptions in objects embedded in HSLF

Posted by Yegor Kozlov <ye...@dinom.ru>.
Hi Tim,

Research Forum 2013.3.ppt attached to TIKA-2164 is not a valid PPT file. My
PowerPoint 2013 cannot open it and displays "The selected file does not
appear to be a valid Microsoft PowerPoint file.". It seems that the OLE2
filesystem in this file is invalid. POI fails before parsing the PPT data,
the error happens when reading POIFS blocks :

Exception in thread "main" java.lang.IndexOutOfBoundsException: Block 24081
not found
at
org.apache.poi.poifs.filesystem.NPOIFSFileSystem.getBlockAt(NPOIFSFileSystem.java:486)
at
org.apache.poi.poifs.filesystem.NPOIFSFileSystem.readBAT(NPOIFSFileSystem.java:458)
at
org.apache.poi.poifs.filesystem.NPOIFSFileSystem.readCoreContents(NPOIFSFileSystem.java:411)
at
org.apache.poi.poifs.filesystem.NPOIFSFileSystem.<init>(NPOIFSFileSystem.java:335)

In all other cases the code fails when reading data from  "image/pict"
objects. I see your commit to swallow "incorrect data check" error
messages. This is not to be relied on. I tried files attached to TIKA-2164
and the exception differs.
In case of Jankovic final Retreat 2002.PPT the error is
java.util.zip.ZipException: invalid literal/length code.
In case of Lab Meeting.ppt the error is java.io.EOFException: Unexpected
end of ZLIB input stream
In case of paperfigures.ppt the error is java.util.zip.ZipException:
invalid distance too far back

Apparently we don't handle all cases when reading PICT files. I'm not sure
how much is the effort to fix it, but for now can you swallow all errors
for the "image/pict" content type? It is a known troublemaker and the best
you can do for now is to catch all its exceptions.

Yegor


On Fri, Nov 4, 2016 at 9:25 PM, Allison, Timothy B. <ta...@mitre.org>
wrote:

> And for a larger collection of zip exceptions in embedded HSLF, see
> TIKA-2164.
>
> -----Original Message-----
> From: Allison, Timothy B. [mailto:tallison@mitre.org]
> Sent: Friday, November 4, 2016 11:49 AM
> To: POI Users List <us...@poi.apache.org>
> Subject: zip exceptions in objects embedded in HSLF
>
> POI Colleagues,
>   On TIKA-2157 and TIKA-2130, Seva Alekseyev attached files that trigger a
> ZipException on an object embedded within a ppt.  We've seen these in our
> regression corpus as well.  For now, we're swallowing these in Tika.  If
> anyone has a chance to look into those triggering files to figure out if
> the embedded files are truly corrupt or if this is something we can fix in
> POI, I'd appreciate it.  I investigated a bit with TIKA-2130's file, and it
> _looks_ to me like the zip stream is truly corrupt, but this area of the
> code base is not one of my strengths.
>   Thank you.
>
>      Cheers,
>
>               Tim
>
>
>