You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@tika.apache.org by Tim Allison <ta...@apache.org> on 2020/10/09 21:15:25 UTC

Fwd: XLSX wrapped in an OLE2 CompObj/Package - should WorkbookFactory handle it?

Nick,
  Do you think we should follow up on the Tika side?  Do we know if we can
handle this?

---------- Forwarded message ---------
From: Nick Burch <ap...@gagravarr.org>
Date: Fri, Oct 9, 2020 at 4:43 PM
Subject: XLSX wrapped in an OLE2 CompObj/Package - should WorkbookFactory
handle it?
To: <de...@poi.apache.org>


Hi All

Over on Stackoverflow <https://stackoverflow.com/q/64269294/685641>
there's a user who was getting what they thought was an embedded XSLX file
out of a PPT, but finding it was an OLE2 wrapper with CompObj and Package
entries. The real XLSX was in the Package part. Passing the outer OLE2
stream to WorkbookFactory didn't work

What do people think here? Should we have WorkbookFactory spot this case,
grab the OOXML out of the POIFS and try to load that? Update HSLF to
optionally extract the OOXML out of the OLE2? Record the gotcha in the
docs somewhere? Something else?

Cheers
Nick

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@poi.apache.org
For additional commands, e-mail: dev-help@poi.apache.org

Re: Fwd: XLSX wrapped in an OLE2 CompObj/Package - should WorkbookFactory handle it?

Posted by Nick Burch <ap...@gagravarr.org>.
On Tue, 13 Oct 2020, Tim Allison wrote:
> Ha, y, this file exercises those bits of code:
> https://github.com/apache/tika/blob/main/tika-parser-modules/tika-parser-microsoft-module/src/test/resources/test-documents/testPPT_oleWorkbook.ppt
>
> Nick, does this match the features of the SO question?

Yup, it does!

So, seems we can answer the SO question with "you need to upgrade, was 
already fixed" :)

Nick

Re: Fwd: XLSX wrapped in an OLE2 CompObj/Package - should WorkbookFactory handle it?

Posted by Tim Allison <ta...@apache.org>.
Ha, y, this file exercises those bits of code:
https://github.com/apache/tika/blob/main/tika-parser-modules/tika-parser-microsoft-module/src/test/resources/test-documents/testPPT_oleWorkbook.ppt

Nick, does this match the features of the SO question?

On Tue, Oct 13, 2020 at 10:58 AM Tim Allison <ta...@apache.org> wrote:

> Based on
>
>
> https://github.com/apache/tika/blob/main/tika-parser-modules/tika-parser-microsoft-module/src/main/java/org/apache/tika/parser/microsoft/HSLFExtractor.java#L518
>
> and
>
>
> https://github.com/apache/tika/blob/main/tika-parser-modules/tika-parser-microsoft-module/src/main/java/org/apache/tika/parser/microsoft/AbstractPOIFSExtractor.java#L159
>
> I _think_ we're handling this...
>
> On Tue, Oct 13, 2020 at 10:38 AM Tim Allison <ta...@apache.org> wrote:
>
>> Thank you, Nick!
>>
>> IIUC the XLSX raw bytes are in the Package entry of an OLE2 wrapper.
>> What is the key for the OLE2 wrapper in the PPT?  Sorry for missing this...
>>
>> Have you put your hands on an example that you could share privately?
>> Happy to look through our regression corpus if I know what exactly to look
>> for.
>>
>> Thank you, again!
>>
>> Cheers,
>>
>>        Tim
>>
>> On Sat, Oct 10, 2020 at 7:20 AM Nick Burch <ap...@gagravarr.org> wrote:
>>
>>> On Fri, 9 Oct 2020, Tim Allison wrote:
>>> > Do you think we should follow up on the Tika side?  Do we know if we
>>> can
>>> > handle this?
>>>
>>> I thought we did, but checking POIFSContainerDetector I can't actually
>>> see
>>> that case covered....
>>>
>>> I think we (Tika) can handle it in a similar way to CompObj
>>>
>>> > Over on Stackoverflow <https://stackoverflow.com/q/64269294/685641>
>>> > there's a user who was getting what they thought was an embedded XSLX
>>> file
>>> > out of a PPT, but finding it was an OLE2 wrapper with CompObj and
>>> Package
>>> > entries. The real XLSX was in the Package part. Passing the outer OLE2
>>> > stream to WorkbookFactory didn't work
>>>
>>> The list of entries to search for are in the comments on the question.
>>> We
>>> may actually have a similar file in our corpus we can use to test. I
>>> think
>>> it is triggered when an OOXML file is embedded in a PPT by some older
>>> versions of PowerPoint, as a compatibility wrapper
>>>
>>> Nick
>>>
>>

Re: Fwd: XLSX wrapped in an OLE2 CompObj/Package - should WorkbookFactory handle it?

Posted by Tim Allison <ta...@apache.org>.
Based on

https://github.com/apache/tika/blob/main/tika-parser-modules/tika-parser-microsoft-module/src/main/java/org/apache/tika/parser/microsoft/HSLFExtractor.java#L518

and

https://github.com/apache/tika/blob/main/tika-parser-modules/tika-parser-microsoft-module/src/main/java/org/apache/tika/parser/microsoft/AbstractPOIFSExtractor.java#L159

I _think_ we're handling this...

On Tue, Oct 13, 2020 at 10:38 AM Tim Allison <ta...@apache.org> wrote:

> Thank you, Nick!
>
> IIUC the XLSX raw bytes are in the Package entry of an OLE2 wrapper.  What
> is the key for the OLE2 wrapper in the PPT?  Sorry for missing this...
>
> Have you put your hands on an example that you could share privately?
> Happy to look through our regression corpus if I know what exactly to look
> for.
>
> Thank you, again!
>
> Cheers,
>
>        Tim
>
> On Sat, Oct 10, 2020 at 7:20 AM Nick Burch <ap...@gagravarr.org> wrote:
>
>> On Fri, 9 Oct 2020, Tim Allison wrote:
>> > Do you think we should follow up on the Tika side?  Do we know if we can
>> > handle this?
>>
>> I thought we did, but checking POIFSContainerDetector I can't actually
>> see
>> that case covered....
>>
>> I think we (Tika) can handle it in a similar way to CompObj
>>
>> > Over on Stackoverflow <https://stackoverflow.com/q/64269294/685641>
>> > there's a user who was getting what they thought was an embedded XSLX
>> file
>> > out of a PPT, but finding it was an OLE2 wrapper with CompObj and
>> Package
>> > entries. The real XLSX was in the Package part. Passing the outer OLE2
>> > stream to WorkbookFactory didn't work
>>
>> The list of entries to search for are in the comments on the question. We
>> may actually have a similar file in our corpus we can use to test. I
>> think
>> it is triggered when an OOXML file is embedded in a PPT by some older
>> versions of PowerPoint, as a compatibility wrapper
>>
>> Nick
>>
>

Re: Fwd: XLSX wrapped in an OLE2 CompObj/Package - should WorkbookFactory handle it?

Posted by Tim Allison <ta...@apache.org>.
Thank you, Nick!

IIUC the XLSX raw bytes are in the Package entry of an OLE2 wrapper.  What
is the key for the OLE2 wrapper in the PPT?  Sorry for missing this...

Have you put your hands on an example that you could share privately?
Happy to look through our regression corpus if I know what exactly to look
for.

Thank you, again!

Cheers,

       Tim

On Sat, Oct 10, 2020 at 7:20 AM Nick Burch <ap...@gagravarr.org> wrote:

> On Fri, 9 Oct 2020, Tim Allison wrote:
> > Do you think we should follow up on the Tika side?  Do we know if we can
> > handle this?
>
> I thought we did, but checking POIFSContainerDetector I can't actually see
> that case covered....
>
> I think we (Tika) can handle it in a similar way to CompObj
>
> > Over on Stackoverflow <https://stackoverflow.com/q/64269294/685641>
> > there's a user who was getting what they thought was an embedded XSLX
> file
> > out of a PPT, but finding it was an OLE2 wrapper with CompObj and Package
> > entries. The real XLSX was in the Package part. Passing the outer OLE2
> > stream to WorkbookFactory didn't work
>
> The list of entries to search for are in the comments on the question. We
> may actually have a similar file in our corpus we can use to test. I think
> it is triggered when an OOXML file is embedded in a PPT by some older
> versions of PowerPoint, as a compatibility wrapper
>
> Nick
>

Re: Fwd: XLSX wrapped in an OLE2 CompObj/Package - should WorkbookFactory handle it?

Posted by Nick Burch <ap...@gagravarr.org>.
On Fri, 9 Oct 2020, Tim Allison wrote:
> Do you think we should follow up on the Tika side?  Do we know if we can
> handle this?

I thought we did, but checking POIFSContainerDetector I can't actually see 
that case covered....

I think we (Tika) can handle it in a similar way to CompObj

> Over on Stackoverflow <https://stackoverflow.com/q/64269294/685641>
> there's a user who was getting what they thought was an embedded XSLX file
> out of a PPT, but finding it was an OLE2 wrapper with CompObj and Package
> entries. The real XLSX was in the Package part. Passing the outer OLE2
> stream to WorkbookFactory didn't work

The list of entries to search for are in the comments on the question. We 
may actually have a similar file in our corpus we can use to test. I think 
it is triggered when an OOXML file is embedded in a PPT by some older 
versions of PowerPoint, as a compatibility wrapper

Nick