You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@tika.apache.org by Jukka Zitting <ju...@gmail.com> on 2011/09/05 14:23:07 UTC

Re: svn commit: r1165230 - in /tika/trunk/tika-parsers/src: main/java/org/apache/tika/parser/microsoft/ooxml/ test/java/org/apache/tika/parser/microsoft/ test/resources/test-documents/

Hi,

On Mon, Sep 5, 2011 at 12:30 PM,  <ma...@apache.org> wrote:
> Embedded file extraction is broken for some OOXML files
> (bug introduced few commits ago)

That was me in revision 1164578 for TIKA-704. :-(

> -            if (root.hasEntry("CONTENTS")) {
> -                stream = TikaInputStream.get(
> -                        fs.createDocumentInputStream("CONTENTS"));

This was my attempt at properly handling the embedded PDF in
TestWithPdf.docx. It was included in an OLE object with the PDF
document as it's "CONTENTS" entry. I restored this functionality with
some more specific checks in revision 1165259, and the resulting code
should now work correctly with all the test documents we have.

Improvements welcome, as I'm no expert on POI or the Office file format.

BR,

Jukka Zitting

Re: svn commit: r1165230 - in /tika/trunk/tika-parsers/src: main/java/org/apache/tika/parser/microsoft/ooxml/ test/java/org/apache/tika/parser/microsoft/ test/resources/test-documents/

Posted by Nick Burch <ni...@alfresco.com>.

On Mon, 5 Sep 2011, Jukka Zitting wrote:
>> Hm, that is strange - current version of 
>> OfficeParser.POIFSDocumentType.detectType() thinks that "CONTENTS" part 
>> identifies POI filesystem as MS Works document. Maybe this is not 
>> right.
>
> I think we have some MS Works test files that do contain the
> "CONTENTS" entry, though I'm not sure if that's the best possible
> heuristic for detecting MS Works documents.

I've checked a few sample ones, and they have both CONTENTS and SPELLING, 
so I tweaked the rule to look for both

> My fix in revision 1165259 also checks for the presence of explicit OLE 
> entries, which I believe should help prevent collisions with actual 
> embedded MS Works documents.

I think we might want a different type for OLE1 native and general OLE2, 
as currently the detector won't let us spot the difference between them?

Nick

Re: svn commit: r1165230 - in /tika/trunk/tika-parsers/src: main/java/org/apache/tika/parser/microsoft/ooxml/ test/java/org/apache/tika/parser/microsoft/ test/resources/test-documents/

Posted by Jukka Zitting <ju...@gmail.com>.

Hi,

2011/9/5 Maxim Valyanskiy <ma...@jet.msk.su>:
> 05.09.2011, в 16:23, Jukka Zitting написал(а):
>> This was my attempt at properly handling the embedded PDF in
>> TestWithPdf.docx. It was included in an OLE object with the PDF
>> document as it's "CONTENTS" entry. I restored this functionality with
>> some more specific checks in revision 1165259, and the resulting code
>> should now work correctly with all the test documents we have.
>
> Hm, that is strange - current version of OfficeParser.POIFSDocumentType.detectType()
> thinks that "CONTENTS" part identifies POI filesystem as MS Works document.
> Maybe this is not right.

I think we have some MS Works test files that do contain the
"CONTENTS" entry, though I'm not sure if that's the best possible
heuristic for detecting MS Works documents. My fix in revision 1165259
also checks for the presence of explicit OLE entries, which I believe
should help prevent collisions with actual embedded MS Works
documents.

> Please add unit test with that TestWithPdf.docx.

The file was uploaded without the "grant license" option (and I
couldn't create a similar document myself) so I unfortunately couldn't
add the test case along with my original commit. I asked for the
required license grant in TIKA-704 and will add the test case if
approved.

BR,

Jukka Zitting

Re: svn commit: r1165230 - in /tika/trunk/tika-parsers/src: main/java/org/apache/tika/parser/microsoft/ooxml/ test/java/org/apache/tika/parser/microsoft/ test/resources/test-documents/

Posted by Maxim Valyanskiy <ma...@jet.msk.su>.

Hello!

05.09.2011, в 16:23, Jukka Zitting написал(а):

> That was me in revision 1164578 for TIKA-704. :-(
> 
>> -            if (root.hasEntry("CONTENTS")) {
>> -                stream = TikaInputStream.get(
>> -                        fs.createDocumentInputStream("CONTENTS"));
> 
> This was my attempt at properly handling the embedded PDF in
> TestWithPdf.docx. It was included in an OLE object with the PDF
> document as it's "CONTENTS" entry. I restored this functionality with
> some more specific checks in revision 1165259, and the resulting code
> should now work correctly with all the test documents we have.

Hm, that is strange - current version of OfficeParser.POIFSDocumentType.detectType() thinks that "CONTENTS" part identifies POI filesystem as MS Works document. Maybe this is not right.

Please add unit test with that TestWithPdf.docx.

best wishes, Max