You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@tika.apache.org by Stephan Budach <st...@jvm.de> on 2019/07/25 13:22:32 UTC

Update Tika's Apple iWork parser?

Hello, 


I have just recently discovered Tika as I have been playing around with fscrawler to help me index my file shares and I came across a problem, that I can't fix. Tika has had the ability to parse Apple iWork files for quite some time, but since Apple has split up the iWorks Suite into three seperate apps, the media type has changed for each of those - now seperate files. 


As I have learned from looking at the code of the Class IWorkPackageParser, it defines this media type for iWork files: 



/** 
* This parser handles all iWorks formats. 
*/ 
private final static Set<MediaType> supportedTypes = 
Collections.unmodifiableSet(new HashSet<MediaType>(Arrays.asList( 
MediaType.application("vnd.apple.iwork"), 
IWORKDocumentType.KEYNOTE.getType(), 
IWORKDocumentType.NUMBERS.getType(), 
IWORKDocumentType.PAGES.getType() 
))); 


However, fscrawler sends this MediaType to Tika, which of course triggers no parser: application/vnd.apple.keynote 


Can the iWorks parser be updated to be able to handle Keynote files, or at least, give it a try? Unfortuanetly, I am not a dev type, so I am lacking the skills to pull that off, but I'd be ready to try a new parser and provide feedback. 


Regards, 
Stephan 
-- 

Krebs's 3 Basic Rules for Online Safety 
1st - "If you didn't go looking for it, don't install it!" 
2nd - "If you installed it, update it." 
3rd - "If you no longer need it, remove it." 
http://krebsonsecurity.com/2011/05/krebss-3-basic-rules-for-online-safety 


Stephan Budach 
Head of IT 
Jung von Matt AG 
Glashüttenstraße 79 
D-20357 Hamburg 


Tel: +49 40-4321-1353 
Fax: +49 40-4321-1114 
E-Mail: stephan.budach@jvm.de 
Internet: http://www.jvm.com 
WebEx: https://jvm.webex.com/meet/stephan.budach 

Vorstand: Dr. Peter Figge 
Vorsitzender des Aufsichtsrates: Dr. Jochen Gutbrod 
AG HH HRB 72893 


Re: Update Tika's Apple iWork parser?

Posted by Tim Allison <ta...@apache.org>.
> in the end we're mostly interested in the text

Ditto!  :D

The more help, the better.  Thank you!

On Thu, Jul 25, 2019 at 11:41 AM Stephan Budach <st...@jvm.de> wrote:
>
> Hi Tim,
>
> yeah, I have read, I think, all of those - the two Jira issues definetively. I also didn't expect this to be a no-brainer and I at least I do have all of those apps on my Mac, so I can share example files without any issue. Thanks to be willing to take shot at it.
>
> To start with one thing… Keynote has two flavours of files: bundled ones (all files separately in a folder, carrying the app's extension e.g. .key) or a zip-compressed archive (a zip file, again with the extension .key for Keynote, instead of .zip). Does the current iWork parser can handle both - that wasn't clear to me, when I looked at the code on Github. I do think though, that if the iWorks parser encounters a zip-compressed file, it will have to unzip it somewhere temporarily and then look into the structure (folders: Data/Index) to find the interesting pieces.
>
> I will take a look at the protobuf tool and feed it some of the iwa files… in the end we're mostly interested in the text, that is on those slides and at leats I do know, whats on the slides. ;)
>
> Thanks and regards,
> Stephan
>
>
> ----- Ursprüngliche Mail -----
> > Von: "Tim Allison" <ta...@apache.org>
> > An: user@tika.apache.org
> > Gesendet: Donnerstag, 25. Juli 2019 17:07:21
> > Betreff: Re: Update Tika's Apple iWork parser?
> >
> > Hi Stephan,
> >   This is currently an omission/blindspot in Tika[1].  Regrettably,
> > the new iWorks files are, um, complex, and last I looked the schemas
> > for iWorks were enormous, and there were version conflicts in the
> > schemas across different versions of iWorks files.
> >   So, perhaps our best bet would be to follow something along the
> > lines of [2] on [3].
> >   You could help out by sharing example files.  I don't know that
> >   I'll
> > have any time soon to work on this, but, y, this is a known issue.
> > Sorry.
> >
> >              Best,
> >
> >                    Tim
> >
> > [1] https://issues.apache.org/jira/browse/TIKA-1358
> > [2]
> > https://stackoverflow.com/questions/25898230/decoding-protobuf-without-schema/25898551#25898551
> > [3] https://issues.apache.org/jira/browse/TIKA-2912
> >
> > On Thu, Jul 25, 2019 at 9:22 AM Stephan Budach
> > <st...@jvm.de> wrote:
> > >
> > > Hello,
> > >
> > > I have just recently discovered Tika as I have been playing around
> > > with fscrawler to help me index my file shares and I came across a
> > > problem, that I can't fix. Tika has had the ability to parse Apple
> > > iWork files for quite some time, but since Apple has split up the
> > > iWorks Suite into three seperate apps, the media type has changed
> > > for each of those - now seperate files.
> > >
> > > As I have learned from looking at the code of the Class
> > > IWorkPackageParser, it defines this media type for iWork files:
> > >
> > > /**
> > >      * This parser handles all iWorks formats.
> > >      */
> > >     private final static Set<MediaType> supportedTypes =
> > >          Collections.unmodifiableSet(new
> > >          HashSet<MediaType>(Arrays.asList(
> > >                 MediaType.application("vnd.apple.iwork"),
> > >                 IWORKDocumentType.KEYNOTE.getType(),
> > >                 IWORKDocumentType.NUMBERS.getType(),
> > >                 IWORKDocumentType.PAGES.getType()
> > >          )));
> > >
> > > However, fscrawler sends this MediaType to Tika, which of course
> > > triggers no parser: application/vnd.apple.keynote
> > >
> > > Can the iWorks parser be updated to be able to handle Keynote
> > > files, or at least, give it a try? Unfortuanetly, I am not a dev
> > > type, so I am lacking the skills to pull that off, but I'd be
> > > ready to try a new parser and provide feedback.
> > >
> > > Regards,
> > > Stephan
> > >
> > > --
> > > Krebs's 3 Basic Rules for Online Safety
> > > 1st - "If you didn't go looking for it, don't install it!"
> > > 2nd - "If you installed it, update it."
> > > 3rd - "If you no longer need it, remove it."
> > > http://krebsonsecurity.com/2011/05/krebss-3-basic-rules-for-online-safety
> > >
> > >
> > > Stephan Budach
> > > Head of IT
> > > Jung von Matt AG
> > > Glashüttenstraße 79
> > > D-20357 Hamburg
> > >
> > >
> > > Tel: +49 40-4321-1353
> > > Fax: +49 40-4321-1114
> > > E-Mail: stephan.budach@jvm.de
> > > Internet: http://www.jvm.com
> > > WebEx: https://jvm.webex.com/meet/stephan.budach
> > >
> > > Vorstand: Dr. Peter Figge
> > > Vorsitzender des Aufsichtsrates: Dr. Jochen Gutbrod
> > > AG HH HRB 72893
> > >
> >
>
> --
>
> Krebs's 3 Basic Rules for Online Safety
> 1st - "If you didn't go looking for it, don't install it!"
> 2nd - "If you installed it, update it."
> 3rd - "If you no longer need it, remove it."
> http://krebsonsecurity.com/2011/05/krebss-3-basic-rules-for-online-safety
>
>
> Stephan Budach
> Head of IT
> Jung von Matt AG
> Glashüttenstraße 79
> D-20357 Hamburg
>
>
> Tel: +49 40-4321-1353
> Fax: +49 40-4321-1114
> E-Mail: stephan.budach@jvm.de
> Internet: http://www.jvm.com
> WebEx: https://jvm.webex.com/meet/stephan.budach
>
> Vorstand: Dr. Peter Figge
> Vorsitzender des Aufsichtsrates: Dr. Jochen Gutbrod
> AG HH HRB 72893
>
>
>
> Jung von Matt investiert in die Kreativen von morgen: JvM-Academy.
> http://jvm-academy.org

Re: Update Tika's Apple iWork parser?

Posted by Stephan Budach <st...@jvm.de>.
Hi Tim,

yeah, I have read, I think, all of those - the two Jira issues definetively. I also didn't expect this to be a no-brainer and I at least I do have all of those apps on my Mac, so I can share example files without any issue. Thanks to be willing to take shot at it.

To start with one thing… Keynote has two flavours of files: bundled ones (all files separately in a folder, carrying the app's extension e.g. .key) or a zip-compressed archive (a zip file, again with the extension .key for Keynote, instead of .zip). Does the current iWork parser can handle both - that wasn't clear to me, when I looked at the code on Github. I do think though, that if the iWorks parser encounters a zip-compressed file, it will have to unzip it somewhere temporarily and then look into the structure (folders: Data/Index) to find the interesting pieces.

I will take a look at the protobuf tool and feed it some of the iwa files… in the end we're mostly interested in the text, that is on those slides and at leats I do know, whats on the slides. ;)

Thanks and regards,
Stephan


----- Ursprüngliche Mail -----
> Von: "Tim Allison" <ta...@apache.org>
> An: user@tika.apache.org
> Gesendet: Donnerstag, 25. Juli 2019 17:07:21
> Betreff: Re: Update Tika's Apple iWork parser?
> 
> Hi Stephan,
>   This is currently an omission/blindspot in Tika[1].  Regrettably,
> the new iWorks files are, um, complex, and last I looked the schemas
> for iWorks were enormous, and there were version conflicts in the
> schemas across different versions of iWorks files.
>   So, perhaps our best bet would be to follow something along the
> lines of [2] on [3].
>   You could help out by sharing example files.  I don't know that
>   I'll
> have any time soon to work on this, but, y, this is a known issue.
> Sorry.
> 
>              Best,
> 
>                    Tim
> 
> [1] https://issues.apache.org/jira/browse/TIKA-1358
> [2]
> https://stackoverflow.com/questions/25898230/decoding-protobuf-without-schema/25898551#25898551
> [3] https://issues.apache.org/jira/browse/TIKA-2912
> 
> On Thu, Jul 25, 2019 at 9:22 AM Stephan Budach
> <st...@jvm.de> wrote:
> >
> > Hello,
> >
> > I have just recently discovered Tika as I have been playing around
> > with fscrawler to help me index my file shares and I came across a
> > problem, that I can't fix. Tika has had the ability to parse Apple
> > iWork files for quite some time, but since Apple has split up the
> > iWorks Suite into three seperate apps, the media type has changed
> > for each of those - now seperate files.
> >
> > As I have learned from looking at the code of the Class
> > IWorkPackageParser, it defines this media type for iWork files:
> >
> > /**
> >      * This parser handles all iWorks formats.
> >      */
> >     private final static Set<MediaType> supportedTypes =
> >          Collections.unmodifiableSet(new
> >          HashSet<MediaType>(Arrays.asList(
> >                 MediaType.application("vnd.apple.iwork"),
> >                 IWORKDocumentType.KEYNOTE.getType(),
> >                 IWORKDocumentType.NUMBERS.getType(),
> >                 IWORKDocumentType.PAGES.getType()
> >          )));
> >
> > However, fscrawler sends this MediaType to Tika, which of course
> > triggers no parser: application/vnd.apple.keynote
> >
> > Can the iWorks parser be updated to be able to handle Keynote
> > files, or at least, give it a try? Unfortuanetly, I am not a dev
> > type, so I am lacking the skills to pull that off, but I'd be
> > ready to try a new parser and provide feedback.
> >
> > Regards,
> > Stephan
> >
> > --
> > Krebs's 3 Basic Rules for Online Safety
> > 1st - "If you didn't go looking for it, don't install it!"
> > 2nd - "If you installed it, update it."
> > 3rd - "If you no longer need it, remove it."
> > http://krebsonsecurity.com/2011/05/krebss-3-basic-rules-for-online-safety
> >
> >
> > Stephan Budach
> > Head of IT
> > Jung von Matt AG
> > Glashüttenstraße 79
> > D-20357 Hamburg
> >
> >
> > Tel: +49 40-4321-1353
> > Fax: +49 40-4321-1114
> > E-Mail: stephan.budach@jvm.de
> > Internet: http://www.jvm.com
> > WebEx: https://jvm.webex.com/meet/stephan.budach
> >
> > Vorstand: Dr. Peter Figge
> > Vorsitzender des Aufsichtsrates: Dr. Jochen Gutbrod
> > AG HH HRB 72893
> >
> 

-- 

Krebs's 3 Basic Rules for Online Safety 
1st - "If you didn't go looking for it, don't install it!" 
2nd - "If you installed it, update it." 
3rd - "If you no longer need it, remove it." 
http://krebsonsecurity.com/2011/05/krebss-3-basic-rules-for-online-safety 


Stephan Budach 
Head of IT 
Jung von Matt AG 
Glashüttenstraße 79 
D-20357 Hamburg 


Tel: +49 40-4321-1353 
Fax: +49 40-4321-1114 
E-Mail: stephan.budach@jvm.de 
Internet: http://www.jvm.com 
WebEx: https://jvm.webex.com/meet/stephan.budach 

Vorstand: Dr. Peter Figge 
Vorsitzender des Aufsichtsrates: Dr. Jochen Gutbrod 
AG HH HRB 72893 



Jung von Matt investiert in die Kreativen von morgen: JvM-Academy. 
http://jvm-academy.org 

Re: Update Tika's Apple iWork parser?

Posted by Tim Allison <ta...@apache.org>.
Hi Stephan,
  This is currently an omission/blindspot in Tika[1].  Regrettably,
the new iWorks files are, um, complex, and last I looked the schemas
for iWorks were enormous, and there were version conflicts in the
schemas across different versions of iWorks files.
  So, perhaps our best bet would be to follow something along the
lines of [2] on [3].
  You could help out by sharing example files.  I don't know that I'll
have any time soon to work on this, but, y, this is a known issue.
Sorry.

             Best,

                   Tim

[1] https://issues.apache.org/jira/browse/TIKA-1358
[2] https://stackoverflow.com/questions/25898230/decoding-protobuf-without-schema/25898551#25898551
[3] https://issues.apache.org/jira/browse/TIKA-2912

On Thu, Jul 25, 2019 at 9:22 AM Stephan Budach <st...@jvm.de> wrote:
>
> Hello,
>
> I have just recently discovered Tika as I have been playing around with fscrawler to help me index my file shares and I came across a problem, that I can't fix. Tika has had the ability to parse Apple iWork files for quite some time, but since Apple has split up the iWorks Suite into three seperate apps, the media type has changed for each of those - now seperate files.
>
> As I have learned from looking at the code of the Class IWorkPackageParser, it defines this media type for iWork files:
>
> /**
>      * This parser handles all iWorks formats.
>      */
>     private final static Set<MediaType> supportedTypes =
>          Collections.unmodifiableSet(new HashSet<MediaType>(Arrays.asList(
>                 MediaType.application("vnd.apple.iwork"),
>                 IWORKDocumentType.KEYNOTE.getType(),
>                 IWORKDocumentType.NUMBERS.getType(),
>                 IWORKDocumentType.PAGES.getType()
>          )));
>
> However, fscrawler sends this MediaType to Tika, which of course triggers no parser: application/vnd.apple.keynote
>
> Can the iWorks parser be updated to be able to handle Keynote files, or at least, give it a try? Unfortuanetly, I am not a dev type, so I am lacking the skills to pull that off, but I'd be ready to try a new parser and provide feedback.
>
> Regards,
> Stephan
>
> --
> Krebs's 3 Basic Rules for Online Safety
> 1st - "If you didn't go looking for it, don't install it!"
> 2nd - "If you installed it, update it."
> 3rd - "If you no longer need it, remove it."
> http://krebsonsecurity.com/2011/05/krebss-3-basic-rules-for-online-safety
>
>
> Stephan Budach
> Head of IT
> Jung von Matt AG
> Glashüttenstraße 79
> D-20357 Hamburg
>
>
> Tel: +49 40-4321-1353
> Fax: +49 40-4321-1114
> E-Mail: stephan.budach@jvm.de
> Internet: http://www.jvm.com
> WebEx: https://jvm.webex.com/meet/stephan.budach
>
> Vorstand: Dr. Peter Figge
> Vorsitzender des Aufsichtsrates: Dr. Jochen Gutbrod
> AG HH HRB 72893
>