You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@tika.apache.org by Stephan Budach <st...@jvm.de> on 2019/07/25 13:22:32 UTC
Update Tika's Apple iWork parser?
Hello,
I have just recently discovered Tika as I have been playing around with fscrawler to help me index my file shares and I came across a problem, that I can't fix. Tika has had the ability to parse Apple iWork files for quite some time, but since Apple has split up the iWorks Suite into three seperate apps, the media type has changed for each of those - now seperate files.
As I have learned from looking at the code of the Class IWorkPackageParser, it defines this media type for iWork files:
/**
* This parser handles all iWorks formats.
*/
private final static Set<MediaType> supportedTypes =
Collections.unmodifiableSet(new HashSet<MediaType>(Arrays.asList(
MediaType.application("vnd.apple.iwork"),
IWORKDocumentType.KEYNOTE.getType(),
IWORKDocumentType.NUMBERS.getType(),
IWORKDocumentType.PAGES.getType()
)));
However, fscrawler sends this MediaType to Tika, which of course triggers no parser: application/vnd.apple.keynote
Can the iWorks parser be updated to be able to handle Keynote files, or at least, give it a try? Unfortuanetly, I am not a dev type, so I am lacking the skills to pull that off, but I'd be ready to try a new parser and provide feedback.
Regards,
Stephan
--
Krebs's 3 Basic Rules for Online Safety
1st - "If you didn't go looking for it, don't install it!"
2nd - "If you installed it, update it."
3rd - "If you no longer need it, remove it."
http://krebsonsecurity.com/2011/05/krebss-3-basic-rules-for-online-safety
Stephan Budach
Head of IT
Jung von Matt AG
Glashüttenstraße 79
D-20357 Hamburg
Tel: +49 40-4321-1353
Fax: +49 40-4321-1114
E-Mail: stephan.budach@jvm.de
Internet: http://www.jvm.com
WebEx: https://jvm.webex.com/meet/stephan.budach
Vorstand: Dr. Peter Figge
Vorsitzender des Aufsichtsrates: Dr. Jochen Gutbrod
AG HH HRB 72893
Re: Update Tika's Apple iWork parser?
Posted by Tim Allison <ta...@apache.org>.
> in the end we're mostly interested in the text
Ditto! :D
The more help, the better. Thank you!
On Thu, Jul 25, 2019 at 11:41 AM Stephan Budach <st...@jvm.de> wrote:
>
> Hi Tim,
>
> yeah, I have read, I think, all of those - the two Jira issues definetively. I also didn't expect this to be a no-brainer and I at least I do have all of those apps on my Mac, so I can share example files without any issue. Thanks to be willing to take shot at it.
>
> To start with one thing… Keynote has two flavours of files: bundled ones (all files separately in a folder, carrying the app's extension e.g. .key) or a zip-compressed archive (a zip file, again with the extension .key for Keynote, instead of .zip). Does the current iWork parser can handle both - that wasn't clear to me, when I looked at the code on Github. I do think though, that if the iWorks parser encounters a zip-compressed file, it will have to unzip it somewhere temporarily and then look into the structure (folders: Data/Index) to find the interesting pieces.
>
> I will take a look at the protobuf tool and feed it some of the iwa files… in the end we're mostly interested in the text, that is on those slides and at leats I do know, whats on the slides. ;)
>
> Thanks and regards,
> Stephan
>
>
> ----- Ursprüngliche Mail -----
> > Von: "Tim Allison" <ta...@apache.org>
> > An: user@tika.apache.org
> > Gesendet: Donnerstag, 25. Juli 2019 17:07:21
> > Betreff: Re: Update Tika's Apple iWork parser?
> >
> > Hi Stephan,
> > This is currently an omission/blindspot in Tika[1]. Regrettably,
> > the new iWorks files are, um, complex, and last I looked the schemas
> > for iWorks were enormous, and there were version conflicts in the
> > schemas across different versions of iWorks files.
> > So, perhaps our best bet would be to follow something along the
> > lines of [2] on [3].
> > You could help out by sharing example files. I don't know that
> > I'll
> > have any time soon to work on this, but, y, this is a known issue.
> > Sorry.
> >
> > Best,
> >
> > Tim
> >
> > [1] https://issues.apache.org/jira/browse/TIKA-1358
> > [2]
> > https://stackoverflow.com/questions/25898230/decoding-protobuf-without-schema/25898551#25898551
> > [3] https://issues.apache.org/jira/browse/TIKA-2912
> >
> > On Thu, Jul 25, 2019 at 9:22 AM Stephan Budach
> > <st...@jvm.de> wrote:
> > >
> > > Hello,
> > >
> > > I have just recently discovered Tika as I have been playing around
> > > with fscrawler to help me index my file shares and I came across a
> > > problem, that I can't fix. Tika has had the ability to parse Apple
> > > iWork files for quite some time, but since Apple has split up the
> > > iWorks Suite into three seperate apps, the media type has changed
> > > for each of those - now seperate files.
> > >
> > > As I have learned from looking at the code of the Class
> > > IWorkPackageParser, it defines this media type for iWork files:
> > >
> > > /**
> > > * This parser handles all iWorks formats.
> > > */
> > > private final static Set<MediaType> supportedTypes =
> > > Collections.unmodifiableSet(new
> > > HashSet<MediaType>(Arrays.asList(
> > > MediaType.application("vnd.apple.iwork"),
> > > IWORKDocumentType.KEYNOTE.getType(),
> > > IWORKDocumentType.NUMBERS.getType(),
> > > IWORKDocumentType.PAGES.getType()
> > > )));
> > >
> > > However, fscrawler sends this MediaType to Tika, which of course
> > > triggers no parser: application/vnd.apple.keynote
> > >
> > > Can the iWorks parser be updated to be able to handle Keynote
> > > files, or at least, give it a try? Unfortuanetly, I am not a dev
> > > type, so I am lacking the skills to pull that off, but I'd be
> > > ready to try a new parser and provide feedback.
> > >
> > > Regards,
> > > Stephan
> > >
> > > --
> > > Krebs's 3 Basic Rules for Online Safety
> > > 1st - "If you didn't go looking for it, don't install it!"
> > > 2nd - "If you installed it, update it."
> > > 3rd - "If you no longer need it, remove it."
> > > http://krebsonsecurity.com/2011/05/krebss-3-basic-rules-for-online-safety
> > >
> > >
> > > Stephan Budach
> > > Head of IT
> > > Jung von Matt AG
> > > Glashüttenstraße 79
> > > D-20357 Hamburg
> > >
> > >
> > > Tel: +49 40-4321-1353
> > > Fax: +49 40-4321-1114
> > > E-Mail: stephan.budach@jvm.de
> > > Internet: http://www.jvm.com
> > > WebEx: https://jvm.webex.com/meet/stephan.budach
> > >
> > > Vorstand: Dr. Peter Figge
> > > Vorsitzender des Aufsichtsrates: Dr. Jochen Gutbrod
> > > AG HH HRB 72893
> > >
> >
>
> --
>
> Krebs's 3 Basic Rules for Online Safety
> 1st - "If you didn't go looking for it, don't install it!"
> 2nd - "If you installed it, update it."
> 3rd - "If you no longer need it, remove it."
> http://krebsonsecurity.com/2011/05/krebss-3-basic-rules-for-online-safety
>
>
> Stephan Budach
> Head of IT
> Jung von Matt AG
> Glashüttenstraße 79
> D-20357 Hamburg
>
>
> Tel: +49 40-4321-1353
> Fax: +49 40-4321-1114
> E-Mail: stephan.budach@jvm.de
> Internet: http://www.jvm.com
> WebEx: https://jvm.webex.com/meet/stephan.budach
>
> Vorstand: Dr. Peter Figge
> Vorsitzender des Aufsichtsrates: Dr. Jochen Gutbrod
> AG HH HRB 72893
>
>
>
> Jung von Matt investiert in die Kreativen von morgen: JvM-Academy.
> http://jvm-academy.org
Re: Update Tika's Apple iWork parser?
Posted by Stephan Budach <st...@jvm.de>.
Hi Tim,
yeah, I have read, I think, all of those - the two Jira issues definetively. I also didn't expect this to be a no-brainer and I at least I do have all of those apps on my Mac, so I can share example files without any issue. Thanks to be willing to take shot at it.
To start with one thing… Keynote has two flavours of files: bundled ones (all files separately in a folder, carrying the app's extension e.g. .key) or a zip-compressed archive (a zip file, again with the extension .key for Keynote, instead of .zip). Does the current iWork parser can handle both - that wasn't clear to me, when I looked at the code on Github. I do think though, that if the iWorks parser encounters a zip-compressed file, it will have to unzip it somewhere temporarily and then look into the structure (folders: Data/Index) to find the interesting pieces.
I will take a look at the protobuf tool and feed it some of the iwa files… in the end we're mostly interested in the text, that is on those slides and at leats I do know, whats on the slides. ;)
Thanks and regards,
Stephan
----- Ursprüngliche Mail -----
> Von: "Tim Allison" <ta...@apache.org>
> An: user@tika.apache.org
> Gesendet: Donnerstag, 25. Juli 2019 17:07:21
> Betreff: Re: Update Tika's Apple iWork parser?
>
> Hi Stephan,
> This is currently an omission/blindspot in Tika[1]. Regrettably,
> the new iWorks files are, um, complex, and last I looked the schemas
> for iWorks were enormous, and there were version conflicts in the
> schemas across different versions of iWorks files.
> So, perhaps our best bet would be to follow something along the
> lines of [2] on [3].
> You could help out by sharing example files. I don't know that
> I'll
> have any time soon to work on this, but, y, this is a known issue.
> Sorry.
>
> Best,
>
> Tim
>
> [1] https://issues.apache.org/jira/browse/TIKA-1358
> [2]
> https://stackoverflow.com/questions/25898230/decoding-protobuf-without-schema/25898551#25898551
> [3] https://issues.apache.org/jira/browse/TIKA-2912
>
> On Thu, Jul 25, 2019 at 9:22 AM Stephan Budach
> <st...@jvm.de> wrote:
> >
> > Hello,
> >
> > I have just recently discovered Tika as I have been playing around
> > with fscrawler to help me index my file shares and I came across a
> > problem, that I can't fix. Tika has had the ability to parse Apple
> > iWork files for quite some time, but since Apple has split up the
> > iWorks Suite into three seperate apps, the media type has changed
> > for each of those - now seperate files.
> >
> > As I have learned from looking at the code of the Class
> > IWorkPackageParser, it defines this media type for iWork files:
> >
> > /**
> > * This parser handles all iWorks formats.
> > */
> > private final static Set<MediaType> supportedTypes =
> > Collections.unmodifiableSet(new
> > HashSet<MediaType>(Arrays.asList(
> > MediaType.application("vnd.apple.iwork"),
> > IWORKDocumentType.KEYNOTE.getType(),
> > IWORKDocumentType.NUMBERS.getType(),
> > IWORKDocumentType.PAGES.getType()
> > )));
> >
> > However, fscrawler sends this MediaType to Tika, which of course
> > triggers no parser: application/vnd.apple.keynote
> >
> > Can the iWorks parser be updated to be able to handle Keynote
> > files, or at least, give it a try? Unfortuanetly, I am not a dev
> > type, so I am lacking the skills to pull that off, but I'd be
> > ready to try a new parser and provide feedback.
> >
> > Regards,
> > Stephan
> >
> > --
> > Krebs's 3 Basic Rules for Online Safety
> > 1st - "If you didn't go looking for it, don't install it!"
> > 2nd - "If you installed it, update it."
> > 3rd - "If you no longer need it, remove it."
> > http://krebsonsecurity.com/2011/05/krebss-3-basic-rules-for-online-safety
> >
> >
> > Stephan Budach
> > Head of IT
> > Jung von Matt AG
> > Glashüttenstraße 79
> > D-20357 Hamburg
> >
> >
> > Tel: +49 40-4321-1353
> > Fax: +49 40-4321-1114
> > E-Mail: stephan.budach@jvm.de
> > Internet: http://www.jvm.com
> > WebEx: https://jvm.webex.com/meet/stephan.budach
> >
> > Vorstand: Dr. Peter Figge
> > Vorsitzender des Aufsichtsrates: Dr. Jochen Gutbrod
> > AG HH HRB 72893
> >
>
--
Krebs's 3 Basic Rules for Online Safety
1st - "If you didn't go looking for it, don't install it!"
2nd - "If you installed it, update it."
3rd - "If you no longer need it, remove it."
http://krebsonsecurity.com/2011/05/krebss-3-basic-rules-for-online-safety
Stephan Budach
Head of IT
Jung von Matt AG
Glashüttenstraße 79
D-20357 Hamburg
Tel: +49 40-4321-1353
Fax: +49 40-4321-1114
E-Mail: stephan.budach@jvm.de
Internet: http://www.jvm.com
WebEx: https://jvm.webex.com/meet/stephan.budach
Vorstand: Dr. Peter Figge
Vorsitzender des Aufsichtsrates: Dr. Jochen Gutbrod
AG HH HRB 72893
Jung von Matt investiert in die Kreativen von morgen: JvM-Academy.
http://jvm-academy.org
Re: Update Tika's Apple iWork parser?
Posted by Tim Allison <ta...@apache.org>.
Hi Stephan,
This is currently an omission/blindspot in Tika[1]. Regrettably,
the new iWorks files are, um, complex, and last I looked the schemas
for iWorks were enormous, and there were version conflicts in the
schemas across different versions of iWorks files.
So, perhaps our best bet would be to follow something along the
lines of [2] on [3].
You could help out by sharing example files. I don't know that I'll
have any time soon to work on this, but, y, this is a known issue.
Sorry.
Best,
Tim
[1] https://issues.apache.org/jira/browse/TIKA-1358
[2] https://stackoverflow.com/questions/25898230/decoding-protobuf-without-schema/25898551#25898551
[3] https://issues.apache.org/jira/browse/TIKA-2912
On Thu, Jul 25, 2019 at 9:22 AM Stephan Budach <st...@jvm.de> wrote:
>
> Hello,
>
> I have just recently discovered Tika as I have been playing around with fscrawler to help me index my file shares and I came across a problem, that I can't fix. Tika has had the ability to parse Apple iWork files for quite some time, but since Apple has split up the iWorks Suite into three seperate apps, the media type has changed for each of those - now seperate files.
>
> As I have learned from looking at the code of the Class IWorkPackageParser, it defines this media type for iWork files:
>
> /**
> * This parser handles all iWorks formats.
> */
> private final static Set<MediaType> supportedTypes =
> Collections.unmodifiableSet(new HashSet<MediaType>(Arrays.asList(
> MediaType.application("vnd.apple.iwork"),
> IWORKDocumentType.KEYNOTE.getType(),
> IWORKDocumentType.NUMBERS.getType(),
> IWORKDocumentType.PAGES.getType()
> )));
>
> However, fscrawler sends this MediaType to Tika, which of course triggers no parser: application/vnd.apple.keynote
>
> Can the iWorks parser be updated to be able to handle Keynote files, or at least, give it a try? Unfortuanetly, I am not a dev type, so I am lacking the skills to pull that off, but I'd be ready to try a new parser and provide feedback.
>
> Regards,
> Stephan
>
> --
> Krebs's 3 Basic Rules for Online Safety
> 1st - "If you didn't go looking for it, don't install it!"
> 2nd - "If you installed it, update it."
> 3rd - "If you no longer need it, remove it."
> http://krebsonsecurity.com/2011/05/krebss-3-basic-rules-for-online-safety
>
>
> Stephan Budach
> Head of IT
> Jung von Matt AG
> Glashüttenstraße 79
> D-20357 Hamburg
>
>
> Tel: +49 40-4321-1353
> Fax: +49 40-4321-1114
> E-Mail: stephan.budach@jvm.de
> Internet: http://www.jvm.com
> WebEx: https://jvm.webex.com/meet/stephan.budach
>
> Vorstand: Dr. Peter Figge
> Vorsitzender des Aufsichtsrates: Dr. Jochen Gutbrod
> AG HH HRB 72893
>