You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@tika.apache.org by "McGreevy, Anthony" <an...@microfocus.com> on 2018/03/25 07:57:23 UTC
Subfile Extraction
Hey,
I am currently playing with Tika to see how it works with regards to extraction of subfiles.
The requirement I have is to have Tika take in a parent document, a .docx or .eml for example, and extract out the text content, metadata and all subfiles so that I can save them to disk.
So far I have worked out the metadata and content extraction but I haven't been able to find any tutorials on the subfile extraction.
If you could point me at resources I could use to work this out or examples of sample code doing this already it would be much appreciated.
Thanks,
Anthony
RE: Subfile Extraction
Posted by "Allison, Timothy B." <ta...@mitre.org>.
+1 to Nick's links and advice.
To use the RecursiveParserWrapper with tika-app, use the -J option; or if you're using tika-server, use the /rmeta endpoint.
The ecology of embedded docs is rich and understudied (IMHO), let us know what you find!
Cheers,
Tim
-----Original Message-----
From: McGreevy, Anthony [mailto:anthony.mcgreevy@microfocus.com]
Sent: Tuesday, March 27, 2018 11:47 AM
To: user@tika.apache.org
Subject: RE: Subfile Extraction
Thanks for the information!
Much appreciated!
Anthony
-----Original Message-----
From: Nick Burch [mailto:apache@gagravarr.org]
Sent: 27 March 2018 15:50
To: user@tika.apache.org
Subject: Re: Subfile Extraction
On Sun, 25 Mar 2018, McGreevy, Anthony wrote:
> I am currently playing with Tika to see how it works with regards to
> extraction of subfiles.
Do you mean files or resources embedded within another file?
If so... With the Tika App, you want -z to have these extracted. With the Tika java classes, you want to pop something like a https://tika.apache.org/1.17/api/org/apache/tika/parser/RecursiveParserWrapper.htmlhttps://tika.apache.org/1.17/api/org/apache/tika/parser/RecursiveParserWrapper.html
or a
https://tika.apache.org/1.17/api/org/apache/tika/extractor/ContainerExtractor.html
on your ParseContext to get called for embedded resources. See https://wiki.apache.org/tika/RecursiveMetadata for more on how it works and how to have Tika parse + return all the embedded files and resources
Nick
RE: Subfile Extraction
Posted by "McGreevy, Anthony" <an...@microfocus.com>.
Thanks for the information!
Much appreciated!
Anthony
-----Original Message-----
From: Nick Burch [mailto:apache@gagravarr.org]
Sent: 27 March 2018 15:50
To: user@tika.apache.org
Subject: Re: Subfile Extraction
On Sun, 25 Mar 2018, McGreevy, Anthony wrote:
> I am currently playing with Tika to see how it works with regards to
> extraction of subfiles.
Do you mean files or resources embedded within another file?
If so... With the Tika App, you want -z to have these extracted. With the Tika java classes, you want to pop something like a https://tika.apache.org/1.17/api/org/apache/tika/parser/RecursiveParserWrapper.htmlhttps://tika.apache.org/1.17/api/org/apache/tika/parser/RecursiveParserWrapper.html
or a
https://tika.apache.org/1.17/api/org/apache/tika/extractor/ContainerExtractor.html
on your ParseContext to get called for embedded resources. See https://wiki.apache.org/tika/RecursiveMetadata for more on how it works and how to have Tika parse + return all the embedded files and resources
Nick
Re: Subfile Extraction
Posted by Nick Burch <ap...@gagravarr.org>.
On Sun, 25 Mar 2018, McGreevy, Anthony wrote:
> I am currently playing with Tika to see how it works with regards to
> extraction of subfiles.
Do you mean files or resources embedded within another file?
If so... With the Tika App, you want -z to have these extracted. With the
Tika java classes, you want to pop something like a
https://tika.apache.org/1.17/api/org/apache/tika/parser/RecursiveParserWrapper.htmlhttps://tika.apache.org/1.17/api/org/apache/tika/parser/RecursiveParserWrapper.html
or a
https://tika.apache.org/1.17/api/org/apache/tika/extractor/ContainerExtractor.html
on your ParseContext to get called for embedded resources. See
https://wiki.apache.org/tika/RecursiveMetadata for more on how it works
and how to have Tika parse + return all the embedded files and resources
Nick