You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@tika.apache.org by "McGreevy, Anthony" <an...@microfocus.com> on 2018/03/25 07:57:23 UTC

Subfile Extraction

Hey,

I am currently playing with Tika to see how it works with regards to extraction of subfiles.

The requirement I have is to have Tika take in a parent document, a .docx or .eml for example, and extract out the text content, metadata and all subfiles so that I can save them to disk.

So far I have worked out the metadata and content extraction but I haven't been able to find any tutorials on the subfile extraction.

If you could point me at resources I could use to work this out or examples of sample code doing this already it would be much appreciated.

Thanks,

Anthony

RE: Subfile Extraction

Posted by "Allison, Timothy B." <ta...@mitre.org>.

+1 to Nick's links and advice.

To use the RecursiveParserWrapper with tika-app, use the -J option; or if you're using tika-server, use the /rmeta endpoint.

The ecology of embedded docs is rich and understudied (IMHO), let us know what you find!

Cheers,

                  Tim

-----Original Message-----
From: McGreevy, Anthony [mailto:anthony.mcgreevy@microfocus.com] 
Sent: Tuesday, March 27, 2018 11:47 AM
To: user@tika.apache.org
Subject: RE: Subfile Extraction

Thanks for the information!

Much appreciated!

Anthony

-----Original Message-----
From: Nick Burch [mailto:apache@gagravarr.org]
Sent: 27 March 2018 15:50
To: user@tika.apache.org
Subject: Re: Subfile Extraction

On Sun, 25 Mar 2018, McGreevy, Anthony wrote:
> I am currently playing with Tika to see how it works with regards to 
> extraction of subfiles.

Do you mean files or resources embedded within another file?

If so... With the Tika App, you want -z to have these extracted. With the Tika java classes, you want to pop something like a https://tika.apache.org/1.17/api/org/apache/tika/parser/RecursiveParserWrapper.htmlhttps://tika.apache.org/1.17/api/org/apache/tika/parser/RecursiveParserWrapper.html
or a
https://tika.apache.org/1.17/api/org/apache/tika/extractor/ContainerExtractor.html
on your ParseContext to get called for embedded resources. See https://wiki.apache.org/tika/RecursiveMetadata for more on how it works and how to have Tika parse + return all the embedded files and resources

Nick

RE: Subfile Extraction

Posted by "McGreevy, Anthony" <an...@microfocus.com>.

Thanks for the information!

Much appreciated!

Anthony

-----Original Message-----
From: Nick Burch [mailto:apache@gagravarr.org] 
Sent: 27 March 2018 15:50
To: user@tika.apache.org
Subject: Re: Subfile Extraction

On Sun, 25 Mar 2018, McGreevy, Anthony wrote:
> I am currently playing with Tika to see how it works with regards to 
> extraction of subfiles.

Do you mean files or resources embedded within another file?

If so... With the Tika App, you want -z to have these extracted. With the Tika java classes, you want to pop something like a https://tika.apache.org/1.17/api/org/apache/tika/parser/RecursiveParserWrapper.htmlhttps://tika.apache.org/1.17/api/org/apache/tika/parser/RecursiveParserWrapper.html
or a
https://tika.apache.org/1.17/api/org/apache/tika/extractor/ContainerExtractor.html
on your ParseContext to get called for embedded resources. See https://wiki.apache.org/tika/RecursiveMetadata for more on how it works and how to have Tika parse + return all the embedded files and resources

Nick

Re: Subfile Extraction

Posted by Nick Burch <ap...@gagravarr.org>.

On Sun, 25 Mar 2018, McGreevy, Anthony wrote:
> I am currently playing with Tika to see how it works with regards to 
> extraction of subfiles.

Do you mean files or resources embedded within another file?

If so... With the Tika App, you want -z to have these extracted. With the 
Tika java classes, you want to pop something like a 
https://tika.apache.org/1.17/api/org/apache/tika/parser/RecursiveParserWrapper.htmlhttps://tika.apache.org/1.17/api/org/apache/tika/parser/RecursiveParserWrapper.html
or a
https://tika.apache.org/1.17/api/org/apache/tika/extractor/ContainerExtractor.html
on your ParseContext to get called for embedded resources. See
https://wiki.apache.org/tika/RecursiveMetadata for more on how it works 
and how to have Tika parse + return all the embedded files and resources

Nick