You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@tika.apache.org by Shiv Kenche <sh...@servient.com> on 2013/11/29 11:39:32 UTC

Need to extract text only for parent file.

Hi,

I have a Parent doc file with many attachments(children) into it. I need to
extract text content of Parent doc file but do not need text extract of its
children.

I have used AutoDetectParser.parse(inputStream, BodyContentHandler,
metadata, ParseContext) method to extract text for Parent file. But the
text extract has text of its children too, I do not want this.

Has anyone done this before? If yes could you please provide me the code
snippet?

Regards,
Shiv

Re: Need to extract text only for parent file.

Posted by Shiv Kenche <sh...@servient.com>.
Thanks Nick.

I had set AutoDetectParser in the ParseContext and that was causing text
extraction of embedded objects recursively. Once I removed this I got text
extract of just the parent file.

Regards,
Shiv


On Fri, Nov 29, 2013 at 4:16 PM, Nick Burch <ap...@gagravarr.org> wrote:

> On Fri, 29 Nov 2013, Shiv Kenche wrote:
>
>> I have a Parent doc file with many attachments(children) into it. I need
>> to extract text content of Parent doc file but do not need text extract of
>> its children.
>>
>
> Tika does not recurse into embedded documents by default. To enable
> recursion, you need to set a Parser object onto the ParseContext, to be
> used to handle the child objects. Without one, Tika will process the outer
> (parent) document only
>
> Nick
>

Re: Need to extract text only for parent file.

Posted by Nick Burch <ap...@gagravarr.org>.
On Fri, 29 Nov 2013, Shiv Kenche wrote:
> I have a Parent doc file with many attachments(children) into it. I need 
> to extract text content of Parent doc file but do not need text extract 
> of its children.

Tika does not recurse into embedded documents by default. To enable 
recursion, you need to set a Parser object onto the ParseContext, to be 
used to handle the child objects. Without one, Tika will process the outer 
(parent) document only

Nick