You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@tika.apache.org by Shiv Kenche <sh...@servient.com> on 2013/11/29 11:39:32 UTC
Need to extract text only for parent file.
Hi,
I have a Parent doc file with many attachments(children) into it. I need to
extract text content of Parent doc file but do not need text extract of its
children.
I have used AutoDetectParser.parse(inputStream, BodyContentHandler,
metadata, ParseContext) method to extract text for Parent file. But the
text extract has text of its children too, I do not want this.
Has anyone done this before? If yes could you please provide me the code
snippet?
Regards,
Shiv
Re: Need to extract text only for parent file.
Posted by Shiv Kenche <sh...@servient.com>.
Thanks Nick.
I had set AutoDetectParser in the ParseContext and that was causing text
extraction of embedded objects recursively. Once I removed this I got text
extract of just the parent file.
Regards,
Shiv
On Fri, Nov 29, 2013 at 4:16 PM, Nick Burch <ap...@gagravarr.org> wrote:
> On Fri, 29 Nov 2013, Shiv Kenche wrote:
>
>> I have a Parent doc file with many attachments(children) into it. I need
>> to extract text content of Parent doc file but do not need text extract of
>> its children.
>>
>
> Tika does not recurse into embedded documents by default. To enable
> recursion, you need to set a Parser object onto the ParseContext, to be
> used to handle the child objects. Without one, Tika will process the outer
> (parent) document only
>
> Nick
>
Re: Need to extract text only for parent file.
Posted by Nick Burch <ap...@gagravarr.org>.
On Fri, 29 Nov 2013, Shiv Kenche wrote:
> I have a Parent doc file with many attachments(children) into it. I need
> to extract text content of Parent doc file but do not need text extract
> of its children.
Tika does not recurse into embedded documents by default. To enable
recursion, you need to set a Parser object onto the ParseContext, to be
used to handle the child objects. Without one, Tika will process the outer
(parent) document only
Nick