You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@tika.apache.org by Alec Swan <al...@gmail.com> on 2012/04/25 18:08:00 UTC

How to stream extracted text?

Hello,

We are replacing another text extraction library with Tika. We have
legacy code which expects document text to be output as an
InputStream. I understand that this is not directly related to Tika,
but I am assuming that other Tika users already solved this problem.

Does anybody have any sample code or ideas that will help us pipe
chars in ContentHandler#characters(..) method to a stream? Is there an
existing ContentHandler implementation that does this already?

Thanks,

Alec

Re: How to stream extracted text?

Posted by Alec Swan <al...@gmail.com>.
Thanks Wade, but "handler.toString()" is not going to work for us
because of the memory restrictions.
We ended up using BodyContentHandler(PipedOutputStream) and had the
output stream pipe to PipedInputStream effectively giving us what we
needed.

Thanks!

Alec

On Thu, Apr 26, 2012 at 6:14 AM, Taylor, Wade <wt...@ptfs.com> wrote:
> have you tried using BodyContentHandler? for example:
>
> ...
> ContentHandler handler = new BodyContentHandler();
> parser.parse(inputStream, handler, metadata, context);
> InputStream charStream = new ByteArrayInputStream(handler.toString());
> ...
>
>
>
> Regards,
> Wade
>
>
>
>
>
> On Wed, Apr 25, 2012 at 12:08 PM, Alec Swan <al...@gmail.com> wrote:
>>
>> Hello,
>>
>> We are replacing another text extraction library with Tika. We have
>> legacy code which expects document text to be output as an
>> InputStream. I understand that this is not directly related to Tika,
>> but I am assuming that other Tika users already solved this problem.
>>
>> Does anybody have any sample code or ideas that will help us pipe
>> chars in ContentHandler#characters(..) method to a stream? Is there an
>> existing ContentHandler implementation that does this already?
>>
>> Thanks,
>>
>> Alec
>
>

Re: How to stream extracted text?

Posted by "Taylor, Wade" <wt...@ptfs.com>.
have you tried using BodyContentHandler? for example:

...
ContentHandler handler = new BodyContentHandler();
parser.parse(inputStream, handler, metadata, context);
InputStream charStream = new ByteArrayInputStream(handler.toString());
...



Regards,
Wade




On Wed, Apr 25, 2012 at 12:08 PM, Alec Swan <al...@gmail.com> wrote:

> Hello,
>
> We are replacing another text extraction library with Tika. We have
> legacy code which expects document text to be output as an
> InputStream. I understand that this is not directly related to Tika,
> but I am assuming that other Tika users already solved this problem.
>
> Does anybody have any sample code or ideas that will help us pipe
> chars in ContentHandler#characters(..) method to a stream? Is there an
> existing ContentHandler implementation that does this already?
>
> Thanks,
>
> Alec
>