You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@tika.apache.org by "K, Baraneetharan" <ba...@hp.com> on 2012/06/05 09:18:18 UTC

partial file parsing

Hi Tika-dev community,

I'm new to Tika, We are using AutoDetectParser (from Tika 0.9)for parsing the files and sending the parsed contents to Solr. We are facing severe performance issues while some large sized .xlsx, .docx and .pptx files getting parsed. Hence it is decided to parse files partially like first 10 paragraphs of a doc or first 1000 words or first 2MB of contents like that.

Please let me know is there any way to say Tika to parse part of a file.

Regards,
Baranee

Re: TikaInputStream customization

Posted by Jukka Zitting <ju...@gmail.com>.
Hi,

On Wed, Jun 6, 2012 at 2:15 PM, Baranee <ba...@hp.com> wrote:
> Can u pls tell me how to use the beforeRead() method in TikaInputStream to
> set readlimit for reading bytes from a stream.

http://people.apache.org/~hossman/#xyproblem

Why do you want to use TikaInputStream like this?

BR,

Jukka Zitting

Re: TikaInputStream customization

Posted by Baranee <ba...@hp.com>.
Thanks Zukka for your reply.

Can u pls tell me how to use the beforeRead() method in TikaInputStream to
set readlimit for reading bytes from a stream.

Baranee

--
View this message in context: http://lucene.472066.n3.nabble.com/partial-file-parsing-tp3987724p3987956.html
Sent from the Apache Tika - Development mailing list archive at Nabble.com.

Re: TikaInputStream customization

Posted by Jukka Zitting <ju...@gmail.com>.
Hi,

On Wed, Jun 6, 2012 at 12:30 PM, K, Baraneetharan
<ba...@hp.com> wrote:
> Can anyone pls let me know how to customize TikaInputStream to read only first
> 1000bytes from a given InputStream.

You can use the BoundedInputStream [1] class from Commons IO:

    TikaInputStream.get(new BoundedInputStream(stream, 1000));

However, see the concern in TIKA-307 [2]. Passing a truncated stream
to Tika may produce unexpected results.

[1] http://commons.apache.org/io/api-release/org/apache/commons/io/input/BoundedInputStream.html
[2] https://issues.apache.org/jira/browse/TIKA-307

BR,

Jukka Zitting

TikaInputStream customization

Posted by "K, Baraneetharan" <ba...@hp.com>.
Can anyone pls let me know how to customize TikaInputStream to read only first 1000bytes from a given InputStream.

Regards,
Baranee