You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@tika.apache.org by "K, Baraneetharan" <ba...@hp.com> on 2012/06/05 09:18:18 UTC
partial file parsing
Hi Tika-dev community,
I'm new to Tika, We are using AutoDetectParser (from Tika 0.9)for parsing the files and sending the parsed contents to Solr. We are facing severe performance issues while some large sized .xlsx, .docx and .pptx files getting parsed. Hence it is decided to parse files partially like first 10 paragraphs of a doc or first 1000 words or first 2MB of contents like that.
Please let me know is there any way to say Tika to parse part of a file.
Regards,
Baranee
Re: TikaInputStream customization
Posted by Jukka Zitting <ju...@gmail.com>.
Hi,
On Wed, Jun 6, 2012 at 2:15 PM, Baranee <ba...@hp.com> wrote:
> Can u pls tell me how to use the beforeRead() method in TikaInputStream to
> set readlimit for reading bytes from a stream.
http://people.apache.org/~hossman/#xyproblem
Why do you want to use TikaInputStream like this?
BR,
Jukka Zitting
Re: TikaInputStream customization
Posted by Baranee <ba...@hp.com>.
Thanks Zukka for your reply.
Can u pls tell me how to use the beforeRead() method in TikaInputStream to
set readlimit for reading bytes from a stream.
Baranee
--
View this message in context: http://lucene.472066.n3.nabble.com/partial-file-parsing-tp3987724p3987956.html
Sent from the Apache Tika - Development mailing list archive at Nabble.com.
Re: TikaInputStream customization
Posted by Jukka Zitting <ju...@gmail.com>.
Hi,
On Wed, Jun 6, 2012 at 12:30 PM, K, Baraneetharan
<ba...@hp.com> wrote:
> Can anyone pls let me know how to customize TikaInputStream to read only first
> 1000bytes from a given InputStream.
You can use the BoundedInputStream [1] class from Commons IO:
TikaInputStream.get(new BoundedInputStream(stream, 1000));
However, see the concern in TIKA-307 [2]. Passing a truncated stream
to Tika may produce unexpected results.
[1] http://commons.apache.org/io/api-release/org/apache/commons/io/input/BoundedInputStream.html
[2] https://issues.apache.org/jira/browse/TIKA-307
BR,
Jukka Zitting
TikaInputStream customization
Posted by "K, Baraneetharan" <ba...@hp.com>.
Can anyone pls let me know how to customize TikaInputStream to read only first 1000bytes from a given InputStream.
Regards,
Baranee