You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@tika.apache.org by kbennett <kb...@bbsinc.biz> on 2007/09/14 00:45:29 UTC

Chunk Support in Tika?

All -

We have a use case where we need to be able to support documents that may be
too large to fit in memory.  In other kinds of data sources, we deal with
this by splitting the input data into chunks, so that only one or two chunks
need to be in memory at any given time.

I believe we don't yet support that in Tika, right?  The Parser abstract
class looks like it gives you the entire document's text in one call.

What are the plans, if any, to support chunking?  I could get involved in
that if you like.  I realize that the Parser abstract class would just adapt
the chunking of the Parser implementations (e.g. Poi) to our unified API,
rather than doing the chunking itself.   I suppose that in the Parser
abstract class and its implementations we'd have to add support for:

* querying chunking capabilities (most fundamentally, can it chunk at all?)
* configuring the chunking mode (e.g. on/off, chunk size)
* reading chunks

Thanks,
Keith

-- 
View this message in context: http://www.nabble.com/Chunk-Support-in-Tika--tf4438977.html#a12665223
Sent from the Apache Tika - Development mailing list archive at Nabble.com.

Re: Chunk Support in Tika?

Posted by Jukka Zitting <ju...@gmail.com>.

Hi,

On 9/14/07, kbennett <kb...@bbsinc.biz> wrote:
> What are the plans, if any, to support chunking?

See previous emails on this list about possible Parser interface
designs. I believe the consensus is to go with a streaming solution
that parses an InputStream into a sequence of SAX events so that the
entire input document is never kept fully in memory.

BR,

Jukka Zitting