You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@tika.apache.org by Jonathan Koren <jo...@soe.ucsc.edu> on 2009/02/05 03:02:02 UTC

ContentHandler's OutputStream

Let me preface my remarks by saying, I'm mystified how to use  
ContentHandler to do anything complicated.

It seems like the semantics of getting the content out of a  
ContentHandler is wrong, or at least shortsighted.  The user has two  
options on how to use the text provided by ContentHandler.  The user  
can provide an OutputStream, which ContentHandler will write() the the  
bytes to in as it reads the InputStream associated with the file, or  
the user can have ContentHandler buffer the entire parsed contents of  
the file into memory and then get back a humungous String via  
ContentHandler.toString() .

There needs to be a better way.

Writing the bytes to an OutputStream pretty much locks the bytes up so  
that the only thing you can do is write them to some sort of device  
whether it's the console, disk, or a network connection.  Buffering  
the entire file is simply a not an option for very large files.  For  
very large files, you need to process chunks of the file, like from a  
stream, or better yet, a series of callbacks with a relatively small  
buffer (say even a few megs).  (This is how SAX does it.)   By using a  
callback system, the user is free to do whatever he/she wants to do  
with each chunk.  If he/she wants to blast it to the disk, a simple  
OutputStream.write(buf) is good enough.  If they want to do some more  
parsing of the text (like I want to do) then he/she can that as well  
without reading the entire file into memory.

Here's my scenario that prompted this email:

I'm reading a bunch of files of a variety of types.  Some of these  
files can be quite large.  Like gigabytes.  I'm using AutoDetectParser  
to handle the approrpriate parsing and BodyContentHandler to extract  
out the plaintext.  I want to take the extracted plaintext, do some  
analysis on it, and then index the plaintext along with results of my  
analysis.  Specifically, my analysis requires taking the extracted  
plaintext, segmenting it into sentences and doing part of speech  
tagging and morphalogical analysis (ie stemmming) via an external  
process.  This mean I can't use an OutputStream since you can't read  
from an OutputStream, so I'm stuck with using  
ContentHandler.toString() which can (and does) exhaust memory for  
large files.

What I really want is someone to tell me how to get back a usable  
stream of plaintext, whether this involves a radical change to Tika's  
ContentHandler class or some trick with Java, I really don't care, as  
long as it's single thread save.  (Java's PipedInputStream and  
PipedOutputStream are not single thread safe.)

I know I can't be only one that's had or will have this problem.  It  
really seems like this use case needs to be handled, because the use  
case that Tika currently seems to be designed for is "Write plaintext  
to the disk."

Thanks.

--
Jonathan Koren
jonathan@soe.ucsc.edu
http://www.soe.ucsc.edu/~jonathan/



Re: ContentHandler's OutputStream

Posted by Jonathan Koren <jo...@soe.ucsc.edu>.
On Feb 5, 2009, at 1:22 AM, Jukka Zitting wrote:

> Hi,
>
> On Thu, Feb 5, 2009 at 3:02 AM, Jonathan Koren  
> <jo...@soe.ucsc.edu> wrote:
>> What I really want is someone to tell me how to get back a usable  
>> stream of
>> plaintext, whether this involves a radical change to Tika's  
>> ContentHandler
>> class or some trick with Java, I really don't care, as long as it's  
>> single
>> thread save.
>
> Have you looked at the ParsingReader class? It seems like a perfect
> match to your needs. The ParsingReader class fires a background thread
> to do the parsing and pipes the output so you can control when and how
> you want to read the extracted text.

I had no idea that class existed.  Thanks.

--
Jonathan Koren
jonathan@soe.ucsc.edu
http://www.soe.ucsc.edu/~jonathan/



Re: ContentHandler's OutputStream

Posted by Jukka Zitting <ju...@gmail.com>.
Hi,

On Thu, Feb 5, 2009 at 3:02 AM, Jonathan Koren <jo...@soe.ucsc.edu> wrote:
> What I really want is someone to tell me how to get back a usable stream of
> plaintext, whether this involves a radical change to Tika's ContentHandler
> class or some trick with Java, I really don't care, as long as it's single
> thread save.

Have you looked at the ParsingReader class? It seems like a perfect
match to your needs. The ParsingReader class fires a background thread
to do the parsing and pipes the output so you can control when and how
you want to read the extracted text.

Alternatively, if the extra thread is not acceptable, you implement a
custom ContentHandler that directly catches and processes the
characters() and ignorableWhitespace() events.

Or you could subclass Writer and treat the write() calls as callbacks
from the parser.

BR,

Jukka Zitting