You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@tika.apache.org by Jonathan Koren <jo...@soe.ucsc.edu> on 2009/02/05 03:02:02 UTC
ContentHandler's OutputStream
Let me preface my remarks by saying, I'm mystified how to use
ContentHandler to do anything complicated.
It seems like the semantics of getting the content out of a
ContentHandler is wrong, or at least shortsighted. The user has two
options on how to use the text provided by ContentHandler. The user
can provide an OutputStream, which ContentHandler will write() the the
bytes to in as it reads the InputStream associated with the file, or
the user can have ContentHandler buffer the entire parsed contents of
the file into memory and then get back a humungous String via
ContentHandler.toString() .
There needs to be a better way.
Writing the bytes to an OutputStream pretty much locks the bytes up so
that the only thing you can do is write them to some sort of device
whether it's the console, disk, or a network connection. Buffering
the entire file is simply a not an option for very large files. For
very large files, you need to process chunks of the file, like from a
stream, or better yet, a series of callbacks with a relatively small
buffer (say even a few megs). (This is how SAX does it.) By using a
callback system, the user is free to do whatever he/she wants to do
with each chunk. If he/she wants to blast it to the disk, a simple
OutputStream.write(buf) is good enough. If they want to do some more
parsing of the text (like I want to do) then he/she can that as well
without reading the entire file into memory.
Here's my scenario that prompted this email:
I'm reading a bunch of files of a variety of types. Some of these
files can be quite large. Like gigabytes. I'm using AutoDetectParser
to handle the approrpriate parsing and BodyContentHandler to extract
out the plaintext. I want to take the extracted plaintext, do some
analysis on it, and then index the plaintext along with results of my
analysis. Specifically, my analysis requires taking the extracted
plaintext, segmenting it into sentences and doing part of speech
tagging and morphalogical analysis (ie stemmming) via an external
process. This mean I can't use an OutputStream since you can't read
from an OutputStream, so I'm stuck with using
ContentHandler.toString() which can (and does) exhaust memory for
large files.
What I really want is someone to tell me how to get back a usable
stream of plaintext, whether this involves a radical change to Tika's
ContentHandler class or some trick with Java, I really don't care, as
long as it's single thread save. (Java's PipedInputStream and
PipedOutputStream are not single thread safe.)
I know I can't be only one that's had or will have this problem. It
really seems like this use case needs to be handled, because the use
case that Tika currently seems to be designed for is "Write plaintext
to the disk."
Thanks.
--
Jonathan Koren
jonathan@soe.ucsc.edu
http://www.soe.ucsc.edu/~jonathan/
Re: ContentHandler's OutputStream
Posted by Jonathan Koren <jo...@soe.ucsc.edu>.
On Feb 5, 2009, at 1:22 AM, Jukka Zitting wrote:
> Hi,
>
> On Thu, Feb 5, 2009 at 3:02 AM, Jonathan Koren
> <jo...@soe.ucsc.edu> wrote:
>> What I really want is someone to tell me how to get back a usable
>> stream of
>> plaintext, whether this involves a radical change to Tika's
>> ContentHandler
>> class or some trick with Java, I really don't care, as long as it's
>> single
>> thread save.
>
> Have you looked at the ParsingReader class? It seems like a perfect
> match to your needs. The ParsingReader class fires a background thread
> to do the parsing and pipes the output so you can control when and how
> you want to read the extracted text.
I had no idea that class existed. Thanks.
--
Jonathan Koren
jonathan@soe.ucsc.edu
http://www.soe.ucsc.edu/~jonathan/
Re: ContentHandler's OutputStream
Posted by Jukka Zitting <ju...@gmail.com>.
Hi,
On Thu, Feb 5, 2009 at 3:02 AM, Jonathan Koren <jo...@soe.ucsc.edu> wrote:
> What I really want is someone to tell me how to get back a usable stream of
> plaintext, whether this involves a radical change to Tika's ContentHandler
> class or some trick with Java, I really don't care, as long as it's single
> thread save.
Have you looked at the ParsingReader class? It seems like a perfect
match to your needs. The ParsingReader class fires a background thread
to do the parsing and pipes the output so you can control when and how
you want to read the extracted text.
Alternatively, if the extra thread is not acceptable, you implement a
custom ContentHandler that directly catches and processes the
characters() and ignorableWhitespace() events.
Or you could subclass Writer and treat the write() calls as callbacks
from the parser.
BR,
Jukka Zitting