You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@tika.apache.org by kbennett <kb...@bbsinc.biz> on 2007/09/20 19:21:30 UTC

Opening and Closing Document Input Streams

Given the fact that input documents can be specified by URL, it would seem
logical to me that a caller would pass Tika a URL, get the parsed content,
and not want to have to manage any streams required to process the document. 
In other words, Tika would open and close the stream itself.

Currently, the parser factory takes the URL, opens a stream from it, and
passes it to the newly created Parser object.  However, as far as I know,
the stream is never closed unless the caller calls Parser.getInputStream()
and does so himself.

For my use of Tika, I am creating a Java component that will continuously
read URL's as input, and output the parsed text read from those URL's. 
Ideally, a single entry point in Tika would be great, where we do something
like this:

String fulltext = Tika.getFullText(documentUrl, tikaConfigUrl);

... or perhaps to be more performant, we would create a Tika 'thing' with
the config URL and reuse that for each document:

TikaThing tikaThing = new TikaThing(tikaConfigUrl);
String fulltext = tikaThing.getFullText(documentUrl);

Another reason to open and close the stream ourselves is that (I am
assuming) that any parser will read the entire resource from beginning to
end.  So returning the stream would have little value.  However, I'm not
suggesting that we eliminate that functionality.

To sum up, I propose that when the Parser class receives a URL, it opens and
closes the stream itself.  When it receives a stream, it does NOT close the
stream itself.

What do you think?

- Keith

-- 
View this message in context: http://www.nabble.com/Opening-and-Closing-Document-Input-Streams-tf4488928.html#a12801853
Sent from the Apache Tika - Development mailing list archive at Nabble.com.


Re: Opening and Closing Document Input Streams

Posted by Jukka Zitting <ju...@gmail.com>.
Hi,

On 9/20/07, kbennett <kb...@bbsinc.biz> wrote:
> That sounds fine to me.  My main hope is that we put this functionality
> somewhere in Tika.  Otherwise, many Tika users such as myself will need to
> write this helper functionality ourselves.  I don't mind doing it, but would
> prefer that it be part of Tika so it would be reviewed and tested more
> thoroughly than I could do it myself.

That would be perfect, such a solution would both keep Tika modular
and easy to use. :-)

> Could this helper class also contain the single call document parse I
> mentioned?

Sure!

BR,

Jukka Zitting

Re: Opening and Closing Document Input Streams

Posted by kbennett <kb...@bbsinc.biz>.
Jukka -

That sounds fine to me.  My main hope is that we put this functionality
somewhere in Tika.  Otherwise, many Tika users such as myself will need to
write this helper functionality ourselves.  I don't mind doing it, but would
prefer that it be part of Tika so it would be reviewed and tested more
thoroughly than I could do it myself.

Could this helper class also contain the single call document parse I
mentioned? Come to think of it, maybe I'll start working on a TikaUtils
class for my own use, and contribute it later if/when you want it.

Thanks,
Keith



Jukka Zitting wrote:
> 
> Hi,
> 
> On 9/20/07, kbennett <kb...@bbsinc.biz> wrote:
>> To sum up, I propose that when the Parser class receives a URL, it opens
>> and
>> closes the stream itself.  When it receives a stream, it does NOT close
>> the
>> stream itself.
> 
> IMHO we should keep the parser interface as simple as possible. In
> fact I'd rather put the URL handling to a separate helper class or
> layer and keep the core parser interface stream-oriented.
> 
> BR,
> 
> Jukka Zitting
> 
> 

-- 
View this message in context: http://www.nabble.com/Opening-and-Closing-Document-Input-Streams-tf4488928.html#a12802300
Sent from the Apache Tika - Development mailing list archive at Nabble.com.


Re: Opening and Closing Document Input Streams

Posted by Jukka Zitting <ju...@gmail.com>.
Hi,

On 9/20/07, kbennett <kb...@bbsinc.biz> wrote:
> To sum up, I propose that when the Parser class receives a URL, it opens and
> closes the stream itself.  When it receives a stream, it does NOT close the
> stream itself.

IMHO we should keep the parser interface as simple as possible. In
fact I'd rather put the URL handling to a separate helper class or
layer and keep the core parser interface stream-oriented.

BR,

Jukka Zitting