You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@tika.apache.org by "Allison, Timothy B." <ta...@mitre.org> on 2014/08/28 20:49:32 UTC

RE: TIKA - how to read chunks at a time from a very large file?

Probably better question for the user list.

Extending a ContentHandler and using that in ContentHandlerDecorator is pretty straightforward.

Would it be easy enough to write to file by passing in an OutputStream to WriteOutContentHandler?

-----Original Message-----
From: ruby [mailto:rshossain@gmail.com] 
Sent: Thursday, August 28, 2014 2:07 PM
To: tika-dev@lucene.apache.org
Subject: TIKA - how to read chunks at a time from a very large file?

Using ContentHandler is there a way to read chunks at a time from a very
large file (over 5GB). Right now I'm doing following to read the entire
content at once:

InputStream stream = new FileInputStream(file);
Parser p = new AutoDetectParser();
Metadata meta =new Metadata();
WriteOutContentHandler handler = new WriteOutContnetHandler(-1);
ParseContext parse = new ParseContext();
p.parse(stream,handler,meta, context);
String content = handler.toString();

Since the files contain over 5GB data, the content string here will end up
too much data in memory. I want to avoid this and want to read chunk at a
time.

I tried ParsingReader and I can read chunks using this but we are splitting
on words. Some of the files have Chinese/Japanese words, so we can't process
using white-spaces either. 

--
View this message in context: http://lucene.472066.n3.nabble.com/TIKA-how-to-read-chunks-at-a-time-from-a-very-large-file-tp4155644.html
Sent from the Apache Tika - Development mailing list archive at Nabble.com.