You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@tika.apache.org by "Allison, Timothy B." <ta...@mitre.org> on 2014/08/28 20:49:32 UTC
RE: TIKA - how to read chunks at a time from a very large file?
Probably better question for the user list.
Extending a ContentHandler and using that in ContentHandlerDecorator is pretty straightforward.
Would it be easy enough to write to file by passing in an OutputStream to WriteOutContentHandler?
-----Original Message-----
From: ruby [mailto:rshossain@gmail.com]
Sent: Thursday, August 28, 2014 2:07 PM
To: tika-dev@lucene.apache.org
Subject: TIKA - how to read chunks at a time from a very large file?
Using ContentHandler is there a way to read chunks at a time from a very
large file (over 5GB). Right now I'm doing following to read the entire
content at once:
InputStream stream = new FileInputStream(file);
Parser p = new AutoDetectParser();
Metadata meta =new Metadata();
WriteOutContentHandler handler = new WriteOutContnetHandler(-1);
ParseContext parse = new ParseContext();
p.parse(stream,handler,meta, context);
String content = handler.toString();
Since the files contain over 5GB data, the content string here will end up
too much data in memory. I want to avoid this and want to read chunk at a
time.
I tried ParsingReader and I can read chunks using this but we are splitting
on words. Some of the files have Chinese/Japanese words, so we can't process
using white-spaces either.
--
View this message in context: http://lucene.472066.n3.nabble.com/TIKA-how-to-read-chunks-at-a-time-from-a-very-large-file-tp4155644.html
Sent from the Apache Tika - Development mailing list archive at Nabble.com.