You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@tika.apache.org by ruby <rs...@gmail.com> on 2014/08/28 20:06:54 UTC

TIKA - how to read chunks at a time from a very large file?

Using ContentHandler is there a way to read chunks at a time from a very
large file (over 5GB). Right now I'm doing following to read the entire
content at once:

InputStream stream = new FileInputStream(file);
Parser p = new AutoDetectParser();
Metadata meta =new Metadata();
WriteOutContentHandler handler = new WriteOutContnetHandler(-1);
ParseContext parse = new ParseContext();
p.parse(stream,handler,meta, context);
String content = handler.toString();

Since the files contain over 5GB data, the content string here will end up
too much data in memory. I want to avoid this and want to read chunk at a
time.

I tried ParsingReader and I can read chunks using this but we are splitting
on words. Some of the files have Chinese/Japanese words, so we can't process
using white-spaces either. 





--
View this message in context: http://lucene.472066.n3.nabble.com/TIKA-how-to-read-chunks-at-a-time-from-a-very-large-file-tp4155644.html
Sent from the Apache Tika - Development mailing list archive at Nabble.com.

RE: TIKA - how to read chunks at a time from a very large file?

Posted by "Allison, Timothy B." <ta...@mitre.org>.

Probably better question for the user list.

Extending a ContentHandler and using that in ContentHandlerDecorator is pretty straightforward.

Would it be easy enough to write to file by passing in an OutputStream to WriteOutContentHandler?

-----Original Message-----
From: ruby [mailto:rshossain@gmail.com] 
Sent: Thursday, August 28, 2014 2:07 PM
To: tika-dev@lucene.apache.org
Subject: TIKA - how to read chunks at a time from a very large file?

Using ContentHandler is there a way to read chunks at a time from a very
large file (over 5GB). Right now I'm doing following to read the entire
content at once:

InputStream stream = new FileInputStream(file);
Parser p = new AutoDetectParser();
Metadata meta =new Metadata();
WriteOutContentHandler handler = new WriteOutContnetHandler(-1);
ParseContext parse = new ParseContext();
p.parse(stream,handler,meta, context);
String content = handler.toString();

Since the files contain over 5GB data, the content string here will end up
too much data in memory. I want to avoid this and want to read chunk at a
time.

I tried ParsingReader and I can read chunks using this but we are splitting
on words. Some of the files have Chinese/Japanese words, so we can't process
using white-spaces either. 

--
View this message in context: http://lucene.472066.n3.nabble.com/TIKA-how-to-read-chunks-at-a-time-from-a-very-large-file-tp4155644.html
Sent from the Apache Tika - Development mailing list archive at Nabble.com.

RE: TIKA - how to read chunks at a time from a very large file?

Posted by "Allison, Timothy B." <ta...@mitre.org>.

Probably better question for the user list.

Extending a ContentHandler and using that in ContentHandlerDecorator is pretty straightforward.

Would it be easy enough to write to file by passing in an OutputStream to WriteOutContentHandler?

-----Original Message-----
From: ruby [mailto:rshossain@gmail.com] 
Sent: Thursday, August 28, 2014 2:07 PM
To: tika-dev@lucene.apache.org
Subject: TIKA - how to read chunks at a time from a very large file?

Using ContentHandler is there a way to read chunks at a time from a very
large file (over 5GB). Right now I'm doing following to read the entire
content at once:

InputStream stream = new FileInputStream(file);
Parser p = new AutoDetectParser();
Metadata meta =new Metadata();
WriteOutContentHandler handler = new WriteOutContnetHandler(-1);
ParseContext parse = new ParseContext();
p.parse(stream,handler,meta, context);
String content = handler.toString();

Since the files contain over 5GB data, the content string here will end up
too much data in memory. I want to avoid this and want to read chunk at a
time.

I tried ParsingReader and I can read chunks using this but we are splitting
on words. Some of the files have Chinese/Japanese words, so we can't process
using white-spaces either. 

--
View this message in context: http://lucene.472066.n3.nabble.com/TIKA-how-to-read-chunks-at-a-time-from-a-very-large-file-tp4155644.html
Sent from the Apache Tika - Development mailing list archive at Nabble.com.

RE: TIKA - how to read chunks at a time from a very large file?

Posted by "Allison, Timothy B." <ta...@mitre.org>.

My belief in making that recommendation was that a given document wouldn't split a word across an "element".  I can, of course, think of exceptions (word break at the end of a PDF page, for example), but generally, my assumption is that this wouldn't happen very often.  However, if this does happen often with your documents, or if a single element is too large to hold in memory, then that recommendation won't work, and you'll probably have to write to disk.

________________________________________
From: ruby [rshossain@gmail.com]
Sent: Thursday, August 28, 2014 3:26 PM
To: tika-dev@lucene.apache.org
Subject: Re: TIKA - how to read chunks at a time from a very large file?

If I extend the ContentHandler then is there way to make sure that I don't
split on words?

--
View this message in context: http://lucene.472066.n3.nabble.com/TIKA-how-to-read-chunks-at-a-time-from-a-very-large-file-tp4155644p4155673.html
Sent from the Apache Tika - Development mailing list archive at Nabble.com.

Re: TIKA - how to read chunks at a time from a very large file?

Posted by ruby <rs...@gmail.com>.

If I extend the ContentHandler then is there way to make sure that I don't
split on words? 




--
View this message in context: http://lucene.472066.n3.nabble.com/TIKA-how-to-read-chunks-at-a-time-from-a-very-large-file-tp4155644p4155673.html
Sent from the Apache Tika - Development mailing list archive at Nabble.com.

Re: TIKA - how to read chunks at a time from a very large file?

Posted by Nick Burch <ap...@gagravarr.org>.

On Thu, 28 Aug 2014, ruby wrote:
> Since the files contain over 5GB data, the content string here will end up
> too much data in memory. I want to avoid this and want to read chunk at a
> time.

You'll probably need your own custom ContentHandler, which detects when 
there's too much data, and flushes it / starts a new file / etc

There's an example of how to do this in the tika-examples package, look at
parseToPlainTextChunks from ContentHandlerExample:
https://svn.apache.org/repos/asf/tika/trunk/tika-example/src/main/java/org/apache/tika/example/ContentHandlerExample.java

Basically though, you'll want to extend from DefaultContentHandler (which 
takes care of most of the basics for you), then write your own logic to 
handle outputting / flushing / chunking as per your needs

Nick