You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@poi.apache.org by Norman M <mn...@yahoo.com> on 2012/11/09 23:47:31 UTC

Is POI really using streaming to parse files?

I am using Apache Tika to extract text from PPT/PPTX files.

Tika is using Apache POI to extract texts.

I tried to compare processing time and memory usage for POI vs Aspose (www.aspose.com)


The processing time and memory requirement for Tika (i-e POI) is almost double of Aspose.

Is Poi really using streaming to parse files? Why it is taking much more memory than Aspose that I thought reads the whole file into memory.

I found this thread http://lucene.472066.n3.nabble.com/Large-xls-files-always-loaded-into-memory-td646710.html where Tika founder is claiming that POi is not steaming inout files. That thread is quite old, is it still the same?

Any response will be appreciated.

Thanks,

Re: Is Tika really using streaming to parse files?

Posted by goog cheng <go...@gmail.com>.
i have same question, i call the python tika,  the cpu 100%  then crashed
....


2012/11/10 Norman M <mn...@yahoo.com>

> I am using Apache Tika to extract text from PPT/PPTX files.
>
> Tika is using Apache POI to extract texts.
>
> I tried to compare processing time and memory usage for POI vs Aspose (
> www.aspose.com)
>
> The processing time and memory requirement for Tika (i-e POI) is almost
> double of Aspose.
>
> Is Poi really using streaming to parse files? Why it is taking much more
> memory than Aspose that I thought reads the whole file into memory.
>
> I found this thread
> http://lucene.472066.n3.nabble.com/Large-xls-files-always-loaded-into-memory-td646710.htmlwhere Tika founder is claiming that Poi is not steaming input files. That
> thread is quite old, is it still the same?
>
> My goal is to minimize the memory requirement.
>
> Here is my code
>
> ParseContext context - new ParseContext();
> Detector detector = new DefaultDetector();
> Parser parser = new AutoDetectParser(detector);
> context.set(Parser.class, parser);
> MetaData metaData = new MetaData();
>
> File file = new File ("temp.ppt");
> Url url = file.toURI().toURL();
> OutputStream o = new ByteArrayOutputStream()
>
> InputStream input = TikaInputStream.get(url, metadata);
> ContentHandler handler = new BodyContentHandler(outputStream);
>
> parser.parse(input, handler, metadata,context);
>
> String extractedText = outputStream.toStream();
>
> It looks like that whole extracted text will be written to output stream
> and hence it may be the reason for large memory consumption. How can I make
> memory usage as least as possible?
>
>  Any response will be appreciated.
>
> Thanks,
>

Re: Is Tika really using streaming to parse files?

Posted by Nick Burch <ap...@gagravarr.org>.
On Fri, 9 Nov 2012, Norman M wrote:
> I am using Apache Tika to extract text from PPT/PPTX files.
>
> Is Poi really using streaming to parse files?

Some bits. xls file processing is stream based, for ppt the whole file 
gets processed and then the text parts are located and picked out.

> File file = new File ("temp.ppt");
> Url url = file.toURI().toURL();
> OutputStream o = new ByteArrayOutputStream()
>
> InputStream input = TikaInputStream.get(url, metadata);

Is there a reason why you're not passing the file to TikaInputStream, but 
going via the URL instead?


> ContentHandler handler = new BodyContentHandler(outputStream);
> parser.parse(input, handler, metadata,context);
> String extractedText = outputStream.toStream();

The text you extract will probably be fairly small, but the code above 
will mean it all has to get buffered first. You might want to look at 
processing the sax events as they come in, to reduce the memory instead of 
buffering everything, especially for very large amounts of text

Nick

Is Tika really using streaming to parse files?

Posted by Norman M <mn...@yahoo.com>.
I am using Apache Tika to extract text from PPT/PPTX files.

Tika is using Apache POI to extract texts.

I tried to compare processing time and memory usage for POI vs Aspose (www.aspose.com)

The processing time and memory requirement for Tika (i-e POI) is almost double of Aspose.

Is
 Poi really using streaming to parse files? Why it is taking much more 
memory than Aspose that I thought reads the whole file into memory.

I found this thread http://lucene.472066.n3.nabble.com/Large-xls-files-always-loaded-into-memory-td646710.html where Tika founder is claiming that Poi is not steaming input files. That thread is quite old, is it still the same?

My goal is to minimize the memory requirement.

Here is my code

ParseContext context - new ParseContext();
Detector detector = new DefaultDetector();
Parser parser = new AutoDetectParser(detector);
context.set(Parser.class, parser);
MetaData metaData = new MetaData();

File file = new File ("temp.ppt");
Url url = file.toURI().toURL();
OutputStream o = new ByteArrayOutputStream()

InputStream input = TikaInputStream.get(url, metadata);
ContentHandler handler = new BodyContentHandler(outputStream);

parser.parse(input, handler, metadata,context);

String extractedText = outputStream.toStream();

It looks like that whole extracted text will be written to output stream and hence it may be the reason for large memory consumption. How can I make memory usage as least as possible?
 
 Any response will be appreciated.

Thanks,

Re: Is POI really using streaming to parse files?

Posted by Nick Burch <ap...@gagravarr.org>.
On Fri, 9 Nov 2012, Norman M wrote:
> Is Poi really using streaming to parse files? Why it is taking much more 
> memory than Aspose that I thought reads the whole file into memory.

Depends, what bit of POI are you using? And are you passing in a File or 
an InputStream?

If you use NPOIFSFileSytem with a File rather than an input stream, it'll 
do very low memory access to the underlying OLE2 container. If you then 
use event based HSSF processing, it'll do stream based processing of the 
excel contents and need very little memory

If you use POIFSFileSystem from a stream, and HSSF UserModel, it'll buffer 
everything in memory and need lots.

Nick

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@poi.apache.org
For additional commands, e-mail: user-help@poi.apache.org