You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@tika.apache.org by Alec Swan <al...@gmail.com> on 2012/05/16 23:41:01 UTC

Tika fails to extract text from very large files

Hello,

Our tests indicate that while Tika can extract text from average files
it fails to extract text from large files of certain types. In our
tests Tika extracted 0 characters from 100 MB PPTX, 60 MB DOCX and 113
MB PDF files. However, it extracted the right text from 94MB TXT file.

Is this Tika's limitation? How can we troubleshoot this?

Thanks,

Alec

Re: Tika fails to extract text from very large files

Posted by Nick Burch <ni...@alfresco.com>.
On Wed, 16 May 2012, Alec Swan wrote:
> Tika's parse() method is taking an InputStream as a parameter, so why
> does it consume so much memory? Can't it stage the file behind the
> scenes? Does Tika try to load the entire stream in memory all the
> time?

Not all file formats support stream based parsing, many can only be 
sensibly parsed in a DOM-like way. For those, the who file needs to be 
loaded into memory (and processed!) before the parser can work on them. 
PDF, DOCX and friends are some of the formats for which this is the case

Also, some parsers work better with a File, so if you're low on memory try 
using TikaInputStream.get(File), it may make a small difference

Nick

Re: Tika fails to extract text from very large files

Posted by Alec Swan <al...@gmail.com>.
Nick, you were right. We tracked down the code that was swallowing the
exception. After that I gave it 1024MB of heap space and it still ran
out of memory while parsing 60 MB DOCX.

Tika's parse() method is taking an InputStream as a parameter, so why
does it consume so much memory? Can't it stage the file behind the
scenes? Does Tika try to load the entire stream in memory all the
time?

On Wed, May 16, 2012 at 4:08 PM, Nick Burch <ni...@alfresco.com> wrote:
> On Wed, 16 May 2012, Alec Swan wrote:
>>
>> Memory consumption stays under 90MB which is less than max heap size
>> (128M). No out-of-memory errors are thrown during test
>
>
> There is absolutely no way that you're going to be able to parse a PDF,
> DOC/DOCX or PPT/PPTX of more than about 20mb in size on a 128mb heap (and
> even that may be pushing it on some of them). Something is blowing up, I'd
> make sure you're not accidently eating the exception
>
> Nick

Re: Tika fails to extract text from very large files

Posted by Nick Burch <ni...@alfresco.com>.
On Wed, 16 May 2012, Alec Swan wrote:
> Memory consumption stays under 90MB which is less than max heap size
> (128M). No out-of-memory errors are thrown during test

There is absolutely no way that you're going to be able to parse a PDF, 
DOC/DOCX or PPT/PPTX of more than about 20mb in size on a 128mb heap (and 
even that may be pushing it on some of them). Something is blowing up, I'd 
make sure you're not accidently eating the exception

Nick

Re: Tika fails to extract text from very large files

Posted by Alec Swan <al...@gmail.com>.
Memory consumption stays under 90MB which is less than max heap size
(128M). No out-of-memory errors are thrown during test.

On Wed, May 16, 2012 at 3:45 PM, Nick Burch <ni...@alfresco.com> wrote:
> On Wed, 16 May 2012, Alec Swan wrote:
>>
>> Our tests indicate that while Tika can extract text from average files
>> it fails to extract text from large files of certain types. In our
>> tests Tika extracted 0 characters from 100 MB PPTX, 60 MB DOCX and 113
>> MB PDF files. However, it extracted the right text from 94MB TXT file.
>
>
> Are you running out of memory? PPT/PPTX, DOC/DOCX and PDF are all formats
> which can only be parsed by building a DOM-like structure in memory, so they
> need more memory available to them. XLS/XLSX, amongst a few others, can be
> done in a largely streaming manner, so have a lower footprint. (It all
> depends on how the file format is laid out internally)
>
> Nick

Re: Tika fails to extract text from very large files

Posted by Alec Swan <al...@gmail.com>.
Could you please clarify "fork parser" and "Tika server" concepts? Do
both of them require spawning and managing external processes which
perform the actual file parsing?

On Thu, May 17, 2012 at 10:12 AM, Nick Burch <ni...@alfresco.com> wrote:
> On Thu, 17 May 2012, Alec Swan wrote:
>>
>> 1. We don't know how to tell if we don't have enough heap space to
>> process the file and skip the file in this case. Allowing out of
>> memory errors take down our process is not acceptable.
>
>
> In that kind of situation, you should be looking at using something like
> the fork parser or the tika server
>
>
>> 2. When we use 1024MB of heap and try to parse a large PDF file at
>> some point it starts printing the following error non-stop. In fact I
>> forgot to kill my process and it ran over night printing this every
>> second or so:
>> May 16, 2012 8:00:58 PM org.apache.pdfbox.filter.FlateFilter decode
>> SEVERE: Stop reading corrupt stream
>
>
> That looks like a PDFBox bug, you should try reporting that upstream
>
> Nick

Re: Tika fails to extract text from very large files

Posted by Nick Burch <ni...@alfresco.com>.
On Thu, 17 May 2012, Alec Swan wrote:
> 1. We don't know how to tell if we don't have enough heap space to
> process the file and skip the file in this case. Allowing out of
> memory errors take down our process is not acceptable.

In that kind of situation, you should be looking at using something like
the fork parser or the tika server

> 2. When we use 1024MB of heap and try to parse a large PDF file at
> some point it starts printing the following error non-stop. In fact I
> forgot to kill my process and it ran over night printing this every
> second or so:
> May 16, 2012 8:00:58 PM org.apache.pdfbox.filter.FlateFilter decode
> SEVERE: Stop reading corrupt stream

That looks like a PDFBox bug, you should try reporting that upstream

Nick

Re: Tika fails to extract text from very large files

Posted by Alec Swan <al...@gmail.com>.
So, we have two problems:

1. We don't know how to tell if we don't have enough heap space to
process the file and skip the file in this case. Allowing out of
memory errors take down our process is not acceptable.

2. When we use 1024MB of heap and try to parse a large PDF file at
some point it starts printing the following error non-stop. In fact I
forgot to kill my process and it ran over night printing this every
second or so:
May 16, 2012 8:00:58 PM org.apache.pdfbox.filter.FlateFilter decode
SEVERE: Stop reading corrupt stream

Thanks,

Alec

On Thu, May 17, 2012 at 2:54 AM, Alex Ott <al...@gmail.com> wrote:
> processing PPT & DOC files could be implemented in almost constant
> space (if we don't store whole text in memory, but pass chunks of text
> to handler)...
>
> P.S. I'm sorry that I can't say more details about it
>
> On Wed, May 16, 2012 at 11:45 PM, Nick Burch <ni...@alfresco.com> wrote:
>> On Wed, 16 May 2012, Alec Swan wrote:
>>>
>>> Our tests indicate that while Tika can extract text from average files
>>> it fails to extract text from large files of certain types. In our
>>> tests Tika extracted 0 characters from 100 MB PPTX, 60 MB DOCX and 113
>>> MB PDF files. However, it extracted the right text from 94MB TXT file.
>>
>>
>> Are you running out of memory? PPT/PPTX, DOC/DOCX and PDF are all formats
>> which can only be parsed by building a DOM-like structure in memory, so they
>> need more memory available to them. XLS/XLSX, amongst a few others, can be
>> done in a largely streaming manner, so have a lower footprint. (It all
>> depends on how the file format is laid out internally)
>>
>> Nick
>
>
>
> --
> With best wishes,                    Alex Ott
> http://alexott.net/
> Tiwtter: alexott_en (English), alexott (Russian)
> Skype: alex.ott

Re: Tika fails to extract text from very large files

Posted by Alex Ott <al...@gmail.com>.
processing PPT & DOC files could be implemented in almost constant
space (if we don't store whole text in memory, but pass chunks of text
to handler)...

P.S. I'm sorry that I can't say more details about it

On Wed, May 16, 2012 at 11:45 PM, Nick Burch <ni...@alfresco.com> wrote:
> On Wed, 16 May 2012, Alec Swan wrote:
>>
>> Our tests indicate that while Tika can extract text from average files
>> it fails to extract text from large files of certain types. In our
>> tests Tika extracted 0 characters from 100 MB PPTX, 60 MB DOCX and 113
>> MB PDF files. However, it extracted the right text from 94MB TXT file.
>
>
> Are you running out of memory? PPT/PPTX, DOC/DOCX and PDF are all formats
> which can only be parsed by building a DOM-like structure in memory, so they
> need more memory available to them. XLS/XLSX, amongst a few others, can be
> done in a largely streaming manner, so have a lower footprint. (It all
> depends on how the file format is laid out internally)
>
> Nick



-- 
With best wishes,                    Alex Ott
http://alexott.net/
Tiwtter: alexott_en (English), alexott (Russian)
Skype: alex.ott

Re: Tika fails to extract text from very large files

Posted by Nick Burch <ni...@alfresco.com>.
On Wed, 16 May 2012, Alec Swan wrote:
> Our tests indicate that while Tika can extract text from average files
> it fails to extract text from large files of certain types. In our
> tests Tika extracted 0 characters from 100 MB PPTX, 60 MB DOCX and 113
> MB PDF files. However, it extracted the right text from 94MB TXT file.

Are you running out of memory? PPT/PPTX, DOC/DOCX and PDF are all formats 
which can only be parsed by building a DOM-like structure in memory, so 
they need more memory available to them. XLS/XLSX, amongst a few others, 
can be done in a largely streaming manner, so have a lower footprint. (It 
all depends on how the file format is laid out internally)

Nick