You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@tika.apache.org by Alec Swan <al...@gmail.com> on 2012/05/16 23:41:01 UTC
Tika fails to extract text from very large files
Hello,
Our tests indicate that while Tika can extract text from average files
it fails to extract text from large files of certain types. In our
tests Tika extracted 0 characters from 100 MB PPTX, 60 MB DOCX and 113
MB PDF files. However, it extracted the right text from 94MB TXT file.
Is this Tika's limitation? How can we troubleshoot this?
Thanks,
Alec
Re: Tika fails to extract text from very large files
Posted by Nick Burch <ni...@alfresco.com>.
On Wed, 16 May 2012, Alec Swan wrote:
> Tika's parse() method is taking an InputStream as a parameter, so why
> does it consume so much memory? Can't it stage the file behind the
> scenes? Does Tika try to load the entire stream in memory all the
> time?
Not all file formats support stream based parsing, many can only be
sensibly parsed in a DOM-like way. For those, the who file needs to be
loaded into memory (and processed!) before the parser can work on them.
PDF, DOCX and friends are some of the formats for which this is the case
Also, some parsers work better with a File, so if you're low on memory try
using TikaInputStream.get(File), it may make a small difference
Nick
Re: Tika fails to extract text from very large files
Posted by Alec Swan <al...@gmail.com>.
Nick, you were right. We tracked down the code that was swallowing the
exception. After that I gave it 1024MB of heap space and it still ran
out of memory while parsing 60 MB DOCX.
Tika's parse() method is taking an InputStream as a parameter, so why
does it consume so much memory? Can't it stage the file behind the
scenes? Does Tika try to load the entire stream in memory all the
time?
On Wed, May 16, 2012 at 4:08 PM, Nick Burch <ni...@alfresco.com> wrote:
> On Wed, 16 May 2012, Alec Swan wrote:
>>
>> Memory consumption stays under 90MB which is less than max heap size
>> (128M). No out-of-memory errors are thrown during test
>
>
> There is absolutely no way that you're going to be able to parse a PDF,
> DOC/DOCX or PPT/PPTX of more than about 20mb in size on a 128mb heap (and
> even that may be pushing it on some of them). Something is blowing up, I'd
> make sure you're not accidently eating the exception
>
> Nick
Re: Tika fails to extract text from very large files
Posted by Nick Burch <ni...@alfresco.com>.
On Wed, 16 May 2012, Alec Swan wrote:
> Memory consumption stays under 90MB which is less than max heap size
> (128M). No out-of-memory errors are thrown during test
There is absolutely no way that you're going to be able to parse a PDF,
DOC/DOCX or PPT/PPTX of more than about 20mb in size on a 128mb heap (and
even that may be pushing it on some of them). Something is blowing up, I'd
make sure you're not accidently eating the exception
Nick
Re: Tika fails to extract text from very large files
Posted by Alec Swan <al...@gmail.com>.
Memory consumption stays under 90MB which is less than max heap size
(128M). No out-of-memory errors are thrown during test.
On Wed, May 16, 2012 at 3:45 PM, Nick Burch <ni...@alfresco.com> wrote:
> On Wed, 16 May 2012, Alec Swan wrote:
>>
>> Our tests indicate that while Tika can extract text from average files
>> it fails to extract text from large files of certain types. In our
>> tests Tika extracted 0 characters from 100 MB PPTX, 60 MB DOCX and 113
>> MB PDF files. However, it extracted the right text from 94MB TXT file.
>
>
> Are you running out of memory? PPT/PPTX, DOC/DOCX and PDF are all formats
> which can only be parsed by building a DOM-like structure in memory, so they
> need more memory available to them. XLS/XLSX, amongst a few others, can be
> done in a largely streaming manner, so have a lower footprint. (It all
> depends on how the file format is laid out internally)
>
> Nick
Re: Tika fails to extract text from very large files
Posted by Alec Swan <al...@gmail.com>.
Could you please clarify "fork parser" and "Tika server" concepts? Do
both of them require spawning and managing external processes which
perform the actual file parsing?
On Thu, May 17, 2012 at 10:12 AM, Nick Burch <ni...@alfresco.com> wrote:
> On Thu, 17 May 2012, Alec Swan wrote:
>>
>> 1. We don't know how to tell if we don't have enough heap space to
>> process the file and skip the file in this case. Allowing out of
>> memory errors take down our process is not acceptable.
>
>
> In that kind of situation, you should be looking at using something like
> the fork parser or the tika server
>
>
>> 2. When we use 1024MB of heap and try to parse a large PDF file at
>> some point it starts printing the following error non-stop. In fact I
>> forgot to kill my process and it ran over night printing this every
>> second or so:
>> May 16, 2012 8:00:58 PM org.apache.pdfbox.filter.FlateFilter decode
>> SEVERE: Stop reading corrupt stream
>
>
> That looks like a PDFBox bug, you should try reporting that upstream
>
> Nick
Re: Tika fails to extract text from very large files
Posted by Nick Burch <ni...@alfresco.com>.
On Thu, 17 May 2012, Alec Swan wrote:
> 1. We don't know how to tell if we don't have enough heap space to
> process the file and skip the file in this case. Allowing out of
> memory errors take down our process is not acceptable.
In that kind of situation, you should be looking at using something like
the fork parser or the tika server
> 2. When we use 1024MB of heap and try to parse a large PDF file at
> some point it starts printing the following error non-stop. In fact I
> forgot to kill my process and it ran over night printing this every
> second or so:
> May 16, 2012 8:00:58 PM org.apache.pdfbox.filter.FlateFilter decode
> SEVERE: Stop reading corrupt stream
That looks like a PDFBox bug, you should try reporting that upstream
Nick
Re: Tika fails to extract text from very large files
Posted by Alec Swan <al...@gmail.com>.
So, we have two problems:
1. We don't know how to tell if we don't have enough heap space to
process the file and skip the file in this case. Allowing out of
memory errors take down our process is not acceptable.
2. When we use 1024MB of heap and try to parse a large PDF file at
some point it starts printing the following error non-stop. In fact I
forgot to kill my process and it ran over night printing this every
second or so:
May 16, 2012 8:00:58 PM org.apache.pdfbox.filter.FlateFilter decode
SEVERE: Stop reading corrupt stream
Thanks,
Alec
On Thu, May 17, 2012 at 2:54 AM, Alex Ott <al...@gmail.com> wrote:
> processing PPT & DOC files could be implemented in almost constant
> space (if we don't store whole text in memory, but pass chunks of text
> to handler)...
>
> P.S. I'm sorry that I can't say more details about it
>
> On Wed, May 16, 2012 at 11:45 PM, Nick Burch <ni...@alfresco.com> wrote:
>> On Wed, 16 May 2012, Alec Swan wrote:
>>>
>>> Our tests indicate that while Tika can extract text from average files
>>> it fails to extract text from large files of certain types. In our
>>> tests Tika extracted 0 characters from 100 MB PPTX, 60 MB DOCX and 113
>>> MB PDF files. However, it extracted the right text from 94MB TXT file.
>>
>>
>> Are you running out of memory? PPT/PPTX, DOC/DOCX and PDF are all formats
>> which can only be parsed by building a DOM-like structure in memory, so they
>> need more memory available to them. XLS/XLSX, amongst a few others, can be
>> done in a largely streaming manner, so have a lower footprint. (It all
>> depends on how the file format is laid out internally)
>>
>> Nick
>
>
>
> --
> With best wishes, Alex Ott
> http://alexott.net/
> Tiwtter: alexott_en (English), alexott (Russian)
> Skype: alex.ott
Re: Tika fails to extract text from very large files
Posted by Alex Ott <al...@gmail.com>.
processing PPT & DOC files could be implemented in almost constant
space (if we don't store whole text in memory, but pass chunks of text
to handler)...
P.S. I'm sorry that I can't say more details about it
On Wed, May 16, 2012 at 11:45 PM, Nick Burch <ni...@alfresco.com> wrote:
> On Wed, 16 May 2012, Alec Swan wrote:
>>
>> Our tests indicate that while Tika can extract text from average files
>> it fails to extract text from large files of certain types. In our
>> tests Tika extracted 0 characters from 100 MB PPTX, 60 MB DOCX and 113
>> MB PDF files. However, it extracted the right text from 94MB TXT file.
>
>
> Are you running out of memory? PPT/PPTX, DOC/DOCX and PDF are all formats
> which can only be parsed by building a DOM-like structure in memory, so they
> need more memory available to them. XLS/XLSX, amongst a few others, can be
> done in a largely streaming manner, so have a lower footprint. (It all
> depends on how the file format is laid out internally)
>
> Nick
--
With best wishes, Alex Ott
http://alexott.net/
Tiwtter: alexott_en (English), alexott (Russian)
Skype: alex.ott
Re: Tika fails to extract text from very large files
Posted by Nick Burch <ni...@alfresco.com>.
On Wed, 16 May 2012, Alec Swan wrote:
> Our tests indicate that while Tika can extract text from average files
> it fails to extract text from large files of certain types. In our
> tests Tika extracted 0 characters from 100 MB PPTX, 60 MB DOCX and 113
> MB PDF files. However, it extracted the right text from 94MB TXT file.
Are you running out of memory? PPT/PPTX, DOC/DOCX and PDF are all formats
which can only be parsed by building a DOM-like structure in memory, so
they need more memory available to them. XLS/XLSX, amongst a few others,
can be done in a largely streaming manner, so have a lower footprint. (It
all depends on how the file format is laid out internally)
Nick