You are viewing a plain text version of this content. The canonical link for it is here.
Posted to users@pdfbox.apache.org by Daniel Wilson <wi...@gmail.com> on 2010/03/08 20:49:55 UTC
Re: Problem with PDF to text conversion

>>fix the decoder to properly read all PDFs in the universe

Man, if you can submit the patch that does that ... you deserve to be
knighted!

Failing to raise a truly deadly exception is a problem.  There are cases in
the rendering area, though, where giving up on on part of the document that
we can't handle is deemed preferable to completely failing to render the
document.

Daniel

On Thu, Feb 18, 2010 at 3:39 AM, Erik Scholtz, ArgonSoft GmbH <
escholtz@argonsoft.de> wrote:

> Andreas,
>
> you are right; catching the exception and not raising it to the caller is a
> problem. I would suggest to file this as a report in JIRA:
>
> https://issues.apache.org/jira/browse/PDFBOX
>
> Greetings,
> Erik
>
>
> aw@abcona.de wrote:
>
>> Hi folks!
>>
>> Sorry, this is my first posting on this mailing list and well, errrr I had
>> some interesting experiences with PDFBOX today :-\
>>
>>
>> Well, i'm working on some Alfresco project currently, where Alfresco (a
>> document management system) employs Lucene as full text search engine and
>> PDFBoX as converter from PDF to plain text to feed Lucene.
>> Then we realized that some 5% of our PDF documents yielded some innocent
>> message in Alfresco's log file:
>>
>> ERROR [pdfbox.filter.FlateFilter] Stop reading corrupt stream
>>
>> See http://forums.alfresco.com/en/viewtopic.php?f=8&t=24033&p=81641 for
>> the full thread.
>>
>> Some digging into the PDFBOXs source code yielded this piece of code:
>>
>> FlateFilter:128 ff.
>>
>>                    try                     {
>>                        // decoding not needed
>>                        while ((amountRead = decompressor.read(buffer, 0,
>> Math.min(mayRead,BUFFER_SIZE))) != -1)
>>                        {
>>                            result.write(buffer, 0, amountRead);
>>                        }
>>                    }
>>                    catch (OutOfMemoryError exception)
>> {
>>                        // if the stream is corrupt an OutOfMemoryError may
>> occur
>>                        log.error("Stop reading corrupt stream");
>>                    }
>>                    catch (ZipException exception)                     {
>>                        // if the stream is corrupt an OutOfMemoryError may
>> occur
>>                        log.error("Stop reading corrupt stream");
>>                    }
>>                    catch (EOFException exception)                     {
>>                        // if the stream is corrupt an OutOfMemoryError may
>> occur
>>                        log.error("Stop reading corrupt stream");
>>                    }
>>
>> which i consider really bad for two reasons:
>>
>> - the failure to properly decode the PDF is hidden from the caller, so we
>> never get a hint that the document was only partially decoded. As a result,
>> we get an incomplete Lucene index!
>>
>> - the OutOfMemoryError should NEVER EVER be caught and discarded this way,
>> as it might leave my application in an instable state. When my application
>> is out of memory, i'm busted. And at least, i'd like to know when i'm busted
>> ;-)
>>
>> Conclusion: If an Exception occurs, report it to the caller. And even
>> better, fix the decoder to properly read all PDFs in the universe, but i
>> guess that is the harder part :-)
>>
>>
>> Cheers
>> Andreas
>>
>>
>>
>>