You are viewing a plain text version of this content. The canonical link for it is here.
Posted to users@pdfbox.apache.org by Chris Clark <ch...@allenai.org> on 2015/07/23 17:10:12 UTC

A few problematic PDFs

Hi all,

I have been using PDFBox 2.0 to parse a number of scholarly documents,
which has in general been working great. Version 2.0 is definitely a big
step up from 1.8.9. I ran into a couple of PDFs that PDFBox seemed to have
trouble parsing and I wanted to run them by you to see if they could be
fixed or if I am missing something on my end They are:

http://vortex.cs.wayne.edu/papers/Limited_precision_weights_preprint.pdf
This PDF gets parsed fine by Preview from OS X, and I can copy the text the
text out of Preview without a problem . pdftotext also parses this PDF
without a problem. However when I run the TextExtractor from PDFBox 2.0 on
it I get a lots of warnings and junk output.


http://www.cs.princeton.edu/~chongw/papers/RanganathWangBleiXing2013.pdf
Here I get an IOException when using PDFBox 2.0 (but not in 1.8.9). I
filed PDFBOX-2845 for this problem, but I realize I should have gone to the
mailing list first.

Best Regards,
Chris

Re: A few problematic PDFs

Posted by Andreas Lehmkuehler <an...@lehmi.de>.
Hi,

Am 23.07.2015 um 18:08 schrieb Tilman Hausherr:
> Am 23.07.2015 um 17:10 schrieb Chris Clark:
>> Hi all,
>>
>> I have been using PDFBox 2.0 to parse a number of scholarly documents,
>> which has in general been working great. Version 2.0 is definitely a big
>> step up from 1.8.9. I ran into a couple of PDFs that PDFBox seemed to have
>> trouble parsing and I wanted to run them by you to see if they could be
>> fixed or if I am missing something on my end They are:
>>
>> http://vortex.cs.wayne.edu/papers/Limited_precision_weights_preprint.pdf
>> This PDF gets parsed fine by Preview from OS X, and I can copy the text the
>> text out of Preview without a problem . pdftotext also parses this PDF
>> without a problem. However when I run the TextExtractor from PDFBox 2.0 on
>> it I get a lots of warnings and junk output.
>
> Adobe Reader can't extract the text either. Maybe OSX preview is making a guess?
>
>>
>>
>> http://www.cs.princeton.edu/~chongw/papers/RanganathWangBleiXing2013.pdf
>> Here I get an IOException when using PDFBox 2.0 (but not in 1.8.9). I
>> filed PDFBOX-2845 for this problem, but I realize I should have gone to the
>> mailing list first.
>>
>
> That was OK, I saw it... there just hasn't been anyone who has volunteered to
> make a change. I did have a look at that issue at that time... it looks like
> this is a malformed PDF, and the problem looked too complex for me, it involved
> a reference between ordinary PDF objects and compressed PDF object streams. (We
> do handle many malformed PDFs, but not all).
It looks like the file attached to PDFBOX-2845, which works in the most recent 
trunk.

BR
Andreas

> Ask yourself, is this really important to you, i.e. do you have many such files?
> Or is this just one of many files that you tried to see what happens.
>
> Tilman
>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
> For additional commands, e-mail: users-help@pdfbox.apache.org
>


---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
For additional commands, e-mail: users-help@pdfbox.apache.org


Re: A few problematic PDFs

Posted by Chris Clark <ch...@allenai.org>.
The PDFs were a samples from a larger corpus, but I haven't tested the
entire corpus yet. From what I can tell IOExceptions are very rare, so
being able to handle these cases is not a big deal as far as I am
concerned. I am not sure how common the text parsing error is. It is a
bit surprising Adobe can't extract the text but Preview and pdftotext can,
but if that is the case I am not too worried about getting that PDF right
either. I just wanted to check in in case either of these issues were due
to bugs that could be easily resolved.

Thanks,
Chris

On Thu, Jul 23, 2015 at 9:08 AM, Tilman Hausherr <TH...@t-online.de>
wrote:

> Am 23.07.2015 um 17:10 schrieb Chris Clark:
>
>> Hi all,
>>
>> I have been using PDFBox 2.0 to parse a number of scholarly documents,
>> which has in general been working great. Version 2.0 is definitely a big
>> step up from 1.8.9. I ran into a couple of PDFs that PDFBox seemed to have
>> trouble parsing and I wanted to run them by you to see if they could be
>> fixed or if I am missing something on my end They are:
>>
>> http://vortex.cs.wayne.edu/papers/Limited_precision_weights_preprint.pdf
>> This PDF gets parsed fine by Preview from OS X, and I can copy the text
>> the
>> text out of Preview without a problem . pdftotext also parses this PDF
>> without a problem. However when I run the TextExtractor from PDFBox 2.0 on
>> it I get a lots of warnings and junk output.
>>
>
> Adobe Reader can't extract the text either. Maybe OSX preview is making a
> guess?
>
>
>>
>> http://www.cs.princeton.edu/~chongw/papers/RanganathWangBleiXing2013.pdf
>> Here I get an IOException when using PDFBox 2.0 (but not in 1.8.9). I
>> filed PDFBOX-2845 for this problem, but I realize I should have gone to
>> the
>> mailing list first.
>>
>>
> That was OK, I saw it... there just hasn't been anyone who has volunteered
> to make a change. I did have a look at that issue at that time... it looks
> like this is a malformed PDF, and the problem looked too complex for me, it
> involved a reference between ordinary PDF objects and compressed PDF object
> streams. (We do handle many malformed PDFs, but not all).
>
> Ask yourself, is this really important to you, i.e. do you have many such
> files? Or is this just one of many files that you tried to see what happens.
>
> Tilman
>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
> For additional commands, e-mail: users-help@pdfbox.apache.org
>
>

Re: A few problematic PDFs

Posted by Tilman Hausherr <TH...@t-online.de>.
Am 23.07.2015 um 17:10 schrieb Chris Clark:
> Hi all,
>
> I have been using PDFBox 2.0 to parse a number of scholarly documents,
> which has in general been working great. Version 2.0 is definitely a big
> step up from 1.8.9. I ran into a couple of PDFs that PDFBox seemed to have
> trouble parsing and I wanted to run them by you to see if they could be
> fixed or if I am missing something on my end They are:
>
> http://vortex.cs.wayne.edu/papers/Limited_precision_weights_preprint.pdf
> This PDF gets parsed fine by Preview from OS X, and I can copy the text the
> text out of Preview without a problem . pdftotext also parses this PDF
> without a problem. However when I run the TextExtractor from PDFBox 2.0 on
> it I get a lots of warnings and junk output.

Adobe Reader can't extract the text either. Maybe OSX preview is making 
a guess?

>
>
> http://www.cs.princeton.edu/~chongw/papers/RanganathWangBleiXing2013.pdf
> Here I get an IOException when using PDFBox 2.0 (but not in 1.8.9). I
> filed PDFBOX-2845 for this problem, but I realize I should have gone to the
> mailing list first.
>

That was OK, I saw it... there just hasn't been anyone who has 
volunteered to make a change. I did have a look at that issue at that 
time... it looks like this is a malformed PDF, and the problem looked 
too complex for me, it involved a reference between ordinary PDF objects 
and compressed PDF object streams. (We do handle many malformed PDFs, 
but not all).

Ask yourself, is this really important to you, i.e. do you have many 
such files? Or is this just one of many files that you tried to see what 
happens.

Tilman



---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
For additional commands, e-mail: users-help@pdfbox.apache.org