You are viewing a plain text version of this content. The canonical link for it is here.
Posted to users@pdfbox.apache.org by Dirk Groeneveld <di...@allenai.org> on 2016/08/01 17:59:40 UTC

Infinite loop while parsing

I found a PDF that causes PDFBox to go into an infinite loop. I attached it to this email. The problem is easy to reproduce.

I tried it on version 2.0.0 and 2.0.2.

Cheers!

Dirk

Re: Infinite loop while parsing

Posted by Tilman Hausherr <TH...@t-online.de>.
Am 01.08.2016 um 21:28 schrieb Dirk Groeneveld:
> The file is public. GDrive is just being difficult. Here is a Dropbox 
> link instead: 
> https://www.dropbox.com/s/8thckx5crdc15ml/bb3ddd9a7de5aa494cd5611128e433ea8791c569.pdf?dl=0
>


No problem, I did get the file. I've opened issue
https://issues.apache.org/jira/browse/PDFBOX-3446

Tilman


> I had a feeling the file might be corrupt. We\u2019re processing over 6M 
> PDFs with this, so we\u2019re bound to find some edge cases.
>
> Dirk
>
> On August 1, 2016 at 11:48:51, Tilman Hausherr (thausherr@t-online.de 
> <ma...@t-online.de>) wrote:
>
>> Am 01.08.2016 um 20:20 schrieb Dirk Groeneveld:
>> > 
>> https://drive.google.com/a/allenai.org/file/d/0BxI7RAiTuio0a1k2amhoa1kxS1U/view?usp=sharing
>> >
>> > I hope that works?
>>
>> Yes, although it requires authorization. Is the file public or not?
>>
>> >
>> > There are actually two concerns. Clearly it should not go into an 
>> infinite loop, so that\u2019s concern one. But even if it does, it would 
>> be good if the thread was interruptible. It might already be. I have 
>> not tried that yet.
>>
>> It isn't interruptible... Your file is corrupt, it has this:
>>
>> 0000497410 00000 n
>> 0000497457 00000 n
>> 0000497532 00000 n
>> 0000497579 00000 n
>> 0000497654 00000 n
>> 0000497701 00000 ��w%\u2013C�\u2014�.�=^VP\u0192��y2+\u20306A�o;-�\u203a^\u20ac�rhf-d\u201el��YYD
>> l\u0192}�j�x������\���nP^�P\u2013��(W=���e�nIxG�i�9p�N\u2018Á���\u02c6> ��+sJ��7�
>> <��m�/
>>
>> of course it shouldn't loop forever.
>>
>> Tilman
>>
>> >
>> > Cheers!
>> >
>> > On August 1, 2016 at 11:07:09, Tilman Hausherr 
>> (thausherr@t-online.de) wrote:
>> >
>> > Am 01.08.2016 um 19:59 schrieb Dirk Groeneveld:
>> >> I found a PDF that causes PDFBox to go into an infinite loop. I
>> >> attached it to this email. The problem is easy to reproduce.
>> > PDF Attachments are not allowed, please upload your file somewhere.
>> >
>> > Tilman
>> >
>> >
>> >
>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
>> For additional commands, e-mail: users-help@pdfbox.apache.org
>>


Re: Infinite loop while parsing

Posted by Dirk Groeneveld <di...@allenai.org>.
The file is public. GDrive is just being difficult. Here is a Dropbox link instead: https://www.dropbox.com/s/8thckx5crdc15ml/bb3ddd9a7de5aa494cd5611128e433ea8791c569.pdf?dl=0

I had a feeling the file might be corrupt. We’re processing over 6M PDFs with this, so we’re bound to find some edge cases.

Dirk

On August 1, 2016 at 11:48:51, Tilman Hausherr (thausherr@t-online.de) wrote:

Am 01.08.2016 um 20:20 schrieb Dirk Groeneveld:  
> https://drive.google.com/a/allenai.org/file/d/0BxI7RAiTuio0a1k2amhoa1kxS1U/view?usp=sharing  
>  
> I hope that works?  

Yes, although it requires authorization. Is the file public or not?  

>  
> There are actually two concerns. Clearly it should not go into an infinite loop, so that’s concern one. But even if it does, it would be good if the thread was interruptible. It might already be. I have not tried that yet.  

It isn't interruptible... Your file is corrupt, it has this:  

0000497410 00000 n  
0000497457 00000 n  
0000497532 00000 n  
0000497579 00000 n  
0000497654 00000 n  
0000497701 00000 ¶ñw%–CÞ—ò.þ=^VPƒ»y2+‰6Aºo;-Ó›^€úrhf-d„lÍ£YYD  
lƒ}j¶xïÊÞúÊÿ\ü¡ËnP^P–ÜÓ(W=ÊÚò¶enIxGúiº9pÉN‘Á¿¶èˆ> ×À+sJ´ç7à  
<æ£Ùm/  

of course it shouldn't loop forever.  

Tilman  

>  
> Cheers!  
>  
> On August 1, 2016 at 11:07:09, Tilman Hausherr (thausherr@t-online.de) wrote:  
>  
> Am 01.08.2016 um 19:59 schrieb Dirk Groeneveld:  
>> I found a PDF that causes PDFBox to go into an infinite loop. I  
>> attached it to this email. The problem is easy to reproduce.  
> PDF Attachments are not allowed, please upload your file somewhere.  
>  
> Tilman  
>  
>  
>  


---------------------------------------------------------------------  
To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org  
For additional commands, e-mail: users-help@pdfbox.apache.org  


Re: Infinite loop while parsing

Posted by Tilman Hausherr <TH...@t-online.de>.
Am 01.08.2016 um 20:20 schrieb Dirk Groeneveld:
> https://drive.google.com/a/allenai.org/file/d/0BxI7RAiTuio0a1k2amhoa1kxS1U/view?usp=sharing
>
> I hope that works?

Yes, although it requires authorization. Is the file public or not?

>
> There are actually two concerns. Clearly it should not go into an infinite loop, so that\u2019s concern one. But even if it does, it would be good if the thread was interruptible. It might already be. I have not tried that yet.

It isn't interruptible... Your file is corrupt, it has this:

0000497410 00000 n
0000497457 00000 n
0000497532 00000 n
0000497579 00000 n
0000497654 00000 n
0000497701 00000 ��w%\u2013C�\u2014�.�=^VP\u0192��y2+\u20306A�o;-�\u203a^\u20ac�rhf-d\u201el��YYD
l\u0192}�j�x������\���nP^�P\u2013��(W=���e�nIxG�i�9p�N\u2018Á���\u02c6> ��+sJ��7�
<��m�/

of course it shouldn't loop forever.

Tilman

>
> Cheers!
>
> On August 1, 2016 at 11:07:09, Tilman Hausherr (thausherr@t-online.de) wrote:
>
> Am 01.08.2016 um 19:59 schrieb Dirk Groeneveld:
>> I found a PDF that causes PDFBox to go into an infinite loop. I
>> attached it to this email. The problem is easy to reproduce.
> PDF Attachments are not allowed, please upload your file somewhere.
>
> Tilman
>
>
>


---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
For additional commands, e-mail: users-help@pdfbox.apache.org


Re: Infinite loop while parsing

Posted by Dirk Groeneveld <di...@allenai.org>.
https://drive.google.com/a/allenai.org/file/d/0BxI7RAiTuio0a1k2amhoa1kxS1U/view?usp=sharing

I hope that works?

There are actually two concerns. Clearly it should not go into an infinite loop, so that’s concern one. But even if it does, it would be good if the thread was interruptible. It might already be. I have not tried that yet.

Cheers!

On August 1, 2016 at 11:07:09, Tilman Hausherr (thausherr@t-online.de) wrote:

Am 01.08.2016 um 19:59 schrieb Dirk Groeneveld:  
> I found a PDF that causes PDFBox to go into an infinite loop. I  
> attached it to this email. The problem is easy to reproduce.  

PDF Attachments are not allowed, please upload your file somewhere.  

Tilman  



Re: Infinite loop while parsing

Posted by Tilman Hausherr <TH...@t-online.de>.
Am 01.08.2016 um 19:59 schrieb Dirk Groeneveld:
> I found a PDF that causes PDFBox to go into an infinite loop. I 
> attached it to this email. The problem is easy to reproduce.

PDF Attachments are not allowed, please upload your file somewhere.

Tilman