You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@pdfbox.apache.org by "Allison, Timothy B." <ta...@mitre.org> on 2015/07/10 13:57:05 UTC

first stack trace report from pdfbox 2.0.0 trunk

All,
  I just posted the first stacktrace report from my initial partial batch run of against govdocs1 here: https://issues.apache.org/jira/secure/attachment/12744700/pdfbox_reports_2_0_0_20150709.zip

Caveats/Notes

The run yesterday did not include the fixes that were made in PDFBOX-2370 or PDFBOX-2862.

I stopped the batch run early. This only covered ~50k pdfs.

I forgot to turn on accesspermission checking. Some of the pdfs in here would normally have been skipped.

I haven't reviewed any of the exceptions. They may be caused by code on the Tika side.

I'll plan to re-run with the latest trunk on Tuesday.  I need to turn back to the actual eval code for a bit. :)


Cheers,

          Tim



RE: first stack trace report from pdfbox 2.0.0 trunk

Posted by "Allison, Timothy B." <ta...@mitre.org>.
Thank you!

I think I'll wait until PDFBOX-2883 is resolved if that's ok.  That looks major.

-----Original Message-----
From: John Hewson [mailto:john@jahewson.com] 
Sent: Tuesday, July 14, 2015 8:34 PM
To: dev@pdfbox.apache.org
Subject: Re: first stack trace report from pdfbox 2.0.0 trunk


> On 14 Jul 2015, at 13:49, Tilman Hausherr <TH...@t-online.de> wrote:
> 
> Am 14.07.2015 um 22:35 schrieb John Hewson:
>>> On 14 Jul 2015, at 13:20, Tilman Hausherr <TH...@t-online.de> wrote:
>>> 
>>> Am 14.07.2015 um 21:37 schrieb Allison, Timothy B.:
>>>> Interesting, yes: 781/781172.pdf, 490/490376.pdf and 029/029423.pdf.  Are you running your own regression testing against govdocs1?
>>> Yes, from time to time for the last few months.
>>> 
>>>> Is it duplicated effort for me to do anything with 2.0.0?
>>> Partly yes. The only difference is that I didn't do any text extraction.
>>> 
>>>> Or, is your point that should I wait until PDFBOX-2842 is completed?
>>> Yes :-)
>> Good news, PDFBOX-2842 is now complete.
> 
> No, the 029423 file is still throwing an exception :-(
> 

Ok, I've just fixed this, hopefully it works.

- John

> Tilman
> 
> 
>> 
>> - John
>> 
>>> Tilman
>>> 
>>>> Thank you!
>>>> 
>>>> Best,
>>>> 
>>>>           Tim
>>>> -----Original Message-----
>>>> From: Tilman Hausherr [mailto:THausherr@t-online.de]
>>>> Sent: Tuesday, July 14, 2015 12:47 PM
>>>> To: dev@pdfbox.apache.org
>>>> Subject: Re: first stack trace report from pdfbox 2.0.0 trunk
>>>> 
>>>> Hi Tim,
>>>> 
>>>> Currently there is at least one known regression, mentioned in
>>>> PDFBOX-2842, it applies to 029423 but also to other files.
>>>> 
>>>> Tilman
>>>> 
>>>> Am 10.07.2015 um 13:57 schrieb Allison, Timothy B.:
>>>>> All,
>>>>>    I just posted the first stacktrace report from my initial partial batch run of against govdocs1 here: https://issues.apache.org/jira/secure/attachment/12744700/pdfbox_reports_2_0_0_20150709.zip
>>>>> 
>>>>> Caveats/Notes
>>>>> 
>>>>> The run yesterday did not include the fixes that were made in PDFBOX-2370 or PDFBOX-2862.
>>>>> 
>>>>> I stopped the batch run early. This only covered ~50k pdfs.
>>>>> 
>>>>> I forgot to turn on accesspermission checking. Some of the pdfs in here would normally have been skipped.
>>>>> 
>>>>> I haven't reviewed any of the exceptions. They may be caused by code on the Tika side.
>>>>> 
>>>>> I'll plan to re-run with the latest trunk on Tuesday.  I need to turn back to the actual eval code for a bit. :)
>>>>> 
>>>>> 
> 
> 
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@pdfbox.apache.org
> For additional commands, e-mail: dev-help@pdfbox.apache.org
> 


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@pdfbox.apache.org
For additional commands, e-mail: dev-help@pdfbox.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@pdfbox.apache.org
For additional commands, e-mail: dev-help@pdfbox.apache.org


Re: first stack trace report from pdfbox 2.0.0 trunk

Posted by John Hewson <jo...@jahewson.com>.
> On 14 Jul 2015, at 13:49, Tilman Hausherr <TH...@t-online.de> wrote:
> 
> Am 14.07.2015 um 22:35 schrieb John Hewson:
>>> On 14 Jul 2015, at 13:20, Tilman Hausherr <TH...@t-online.de> wrote:
>>> 
>>> Am 14.07.2015 um 21:37 schrieb Allison, Timothy B.:
>>>> Interesting, yes: 781/781172.pdf, 490/490376.pdf and 029/029423.pdf.  Are you running your own regression testing against govdocs1?
>>> Yes, from time to time for the last few months.
>>> 
>>>> Is it duplicated effort for me to do anything with 2.0.0?
>>> Partly yes. The only difference is that I didn't do any text extraction.
>>> 
>>>> Or, is your point that should I wait until PDFBOX-2842 is completed?
>>> Yes :-)
>> Good news, PDFBOX-2842 is now complete.
> 
> No, the 029423 file is still throwing an exception :-(
> 

Ok, I’ve just fixed this, hopefully it works.

— John

> Tilman
> 
> 
>> 
>> — John
>> 
>>> Tilman
>>> 
>>>> Thank you!
>>>> 
>>>> Best,
>>>> 
>>>>           Tim
>>>> -----Original Message-----
>>>> From: Tilman Hausherr [mailto:THausherr@t-online.de]
>>>> Sent: Tuesday, July 14, 2015 12:47 PM
>>>> To: dev@pdfbox.apache.org
>>>> Subject: Re: first stack trace report from pdfbox 2.0.0 trunk
>>>> 
>>>> Hi Tim,
>>>> 
>>>> Currently there is at least one known regression, mentioned in
>>>> PDFBOX-2842, it applies to 029423 but also to other files.
>>>> 
>>>> Tilman
>>>> 
>>>> Am 10.07.2015 um 13:57 schrieb Allison, Timothy B.:
>>>>> All,
>>>>>    I just posted the first stacktrace report from my initial partial batch run of against govdocs1 here: https://issues.apache.org/jira/secure/attachment/12744700/pdfbox_reports_2_0_0_20150709.zip
>>>>> 
>>>>> Caveats/Notes
>>>>> 
>>>>> The run yesterday did not include the fixes that were made in PDFBOX-2370 or PDFBOX-2862.
>>>>> 
>>>>> I stopped the batch run early. This only covered ~50k pdfs.
>>>>> 
>>>>> I forgot to turn on accesspermission checking. Some of the pdfs in here would normally have been skipped.
>>>>> 
>>>>> I haven't reviewed any of the exceptions. They may be caused by code on the Tika side.
>>>>> 
>>>>> I'll plan to re-run with the latest trunk on Tuesday.  I need to turn back to the actual eval code for a bit. :)
>>>>> 
>>>>> 
> 
> 
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@pdfbox.apache.org
> For additional commands, e-mail: dev-help@pdfbox.apache.org
> 


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@pdfbox.apache.org
For additional commands, e-mail: dev-help@pdfbox.apache.org


Re: first stack trace report from pdfbox 2.0.0 trunk

Posted by Tilman Hausherr <TH...@t-online.de>.
Am 14.07.2015 um 22:35 schrieb John Hewson:
>> On 14 Jul 2015, at 13:20, Tilman Hausherr <TH...@t-online.de> wrote:
>>
>> Am 14.07.2015 um 21:37 schrieb Allison, Timothy B.:
>>> Interesting, yes: 781/781172.pdf, 490/490376.pdf and 029/029423.pdf.  Are you running your own regression testing against govdocs1?
>> Yes, from time to time for the last few months.
>>
>>> Is it duplicated effort for me to do anything with 2.0.0?
>> Partly yes. The only difference is that I didn't do any text extraction.
>>
>>> Or, is your point that should I wait until PDFBOX-2842 is completed?
>> Yes :-)
> Good news, PDFBOX-2842 is now complete.

No, the 029423 file is still throwing an exception :-(

Tilman


>
> — John
>
>> Tilman
>>
>>> Thank you!
>>>
>>> Best,
>>>
>>>            Tim
>>> -----Original Message-----
>>> From: Tilman Hausherr [mailto:THausherr@t-online.de]
>>> Sent: Tuesday, July 14, 2015 12:47 PM
>>> To: dev@pdfbox.apache.org
>>> Subject: Re: first stack trace report from pdfbox 2.0.0 trunk
>>>
>>> Hi Tim,
>>>
>>> Currently there is at least one known regression, mentioned in
>>> PDFBOX-2842, it applies to 029423 but also to other files.
>>>
>>> Tilman
>>>
>>> Am 10.07.2015 um 13:57 schrieb Allison, Timothy B.:
>>>> All,
>>>>     I just posted the first stacktrace report from my initial partial batch run of against govdocs1 here: https://issues.apache.org/jira/secure/attachment/12744700/pdfbox_reports_2_0_0_20150709.zip
>>>>
>>>> Caveats/Notes
>>>>
>>>> The run yesterday did not include the fixes that were made in PDFBOX-2370 or PDFBOX-2862.
>>>>
>>>> I stopped the batch run early. This only covered ~50k pdfs.
>>>>
>>>> I forgot to turn on accesspermission checking. Some of the pdfs in here would normally have been skipped.
>>>>
>>>> I haven't reviewed any of the exceptions. They may be caused by code on the Tika side.
>>>>
>>>> I'll plan to re-run with the latest trunk on Tuesday.  I need to turn back to the actual eval code for a bit. :)
>>>>
>>>>



---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@pdfbox.apache.org
For additional commands, e-mail: dev-help@pdfbox.apache.org


Re: first stack trace report from pdfbox 2.0.0 trunk

Posted by John Hewson <jo...@jahewson.com>.
> On 14 Jul 2015, at 13:20, Tilman Hausherr <TH...@t-online.de> wrote:
> 
> Am 14.07.2015 um 21:37 schrieb Allison, Timothy B.:
>> Interesting, yes: 781/781172.pdf, 490/490376.pdf and 029/029423.pdf.  Are you running your own regression testing against govdocs1?
> 
> Yes, from time to time for the last few months.
> 
>> Is it duplicated effort for me to do anything with 2.0.0?
> Partly yes. The only difference is that I didn't do any text extraction.
> 
>> Or, is your point that should I wait until PDFBOX-2842 is completed?
> 
> Yes :-)

Good news, PDFBOX-2842 is now complete.

— John

> 
> Tilman
> 
>> 
>> Thank you!
>> 
>> Best,
>> 
>>           Tim
>> -----Original Message-----
>> From: Tilman Hausherr [mailto:THausherr@t-online.de]
>> Sent: Tuesday, July 14, 2015 12:47 PM
>> To: dev@pdfbox.apache.org
>> Subject: Re: first stack trace report from pdfbox 2.0.0 trunk
>> 
>> Hi Tim,
>> 
>> Currently there is at least one known regression, mentioned in
>> PDFBOX-2842, it applies to 029423 but also to other files.
>> 
>> Tilman
>> 
>> Am 10.07.2015 um 13:57 schrieb Allison, Timothy B.:
>>> All,
>>>    I just posted the first stacktrace report from my initial partial batch run of against govdocs1 here: https://issues.apache.org/jira/secure/attachment/12744700/pdfbox_reports_2_0_0_20150709.zip
>>> 
>>> Caveats/Notes
>>> 
>>> The run yesterday did not include the fixes that were made in PDFBOX-2370 or PDFBOX-2862.
>>> 
>>> I stopped the batch run early. This only covered ~50k pdfs.
>>> 
>>> I forgot to turn on accesspermission checking. Some of the pdfs in here would normally have been skipped.
>>> 
>>> I haven't reviewed any of the exceptions. They may be caused by code on the Tika side.
>>> 
>>> I'll plan to re-run with the latest trunk on Tuesday.  I need to turn back to the actual eval code for a bit. :)
>>> 
>>> 
>>> Cheers,
>>> 
>>>            Tim
>>> 
>>> 
>>> 
>> 
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: dev-unsubscribe@pdfbox.apache.org
>> For additional commands, e-mail: dev-help@pdfbox.apache.org
>> 
>> 
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: dev-unsubscribe@pdfbox.apache.org
>> For additional commands, e-mail: dev-help@pdfbox.apache.org
>> 
> 
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@pdfbox.apache.org <ma...@pdfbox.apache.org>
> For additional commands, e-mail: dev-help@pdfbox.apache.org <ma...@pdfbox.apache.org>

Re: first stack trace report from pdfbox 2.0.0 trunk

Posted by Tilman Hausherr <TH...@t-online.de>.
Am 14.07.2015 um 21:37 schrieb Allison, Timothy B.:
> Interesting, yes: 781/781172.pdf, 490/490376.pdf and 029/029423.pdf.  Are you running your own regression testing against govdocs1?

Yes, from time to time for the last few months.

> Is it duplicated effort for me to do anything with 2.0.0?
Partly yes. The only difference is that I didn't do any text extraction.

> Or, is your point that should I wait until PDFBOX-2842 is completed?

Yes :-)

Tilman

>
> Thank you!
>
> Best,
>
>            Tim
> -----Original Message-----
> From: Tilman Hausherr [mailto:THausherr@t-online.de]
> Sent: Tuesday, July 14, 2015 12:47 PM
> To: dev@pdfbox.apache.org
> Subject: Re: first stack trace report from pdfbox 2.0.0 trunk
>
> Hi Tim,
>
> Currently there is at least one known regression, mentioned in
> PDFBOX-2842, it applies to 029423 but also to other files.
>
> Tilman
>
> Am 10.07.2015 um 13:57 schrieb Allison, Timothy B.:
>> All,
>>     I just posted the first stacktrace report from my initial partial batch run of against govdocs1 here: https://issues.apache.org/jira/secure/attachment/12744700/pdfbox_reports_2_0_0_20150709.zip
>>
>> Caveats/Notes
>>
>> The run yesterday did not include the fixes that were made in PDFBOX-2370 or PDFBOX-2862.
>>
>> I stopped the batch run early. This only covered ~50k pdfs.
>>
>> I forgot to turn on accesspermission checking. Some of the pdfs in here would normally have been skipped.
>>
>> I haven't reviewed any of the exceptions. They may be caused by code on the Tika side.
>>
>> I'll plan to re-run with the latest trunk on Tuesday.  I need to turn back to the actual eval code for a bit. :)
>>
>>
>> Cheers,
>>
>>             Tim
>>
>>
>>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@pdfbox.apache.org
> For additional commands, e-mail: dev-help@pdfbox.apache.org
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@pdfbox.apache.org
> For additional commands, e-mail: dev-help@pdfbox.apache.org
>


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@pdfbox.apache.org
For additional commands, e-mail: dev-help@pdfbox.apache.org


RE: first stack trace report from pdfbox 2.0.0 trunk

Posted by "Allison, Timothy B." <ta...@mitre.org>.
Interesting, yes: 781/781172.pdf, 490/490376.pdf and 029/029423.pdf.  Are you running your own regression testing against govdocs1?  Is it duplicated effort for me to do anything with 2.0.0?  Or, is your point that should I wait until PDFBOX-2842 is completed?

Thank you!

Best,

          Tim
-----Original Message-----
From: Tilman Hausherr [mailto:THausherr@t-online.de] 
Sent: Tuesday, July 14, 2015 12:47 PM
To: dev@pdfbox.apache.org
Subject: Re: first stack trace report from pdfbox 2.0.0 trunk

Hi Tim,

Currently there is at least one known regression, mentioned in 
PDFBOX-2842, it applies to 029423 but also to other files.

Tilman

Am 10.07.2015 um 13:57 schrieb Allison, Timothy B.:
> All,
>    I just posted the first stacktrace report from my initial partial batch run of against govdocs1 here: https://issues.apache.org/jira/secure/attachment/12744700/pdfbox_reports_2_0_0_20150709.zip
>
> Caveats/Notes
>
> The run yesterday did not include the fixes that were made in PDFBOX-2370 or PDFBOX-2862.
>
> I stopped the batch run early. This only covered ~50k pdfs.
>
> I forgot to turn on accesspermission checking. Some of the pdfs in here would normally have been skipped.
>
> I haven't reviewed any of the exceptions. They may be caused by code on the Tika side.
>
> I'll plan to re-run with the latest trunk on Tuesday.  I need to turn back to the actual eval code for a bit. :)
>
>
> Cheers,
>
>            Tim
>
>
>


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@pdfbox.apache.org
For additional commands, e-mail: dev-help@pdfbox.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@pdfbox.apache.org
For additional commands, e-mail: dev-help@pdfbox.apache.org


Re: first stack trace report from pdfbox 2.0.0 trunk

Posted by Tilman Hausherr <TH...@t-online.de>.
Hi Tim,

Currently there is at least one known regression, mentioned in 
PDFBOX-2842, it applies to 029423 but also to other files.

Tilman

Am 10.07.2015 um 13:57 schrieb Allison, Timothy B.:
> All,
>    I just posted the first stacktrace report from my initial partial batch run of against govdocs1 here: https://issues.apache.org/jira/secure/attachment/12744700/pdfbox_reports_2_0_0_20150709.zip
>
> Caveats/Notes
>
> The run yesterday did not include the fixes that were made in PDFBOX-2370 or PDFBOX-2862.
>
> I stopped the batch run early. This only covered ~50k pdfs.
>
> I forgot to turn on accesspermission checking. Some of the pdfs in here would normally have been skipped.
>
> I haven't reviewed any of the exceptions. They may be caused by code on the Tika side.
>
> I'll plan to re-run with the latest trunk on Tuesday.  I need to turn back to the actual eval code for a bit. :)
>
>
> Cheers,
>
>            Tim
>
>
>


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@pdfbox.apache.org
For additional commands, e-mail: dev-help@pdfbox.apache.org