You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@pdfbox.apache.org by Andreas Lehmkühler <an...@lehmi.de> on 2017/05/02 10:42:02 UTC

2.0.6 release ?

Hi,

I'm planning to cut a 2.0.6 release in about 1 or 2 weeks from now, any objections?

Andreas

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@pdfbox.apache.org
For additional commands, e-mail: dev-help@pdfbox.apache.org


Re: 2.0.6 release ?

Posted by Tilman Hausherr <TH...@t-online.de>.
Am 02.05.2017 um 12:42 schrieb Andreas Lehmkühler:
> Hi,
>
> I'm planning to cut a 2.0.6 release in about 1 or 2 weeks from now, any objections?

I'm always "+1" for new releases.

Tilman

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@pdfbox.apache.org
For additional commands, e-mail: dev-help@pdfbox.apache.org


Re: 2.0.6 release ?

Posted by Andreas Lehmkuehler <an...@lehmi.de>.
Am 02.05.2017 um 12:42 schrieb Andreas Lehmkühler:
> Hi,
> 
> I'm planning to cut a 2.0.6 release in about 1 or 2 weeks from now, any objections?
I've added 2.0.7 as new version to JIRA

Andreas

> 
> Andreas
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@pdfbox.apache.org
> For additional commands, e-mail: dev-help@pdfbox.apache.org
> 


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@pdfbox.apache.org
For additional commands, e-mail: dev-help@pdfbox.apache.org


Re: 2.0.6 release ?

Posted by Andreas Lehmkuehler <an...@lehmi.de>.
Am 04.05.2017 um 18:10 schrieb Andreas Lehmkuehler:
> Am 02.05.2017 um 12:42 schrieb Andreas Lehmkühler:
>> Hi,
>>
>> I'm planning to cut a 2.0.6 release in about 1 or 2 weeks from now, any 
>> objections?
> I'm targeting the 15th or 16th
As discussed on private@ I'm going to cut the release on Friday the 12th.

Andreas

> 
> Andreas
> 
>>
>> Andreas
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: dev-unsubscribe@pdfbox.apache.org
>> For additional commands, e-mail: dev-help@pdfbox.apache.org
>>
> 
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@pdfbox.apache.org
> For additional commands, e-mail: dev-help@pdfbox.apache.org
> 


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@pdfbox.apache.org
For additional commands, e-mail: dev-help@pdfbox.apache.org


RE: 2.0.6 release ?

Posted by "Allison, Timothy B." <ta...@mitre.org>.
Added a page count comparison report under "content/":

http://162.242.228.174/reports/reports_pdfbox_2_0_6c.tar.gz

-----Original Message-----
From: Allison, Timothy B. [mailto:tallison@mitre.org] 
Sent: Tuesday, May 9, 2017 2:39 PM
To: dev@pdfbox.apache.org
Subject: RE: 2.0.6 release ?

http://162.242.228.174/reports/reports_pdfbox_2_0_6b.tar.gz

Added CONTAINER_LENGTH to reports that have a file path.  This is the length in bytes of the container file (as opposed to the embedded file).

Thank you!

-----Original Message-----
From: Tilman Hausherr [mailto:THausherr@t-online.de]
Sent: Tuesday, May 9, 2017 10:07 AM
To: dev@pdfbox.apache.org
Subject: Re: 2.0.6 release ?

Am 09.05.2017 um 02:43 schrieb Allison, Timothy B.:
> Tilman's initial recommendation


Can you do me another favor? Have a column with the size in any table that is about individual files. I think it was there in the past, but I may be wrong.

Reason: I try to get small files to keep any "examples" for my regression tests.

Tilman


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@pdfbox.apache.org For additional commands, e-mail: dev-help@pdfbox.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@pdfbox.apache.org For additional commands, e-mail: dev-help@pdfbox.apache.org

B�KKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKCB��[��X��ܚX�KK[XZ[
�]�][��X��ܚX�P���
�\X�K�ܙ�B��܈Y][ۘ[��[X[��K[XZ[
�]�Z[���
�\X�K�ܙ�B�B

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@pdfbox.apache.org
For additional commands, e-mail: dev-help@pdfbox.apache.org


RE: 2.0.6 release ?

Posted by "Allison, Timothy B." <ta...@mitre.org>.
http://162.242.228.174/reports/reports_pdfbox_2_0_6b.tar.gz

Added CONTAINER_LENGTH to reports that have a file path.  This is the length in bytes of the container file (as opposed to the embedded file).

Thank you!

-----Original Message-----
From: Tilman Hausherr [mailto:THausherr@t-online.de]
Sent: Tuesday, May 9, 2017 10:07 AM
To: dev@pdfbox.apache.org
Subject: Re: 2.0.6 release ?

Am 09.05.2017 um 02:43 schrieb Allison, Timothy B.:
> Tilman's initial recommendation


Can you do me another favor? Have a column with the size in any table that is about individual files. I think it was there in the past, but I may be wrong.

Reason: I try to get small files to keep any "examples" for my regression tests.

Tilman


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@pdfbox.apache.org For additional commands, e-mail: dev-help@pdfbox.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@pdfbox.apache.org For additional commands, e-mail: dev-help@pdfbox.apache.org


RE: 2.0.6 release ?

Posted by "Allison, Timothy B." <ta...@mitre.org>.
Y.  Will do.  Meetings beckon, so it will take a few hours. :(

-----Original Message-----
From: Tilman Hausherr [mailto:THausherr@t-online.de] 
Sent: Tuesday, May 9, 2017 10:07 AM
To: dev@pdfbox.apache.org
Subject: Re: 2.0.6 release ?

Am 09.05.2017 um 02:43 schrieb Allison, Timothy B.:
> Tilman's initial recommendation


Can you do me another favor? Have a column with the size in any table that is about individual files. I think it was there in the past, but I may be wrong.

Reason: I try to get small files to keep any "examples" for my regression tests.

Tilman


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@pdfbox.apache.org For additional commands, e-mail: dev-help@pdfbox.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@pdfbox.apache.org
For additional commands, e-mail: dev-help@pdfbox.apache.org


Re: 2.0.6 release ?

Posted by Tilman Hausherr <TH...@t-online.de>.
Am 09.05.2017 um 02:43 schrieb Allison, Timothy B.:
> Tilman's initial recommendation


Can you do me another favor? Have a column with the size in any table 
that is about individual files. I think it was there in the past, but I 
may be wrong.

Reason: I try to get small files to keep any "examples" for my 
regression tests.

Tilman


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@pdfbox.apache.org
For additional commands, e-mail: dev-help@pdfbox.apache.org


Re: 2.0.6 release ?

Posted by Andreas Lehmkuehler <an...@lehmi.de>.
Am 20.05.2017 um 16:17 schrieb Tilman Hausherr:
> Am 12.05.2017 um 15:23 schrieb Allison, Timothy B.:
>> http://162.242.228.174/reports/reports_pdfbox_2_0_6_20170512.tar.gz
>>
>> Looks good to me on a very cursory look.
> 
> IMO there are two files that could be investigated:
> 
> 5WULWDW54DAQ4ORVJSACEE2KCXQ7PQLL - exception (was mentioned before)
> 
> APXBMVTEZIJCL7VYUN3KFSXLNETDMIKC - the first page is empty, but wasn't in the 
> previous version.Please create 2 tickets, so that those can't get lost. I forgot to do so for the 
first one :-(

Andreas
> 
> Tilman
> 
> 
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@pdfbox.apache.org
> For additional commands, e-mail: dev-help@pdfbox.apache.org
> 


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@pdfbox.apache.org
For additional commands, e-mail: dev-help@pdfbox.apache.org


Re: 2.0.6 release ?

Posted by Tilman Hausherr <TH...@t-online.de>.
Am 12.05.2017 um 15:23 schrieb Allison, Timothy B.:
> http://162.242.228.174/reports/reports_pdfbox_2_0_6_20170512.tar.gz
>
> Looks good to me on a very cursory look.

IMO there are two files that could be investigated:

5WULWDW54DAQ4ORVJSACEE2KCXQ7PQLL - exception (was mentioned before)

APXBMVTEZIJCL7VYUN3KFSXLNETDMIKC - the first page is empty, but wasn't 
in the previous version.

Tilman



---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@pdfbox.apache.org
For additional commands, e-mail: dev-help@pdfbox.apache.org


RE: 2.0.6 release ?

Posted by "Allison, Timothy B." <ta...@mitre.org>.
http://162.242.228.174/reports/reports_pdfbox_2_0_6_20170512.tar.gz

Looks good to me on a very cursory look.



Re: 2.0.6 release ?

Posted by Andreas Lehmkuehler <an...@lehmi.de>.
Am 11.05.2017 um 14:08 schrieb Allison, Timothy B.:
>> It isn't that secret as Tim posted it somewhere in this thread
> 
> :)
> 
> I've added throttling to httpd (I think) so we should be ok, and y, the address is out in the open now.
> 
> Let me know if I should kick off another run.
Yes, one final run please.


Just a friendly reminder, I'm going to cut the 2.0.6 release in about 9 hours 
from now.

Andreas

> 
> Thank you, all!
> 
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@pdfbox.apache.org
> For additional commands, e-mail: dev-help@pdfbox.apache.org
> 


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@pdfbox.apache.org
For additional commands, e-mail: dev-help@pdfbox.apache.org


RE: 2.0.6 release ?

Posted by "Allison, Timothy B." <ta...@mitre.org>.
> It isn't that secret as Tim posted it somewhere in this thread

:)

I've added throttling to httpd (I think) so we should be ok, and y, the address is out in the open now.

Let me know if I should kick off another run.

Thank you, all!


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@pdfbox.apache.org
For additional commands, e-mail: dev-help@pdfbox.apache.org


Re: 2.0.6 release ?

Posted by Andreas Lehmkuehler <an...@lehmi.de>.
Am 10.05.2017 um 17:12 schrieb Tilman Hausherr:
> Thanks for the test... the sum is still negative, but if we'd ignore the 
> truncated files I bet we'd be positive.
> 
> I have downloaded a few of the regressions but won't create issues this time as 
> yesterday's turned out to be duplicates, I'll wait for Andreas next commit and 
> will create issues only if these aren't solved.
I guess the new exception aren't related. I've already created an issue for the 
first one, PDFBOX-3788
I didn't had a chance to look at the second file. I just tested my fix for the 
first one and it still fails.

> @Andreas - ping me if you didn't keep the "secret" URL.
It isn't that secret as Tim posted it somewhere in this thread ...

> 
> Some misc thoughts...
> 
> 039800.pdf: "refinery's" is a different token than refinery. Shouldn't 
> "refinery's" be three tokens? I mention this because refinery is probably in a 
> dictionary.
> 
> Some differences are because of a different treatment of the space in bad fonts. 
> Some were improved, and some now look like this "C I T I E S W I T H O U T D R U 
> G S". There is an open issue about these. It is tricky because if we treat these 
> like 1 word, we'd also lose spaces where we don't want.
> 
> commoncrawl2/5N/5NSKV4CTVY4KT7R2FGY4XJDIK4PRLA4Z I can't find. I used 
> http://XXX.XXX.XXX.XXX/docs/commoncrawl2/5N/5NSKV4CTVY4KT7R2FGY4XJDIK4PRLA4Z
> 
> Tilman
> 
> Am 10.05.2017 um 11:42 schrieb Allison, Timothy B.:
>> Haven't had a chance to look. Reports are here:
>> http://162.242.228.174/reports/reports_pdfbox_2_0_6_20170510.tar.gz
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: dev-unsubscribe@pdfbox.apache.org
>> For additional commands, e-mail: dev-help@pdfbox.apache.org
>>
> 
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@pdfbox.apache.org
> For additional commands, e-mail: dev-help@pdfbox.apache.org
> 


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@pdfbox.apache.org
For additional commands, e-mail: dev-help@pdfbox.apache.org


Re: 2.0.6 release ?

Posted by Tilman Hausherr <TH...@t-online.de>.
Thanks for the test... the sum is still negative, but if we'd ignore the 
truncated files I bet we'd be positive.

I have downloaded a few of the regressions but won't create issues this 
time as yesterday's turned out to be duplicates, I'll wait for Andreas 
next commit and will create issues only if these aren't solved.
@Andreas - ping me if you didn't keep the "secret" URL.

Some misc thoughts...

039800.pdf: "refinery's" is a different token than refinery. Shouldn't 
"refinery's" be three tokens? I mention this because refinery is 
probably in a dictionary.

Some differences are because of a different treatment of the space in 
bad fonts. Some were improved, and some now look like this "C I T I E S 
W I T H O U T D R U G S". There is an open issue about these. It is 
tricky because if we treat these like 1 word, we'd also lose spaces 
where we don't want.

commoncrawl2/5N/5NSKV4CTVY4KT7R2FGY4XJDIK4PRLA4Z I can't find. I used 
http://XXX.XXX.XXX.XXX/docs/commoncrawl2/5N/5NSKV4CTVY4KT7R2FGY4XJDIK4PRLA4Z

Tilman

Am 10.05.2017 um 11:42 schrieb Allison, Timothy B.:
> Haven't had a chance to look. Reports are here:
> http://162.242.228.174/reports/reports_pdfbox_2_0_6_20170510.tar.gz
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@pdfbox.apache.org
> For additional commands, e-mail: dev-help@pdfbox.apache.org
>


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@pdfbox.apache.org
For additional commands, e-mail: dev-help@pdfbox.apache.org


RE: 2.0.6 release ?

Posted by Andreas Lehmkühler <an...@lehmi.de>.
> "Allison, Timothy B." <ta...@mitre.org> hat am 10. Mai 2017 um 11:42 geschrieben:
> 
> 
> Haven't had a chance to look. Reports are here:
> http://162.242.228.174/reports/reports_pdfbox_2_0_6_20170510.tar.gz
Thanks again for running the report again

I had a quick look and there are 2 new exceptions. It seems to be a regression. I'm going to dig deeper later when I'm back home

Here a 2 sample pfs, one for each exception
commoncrawl2/YV/YVFDWHF767TEYTT7IVFSLUIJTDF3YP57
commoncrawl2/5W/5WULWDW54DAQ4ORVJSACEE2KCXQ7PQLL

Andreas

> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@pdfbox.apache.org
> For additional commands, e-mail: dev-help@pdfbox.apache.org
>

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@pdfbox.apache.org
For additional commands, e-mail: dev-help@pdfbox.apache.org


RE: 2.0.6 release ?

Posted by "Allison, Timothy B." <ta...@mitre.org>.
Haven't had a chance to look. Reports are here:
http://162.242.228.174/reports/reports_pdfbox_2_0_6_20170510.tar.gz

RE: 2.0.6 release ?

Posted by "Allison, Timothy B." <ta...@mitre.org>.
I won't have results immediately.  :)

-----Original Message-----
From: Tilman Hausherr [mailto:THausherr@t-online.de] 
Sent: Tuesday, May 9, 2017 4:13 PM
To: dev@pdfbox.apache.org
Subject: Re: 2.0.6 release ?

Am 09.05.2017 um 22:03 schrieb Allison, Timothy B.:
> UGH.  I'm so wrong.  I accidentally had a 2.0.4.jar in my app/target...
>
> <face_palm/>
>
> Off we go?

Yes! However it's 10pm here, so I won't be able to react to the results immediately.

Tilman

>
>
> -----Original Message-----
> From: Tilman Hausherr [mailto:THausherr@t-online.de]
> Sent: Tuesday, May 9, 2017 3:49 PM
> To: dev@pdfbox.apache.org
> Subject: Re: 2.0.6 release ?
>
> You caught me... I haven't checked these yet.
>
> But I did now, with
> MD6X34Z6CODIJODXTOH5E6WJ7VUUPITO.pdf
> 3TE3TRHZVL2ZJGUUASZEWNOY6DXRTTEK.ashx
> IDEJP3MH4FCZNNTWQDMZ6TD2MOIPAYZ7.ashx
> but they don't throw an NPE anymore now.
>
> Oops... I see I have that check you mention in my code, it has been there for months and I forgot to make an issue. But after removing it, it still works with the three files... so the question is, can this parameter ever be null, or not?
>
> Tilman
>
> Am 09.05.2017 um 21:34 schrieb Allison, Timothy B.:
>> Should we return false if the node is null in PDPageTree#isPageTreeNode (66 new NPE exceptions)?  Has this been fixed, or would that cause unintended problems?
>>
>>       /**
>>        * Returns true if the node is a page tree node (i.e. and intermediate).
>>        */
>>       private boolean isPageTreeNode(COSDictionary node )
>>       {
>>           // some files such as PDFBOX-2250-229205.pdf don't have Pages set as the Type, so we have
>>           // to check for the presence of Kids too
>>           return node.getCOSName(COSName.TYPE) == COSName.PAGES ||
>>                  node.containsKey(COSName.KIDS);
>>       }
>>
>> -----Original Message-----
>> From: Tilman Hausherr [mailto:THausherr@t-online.de]
>> Sent: Tuesday, May 9, 2017 3:20 PM
>> To: dev@pdfbox.apache.org
>> Subject: Re: 2.0.6 release ?
>>
>> Am 09.05.2017 um 21:01 schrieb Allison, Timothy B.:
>>>> I've fixed all remaining regression tickets (in the end it was 
>>>> exactly 1)
>>> Great!  Thank you!
>>>
>>> Let me know when I should kick off another eval.
>> Yes, please do.
>>
>> Thanks
>>
>> Tilman
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: dev-unsubscribe@pdfbox.apache.org For 
>> additional commands, e-mail: dev-help@pdfbox.apache.org
>>
>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: dev-unsubscribe@pdfbox.apache.org For 
>> additional commands, e-mail: dev-help@pdfbox.apache.org
>>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@pdfbox.apache.org For 
> additional commands, e-mail: dev-help@pdfbox.apache.org
>
>
> B KKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKCB  [  X  ܚX KK[XZ[
>   ] ][  X  ܚX P
>   \X K ܙ B  ܈Y][ۘ[  [X[  K[XZ[
>   ] Z[
>   \X K ܙ B B
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@pdfbox.apache.org For 
> additional commands, e-mail: dev-help@pdfbox.apache.org
>


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@pdfbox.apache.org For additional commands, e-mail: dev-help@pdfbox.apache.org



Re: 2.0.6 release ?

Posted by Tilman Hausherr <TH...@t-online.de>.
Am 09.05.2017 um 22:03 schrieb Allison, Timothy B.:
> UGH.  I'm so wrong.  I accidentally had a 2.0.4.jar in my app/target...
>
> <face_palm/>
>
> Off we go?

Yes! However it's 10pm here, so I won't be able to react to the results 
immediately.

Tilman

>
>
> -----Original Message-----
> From: Tilman Hausherr [mailto:THausherr@t-online.de]
> Sent: Tuesday, May 9, 2017 3:49 PM
> To: dev@pdfbox.apache.org
> Subject: Re: 2.0.6 release ?
>
> You caught me... I haven't checked these yet.
>
> But I did now, with
> MD6X34Z6CODIJODXTOH5E6WJ7VUUPITO.pdf
> 3TE3TRHZVL2ZJGUUASZEWNOY6DXRTTEK.ashx
> IDEJP3MH4FCZNNTWQDMZ6TD2MOIPAYZ7.ashx
> but they don't throw an NPE anymore now.
>
> Oops... I see I have that check you mention in my code, it has been there for months and I forgot to make an issue. But after removing it, it still works with the three files... so the question is, can this parameter ever be null, or not?
>
> Tilman
>
> Am 09.05.2017 um 21:34 schrieb Allison, Timothy B.:
>> Should we return false if the node is null in PDPageTree#isPageTreeNode (66 new NPE exceptions)?  Has this been fixed, or would that cause unintended problems?
>>
>>       /**
>>        * Returns true if the node is a page tree node (i.e. and intermediate).
>>        */
>>       private boolean isPageTreeNode(COSDictionary node )
>>       {
>>           // some files such as PDFBOX-2250-229205.pdf don't have Pages set as the Type, so we have
>>           // to check for the presence of Kids too
>>           return node.getCOSName(COSName.TYPE) == COSName.PAGES ||
>>                  node.containsKey(COSName.KIDS);
>>       }
>>
>> -----Original Message-----
>> From: Tilman Hausherr [mailto:THausherr@t-online.de]
>> Sent: Tuesday, May 9, 2017 3:20 PM
>> To: dev@pdfbox.apache.org
>> Subject: Re: 2.0.6 release ?
>>
>> Am 09.05.2017 um 21:01 schrieb Allison, Timothy B.:
>>>> I've fixed all remaining regression tickets (in the end it was
>>>> exactly 1)
>>> Great!  Thank you!
>>>
>>> Let me know when I should kick off another eval.
>> Yes, please do.
>>
>> Thanks
>>
>> Tilman
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: dev-unsubscribe@pdfbox.apache.org For
>> additional commands, e-mail: dev-help@pdfbox.apache.org
>>
>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: dev-unsubscribe@pdfbox.apache.org For
>> additional commands, e-mail: dev-help@pdfbox.apache.org
>>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@pdfbox.apache.org For additional commands, e-mail: dev-help@pdfbox.apache.org
>
>
> B KKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKCB  [  X  ܚX KK[XZ[
>   ] ][  X  ܚX P
>   \X K ܙ B  ܈Y][ۘ[  [X[  K[XZ[
>   ] Z[
>   \X K ܙ B B
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@pdfbox.apache.org
> For additional commands, e-mail: dev-help@pdfbox.apache.org
>


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@pdfbox.apache.org
For additional commands, e-mail: dev-help@pdfbox.apache.org


RE: 2.0.6 release ?

Posted by "Allison, Timothy B." <ta...@mitre.org>.
UGH.  I'm so wrong.  I accidentally had a 2.0.4.jar in my app/target...

<face_palm/>

Off we go?


-----Original Message-----
From: Tilman Hausherr [mailto:THausherr@t-online.de]
Sent: Tuesday, May 9, 2017 3:49 PM
To: dev@pdfbox.apache.org
Subject: Re: 2.0.6 release ?

You caught me... I haven't checked these yet.

But I did now, with
MD6X34Z6CODIJODXTOH5E6WJ7VUUPITO.pdf
3TE3TRHZVL2ZJGUUASZEWNOY6DXRTTEK.ashx
IDEJP3MH4FCZNNTWQDMZ6TD2MOIPAYZ7.ashx
but they don't throw an NPE anymore now.

Oops... I see I have that check you mention in my code, it has been there for months and I forgot to make an issue. But after removing it, it still works with the three files... so the question is, can this parameter ever be null, or not?

Tilman

Am 09.05.2017 um 21:34 schrieb Allison, Timothy B.:
> Should we return false if the node is null in PDPageTree#isPageTreeNode (66 new NPE exceptions)?  Has this been fixed, or would that cause unintended problems?
>
>      /**
>       * Returns true if the node is a page tree node (i.e. and intermediate).
>       */
>      private boolean isPageTreeNode(COSDictionary node )
>      {
>          // some files such as PDFBOX-2250-229205.pdf don't have Pages set as the Type, so we have
>          // to check for the presence of Kids too
>          return node.getCOSName(COSName.TYPE) == COSName.PAGES ||
>                 node.containsKey(COSName.KIDS);
>      }
>
> -----Original Message-----
> From: Tilman Hausherr [mailto:THausherr@t-online.de]
> Sent: Tuesday, May 9, 2017 3:20 PM
> To: dev@pdfbox.apache.org
> Subject: Re: 2.0.6 release ?
>
> Am 09.05.2017 um 21:01 schrieb Allison, Timothy B.:
>>> I've fixed all remaining regression tickets (in the end it was 
>>> exactly 1)
>> Great!  Thank you!
>>
>> Let me know when I should kick off another eval.
>
> Yes, please do.
>
> Thanks
>
> Tilman
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@pdfbox.apache.org For 
> additional commands, e-mail: dev-help@pdfbox.apache.org
>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@pdfbox.apache.org For 
> additional commands, e-mail: dev-help@pdfbox.apache.org
>


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@pdfbox.apache.org For additional commands, e-mail: dev-help@pdfbox.apache.org


B KKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKCB  [  X  ܚX KK[XZ[
 ] ][  X  ܚX P   
 \X K ܙ B  ܈Y][ۘ[  [X[  K[XZ[
 ] Z[   
 \X K ܙ B B

RE: 2.0.6 release ?

Posted by "Allison, Timothy B." <ta...@mitre.org>.
With lots of empty pages...

-----Original Message-----
From: Allison, Timothy B. [mailto:tallison@mitre.org] 
Sent: Tuesday, May 9, 2017 3:57 PM
To: dev@pdfbox.apache.org
Subject: RE: 2.0.6 release ?

Doh.  AR can't open it.  Sorry.  Chrome appears to be able to open it.

-----Original Message-----
From: Allison, Timothy B. [mailto:tallison@mitre.org]
Sent: Tuesday, May 9, 2017 3:56 PM
To: dev@pdfbox.apache.org
Subject: RE: 2.0.6 release ?

commoncrawl2_likely_broken/WL/WL4ZBGPG6543HIT24KCT7XZUIL5NBQ6K

throws NPE and opens without complaint in AR.

-----Original Message-----
From: Tilman Hausherr [mailto:THausherr@t-online.de]
Sent: Tuesday, May 9, 2017 3:49 PM
To: dev@pdfbox.apache.org
Subject: Re: 2.0.6 release ?

You caught me... I haven't checked these yet.

But I did now, with
MD6X34Z6CODIJODXTOH5E6WJ7VUUPITO.pdf
3TE3TRHZVL2ZJGUUASZEWNOY6DXRTTEK.ashx
IDEJP3MH4FCZNNTWQDMZ6TD2MOIPAYZ7.ashx
but they don't throw an NPE anymore now.

Oops... I see I have that check you mention in my code, it has been there for months and I forgot to make an issue. But after removing it, it still works with the three files... so the question is, can this parameter ever be null, or not?

Tilman

Am 09.05.2017 um 21:34 schrieb Allison, Timothy B.:
> Should we return false if the node is null in PDPageTree#isPageTreeNode (66 new NPE exceptions)?  Has this been fixed, or would that cause unintended problems?
>
>      /**
>       * Returns true if the node is a page tree node (i.e. and intermediate).
>       */
>      private boolean isPageTreeNode(COSDictionary node )
>      {
>          // some files such as PDFBOX-2250-229205.pdf don't have Pages set as the Type, so we have
>          // to check for the presence of Kids too
>          return node.getCOSName(COSName.TYPE) == COSName.PAGES ||
>                 node.containsKey(COSName.KIDS);
>      }
>
> -----Original Message-----
> From: Tilman Hausherr [mailto:THausherr@t-online.de]
> Sent: Tuesday, May 9, 2017 3:20 PM
> To: dev@pdfbox.apache.org
> Subject: Re: 2.0.6 release ?
>
> Am 09.05.2017 um 21:01 schrieb Allison, Timothy B.:
>>> I've fixed all remaining regression tickets (in the end it was 
>>> exactly 1)
>> Great!  Thank you!
>>
>> Let me know when I should kick off another eval.
>
> Yes, please do.
>
> Thanks
>
> Tilman
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@pdfbox.apache.org For 
> additional commands, e-mail: dev-help@pdfbox.apache.org
>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@pdfbox.apache.org For 
> additional commands, e-mail: dev-help@pdfbox.apache.org
>


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@pdfbox.apache.org For additional commands, e-mail: dev-help@pdfbox.apache.org


B KKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKCB  [  X  ܚX KK[XZ[
 ] ][  X  ܚX P   
 \X K ܙ B  ܈Y][ۘ[  [X[  K[XZ[
 ] Z[   
 \X K ܙ B B

RE: 2.0.6 release ?

Posted by "Allison, Timothy B." <ta...@mitre.org>.
Doh.  AR can't open it.  Sorry.  Chrome appears to be able to open it.

-----Original Message-----
From: Allison, Timothy B. [mailto:tallison@mitre.org] 
Sent: Tuesday, May 9, 2017 3:56 PM
To: dev@pdfbox.apache.org
Subject: RE: 2.0.6 release ?

commoncrawl2_likely_broken/WL/WL4ZBGPG6543HIT24KCT7XZUIL5NBQ6K

throws NPE and opens without complaint in AR.

-----Original Message-----
From: Tilman Hausherr [mailto:THausherr@t-online.de] 
Sent: Tuesday, May 9, 2017 3:49 PM
To: dev@pdfbox.apache.org
Subject: Re: 2.0.6 release ?

You caught me... I haven't checked these yet.

But I did now, with
MD6X34Z6CODIJODXTOH5E6WJ7VUUPITO.pdf
3TE3TRHZVL2ZJGUUASZEWNOY6DXRTTEK.ashx
IDEJP3MH4FCZNNTWQDMZ6TD2MOIPAYZ7.ashx
but they don't throw an NPE anymore now.

Oops... I see I have that check you mention in my code, it has been there for months and I forgot to make an issue. But after removing it, it still works with the three files... so the question is, can this parameter ever be null, or not?

Tilman

Am 09.05.2017 um 21:34 schrieb Allison, Timothy B.:
> Should we return false if the node is null in PDPageTree#isPageTreeNode (66 new NPE exceptions)?  Has this been fixed, or would that cause unintended problems?
>
>      /**
>       * Returns true if the node is a page tree node (i.e. and intermediate).
>       */
>      private boolean isPageTreeNode(COSDictionary node )
>      {
>          // some files such as PDFBOX-2250-229205.pdf don't have Pages set as the Type, so we have
>          // to check for the presence of Kids too
>          return node.getCOSName(COSName.TYPE) == COSName.PAGES ||
>                 node.containsKey(COSName.KIDS);
>      }
>
> -----Original Message-----
> From: Tilman Hausherr [mailto:THausherr@t-online.de]
> Sent: Tuesday, May 9, 2017 3:20 PM
> To: dev@pdfbox.apache.org
> Subject: Re: 2.0.6 release ?
>
> Am 09.05.2017 um 21:01 schrieb Allison, Timothy B.:
>>> I've fixed all remaining regression tickets (in the end it was 
>>> exactly 1)
>> Great!  Thank you!
>>
>> Let me know when I should kick off another eval.
>
> Yes, please do.
>
> Thanks
>
> Tilman
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@pdfbox.apache.org For 
> additional commands, e-mail: dev-help@pdfbox.apache.org
>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@pdfbox.apache.org For 
> additional commands, e-mail: dev-help@pdfbox.apache.org
>


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@pdfbox.apache.org For additional commands, e-mail: dev-help@pdfbox.apache.org


B�KKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKCB��[��X��ܚX�KK[XZ[
�]�][��X��ܚX�P���
�\X�K�ܙ�B��܈Y][ۘ[��[X[��K[XZ[
�]�Z[���
�\X�K�ܙ�B�B

RE: 2.0.6 release ?

Posted by "Allison, Timothy B." <ta...@mitre.org>.
commoncrawl2_likely_broken/WL/WL4ZBGPG6543HIT24KCT7XZUIL5NBQ6K

throws NPE and opens without complaint in AR.

-----Original Message-----
From: Tilman Hausherr [mailto:THausherr@t-online.de] 
Sent: Tuesday, May 9, 2017 3:49 PM
To: dev@pdfbox.apache.org
Subject: Re: 2.0.6 release ?

You caught me... I haven't checked these yet.

But I did now, with
MD6X34Z6CODIJODXTOH5E6WJ7VUUPITO.pdf
3TE3TRHZVL2ZJGUUASZEWNOY6DXRTTEK.ashx
IDEJP3MH4FCZNNTWQDMZ6TD2MOIPAYZ7.ashx
but they don't throw an NPE anymore now.

Oops... I see I have that check you mention in my code, it has been there for months and I forgot to make an issue. But after removing it, it still works with the three files... so the question is, can this parameter ever be null, or not?

Tilman

Am 09.05.2017 um 21:34 schrieb Allison, Timothy B.:
> Should we return false if the node is null in PDPageTree#isPageTreeNode (66 new NPE exceptions)?  Has this been fixed, or would that cause unintended problems?
>
>      /**
>       * Returns true if the node is a page tree node (i.e. and intermediate).
>       */
>      private boolean isPageTreeNode(COSDictionary node )
>      {
>          // some files such as PDFBOX-2250-229205.pdf don't have Pages set as the Type, so we have
>          // to check for the presence of Kids too
>          return node.getCOSName(COSName.TYPE) == COSName.PAGES ||
>                 node.containsKey(COSName.KIDS);
>      }
>
> -----Original Message-----
> From: Tilman Hausherr [mailto:THausherr@t-online.de]
> Sent: Tuesday, May 9, 2017 3:20 PM
> To: dev@pdfbox.apache.org
> Subject: Re: 2.0.6 release ?
>
> Am 09.05.2017 um 21:01 schrieb Allison, Timothy B.:
>>> I've fixed all remaining regression tickets (in the end it was 
>>> exactly 1)
>> Great!  Thank you!
>>
>> Let me know when I should kick off another eval.
>
> Yes, please do.
>
> Thanks
>
> Tilman
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@pdfbox.apache.org For 
> additional commands, e-mail: dev-help@pdfbox.apache.org
>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@pdfbox.apache.org For 
> additional commands, e-mail: dev-help@pdfbox.apache.org
>


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@pdfbox.apache.org For additional commands, e-mail: dev-help@pdfbox.apache.org



Re: 2.0.6 release ?

Posted by Tilman Hausherr <TH...@t-online.de>.
You caught me... I haven't checked these yet.

But I did now, with
MD6X34Z6CODIJODXTOH5E6WJ7VUUPITO.pdf
3TE3TRHZVL2ZJGUUASZEWNOY6DXRTTEK.ashx
IDEJP3MH4FCZNNTWQDMZ6TD2MOIPAYZ7.ashx
but they don't throw an NPE anymore now.

Oops... I see I have that check you mention in my code, it has been 
there for months and I forgot to make an issue. But after removing it, 
it still works with the three files... so the question is, can this 
parameter ever be null, or not?

Tilman

Am 09.05.2017 um 21:34 schrieb Allison, Timothy B.:
> Should we return false if the node is null in PDPageTree#isPageTreeNode (66 new NPE exceptions)?  Has this been fixed, or would that cause unintended problems?
>
>      /**
>       * Returns true if the node is a page tree node (i.e. and intermediate).
>       */
>      private boolean isPageTreeNode(COSDictionary node )
>      {
>          // some files such as PDFBOX-2250-229205.pdf don't have Pages set as the Type, so we have
>          // to check for the presence of Kids too
>          return node.getCOSName(COSName.TYPE) == COSName.PAGES ||
>                 node.containsKey(COSName.KIDS);
>      }
>
> -----Original Message-----
> From: Tilman Hausherr [mailto:THausherr@t-online.de]
> Sent: Tuesday, May 9, 2017 3:20 PM
> To: dev@pdfbox.apache.org
> Subject: Re: 2.0.6 release ?
>
> Am 09.05.2017 um 21:01 schrieb Allison, Timothy B.:
>>> I've fixed all remaining regression tickets (in the end it was
>>> exactly 1)
>> Great!  Thank you!
>>
>> Let me know when I should kick off another eval.
>
> Yes, please do.
>
> Thanks
>
> Tilman
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@pdfbox.apache.org For additional commands, e-mail: dev-help@pdfbox.apache.org
>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@pdfbox.apache.org
> For additional commands, e-mail: dev-help@pdfbox.apache.org
>


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@pdfbox.apache.org
For additional commands, e-mail: dev-help@pdfbox.apache.org


RE: 2.0.6 release ?

Posted by "Allison, Timothy B." <ta...@mitre.org>.
Should we return false if the node is null in PDPageTree#isPageTreeNode (66 new NPE exceptions)?  Has this been fixed, or would that cause unintended problems?

    /**
     * Returns true if the node is a page tree node (i.e. and intermediate).
     */
    private boolean isPageTreeNode(COSDictionary node )
    {
        // some files such as PDFBOX-2250-229205.pdf don't have Pages set as the Type, so we have
        // to check for the presence of Kids too
        return node.getCOSName(COSName.TYPE) == COSName.PAGES ||
               node.containsKey(COSName.KIDS);
    }

-----Original Message-----
From: Tilman Hausherr [mailto:THausherr@t-online.de] 
Sent: Tuesday, May 9, 2017 3:20 PM
To: dev@pdfbox.apache.org
Subject: Re: 2.0.6 release ?

Am 09.05.2017 um 21:01 schrieb Allison, Timothy B.:
>> I've fixed all remaining regression tickets (in the end it was 
>> exactly 1)
> Great!  Thank you!
>
> Let me know when I should kick off another eval.


Yes, please do.

Thanks

Tilman

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@pdfbox.apache.org For additional commands, e-mail: dev-help@pdfbox.apache.org



Re: 2.0.6 release ?

Posted by Tilman Hausherr <TH...@t-online.de>.
Am 09.05.2017 um 21:01 schrieb Allison, Timothy B.:
>> I've fixed all remaining regression tickets (in the end it was exactly 1)
> Great!  Thank you!
>
> Let me know when I should kick off another eval.


Yes, please do.

Thanks

Tilman

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@pdfbox.apache.org
For additional commands, e-mail: dev-help@pdfbox.apache.org


RE: 2.0.6 release ?

Posted by "Allison, Timothy B." <ta...@mitre.org>.
>I've fixed all remaining regression tickets (in the end it was exactly 1)

Great!  Thank you!

Let me know when I should kick off another eval.

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@pdfbox.apache.org
For additional commands, e-mail: dev-help@pdfbox.apache.org


Re: 2.0.6 release ?

Posted by Andreas Lehmkuehler <an...@lehmi.de>.
Am 09.05.2017 um 19:52 schrieb Tilman Hausherr:
> Am 09.05.2017 um 02:43 schrieb Allison, Timothy B.:
>> Content
>>
>> 1)  To get a _general_ sense of overall content extract, see "content/ 
>> common_token_comparisons_by_mime.xlsx"  This suggests that we've lost 248k 
>> "common words"[1], which out of 2.6 billion isn't much.  However, we also lost 
>> 18 million common words going from 2.0.3 (Tika 1.14) to 2.0.5 (Tika 
>> 1.15-SNAPSHOT)...so I'd hope the fix to PDFBOX-3717 would have led to an 
>> improvement.
>>
>> 2)  If you want to compare content whether or not one there was a parse 
>> exception, see "content/content_diffs_with_exceptions.xlsx"
>>
>> 3) If you only want to see content diffs where both extracts did not have an 
>> exception, see "content/content_diffs_ignore_exceptions.xlsx".
>>
>> To make quick sense of the content_diffs_files, sort 
>> "NUM_COMMON_TOKENS_DIFF_IN_B" in ascending order, and you'll see which files 
>> lost the most common tokens.
>>
>> To see which files changed the most, sort on DICE_COEFFICIENT or OVERLAP, 
>> which compare the number of unique tokens/tokens in common...a low number 
>> means little similarity, while a number close to 1.0 means that the unigrams 
>> are nearly identical.
>>
>>
>>  From a quick look, many of the files with fewer common words are in the 
>> "likely_broken" and or "truncated" subdirectories...  Some exceptions to this 
>> rule include the following, but there are more...and overall, there is a fair 
>> amount of loss from 2.0.3.
>>
>> govdocs1/202/202097.pdf
>> govdocs1/358/358043.pdf
>> commoncrawl2/C5/C5FUETRXI26MXZDK4YP5YYQA2N6GHEC6
>> commoncrawl2/QR/QRGKM44N7J62Y6BZHTP2BC7BCHF3SJ56
> 
> Thanks for the test... three of these four have been fixed, this was yet another 
> trouble recognizing the end of inline images. All were created by "Leadtools". 
> The fourth (202097.pdf) is in issue PDFBOX-3785.
> 
> Most issues are probably related to truncated files. Some of these do not even 
> display with Adobe Reader.
I've fixed all remaining regression tickets (in the end it was exactly 1)

@Tim Thanks for running the comparison
@Tilman Thanks for analyzing

Andreas


> 
> Tilman
> 
> 
> 
>>
>> [1] For this version of tika-eval, I expanded Tilman's initial recommendation 
>> of common words for English a bit.  I took the top 20k most common words (4 
>> characters or more, except for CJK) for a large number of Wikipedia dumps.  I 
>> removed common html markup words (body, form, table) so that failure to strip 
>> html doesn't incorrectly boost scores.
>>
>>   We apply language id and then use the common words for that language.  For 
>> example, for 
>> truncated_pdfs/commoncrawl2_likely_broken/IA/IA64I4PY77P4IVKTLZ3WHRCNSODW3PZW
>>
>> * PDFBox 2.0.5 extracted text that was id'd as "French", and there were 1580 
>> tokens from the French list of common words.
>> * PDFBox 2.0.6-SNAPSHOT extracted text that was id'd as "English", and there 
>> were 320 common words from the English list of common words.
>> -----Original Message-----
>> From: Tilman Hausherr [mailto:THausherr@t-online.de]
>> Sent: Monday, May 8, 2017 10:01 AM
>> To: dev@pdfbox.apache.org
>> Subject: Re: 2.0.6 release ?
>>
>> Am 08.05.2017 um 15:06 schrieb Allison, Timothy B.:
>>> Happy to.  Will kick off now?
>> Yes
>>
>> Tilman
>>
>>> -----Original Message-----
>>> From: Tilman Hausherr [mailto:THausherr@t-online.de]
>>> Sent: Saturday, May 6, 2017 10:02 AM
>>> To: dev@pdfbox.apache.org
>>> Subject: Re: 2.0.6 release ?
>>>
>>> Am 04.05.2017 um 18:10 schrieb Andreas Lehmkuehler:
>>>> Am 02.05.2017 um 12:42 schrieb Andreas Lehmkühler:
>>>>> Hi,
>>>>>
>>>>> I'm planning to cut a 2.0.6 release in about 1 or 2 weeks from now,
>>>>> any objections?
>>>> I'm targeting the 15th or 16th
>>> Tim, could you please run your tests when time allows?
>>>
>>> Thanks
>>>
>>> Tilman
>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: dev-unsubscribe@pdfbox.apache.org For
>>> additional commands, e-mail: dev-help@pdfbox.apache.org
>>>
>>>
>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: dev-unsubscribe@pdfbox.apache.org For
>>> additional commands, e-mail: dev-help@pdfbox.apache.org
>>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: dev-unsubscribe@pdfbox.apache.org For additional 
>> commands, e-mail: dev-help@pdfbox.apache.org
>>
>>
>> B KKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKCB  
>> [  X  ܚX KK[XZ[
>>   ] ][  X  ܚX P
>>   \X K ܙ B  ܈Y][ۘ[  [X[  K[XZ[
>>   ] Z[
>>   \X K ܙ B B
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: dev-unsubscribe@pdfbox.apache.org
>> For additional commands, e-mail: dev-help@pdfbox.apache.org
>>
> 
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@pdfbox.apache.org
> For additional commands, e-mail: dev-help@pdfbox.apache.org
> 


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@pdfbox.apache.org
For additional commands, e-mail: dev-help@pdfbox.apache.org


Re: 2.0.6 release ?

Posted by Tilman Hausherr <TH...@t-online.de>.
Am 09.05.2017 um 02:43 schrieb Allison, Timothy B.:
> Content
>
> 1)  To get a _general_ sense of overall content extract, see "content/ common_token_comparisons_by_mime.xlsx"  This suggests that we've lost 248k "common words"[1], which out of 2.6 billion isn't much.  However, we also lost 18 million common words going from 2.0.3 (Tika 1.14) to 2.0.5 (Tika 1.15-SNAPSHOT)...so I'd hope the fix to PDFBOX-3717 would have led to an improvement.
>
> 2)  If you want to compare content whether or not one there was a parse exception, see "content/content_diffs_with_exceptions.xlsx"
>
> 3) If you only want to see content diffs where both extracts did not have an exception, see "content/content_diffs_ignore_exceptions.xlsx".
>
> To make quick sense of the content_diffs_files, sort "NUM_COMMON_TOKENS_DIFF_IN_B" in ascending order, and you'll see which files lost the most common tokens.
>
> To see which files changed the most, sort on DICE_COEFFICIENT or OVERLAP, which compare the number of unique tokens/tokens in common...a low number means little similarity, while a number close to 1.0 means that the unigrams are nearly identical.
>
>
>  From a quick look, many of the files with fewer common words are in the "likely_broken" and or "truncated" subdirectories...  Some exceptions to this rule include the following, but there are more...and overall, there is a fair amount of loss from 2.0.3.
>
> govdocs1/202/202097.pdf
> govdocs1/358/358043.pdf
> commoncrawl2/C5/C5FUETRXI26MXZDK4YP5YYQA2N6GHEC6
> commoncrawl2/QR/QRGKM44N7J62Y6BZHTP2BC7BCHF3SJ56

Thanks for the test... three of these four have been fixed, this was yet 
another trouble recognizing the end of inline images. All were created 
by "Leadtools". The fourth (202097.pdf) is in issue PDFBOX-3785.

Most issues are probably related to truncated files. Some of these do 
not even display with Adobe Reader.

Tilman



>
> [1] For this version of tika-eval, I expanded Tilman's initial recommendation of common words for English a bit.  I took the top 20k most common words (4 characters or more, except for CJK) for a large number of Wikipedia dumps.  I removed common html markup words (body, form, table) so that failure to strip html doesn't incorrectly boost scores.
>
>   We apply language id and then use the common words for that language.  For example, for truncated_pdfs/commoncrawl2_likely_broken/IA/IA64I4PY77P4IVKTLZ3WHRCNSODW3PZW
>
> * PDFBox 2.0.5 extracted text that was id'd as "French", and there were 1580 tokens from the French list of common words.
> * PDFBox 2.0.6-SNAPSHOT extracted text that was id'd as "English", and there were 320 common words from the English list of common words.
> -----Original Message-----
> From: Tilman Hausherr [mailto:THausherr@t-online.de]
> Sent: Monday, May 8, 2017 10:01 AM
> To: dev@pdfbox.apache.org
> Subject: Re: 2.0.6 release ?
>
> Am 08.05.2017 um 15:06 schrieb Allison, Timothy B.:
>> Happy to.  Will kick off now?
> Yes
>
> Tilman
>
>> -----Original Message-----
>> From: Tilman Hausherr [mailto:THausherr@t-online.de]
>> Sent: Saturday, May 6, 2017 10:02 AM
>> To: dev@pdfbox.apache.org
>> Subject: Re: 2.0.6 release ?
>>
>> Am 04.05.2017 um 18:10 schrieb Andreas Lehmkuehler:
>>> Am 02.05.2017 um 12:42 schrieb Andreas Lehmkühler:
>>>> Hi,
>>>>
>>>> I'm planning to cut a 2.0.6 release in about 1 or 2 weeks from now,
>>>> any objections?
>>> I'm targeting the 15th or 16th
>> Tim, could you please run your tests when time allows?
>>
>> Thanks
>>
>> Tilman
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: dev-unsubscribe@pdfbox.apache.org For
>> additional commands, e-mail: dev-help@pdfbox.apache.org
>>
>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: dev-unsubscribe@pdfbox.apache.org For
>> additional commands, e-mail: dev-help@pdfbox.apache.org
>>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@pdfbox.apache.org For additional commands, e-mail: dev-help@pdfbox.apache.org
>
>
> B KKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKCB  [  X  ܚX KK[XZ[
>   ] ][  X  ܚX P
>   \X K ܙ B  ܈Y][ۘ[  [X[  K[XZ[
>   ] Z[
>   \X K ܙ B B
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@pdfbox.apache.org
> For additional commands, e-mail: dev-help@pdfbox.apache.org
>


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@pdfbox.apache.org
For additional commands, e-mail: dev-help@pdfbox.apache.org


RE: 2.0.6 release ?

Posted by "Allison, Timothy B." <ta...@mitre.org>.
For the reports comparing 2.0.3 with 2.0.5, see https://github.com/tballison/share/blob/master/tika_comparisons/reports_1_14V1_15.zip 

That was a full run against all file types of Tika 1.14 vs 1.15-SNAPSHOT from April 25.

-----Original Message-----
From: Allison, Timothy B. [mailto:tallison@mitre.org] 
Sent: Monday, May 8, 2017 8:43 PM
To: dev@pdfbox.apache.org
Subject: RE: 2.0.6 release ?

Content

1)  To get a _general_ sense of overall content extract, see "content/ common_token_comparisons_by_mime.xlsx"  This suggests that we've lost 248k "common words"[1], which out of 2.6 billion isn't much.  However, we also lost 18 million common words going from 2.0.3 (Tika 1.14) to 2.0.5 (Tika 1.15-SNAPSHOT)...so I'd hope the fix to PDFBOX-3717 would have led to an improvement.

2)  If you want to compare content whether or not one there was a parse exception, see "content/content_diffs_with_exceptions.xlsx"

3) If you only want to see content diffs where both extracts did not have an exception, see "content/content_diffs_ignore_exceptions.xlsx".

To make quick sense of the content_diffs_files, sort "NUM_COMMON_TOKENS_DIFF_IN_B" in ascending order, and you'll see which files lost the most common tokens.

To see which files changed the most, sort on DICE_COEFFICIENT or OVERLAP, which compare the number of unique tokens/tokens in common...a low number means little similarity, while a number close to 1.0 means that the unigrams are nearly identical.


From a quick look, many of the files with fewer common words are in the "likely_broken" and or "truncated" subdirectories...  Some exceptions to this rule include the following, but there are more...and overall, there is a fair amount of loss from 2.0.3.

govdocs1/202/202097.pdf
govdocs1/358/358043.pdf
commoncrawl2/C5/C5FUETRXI26MXZDK4YP5YYQA2N6GHEC6
commoncrawl2/QR/QRGKM44N7J62Y6BZHTP2BC7BCHF3SJ56

[1] For this version of tika-eval, I expanded Tilman's initial recommendation of common words for English a bit.  I took the top 20k most common words (4 characters or more, except for CJK) for a large number of Wikipedia dumps.  I removed common html markup words (body, form, table) so that failure to strip html doesn't incorrectly boost scores.

 We apply language id and then use the common words for that language.  For example, for truncated_pdfs/commoncrawl2_likely_broken/IA/IA64I4PY77P4IVKTLZ3WHRCNSODW3PZW

* PDFBox 2.0.5 extracted text that was id'd as "French", and there were 1580 tokens from the French list of common words.
* PDFBox 2.0.6-SNAPSHOT extracted text that was id'd as "English", and there were 320 common words from the English list of common words.
-----Original Message-----
From: Tilman Hausherr [mailto:THausherr@t-online.de]
Sent: Monday, May 8, 2017 10:01 AM
To: dev@pdfbox.apache.org
Subject: Re: 2.0.6 release ?

Am 08.05.2017 um 15:06 schrieb Allison, Timothy B.:
> Happy to.  Will kick off now?

Yes

Tilman

>
> -----Original Message-----
> From: Tilman Hausherr [mailto:THausherr@t-online.de]
> Sent: Saturday, May 6, 2017 10:02 AM
> To: dev@pdfbox.apache.org
> Subject: Re: 2.0.6 release ?
>
> Am 04.05.2017 um 18:10 schrieb Andreas Lehmkuehler:
>> Am 02.05.2017 um 12:42 schrieb Andreas Lehmkühler:
>>> Hi,
>>>
>>> I'm planning to cut a 2.0.6 release in about 1 or 2 weeks from now, 
>>> any objections?
>> I'm targeting the 15th or 16th
> Tim, could you please run your tests when time allows?
>
> Thanks
>
> Tilman
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@pdfbox.apache.org For 
> additional commands, e-mail: dev-help@pdfbox.apache.org
>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@pdfbox.apache.org For 
> additional commands, e-mail: dev-help@pdfbox.apache.org
>


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@pdfbox.apache.org For additional commands, e-mail: dev-help@pdfbox.apache.org


B KKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKCB  [  X  ܚX KK[XZ[
 ] ][  X  ܚX P   
 \X K ܙ B  ܈Y][ۘ[  [X[  K[XZ[
 ] Z[   
 \X K ܙ B B

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@pdfbox.apache.org For additional commands, e-mail: dev-help@pdfbox.apache.org


RE: 2.0.6 release ?

Posted by "Allison, Timothy B." <ta...@mitre.org>.
Content

1)  To get a _general_ sense of overall content extract, see "content/ common_token_comparisons_by_mime.xlsx"  This suggests that we've lost 248k "common words"[1], which out of 2.6 billion isn't much.  However, we also lost 18 million common words going from 2.0.3 (Tika 1.14) to 2.0.5 (Tika 1.15-SNAPSHOT)...so I'd hope the fix to PDFBOX-3717 would have led to an improvement.

2)  If you want to compare content whether or not one there was a parse exception, see "content/content_diffs_with_exceptions.xlsx"

3) If you only want to see content diffs where both extracts did not have an exception, see "content/content_diffs_ignore_exceptions.xlsx".

To make quick sense of the content_diffs_files, sort "NUM_COMMON_TOKENS_DIFF_IN_B" in ascending order, and you'll see which files lost the most common tokens.

To see which files changed the most, sort on DICE_COEFFICIENT or OVERLAP, which compare the number of unique tokens/tokens in common...a low number means little similarity, while a number close to 1.0 means that the unigrams are nearly identical.


From a quick look, many of the files with fewer common words are in the "likely_broken" and or "truncated" subdirectories...  Some exceptions to this rule include the following, but there are more...and overall, there is a fair amount of loss from 2.0.3.

govdocs1/202/202097.pdf
govdocs1/358/358043.pdf
commoncrawl2/C5/C5FUETRXI26MXZDK4YP5YYQA2N6GHEC6
commoncrawl2/QR/QRGKM44N7J62Y6BZHTP2BC7BCHF3SJ56

[1] For this version of tika-eval, I expanded Tilman's initial recommendation of common words for English a bit.  I took the top 20k most common words (4 characters or more, except for CJK) for a large number of Wikipedia dumps.  I removed common html markup words (body, form, table) so that failure to strip html doesn't incorrectly boost scores.

 We apply language id and then use the common words for that language.  For example, for truncated_pdfs/commoncrawl2_likely_broken/IA/IA64I4PY77P4IVKTLZ3WHRCNSODW3PZW

* PDFBox 2.0.5 extracted text that was id'd as "French", and there were 1580 tokens from the French list of common words.
* PDFBox 2.0.6-SNAPSHOT extracted text that was id'd as "English", and there were 320 common words from the English list of common words.
-----Original Message-----
From: Tilman Hausherr [mailto:THausherr@t-online.de]
Sent: Monday, May 8, 2017 10:01 AM
To: dev@pdfbox.apache.org
Subject: Re: 2.0.6 release ?

Am 08.05.2017 um 15:06 schrieb Allison, Timothy B.:
> Happy to.  Will kick off now?

Yes

Tilman

>
> -----Original Message-----
> From: Tilman Hausherr [mailto:THausherr@t-online.de]
> Sent: Saturday, May 6, 2017 10:02 AM
> To: dev@pdfbox.apache.org
> Subject: Re: 2.0.6 release ?
>
> Am 04.05.2017 um 18:10 schrieb Andreas Lehmkuehler:
>> Am 02.05.2017 um 12:42 schrieb Andreas Lehmkühler:
>>> Hi,
>>>
>>> I'm planning to cut a 2.0.6 release in about 1 or 2 weeks from now, 
>>> any objections?
>> I'm targeting the 15th or 16th
> Tim, could you please run your tests when time allows?
>
> Thanks
>
> Tilman
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@pdfbox.apache.org For 
> additional commands, e-mail: dev-help@pdfbox.apache.org
>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@pdfbox.apache.org For 
> additional commands, e-mail: dev-help@pdfbox.apache.org
>


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@pdfbox.apache.org For additional commands, e-mail: dev-help@pdfbox.apache.org


B KKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKCB  [  X  ܚX KK[XZ[
 ] ][  X  ܚX P   
 \X K ܙ B  ܈Y][ۘ[  [X[  K[XZ[
 ] Z[   
 \X K ܙ B B

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@pdfbox.apache.org
For additional commands, e-mail: dev-help@pdfbox.apache.org


RE: 2.0.6 release ?

Posted by "Allison, Timothy B." <ta...@mitre.org>.
Results here: http://162.242.228.174/reports/reports_pdfbox_2_0_6.tar.gz

A = 2.0.5
B = 2.0.6-SNAPSHOT from 12 hours ago.

I've only had a chance to look at the exceptions, attachments and metadata so far. 

For the new exceptions (roughly grouped by stacktrace), see "exceptions/new_exceptions_in_B_by_mime_by_stack_trace.xlsx"

For the full stack traces and triggering file paths (prepend http://162.242.228.174/docs to retrieve the source files), see "exceptions/new_excetions_in_B_details.xlsx".

For the fixed exceptions, see "exceptions/fixed_exceptions_in_B_by_mime.xlsx" and *_details.xlsx.

To confirm that the content of from the "fixed exceptions" looks language-y, scan through "exceptions/contents_of_fixed_exceptions_in_B.xlsx".

There are few handfuls of diffs in attachments and metadata, and I'll look into these.

Off to look at the contents...


-----Original Message-----
From: Tilman Hausherr [mailto:THausherr@t-online.de] 
Sent: Monday, May 8, 2017 10:01 AM
To: dev@pdfbox.apache.org
Subject: Re: 2.0.6 release ?

Am 08.05.2017 um 15:06 schrieb Allison, Timothy B.:
> Happy to.  Will kick off now?

Yes

Tilman

>
> -----Original Message-----
> From: Tilman Hausherr [mailto:THausherr@t-online.de]
> Sent: Saturday, May 6, 2017 10:02 AM
> To: dev@pdfbox.apache.org
> Subject: Re: 2.0.6 release ?
>
> Am 04.05.2017 um 18:10 schrieb Andreas Lehmkuehler:
>> Am 02.05.2017 um 12:42 schrieb Andreas Lehmkühler:
>>> Hi,
>>>
>>> I'm planning to cut a 2.0.6 release in about 1 or 2 weeks from now, 
>>> any objections?
>> I'm targeting the 15th or 16th
> Tim, could you please run your tests when time allows?
>
> Thanks
>
> Tilman
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@pdfbox.apache.org For 
> additional commands, e-mail: dev-help@pdfbox.apache.org
>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@pdfbox.apache.org For 
> additional commands, e-mail: dev-help@pdfbox.apache.org
>


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@pdfbox.apache.org For additional commands, e-mail: dev-help@pdfbox.apache.org



Re: 2.0.6 release ?

Posted by Tilman Hausherr <TH...@t-online.de>.
Am 08.05.2017 um 15:06 schrieb Allison, Timothy B.:
> Happy to.  Will kick off now?

Yes

Tilman

>
> -----Original Message-----
> From: Tilman Hausherr [mailto:THausherr@t-online.de]
> Sent: Saturday, May 6, 2017 10:02 AM
> To: dev@pdfbox.apache.org
> Subject: Re: 2.0.6 release ?
>
> Am 04.05.2017 um 18:10 schrieb Andreas Lehmkuehler:
>> Am 02.05.2017 um 12:42 schrieb Andreas Lehmkühler:
>>> Hi,
>>>
>>> I'm planning to cut a 2.0.6 release in about 1 or 2 weeks from now,
>>> any objections?
>> I'm targeting the 15th or 16th
> Tim, could you please run your tests when time allows?
>
> Thanks
>
> Tilman
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@pdfbox.apache.org For additional commands, e-mail: dev-help@pdfbox.apache.org
>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@pdfbox.apache.org
> For additional commands, e-mail: dev-help@pdfbox.apache.org
>


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@pdfbox.apache.org
For additional commands, e-mail: dev-help@pdfbox.apache.org


RE: 2.0.6 release ?

Posted by "Allison, Timothy B." <ta...@mitre.org>.
Happy to.  Will kick off now?

-----Original Message-----
From: Tilman Hausherr [mailto:THausherr@t-online.de] 
Sent: Saturday, May 6, 2017 10:02 AM
To: dev@pdfbox.apache.org
Subject: Re: 2.0.6 release ?

Am 04.05.2017 um 18:10 schrieb Andreas Lehmkuehler:
> Am 02.05.2017 um 12:42 schrieb Andreas Lehmkühler:
>> Hi,
>>
>> I'm planning to cut a 2.0.6 release in about 1 or 2 weeks from now, 
>> any objections?
> I'm targeting the 15th or 16th

Tim, could you please run your tests when time allows?

Thanks

Tilman

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@pdfbox.apache.org For additional commands, e-mail: dev-help@pdfbox.apache.org



Re: 2.0.6 release ?

Posted by Tilman Hausherr <TH...@t-online.de>.
Am 04.05.2017 um 18:10 schrieb Andreas Lehmkuehler:
> Am 02.05.2017 um 12:42 schrieb Andreas Lehmkühler:
>> Hi,
>>
>> I'm planning to cut a 2.0.6 release in about 1 or 2 weeks from now, 
>> any objections?
> I'm targeting the 15th or 16th 

Tim, could you please run your tests when time allows?

Thanks

Tilman

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@pdfbox.apache.org
For additional commands, e-mail: dev-help@pdfbox.apache.org


Re: 2.0.6 release ?

Posted by Andreas Lehmkuehler <an...@lehmi.de>.
Am 02.05.2017 um 12:42 schrieb Andreas Lehmkühler:
> Hi,
> 
> I'm planning to cut a 2.0.6 release in about 1 or 2 weeks from now, any objections?
I'm targeting the 15th or 16th

Andreas

> 
> Andreas
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@pdfbox.apache.org
> For additional commands, e-mail: dev-help@pdfbox.apache.org
> 


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@pdfbox.apache.org
For additional commands, e-mail: dev-help@pdfbox.apache.org


Re: 2.0.6 release ?

Posted by Maruan Sahyoun <sa...@fileaffairs.de>.
+1  can work on some tickets over the weekend


> Am 02.05.2017 um 12:42 schrieb Andreas Lehmkühler <an...@lehmi.de>:
> 
> Hi,
> 
> I'm planning to cut a 2.0.6 release in about 1 or 2 weeks from now, any objections?
> 
> Andreas
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@pdfbox.apache.org
> For additional commands, e-mail: dev-help@pdfbox.apache.org
> 


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@pdfbox.apache.org
For additional commands, e-mail: dev-help@pdfbox.apache.org