You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@pdfbox.apache.org by Tilman Hausherr <TH...@t-online.de> on 2014/12/03 21:04:59 UTC

Re: preflight mass tests

I've now run preflight on half of the govdocs files. Every issue I have 
opened on preflight is related to that test. The failure rate 
(exceptions other than the "allowed" ValidationExceptions) is down from 
1% when I started to 0.05% now. Most of the frequent exceptions (e.g. 
the one with NonTermimalField) have been fixed. Whats left now are 
exceptions related to messy files, and some of the font related issues.

Tilman

Am 03.11.2014 um 22:58 schrieb Tilman Hausherr:
> Am 03.11.2014 um 19:00 schrieb Tilman Hausherr:
>> It is not looking good, there is at least one NPEs issue coming. 
>
> No more NPE after solving the two issues I opened today except 
> PDFBOX-1743.pdf which is a known problem.
>
> Coming up soon: run preflight on the 231227 PDF files from 
> digitalcorpora to see what happens.
>
> Tilman
>


Re: preflight mass tests

Posted by Tilman Hausherr <TH...@t-online.de>.
Here's the code... it assumes that all PDFs are flat in one single 
directory. Libraries needed: preflight-app, jai_imageio, 
levigo_jbig2-imageio-1.6.1.jar. I have run it only with the trunk, not 
with 1.8, because we didn't fix all problems there.
Tilman

import java.io.File;
import java.io.FileNotFoundException;
import java.io.FilenameFilter;
import java.io.PrintWriter;
import org.apache.pdfbox.preflight.PreflightDocument;
import org.apache.pdfbox.preflight.exception.ValidationException;
import org.apache.pdfbox.preflight.parser.PreflightParser;

/**
  *
  * @author Tilman Hausherr
  */
public class PreflightTest
{
     public static void main(String[] args) throws FileNotFoundException
     {
         File dir;
         if (args.length > 0)
         {
             dir = new File(args[0]);
         }
         else
         {
             dir = new File("k:\\dc");
         }

         int total = 0;
         int failed = 0;
         File[] dirList = dir.listFiles(new FilenameFilter()
         {
             @Override
             public boolean accept(File dir, String name)
             {
                 if (name.compareTo("000000.pdf") <= 0) // use this to 
start at a certain file
                 {
                     return false;
                 }
                 return name.toLowerCase().endsWith(".pdf");
             }
         });
         for (File pdf : dirList)
         {
             ++total;
             System.out.println(pdf.getName());
             // just test that it doesn't crash
             try
             {
                 new File(pdf.getName() + "-exception.txt").delete();
                 PreflightParser parser = new PreflightParser(pdf);
                 parser.parse();
                 try (PreflightDocument preflightDocument = 
parser.getPreflightDocument())
                 {
                     preflightDocument.validate();
                     preflightDocument.getResult();
                 }
                 parser.clearResources();
             }
             catch (ValidationException e)
             {
             }
             catch (Throwable e)
             {
                 ++failed;
                 try (PrintWriter pw = new PrintWriter(new 
File(pdf.getName() + "-exception.txt")))
                 {
                     e.printStackTrace(pw);
                 }
                 System.out.flush();
                 System.err.flush();
                 System.err.print(pdf.getName() + " preflight fail: ");
                 e.printStackTrace();
                 System.out.flush();
                 System.err.flush();
             }
             System.out.println("total: " + total + ", failed: " + 
failed + ", percentage failed: " + (((float) failed) / total * 100.0) + 
"%");
         }

     }

}


Am 09.12.2014 um 17:28 schrieb Allison, Timothy B.:
> Tilman,
>    This is fantastic!  If you send me an example of the code you used to call preflight (#parse() or  #parse(Format format)???), I'd like to run it within tika-batch to see what our batch performance is.
>    Ideally, once we can turn our public vm on, it would be fun to run these tests there.
>    
>
>           Best,
>
>                      Tim
>
> -----Original Message-----
> From: Tilman Hausherr [mailto:THausherr@t-online.de]
> Sent: Friday, December 05, 2014 2:45 PM
> To: dev@pdfbox.apache.org
> Subject: Re: preflight mass tests
>
> Some numbers... it took 4-5 days
>
> total: 231223, failed: 142, percentage failed: 0.06141257472336292
>
> Of these, one can substract 33 OutOfMemoryErrors that happened near the
> end of the test. Isolated runs went fine.
>
> about the rest:
> 18 are the isSymbol stackoverflow
> 9 are the getFontMatrix NPE
> 33 are the "root must be of type Pages" errors
>
> The rest is mostly related to very broken PDF files.
>
> Tilman
>
>
> Am 04.12.2014 um 14:55 schrieb Maruan Sahyoun:
>> Hi Tilman,
>>
>> that's very good news. I trust a lot of time went into reviewing the test results. wo your and Tim's efforts this achievement wouldn't have been possible.
>>
>> BR
>>
>> Maruan
>>
>> Am 03.12.2014 um 21:04 schrieb Tilman Hausherr <TH...@t-online.de>:
>>
>>> I've now run preflight on half of the govdocs files. Every issue I have opened on preflight is related to that test. The failure rate (exceptions other than the "allowed" ValidationExceptions) is down from 1% when I started to 0.05% now. Most of the frequent exceptions (e.g. the one with NonTermimalField) have been fixed. Whats left now are exceptions related to messy files, and some of the font related issues.
>>>
>>> Tilman
>>>
>>> Am 03.11.2014 um 22:58 schrieb Tilman Hausherr:
>>>> Am 03.11.2014 um 19:00 schrieb Tilman Hausherr:
>>>>> It is not looking good, there is at least one NPEs issue coming.
>>>> No more NPE after solving the two issues I opened today except PDFBOX-1743.pdf which is a known problem.
>>>>
>>>> Coming up soon: run preflight on the 231227 PDF files from digitalcorpora to see what happens.
>>>>
>>>> Tilman
>>>>


RE: preflight mass tests

Posted by "Allison, Timothy B." <ta...@mitre.org>.
Tilman,
  This is fantastic!  If you send me an example of the code you used to call preflight (#parse() or  #parse(Format format)???), I'd like to run it within tika-batch to see what our batch performance is.
  Ideally, once we can turn our public vm on, it would be fun to run these tests there.
  

         Best,

                    Tim

-----Original Message-----
From: Tilman Hausherr [mailto:THausherr@t-online.de] 
Sent: Friday, December 05, 2014 2:45 PM
To: dev@pdfbox.apache.org
Subject: Re: preflight mass tests

Some numbers... it took 4-5 days

total: 231223, failed: 142, percentage failed: 0.06141257472336292

Of these, one can substract 33 OutOfMemoryErrors that happened near the 
end of the test. Isolated runs went fine.

about the rest:
18 are the isSymbol stackoverflow
9 are the getFontMatrix NPE
33 are the "root must be of type Pages" errors

The rest is mostly related to very broken PDF files.

Tilman


Am 04.12.2014 um 14:55 schrieb Maruan Sahyoun:
> Hi Tilman,
>
> that's very good news. I trust a lot of time went into reviewing the test results. wo your and Tim's efforts this achievement wouldn't have been possible.
>
> BR
>
> Maruan
>
> Am 03.12.2014 um 21:04 schrieb Tilman Hausherr <TH...@t-online.de>:
>
>> I've now run preflight on half of the govdocs files. Every issue I have opened on preflight is related to that test. The failure rate (exceptions other than the "allowed" ValidationExceptions) is down from 1% when I started to 0.05% now. Most of the frequent exceptions (e.g. the one with NonTermimalField) have been fixed. Whats left now are exceptions related to messy files, and some of the font related issues.
>>
>> Tilman
>>
>> Am 03.11.2014 um 22:58 schrieb Tilman Hausherr:
>>> Am 03.11.2014 um 19:00 schrieb Tilman Hausherr:
>>>> It is not looking good, there is at least one NPEs issue coming.
>>> No more NPE after solving the two issues I opened today except PDFBOX-1743.pdf which is a known problem.
>>>
>>> Coming up soon: run preflight on the 231227 PDF files from digitalcorpora to see what happens.
>>>
>>> Tilman
>>>
>


Re: preflight mass tests

Posted by Tilman Hausherr <TH...@t-online.de>.
Since you answered to the list, I'll answer here too:
I dpn't know, I didn't try to display the "fails".

Tilman

Am 09.12.2014 um 10:59 schrieb Maruan Sahyoun:
> Hallo Tilman,
>
> hast Du ne grobe Schätzung welcher Anteil der Dateien z.B. in Adobe Reader entweder nicht angezeigt, mit Dialog angezeigt oder falsch angezeigt wird?
>
> Lieben Gruß
>   
> Maruan Sahyoun
>
> FileAffairs GmbH
> Josef-Schappe-Straße 21
> 40882 Ratingen
>
> Tel: +49 (2102) 89497 88
> Fax: +49 (2102) 89497 91
> sahyoun@fileaffairs.de
> www.fileaffairs.de
>
> Geschäftsführer: Maruan Sahyoun
> Handelsregister: AG Düsseldorf, HRB 53837
> UST.-ID: DE248275827
>
> Am 05.12.2014 um 20:45 schrieb Tilman Hausherr <TH...@t-online.de>:
>
>> Some numbers... it took 4-5 days
>>
>> total: 231223, failed: 142, percentage failed: 0.06141257472336292
>>
>> Of these, one can substract 33 OutOfMemoryErrors that happened near the end of the test. Isolated runs went fine.
>>
>> about the rest:
>> 18 are the isSymbol stackoverflow
>> 9 are the getFontMatrix NPE
>> 33 are the "root must be of type Pages" errors
>>
>> The rest is mostly related to very broken PDF files.
>>
>> Tilman
>>
>>
>> Am 04.12.2014 um 14:55 schrieb Maruan Sahyoun:
>>> Hi Tilman,
>>>
>>> that's very good news. I trust a lot of time went into reviewing the test results. wo your and Tim's efforts this achievement wouldn't have been possible.
>>>
>>> BR
>>>
>>> Maruan
>>>
>>> Am 03.12.2014 um 21:04 schrieb Tilman Hausherr <TH...@t-online.de>:
>>>
>>>> I've now run preflight on half of the govdocs files. Every issue I have opened on preflight is related to that test. The failure rate (exceptions other than the "allowed" ValidationExceptions) is down from 1% when I started to 0.05% now. Most of the frequent exceptions (e.g. the one with NonTermimalField) have been fixed. Whats left now are exceptions related to messy files, and some of the font related issues.
>>>>
>>>> Tilman
>>>>
>>>> Am 03.11.2014 um 22:58 schrieb Tilman Hausherr:
>>>>> Am 03.11.2014 um 19:00 schrieb Tilman Hausherr:
>>>>>> It is not looking good, there is at least one NPEs issue coming.
>>>>> No more NPE after solving the two issues I opened today except PDFBOX-1743.pdf which is a known problem.
>>>>>
>>>>> Coming up soon: run preflight on the 231227 PDF files from digitalcorpora to see what happens.
>>>>>
>>>>> Tilman
>>>>>
>


Re: preflight mass tests

Posted by Maruan Sahyoun <sa...@fileaffairs.de>.
Hallo Tilman,

hast Du ne grobe Schätzung welcher Anteil der Dateien z.B. in Adobe Reader entweder nicht angezeigt, mit Dialog angezeigt oder falsch angezeigt wird?

Lieben Gruß
 
Maruan Sahyoun

FileAffairs GmbH
Josef-Schappe-Straße 21
40882 Ratingen

Tel: +49 (2102) 89497 88
Fax: +49 (2102) 89497 91
sahyoun@fileaffairs.de
www.fileaffairs.de

Geschäftsführer: Maruan Sahyoun
Handelsregister: AG Düsseldorf, HRB 53837
UST.-ID: DE248275827

Am 05.12.2014 um 20:45 schrieb Tilman Hausherr <TH...@t-online.de>:

> Some numbers... it took 4-5 days
> 
> total: 231223, failed: 142, percentage failed: 0.06141257472336292
> 
> Of these, one can substract 33 OutOfMemoryErrors that happened near the end of the test. Isolated runs went fine.
> 
> about the rest:
> 18 are the isSymbol stackoverflow
> 9 are the getFontMatrix NPE
> 33 are the "root must be of type Pages" errors
> 
> The rest is mostly related to very broken PDF files.
> 
> Tilman
> 
> 
> Am 04.12.2014 um 14:55 schrieb Maruan Sahyoun:
>> Hi Tilman,
>> 
>> that's very good news. I trust a lot of time went into reviewing the test results. wo your and Tim's efforts this achievement wouldn't have been possible.
>> 
>> BR
>> 
>> Maruan
>> 
>> Am 03.12.2014 um 21:04 schrieb Tilman Hausherr <TH...@t-online.de>:
>> 
>>> I've now run preflight on half of the govdocs files. Every issue I have opened on preflight is related to that test. The failure rate (exceptions other than the "allowed" ValidationExceptions) is down from 1% when I started to 0.05% now. Most of the frequent exceptions (e.g. the one with NonTermimalField) have been fixed. Whats left now are exceptions related to messy files, and some of the font related issues.
>>> 
>>> Tilman
>>> 
>>> Am 03.11.2014 um 22:58 schrieb Tilman Hausherr:
>>>> Am 03.11.2014 um 19:00 schrieb Tilman Hausherr:
>>>>> It is not looking good, there is at least one NPEs issue coming.
>>>> No more NPE after solving the two issues I opened today except PDFBOX-1743.pdf which is a known problem.
>>>> 
>>>> Coming up soon: run preflight on the 231227 PDF files from digitalcorpora to see what happens.
>>>> 
>>>> Tilman
>>>> 
>> 
> 


Re: preflight mass tests

Posted by Andreas Lehmkuehler <an...@lehmi.de>.
Hi,

I've nothing to add but +++++1

BR
Andreas Lehmkühler

Am 23.01.2015 um 09:14 schrieb Maruan Sahyoun:
> Hi Tilman,
>
> let me take the opportunity to say thank you for your efforts around code quality and testing. That doesn't result in "hey that's a great new feature" but is a very important part of the development which is very often not directly visible but takes time and dedication.
>
> Sincerly yours
> Maruan
>
> Am 23.01.2015 um 09:00 schrieb Tilman Hausherr <TH...@t-online.de>:
>
>> Hi,
>>
>> Besides the "very broken files" (which results in errors in bad parameters for the PDF operators), there are the out of memory exceptions on huge files. I think that there are at most 5-10 files left with problems that can be solved. I'll start a new test when the Isartor improvements are done with a bigger memory setting, and will also open issues on the exceptions that I believe can be fixed.
>>
>> Tilman
>>
>> Am 23.01.2015 um 08:54 schrieb Maruan Sahyoun:
>>> Hi Tilman,
>>>
>>> that's very positive. Not only the number of failures is down by another 45%  also the time has been reduced a lot. Might be a hint that some of the internal changes (parsing, closing …) and improvements in code quality start to pay off.
>>>
>>> For the 79 files - could you be a little more specific which errors we get? Are these still the ones mentioned in you earlier post?
>>>
>>> BR
>>>
>>> Maruan
>>>
>>> Am 23.01.2015 um 08:45 schrieb Tilman Hausherr <TH...@t-online.de>:
>>>
>>>> total: 231223, failed: 79, percentage failed (exceptions other than the "allowed" ValidationExceptions): 0.03416585677769035%
>>>>
>>>> This time it took only 2 days instead of 4. Maybe the change with closing made it faster?
>>>>
>>>> (This was done about a week ago, I forgot to send the posting)
>>>>
>>>> Tilman
>>>>
>>>> Am 05.12.2014 um 20:45 schrieb Tilman Hausherr:
>>>>> Some numbers... it took 4-5 days
>>>>>
>>>>> total: 231223, failed: 142, percentage failed: 0.06141257472336292
>>>>>
>>>>> Of these, one can substract 33 OutOfMemoryErrors that happened near the end of the test. Isolated runs went fine.
>>>>>
>>>>> about the rest:
>>>>> 18 are the isSymbol stackoverflow
>>>>> 9 are the getFontMatrix NPE
>>>>> 33 are the "root must be of type Pages" errors
>>>>>
>>>>> The rest is mostly related to very broken PDF files.
>>>>>
>>>>> Tilman
>>>>>
>>>>>
>>>>> Am 04.12.2014 um 14:55 schrieb Maruan Sahyoun:
>>>>>> Hi Tilman,
>>>>>>
>>>>>> that's very good news. I trust a lot of time went into reviewing the test results. wo your and Tim's efforts this achievement wouldn't have been possible.
>>>>>>
>>>>>> BR
>>>>>>
>>>>>> Maruan
>>>>>>
>>>>>> Am 03.12.2014 um 21:04 schrieb Tilman Hausherr <TH...@t-online.de>:
>>>>>>
>>>>>>> I've now run preflight on half of the govdocs files. Every issue I have opened on preflight is related to that test. The failure rate (exceptions other than the "allowed" ValidationExceptions) is down from 1% when I started to 0.05% now. Most of the frequent exceptions (e.g. the one with NonTermimalField) have been fixed. Whats left now are exceptions related to messy files, and some of the font related issues.
>>>>>>>
>>>>>>> Tilman
>>>>>>>
>>>>>>> Am 03.11.2014 um 22:58 schrieb Tilman Hausherr:
>>>>>>>> Am 03.11.2014 um 19:00 schrieb Tilman Hausherr:
>>>>>>>>> It is not looking good, there is at least one NPEs issue coming.
>>>>>>>> No more NPE after solving the two issues I opened today except PDFBOX-1743.pdf which is a known problem.
>>>>>>>>
>>>>>>>> Coming up soon: run preflight on the 231227 PDF files from digitalcorpora to see what happens.
>>>>>>>>
>>>>>>>> Tilman
>>>>>>>>
>>>
>>
>
>


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@pdfbox.apache.org
For additional commands, e-mail: dev-help@pdfbox.apache.org


Re: preflight mass tests

Posted by Maruan Sahyoun <sa...@fileaffairs.de>.
Hi Tilman,

let me take the opportunity to say thank you for your efforts around code quality and testing. That doesn't result in "hey that's a great new feature" but is a very important part of the development which is very often not directly visible but takes time and dedication.

Sincerly yours
Maruan

Am 23.01.2015 um 09:00 schrieb Tilman Hausherr <TH...@t-online.de>:

> Hi,
> 
> Besides the "very broken files" (which results in errors in bad parameters for the PDF operators), there are the out of memory exceptions on huge files. I think that there are at most 5-10 files left with problems that can be solved. I'll start a new test when the Isartor improvements are done with a bigger memory setting, and will also open issues on the exceptions that I believe can be fixed.
> 
> Tilman
> 
> Am 23.01.2015 um 08:54 schrieb Maruan Sahyoun:
>> Hi Tilman,
>> 
>> that's very positive. Not only the number of failures is down by another 45%  also the time has been reduced a lot. Might be a hint that some of the internal changes (parsing, closing …) and improvements in code quality start to pay off.
>> 
>> For the 79 files - could you be a little more specific which errors we get? Are these still the ones mentioned in you earlier post?
>> 
>> BR
>> 
>> Maruan
>> 
>> Am 23.01.2015 um 08:45 schrieb Tilman Hausherr <TH...@t-online.de>:
>> 
>>> total: 231223, failed: 79, percentage failed (exceptions other than the "allowed" ValidationExceptions): 0.03416585677769035%
>>> 
>>> This time it took only 2 days instead of 4. Maybe the change with closing made it faster?
>>> 
>>> (This was done about a week ago, I forgot to send the posting)
>>> 
>>> Tilman
>>> 
>>> Am 05.12.2014 um 20:45 schrieb Tilman Hausherr:
>>>> Some numbers... it took 4-5 days
>>>> 
>>>> total: 231223, failed: 142, percentage failed: 0.06141257472336292
>>>> 
>>>> Of these, one can substract 33 OutOfMemoryErrors that happened near the end of the test. Isolated runs went fine.
>>>> 
>>>> about the rest:
>>>> 18 are the isSymbol stackoverflow
>>>> 9 are the getFontMatrix NPE
>>>> 33 are the "root must be of type Pages" errors
>>>> 
>>>> The rest is mostly related to very broken PDF files.
>>>> 
>>>> Tilman
>>>> 
>>>> 
>>>> Am 04.12.2014 um 14:55 schrieb Maruan Sahyoun:
>>>>> Hi Tilman,
>>>>> 
>>>>> that's very good news. I trust a lot of time went into reviewing the test results. wo your and Tim's efforts this achievement wouldn't have been possible.
>>>>> 
>>>>> BR
>>>>> 
>>>>> Maruan
>>>>> 
>>>>> Am 03.12.2014 um 21:04 schrieb Tilman Hausherr <TH...@t-online.de>:
>>>>> 
>>>>>> I've now run preflight on half of the govdocs files. Every issue I have opened on preflight is related to that test. The failure rate (exceptions other than the "allowed" ValidationExceptions) is down from 1% when I started to 0.05% now. Most of the frequent exceptions (e.g. the one with NonTermimalField) have been fixed. Whats left now are exceptions related to messy files, and some of the font related issues.
>>>>>> 
>>>>>> Tilman
>>>>>> 
>>>>>> Am 03.11.2014 um 22:58 schrieb Tilman Hausherr:
>>>>>>> Am 03.11.2014 um 19:00 schrieb Tilman Hausherr:
>>>>>>>> It is not looking good, there is at least one NPEs issue coming.
>>>>>>> No more NPE after solving the two issues I opened today except PDFBOX-1743.pdf which is a known problem.
>>>>>>> 
>>>>>>> Coming up soon: run preflight on the 231227 PDF files from digitalcorpora to see what happens.
>>>>>>> 
>>>>>>> Tilman
>>>>>>> 
>> 
> 


Re: preflight mass tests

Posted by Tilman Hausherr <TH...@t-online.de>.
Hi,

Besides the "very broken files" (which results in errors in bad 
parameters for the PDF operators), there are the out of memory 
exceptions on huge files. I think that there are at most 5-10 files left 
with problems that can be solved. I'll start a new test when the Isartor 
improvements are done with a bigger memory setting, and will also open 
issues on the exceptions that I believe can be fixed.

Tilman

Am 23.01.2015 um 08:54 schrieb Maruan Sahyoun:
> Hi Tilman,
>
> that's very positive. Not only the number of failures is down by another 45%  also the time has been reduced a lot. Might be a hint that some of the internal changes (parsing, closing …) and improvements in code quality start to pay off.
>
> For the 79 files - could you be a little more specific which errors we get? Are these still the ones mentioned in you earlier post?
>
> BR
>
> Maruan
>
> Am 23.01.2015 um 08:45 schrieb Tilman Hausherr <TH...@t-online.de>:
>
>> total: 231223, failed: 79, percentage failed (exceptions other than the "allowed" ValidationExceptions): 0.03416585677769035%
>>
>> This time it took only 2 days instead of 4. Maybe the change with closing made it faster?
>>
>> (This was done about a week ago, I forgot to send the posting)
>>
>> Tilman
>>
>> Am 05.12.2014 um 20:45 schrieb Tilman Hausherr:
>>> Some numbers... it took 4-5 days
>>>
>>> total: 231223, failed: 142, percentage failed: 0.06141257472336292
>>>
>>> Of these, one can substract 33 OutOfMemoryErrors that happened near the end of the test. Isolated runs went fine.
>>>
>>> about the rest:
>>> 18 are the isSymbol stackoverflow
>>> 9 are the getFontMatrix NPE
>>> 33 are the "root must be of type Pages" errors
>>>
>>> The rest is mostly related to very broken PDF files.
>>>
>>> Tilman
>>>
>>>
>>> Am 04.12.2014 um 14:55 schrieb Maruan Sahyoun:
>>>> Hi Tilman,
>>>>
>>>> that's very good news. I trust a lot of time went into reviewing the test results. wo your and Tim's efforts this achievement wouldn't have been possible.
>>>>
>>>> BR
>>>>
>>>> Maruan
>>>>
>>>> Am 03.12.2014 um 21:04 schrieb Tilman Hausherr <TH...@t-online.de>:
>>>>
>>>>> I've now run preflight on half of the govdocs files. Every issue I have opened on preflight is related to that test. The failure rate (exceptions other than the "allowed" ValidationExceptions) is down from 1% when I started to 0.05% now. Most of the frequent exceptions (e.g. the one with NonTermimalField) have been fixed. Whats left now are exceptions related to messy files, and some of the font related issues.
>>>>>
>>>>> Tilman
>>>>>
>>>>> Am 03.11.2014 um 22:58 schrieb Tilman Hausherr:
>>>>>> Am 03.11.2014 um 19:00 schrieb Tilman Hausherr:
>>>>>>> It is not looking good, there is at least one NPEs issue coming.
>>>>>> No more NPE after solving the two issues I opened today except PDFBOX-1743.pdf which is a known problem.
>>>>>>
>>>>>> Coming up soon: run preflight on the 231227 PDF files from digitalcorpora to see what happens.
>>>>>>
>>>>>> Tilman
>>>>>>
>


Re: preflight mass tests

Posted by Maruan Sahyoun <sa...@fileaffairs.de>.
Hi Tilman,

that's very positive. Not only the number of failures is down by another 45%  also the time has been reduced a lot. Might be a hint that some of the internal changes (parsing, closing …) and improvements in code quality start to pay off.

For the 79 files - could you be a little more specific which errors we get? Are these still the ones mentioned in you earlier post?

BR

Maruan

Am 23.01.2015 um 08:45 schrieb Tilman Hausherr <TH...@t-online.de>:

> total: 231223, failed: 79, percentage failed (exceptions other than the "allowed" ValidationExceptions): 0.03416585677769035%
> 
> This time it took only 2 days instead of 4. Maybe the change with closing made it faster?
> 
> (This was done about a week ago, I forgot to send the posting)
> 
> Tilman
> 
> Am 05.12.2014 um 20:45 schrieb Tilman Hausherr:
>> Some numbers... it took 4-5 days
>> 
>> total: 231223, failed: 142, percentage failed: 0.06141257472336292
>> 
>> Of these, one can substract 33 OutOfMemoryErrors that happened near the end of the test. Isolated runs went fine.
>> 
>> about the rest:
>> 18 are the isSymbol stackoverflow
>> 9 are the getFontMatrix NPE
>> 33 are the "root must be of type Pages" errors
>> 
>> The rest is mostly related to very broken PDF files.
>> 
>> Tilman
>> 
>> 
>> Am 04.12.2014 um 14:55 schrieb Maruan Sahyoun:
>>> Hi Tilman,
>>> 
>>> that's very good news. I trust a lot of time went into reviewing the test results. wo your and Tim's efforts this achievement wouldn't have been possible.
>>> 
>>> BR
>>> 
>>> Maruan
>>> 
>>> Am 03.12.2014 um 21:04 schrieb Tilman Hausherr <TH...@t-online.de>:
>>> 
>>>> I've now run preflight on half of the govdocs files. Every issue I have opened on preflight is related to that test. The failure rate (exceptions other than the "allowed" ValidationExceptions) is down from 1% when I started to 0.05% now. Most of the frequent exceptions (e.g. the one with NonTermimalField) have been fixed. Whats left now are exceptions related to messy files, and some of the font related issues.
>>>> 
>>>> Tilman
>>>> 
>>>> Am 03.11.2014 um 22:58 schrieb Tilman Hausherr:
>>>>> Am 03.11.2014 um 19:00 schrieb Tilman Hausherr:
>>>>>> It is not looking good, there is at least one NPEs issue coming.
>>>>> No more NPE after solving the two issues I opened today except PDFBOX-1743.pdf which is a known problem.
>>>>> 
>>>>> Coming up soon: run preflight on the 231227 PDF files from digitalcorpora to see what happens.
>>>>> 
>>>>> Tilman
>>>>> 
>>> 
>> 
> 


Re: preflight mass tests

Posted by Maruan Sahyoun <sa...@fileaffairs.de>.
Hi Tilman,

no problem - I thought you might have the information. 

BR
Maruan

Am 12.02.2015 um 08:53 schrieb Tilman Hausherr <TH...@t-online.de>:

> Sorry can't tell. Would be too much work to open 73 files manually and scroll through, then write down which displays without error.
> 
> I rather try chosing some file with an exception that sounds like it could be corrected, and then submit that one. However now I'd rather concentrate on new issues and remaining 2.0 issues.
> 
> Tilman
> 
> Am 12.02.2015 um 07:52 schrieb Maruan Sahyoun:
>> great - another 6 files we are now able to process. Of the remaining 73 how many are left which Adobe Reader is able to process?
>> 
>> Having that testbed really paid off!
>> 
>> Maruan
>> 
>> Am 12.02.2015 um 00:21 schrieb Tilman Hausherr <TH...@t-online.de>:
>> 
>>> Am 23.01.2015 um 08:45 schrieb Tilman Hausherr:
>>>> total: 231223, failed: 79, percentage failed (exceptions other than the "allowed" ValidationExceptions): 0.03416585677769035%
>>> total: 231223, failed: 73, percentage failed: 0.0315705721732229%
>>> 
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: dev-unsubscribe@pdfbox.apache.org
>>> For additional commands, e-mail: dev-help@pdfbox.apache.org
>>> 
>> 
> 
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@pdfbox.apache.org
> For additional commands, e-mail: dev-help@pdfbox.apache.org
> 


Re: preflight mass tests

Posted by Tilman Hausherr <TH...@t-online.de>.
Sorry can't tell. Would be too much work to open 73 files manually and 
scroll through, then write down which displays without error.

I rather try chosing some file with an exception that sounds like it 
could be corrected, and then submit that one. However now I'd rather 
concentrate on new issues and remaining 2.0 issues.

Tilman

Am 12.02.2015 um 07:52 schrieb Maruan Sahyoun:
> great - another 6 files we are now able to process. Of the remaining 73 how many are left which Adobe Reader is able to process?
>
> Having that testbed really paid off!
>
> Maruan
>
> Am 12.02.2015 um 00:21 schrieb Tilman Hausherr <TH...@t-online.de>:
>
>> Am 23.01.2015 um 08:45 schrieb Tilman Hausherr:
>>> total: 231223, failed: 79, percentage failed (exceptions other than the "allowed" ValidationExceptions): 0.03416585677769035%
>> total: 231223, failed: 73, percentage failed: 0.0315705721732229%
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: dev-unsubscribe@pdfbox.apache.org
>> For additional commands, e-mail: dev-help@pdfbox.apache.org
>>
>


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@pdfbox.apache.org
For additional commands, e-mail: dev-help@pdfbox.apache.org


Re: preflight mass tests

Posted by Maruan Sahyoun <sa...@fileaffairs.de>.
great - another 6 files we are now able to process. Of the remaining 73 how many are left which Adobe Reader is able to process?

Having that testbed really paid off!

Maruan

Am 12.02.2015 um 00:21 schrieb Tilman Hausherr <TH...@t-online.de>:

> Am 23.01.2015 um 08:45 schrieb Tilman Hausherr:
>> total: 231223, failed: 79, percentage failed (exceptions other than the "allowed" ValidationExceptions): 0.03416585677769035%
> total: 231223, failed: 73, percentage failed: 0.0315705721732229%
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@pdfbox.apache.org
> For additional commands, e-mail: dev-help@pdfbox.apache.org
> 


Re: preflight mass tests

Posted by Tilman Hausherr <TH...@t-online.de>.
Am 23.01.2015 um 08:45 schrieb Tilman Hausherr:
> total: 231223, failed: 79, percentage failed (exceptions other than 
> the "allowed" ValidationExceptions): 0.03416585677769035%
total: 231223, failed: 73, percentage failed: 0.0315705721732229%

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@pdfbox.apache.org
For additional commands, e-mail: dev-help@pdfbox.apache.org


Re: preflight mass tests

Posted by Tilman Hausherr <TH...@t-online.de>.
total: 231223, failed: 79, percentage failed (exceptions other than the 
"allowed" ValidationExceptions): 0.03416585677769035%

This time it took only 2 days instead of 4. Maybe the change with 
closing made it faster?

(This was done about a week ago, I forgot to send the posting)

Tilman

Am 05.12.2014 um 20:45 schrieb Tilman Hausherr:
> Some numbers... it took 4-5 days
>
> total: 231223, failed: 142, percentage failed: 0.06141257472336292
>
> Of these, one can substract 33 OutOfMemoryErrors that happened near 
> the end of the test. Isolated runs went fine.
>
> about the rest:
> 18 are the isSymbol stackoverflow
> 9 are the getFontMatrix NPE
> 33 are the "root must be of type Pages" errors
>
> The rest is mostly related to very broken PDF files.
>
> Tilman
>
>
> Am 04.12.2014 um 14:55 schrieb Maruan Sahyoun:
>> Hi Tilman,
>>
>> that's very good news. I trust a lot of time went into reviewing the 
>> test results. wo your and Tim's efforts this achievement wouldn't 
>> have been possible.
>>
>> BR
>>
>> Maruan
>>
>> Am 03.12.2014 um 21:04 schrieb Tilman Hausherr <TH...@t-online.de>:
>>
>>> I've now run preflight on half of the govdocs files. Every issue I 
>>> have opened on preflight is related to that test. The failure rate 
>>> (exceptions other than the "allowed" ValidationExceptions) is down 
>>> from 1% when I started to 0.05% now. Most of the frequent exceptions 
>>> (e.g. the one with NonTermimalField) have been fixed. Whats left now 
>>> are exceptions related to messy files, and some of the font related 
>>> issues.
>>>
>>> Tilman
>>>
>>> Am 03.11.2014 um 22:58 schrieb Tilman Hausherr:
>>>> Am 03.11.2014 um 19:00 schrieb Tilman Hausherr:
>>>>> It is not looking good, there is at least one NPEs issue coming.
>>>> No more NPE after solving the two issues I opened today except 
>>>> PDFBOX-1743.pdf which is a known problem.
>>>>
>>>> Coming up soon: run preflight on the 231227 PDF files from 
>>>> digitalcorpora to see what happens.
>>>>
>>>> Tilman
>>>>
>>
>


Re: preflight mass tests

Posted by John Hewson <jo...@jahewson.com>.
Very impressive!

-- John

> On 5 Dec 2014, at 11:45, Tilman Hausherr <TH...@t-online.de> wrote:
> 
> Some numbers... it took 4-5 days
> 
> total: 231223, failed: 142, percentage failed: 0.06141257472336292
> 
> Of these, one can substract 33 OutOfMemoryErrors that happened near the end of the test. Isolated runs went fine.
> 
> about the rest:
> 18 are the isSymbol stackoverflow
> 9 are the getFontMatrix NPE
> 33 are the "root must be of type Pages" errors
> 
> The rest is mostly related to very broken PDF files.
> 
> Tilman
> 
> 
> Am 04.12.2014 um 14:55 schrieb Maruan Sahyoun:
>> Hi Tilman,
>> 
>> that's very good news. I trust a lot of time went into reviewing the test results. wo your and Tim's efforts this achievement wouldn't have been possible.
>> 
>> BR
>> 
>> Maruan
>> 
>> Am 03.12.2014 um 21:04 schrieb Tilman Hausherr <TH...@t-online.de>:
>> 
>>> I've now run preflight on half of the govdocs files. Every issue I have opened on preflight is related to that test. The failure rate (exceptions other than the "allowed" ValidationExceptions) is down from 1% when I started to 0.05% now. Most of the frequent exceptions (e.g. the one with NonTermimalField) have been fixed. Whats left now are exceptions related to messy files, and some of the font related issues.
>>> 
>>> Tilman
>>> 
>>> Am 03.11.2014 um 22:58 schrieb Tilman Hausherr:
>>>> Am 03.11.2014 um 19:00 schrieb Tilman Hausherr:
>>>>> It is not looking good, there is at least one NPEs issue coming.
>>>> No more NPE after solving the two issues I opened today except PDFBOX-1743.pdf which is a known problem.
>>>> 
>>>> Coming up soon: run preflight on the 231227 PDF files from digitalcorpora to see what happens.
>>>> 
>>>> Tilman
>>>> 
>> 
> 


Re: preflight mass tests

Posted by Tilman Hausherr <TH...@t-online.de>.
Some numbers... it took 4-5 days

total: 231223, failed: 142, percentage failed: 0.06141257472336292

Of these, one can substract 33 OutOfMemoryErrors that happened near the 
end of the test. Isolated runs went fine.

about the rest:
18 are the isSymbol stackoverflow
9 are the getFontMatrix NPE
33 are the "root must be of type Pages" errors

The rest is mostly related to very broken PDF files.

Tilman


Am 04.12.2014 um 14:55 schrieb Maruan Sahyoun:
> Hi Tilman,
>
> that's very good news. I trust a lot of time went into reviewing the test results. wo your and Tim's efforts this achievement wouldn't have been possible.
>
> BR
>
> Maruan
>
> Am 03.12.2014 um 21:04 schrieb Tilman Hausherr <TH...@t-online.de>:
>
>> I've now run preflight on half of the govdocs files. Every issue I have opened on preflight is related to that test. The failure rate (exceptions other than the "allowed" ValidationExceptions) is down from 1% when I started to 0.05% now. Most of the frequent exceptions (e.g. the one with NonTermimalField) have been fixed. Whats left now are exceptions related to messy files, and some of the font related issues.
>>
>> Tilman
>>
>> Am 03.11.2014 um 22:58 schrieb Tilman Hausherr:
>>> Am 03.11.2014 um 19:00 schrieb Tilman Hausherr:
>>>> It is not looking good, there is at least one NPEs issue coming.
>>> No more NPE after solving the two issues I opened today except PDFBOX-1743.pdf which is a known problem.
>>>
>>> Coming up soon: run preflight on the 231227 PDF files from digitalcorpora to see what happens.
>>>
>>> Tilman
>>>
>


Re: preflight mass tests

Posted by Maruan Sahyoun <sa...@fileaffairs.de>.
Hi Tilman,

that's very good news. I trust a lot of time went into reviewing the test results. wo your and Tim's efforts this achievement wouldn't have been possible.

BR

Maruan

Am 03.12.2014 um 21:04 schrieb Tilman Hausherr <TH...@t-online.de>:

> I've now run preflight on half of the govdocs files. Every issue I have opened on preflight is related to that test. The failure rate (exceptions other than the "allowed" ValidationExceptions) is down from 1% when I started to 0.05% now. Most of the frequent exceptions (e.g. the one with NonTermimalField) have been fixed. Whats left now are exceptions related to messy files, and some of the font related issues.
> 
> Tilman
> 
> Am 03.11.2014 um 22:58 schrieb Tilman Hausherr:
>> Am 03.11.2014 um 19:00 schrieb Tilman Hausherr:
>>> It is not looking good, there is at least one NPEs issue coming. 
>> 
>> No more NPE after solving the two issues I opened today except PDFBOX-1743.pdf which is a known problem.
>> 
>> Coming up soon: run preflight on the 231227 PDF files from digitalcorpora to see what happens.
>> 
>> Tilman
>> 
>