You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@pdfbox.apache.org by Tilman Hausherr <TH...@t-online.de> on 2014/12/03 21:04:59 UTC
Re: preflight mass tests
I've now run preflight on half of the govdocs files. Every issue I have
opened on preflight is related to that test. The failure rate
(exceptions other than the "allowed" ValidationExceptions) is down from
1% when I started to 0.05% now. Most of the frequent exceptions (e.g.
the one with NonTermimalField) have been fixed. Whats left now are
exceptions related to messy files, and some of the font related issues.
Tilman
Am 03.11.2014 um 22:58 schrieb Tilman Hausherr:
> Am 03.11.2014 um 19:00 schrieb Tilman Hausherr:
>> It is not looking good, there is at least one NPEs issue coming.
>
> No more NPE after solving the two issues I opened today except
> PDFBOX-1743.pdf which is a known problem.
>
> Coming up soon: run preflight on the 231227 PDF files from
> digitalcorpora to see what happens.
>
> Tilman
>
Re: preflight mass tests
Posted by Tilman Hausherr <TH...@t-online.de>.
Here's the code... it assumes that all PDFs are flat in one single
directory. Libraries needed: preflight-app, jai_imageio,
levigo_jbig2-imageio-1.6.1.jar. I have run it only with the trunk, not
with 1.8, because we didn't fix all problems there.
Tilman
import java.io.File;
import java.io.FileNotFoundException;
import java.io.FilenameFilter;
import java.io.PrintWriter;
import org.apache.pdfbox.preflight.PreflightDocument;
import org.apache.pdfbox.preflight.exception.ValidationException;
import org.apache.pdfbox.preflight.parser.PreflightParser;
/**
*
* @author Tilman Hausherr
*/
public class PreflightTest
{
public static void main(String[] args) throws FileNotFoundException
{
File dir;
if (args.length > 0)
{
dir = new File(args[0]);
}
else
{
dir = new File("k:\\dc");
}
int total = 0;
int failed = 0;
File[] dirList = dir.listFiles(new FilenameFilter()
{
@Override
public boolean accept(File dir, String name)
{
if (name.compareTo("000000.pdf") <= 0) // use this to
start at a certain file
{
return false;
}
return name.toLowerCase().endsWith(".pdf");
}
});
for (File pdf : dirList)
{
++total;
System.out.println(pdf.getName());
// just test that it doesn't crash
try
{
new File(pdf.getName() + "-exception.txt").delete();
PreflightParser parser = new PreflightParser(pdf);
parser.parse();
try (PreflightDocument preflightDocument =
parser.getPreflightDocument())
{
preflightDocument.validate();
preflightDocument.getResult();
}
parser.clearResources();
}
catch (ValidationException e)
{
}
catch (Throwable e)
{
++failed;
try (PrintWriter pw = new PrintWriter(new
File(pdf.getName() + "-exception.txt")))
{
e.printStackTrace(pw);
}
System.out.flush();
System.err.flush();
System.err.print(pdf.getName() + " preflight fail: ");
e.printStackTrace();
System.out.flush();
System.err.flush();
}
System.out.println("total: " + total + ", failed: " +
failed + ", percentage failed: " + (((float) failed) / total * 100.0) +
"%");
}
}
}
Am 09.12.2014 um 17:28 schrieb Allison, Timothy B.:
> Tilman,
> This is fantastic! If you send me an example of the code you used to call preflight (#parse() or #parse(Format format)???), I'd like to run it within tika-batch to see what our batch performance is.
> Ideally, once we can turn our public vm on, it would be fun to run these tests there.
>
>
> Best,
>
> Tim
>
> -----Original Message-----
> From: Tilman Hausherr [mailto:THausherr@t-online.de]
> Sent: Friday, December 05, 2014 2:45 PM
> To: dev@pdfbox.apache.org
> Subject: Re: preflight mass tests
>
> Some numbers... it took 4-5 days
>
> total: 231223, failed: 142, percentage failed: 0.06141257472336292
>
> Of these, one can substract 33 OutOfMemoryErrors that happened near the
> end of the test. Isolated runs went fine.
>
> about the rest:
> 18 are the isSymbol stackoverflow
> 9 are the getFontMatrix NPE
> 33 are the "root must be of type Pages" errors
>
> The rest is mostly related to very broken PDF files.
>
> Tilman
>
>
> Am 04.12.2014 um 14:55 schrieb Maruan Sahyoun:
>> Hi Tilman,
>>
>> that's very good news. I trust a lot of time went into reviewing the test results. wo your and Tim's efforts this achievement wouldn't have been possible.
>>
>> BR
>>
>> Maruan
>>
>> Am 03.12.2014 um 21:04 schrieb Tilman Hausherr <TH...@t-online.de>:
>>
>>> I've now run preflight on half of the govdocs files. Every issue I have opened on preflight is related to that test. The failure rate (exceptions other than the "allowed" ValidationExceptions) is down from 1% when I started to 0.05% now. Most of the frequent exceptions (e.g. the one with NonTermimalField) have been fixed. Whats left now are exceptions related to messy files, and some of the font related issues.
>>>
>>> Tilman
>>>
>>> Am 03.11.2014 um 22:58 schrieb Tilman Hausherr:
>>>> Am 03.11.2014 um 19:00 schrieb Tilman Hausherr:
>>>>> It is not looking good, there is at least one NPEs issue coming.
>>>> No more NPE after solving the two issues I opened today except PDFBOX-1743.pdf which is a known problem.
>>>>
>>>> Coming up soon: run preflight on the 231227 PDF files from digitalcorpora to see what happens.
>>>>
>>>> Tilman
>>>>
RE: preflight mass tests
Posted by "Allison, Timothy B." <ta...@mitre.org>.
Tilman,
This is fantastic! If you send me an example of the code you used to call preflight (#parse() or #parse(Format format)???), I'd like to run it within tika-batch to see what our batch performance is.
Ideally, once we can turn our public vm on, it would be fun to run these tests there.
Best,
Tim
-----Original Message-----
From: Tilman Hausherr [mailto:THausherr@t-online.de]
Sent: Friday, December 05, 2014 2:45 PM
To: dev@pdfbox.apache.org
Subject: Re: preflight mass tests
Some numbers... it took 4-5 days
total: 231223, failed: 142, percentage failed: 0.06141257472336292
Of these, one can substract 33 OutOfMemoryErrors that happened near the
end of the test. Isolated runs went fine.
about the rest:
18 are the isSymbol stackoverflow
9 are the getFontMatrix NPE
33 are the "root must be of type Pages" errors
The rest is mostly related to very broken PDF files.
Tilman
Am 04.12.2014 um 14:55 schrieb Maruan Sahyoun:
> Hi Tilman,
>
> that's very good news. I trust a lot of time went into reviewing the test results. wo your and Tim's efforts this achievement wouldn't have been possible.
>
> BR
>
> Maruan
>
> Am 03.12.2014 um 21:04 schrieb Tilman Hausherr <TH...@t-online.de>:
>
>> I've now run preflight on half of the govdocs files. Every issue I have opened on preflight is related to that test. The failure rate (exceptions other than the "allowed" ValidationExceptions) is down from 1% when I started to 0.05% now. Most of the frequent exceptions (e.g. the one with NonTermimalField) have been fixed. Whats left now are exceptions related to messy files, and some of the font related issues.
>>
>> Tilman
>>
>> Am 03.11.2014 um 22:58 schrieb Tilman Hausherr:
>>> Am 03.11.2014 um 19:00 schrieb Tilman Hausherr:
>>>> It is not looking good, there is at least one NPEs issue coming.
>>> No more NPE after solving the two issues I opened today except PDFBOX-1743.pdf which is a known problem.
>>>
>>> Coming up soon: run preflight on the 231227 PDF files from digitalcorpora to see what happens.
>>>
>>> Tilman
>>>
>
Re: preflight mass tests
Posted by Tilman Hausherr <TH...@t-online.de>.
Since you answered to the list, I'll answer here too:
I dpn't know, I didn't try to display the "fails".
Tilman
Am 09.12.2014 um 10:59 schrieb Maruan Sahyoun:
> Hallo Tilman,
>
> hast Du ne grobe Schätzung welcher Anteil der Dateien z.B. in Adobe Reader entweder nicht angezeigt, mit Dialog angezeigt oder falsch angezeigt wird?
>
> Lieben Gruß
>
> Maruan Sahyoun
>
> FileAffairs GmbH
> Josef-Schappe-Straße 21
> 40882 Ratingen
>
> Tel: +49 (2102) 89497 88
> Fax: +49 (2102) 89497 91
> sahyoun@fileaffairs.de
> www.fileaffairs.de
>
> Geschäftsführer: Maruan Sahyoun
> Handelsregister: AG Düsseldorf, HRB 53837
> UST.-ID: DE248275827
>
> Am 05.12.2014 um 20:45 schrieb Tilman Hausherr <TH...@t-online.de>:
>
>> Some numbers... it took 4-5 days
>>
>> total: 231223, failed: 142, percentage failed: 0.06141257472336292
>>
>> Of these, one can substract 33 OutOfMemoryErrors that happened near the end of the test. Isolated runs went fine.
>>
>> about the rest:
>> 18 are the isSymbol stackoverflow
>> 9 are the getFontMatrix NPE
>> 33 are the "root must be of type Pages" errors
>>
>> The rest is mostly related to very broken PDF files.
>>
>> Tilman
>>
>>
>> Am 04.12.2014 um 14:55 schrieb Maruan Sahyoun:
>>> Hi Tilman,
>>>
>>> that's very good news. I trust a lot of time went into reviewing the test results. wo your and Tim's efforts this achievement wouldn't have been possible.
>>>
>>> BR
>>>
>>> Maruan
>>>
>>> Am 03.12.2014 um 21:04 schrieb Tilman Hausherr <TH...@t-online.de>:
>>>
>>>> I've now run preflight on half of the govdocs files. Every issue I have opened on preflight is related to that test. The failure rate (exceptions other than the "allowed" ValidationExceptions) is down from 1% when I started to 0.05% now. Most of the frequent exceptions (e.g. the one with NonTermimalField) have been fixed. Whats left now are exceptions related to messy files, and some of the font related issues.
>>>>
>>>> Tilman
>>>>
>>>> Am 03.11.2014 um 22:58 schrieb Tilman Hausherr:
>>>>> Am 03.11.2014 um 19:00 schrieb Tilman Hausherr:
>>>>>> It is not looking good, there is at least one NPEs issue coming.
>>>>> No more NPE after solving the two issues I opened today except PDFBOX-1743.pdf which is a known problem.
>>>>>
>>>>> Coming up soon: run preflight on the 231227 PDF files from digitalcorpora to see what happens.
>>>>>
>>>>> Tilman
>>>>>
>
Re: preflight mass tests
Posted by Maruan Sahyoun <sa...@fileaffairs.de>.
Hallo Tilman,
hast Du ne grobe Schätzung welcher Anteil der Dateien z.B. in Adobe Reader entweder nicht angezeigt, mit Dialog angezeigt oder falsch angezeigt wird?
Lieben Gruß
Maruan Sahyoun
FileAffairs GmbH
Josef-Schappe-Straße 21
40882 Ratingen
Tel: +49 (2102) 89497 88
Fax: +49 (2102) 89497 91
sahyoun@fileaffairs.de
www.fileaffairs.de
Geschäftsführer: Maruan Sahyoun
Handelsregister: AG Düsseldorf, HRB 53837
UST.-ID: DE248275827
Am 05.12.2014 um 20:45 schrieb Tilman Hausherr <TH...@t-online.de>:
> Some numbers... it took 4-5 days
>
> total: 231223, failed: 142, percentage failed: 0.06141257472336292
>
> Of these, one can substract 33 OutOfMemoryErrors that happened near the end of the test. Isolated runs went fine.
>
> about the rest:
> 18 are the isSymbol stackoverflow
> 9 are the getFontMatrix NPE
> 33 are the "root must be of type Pages" errors
>
> The rest is mostly related to very broken PDF files.
>
> Tilman
>
>
> Am 04.12.2014 um 14:55 schrieb Maruan Sahyoun:
>> Hi Tilman,
>>
>> that's very good news. I trust a lot of time went into reviewing the test results. wo your and Tim's efforts this achievement wouldn't have been possible.
>>
>> BR
>>
>> Maruan
>>
>> Am 03.12.2014 um 21:04 schrieb Tilman Hausherr <TH...@t-online.de>:
>>
>>> I've now run preflight on half of the govdocs files. Every issue I have opened on preflight is related to that test. The failure rate (exceptions other than the "allowed" ValidationExceptions) is down from 1% when I started to 0.05% now. Most of the frequent exceptions (e.g. the one with NonTermimalField) have been fixed. Whats left now are exceptions related to messy files, and some of the font related issues.
>>>
>>> Tilman
>>>
>>> Am 03.11.2014 um 22:58 schrieb Tilman Hausherr:
>>>> Am 03.11.2014 um 19:00 schrieb Tilman Hausherr:
>>>>> It is not looking good, there is at least one NPEs issue coming.
>>>> No more NPE after solving the two issues I opened today except PDFBOX-1743.pdf which is a known problem.
>>>>
>>>> Coming up soon: run preflight on the 231227 PDF files from digitalcorpora to see what happens.
>>>>
>>>> Tilman
>>>>
>>
>
Re: preflight mass tests
Posted by Andreas Lehmkuehler <an...@lehmi.de>.
Hi,
I've nothing to add but +++++1
BR
Andreas Lehmkühler
Am 23.01.2015 um 09:14 schrieb Maruan Sahyoun:
> Hi Tilman,
>
> let me take the opportunity to say thank you for your efforts around code quality and testing. That doesn't result in "hey that's a great new feature" but is a very important part of the development which is very often not directly visible but takes time and dedication.
>
> Sincerly yours
> Maruan
>
> Am 23.01.2015 um 09:00 schrieb Tilman Hausherr <TH...@t-online.de>:
>
>> Hi,
>>
>> Besides the "very broken files" (which results in errors in bad parameters for the PDF operators), there are the out of memory exceptions on huge files. I think that there are at most 5-10 files left with problems that can be solved. I'll start a new test when the Isartor improvements are done with a bigger memory setting, and will also open issues on the exceptions that I believe can be fixed.
>>
>> Tilman
>>
>> Am 23.01.2015 um 08:54 schrieb Maruan Sahyoun:
>>> Hi Tilman,
>>>
>>> that's very positive. Not only the number of failures is down by another 45% also the time has been reduced a lot. Might be a hint that some of the internal changes (parsing, closing …) and improvements in code quality start to pay off.
>>>
>>> For the 79 files - could you be a little more specific which errors we get? Are these still the ones mentioned in you earlier post?
>>>
>>> BR
>>>
>>> Maruan
>>>
>>> Am 23.01.2015 um 08:45 schrieb Tilman Hausherr <TH...@t-online.de>:
>>>
>>>> total: 231223, failed: 79, percentage failed (exceptions other than the "allowed" ValidationExceptions): 0.03416585677769035%
>>>>
>>>> This time it took only 2 days instead of 4. Maybe the change with closing made it faster?
>>>>
>>>> (This was done about a week ago, I forgot to send the posting)
>>>>
>>>> Tilman
>>>>
>>>> Am 05.12.2014 um 20:45 schrieb Tilman Hausherr:
>>>>> Some numbers... it took 4-5 days
>>>>>
>>>>> total: 231223, failed: 142, percentage failed: 0.06141257472336292
>>>>>
>>>>> Of these, one can substract 33 OutOfMemoryErrors that happened near the end of the test. Isolated runs went fine.
>>>>>
>>>>> about the rest:
>>>>> 18 are the isSymbol stackoverflow
>>>>> 9 are the getFontMatrix NPE
>>>>> 33 are the "root must be of type Pages" errors
>>>>>
>>>>> The rest is mostly related to very broken PDF files.
>>>>>
>>>>> Tilman
>>>>>
>>>>>
>>>>> Am 04.12.2014 um 14:55 schrieb Maruan Sahyoun:
>>>>>> Hi Tilman,
>>>>>>
>>>>>> that's very good news. I trust a lot of time went into reviewing the test results. wo your and Tim's efforts this achievement wouldn't have been possible.
>>>>>>
>>>>>> BR
>>>>>>
>>>>>> Maruan
>>>>>>
>>>>>> Am 03.12.2014 um 21:04 schrieb Tilman Hausherr <TH...@t-online.de>:
>>>>>>
>>>>>>> I've now run preflight on half of the govdocs files. Every issue I have opened on preflight is related to that test. The failure rate (exceptions other than the "allowed" ValidationExceptions) is down from 1% when I started to 0.05% now. Most of the frequent exceptions (e.g. the one with NonTermimalField) have been fixed. Whats left now are exceptions related to messy files, and some of the font related issues.
>>>>>>>
>>>>>>> Tilman
>>>>>>>
>>>>>>> Am 03.11.2014 um 22:58 schrieb Tilman Hausherr:
>>>>>>>> Am 03.11.2014 um 19:00 schrieb Tilman Hausherr:
>>>>>>>>> It is not looking good, there is at least one NPEs issue coming.
>>>>>>>> No more NPE after solving the two issues I opened today except PDFBOX-1743.pdf which is a known problem.
>>>>>>>>
>>>>>>>> Coming up soon: run preflight on the 231227 PDF files from digitalcorpora to see what happens.
>>>>>>>>
>>>>>>>> Tilman
>>>>>>>>
>>>
>>
>
>
---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@pdfbox.apache.org
For additional commands, e-mail: dev-help@pdfbox.apache.org
Re: preflight mass tests
Posted by Maruan Sahyoun <sa...@fileaffairs.de>.
Hi Tilman,
let me take the opportunity to say thank you for your efforts around code quality and testing. That doesn't result in "hey that's a great new feature" but is a very important part of the development which is very often not directly visible but takes time and dedication.
Sincerly yours
Maruan
Am 23.01.2015 um 09:00 schrieb Tilman Hausherr <TH...@t-online.de>:
> Hi,
>
> Besides the "very broken files" (which results in errors in bad parameters for the PDF operators), there are the out of memory exceptions on huge files. I think that there are at most 5-10 files left with problems that can be solved. I'll start a new test when the Isartor improvements are done with a bigger memory setting, and will also open issues on the exceptions that I believe can be fixed.
>
> Tilman
>
> Am 23.01.2015 um 08:54 schrieb Maruan Sahyoun:
>> Hi Tilman,
>>
>> that's very positive. Not only the number of failures is down by another 45% also the time has been reduced a lot. Might be a hint that some of the internal changes (parsing, closing …) and improvements in code quality start to pay off.
>>
>> For the 79 files - could you be a little more specific which errors we get? Are these still the ones mentioned in you earlier post?
>>
>> BR
>>
>> Maruan
>>
>> Am 23.01.2015 um 08:45 schrieb Tilman Hausherr <TH...@t-online.de>:
>>
>>> total: 231223, failed: 79, percentage failed (exceptions other than the "allowed" ValidationExceptions): 0.03416585677769035%
>>>
>>> This time it took only 2 days instead of 4. Maybe the change with closing made it faster?
>>>
>>> (This was done about a week ago, I forgot to send the posting)
>>>
>>> Tilman
>>>
>>> Am 05.12.2014 um 20:45 schrieb Tilman Hausherr:
>>>> Some numbers... it took 4-5 days
>>>>
>>>> total: 231223, failed: 142, percentage failed: 0.06141257472336292
>>>>
>>>> Of these, one can substract 33 OutOfMemoryErrors that happened near the end of the test. Isolated runs went fine.
>>>>
>>>> about the rest:
>>>> 18 are the isSymbol stackoverflow
>>>> 9 are the getFontMatrix NPE
>>>> 33 are the "root must be of type Pages" errors
>>>>
>>>> The rest is mostly related to very broken PDF files.
>>>>
>>>> Tilman
>>>>
>>>>
>>>> Am 04.12.2014 um 14:55 schrieb Maruan Sahyoun:
>>>>> Hi Tilman,
>>>>>
>>>>> that's very good news. I trust a lot of time went into reviewing the test results. wo your and Tim's efforts this achievement wouldn't have been possible.
>>>>>
>>>>> BR
>>>>>
>>>>> Maruan
>>>>>
>>>>> Am 03.12.2014 um 21:04 schrieb Tilman Hausherr <TH...@t-online.de>:
>>>>>
>>>>>> I've now run preflight on half of the govdocs files. Every issue I have opened on preflight is related to that test. The failure rate (exceptions other than the "allowed" ValidationExceptions) is down from 1% when I started to 0.05% now. Most of the frequent exceptions (e.g. the one with NonTermimalField) have been fixed. Whats left now are exceptions related to messy files, and some of the font related issues.
>>>>>>
>>>>>> Tilman
>>>>>>
>>>>>> Am 03.11.2014 um 22:58 schrieb Tilman Hausherr:
>>>>>>> Am 03.11.2014 um 19:00 schrieb Tilman Hausherr:
>>>>>>>> It is not looking good, there is at least one NPEs issue coming.
>>>>>>> No more NPE after solving the two issues I opened today except PDFBOX-1743.pdf which is a known problem.
>>>>>>>
>>>>>>> Coming up soon: run preflight on the 231227 PDF files from digitalcorpora to see what happens.
>>>>>>>
>>>>>>> Tilman
>>>>>>>
>>
>
Re: preflight mass tests
Posted by Tilman Hausherr <TH...@t-online.de>.
Hi,
Besides the "very broken files" (which results in errors in bad
parameters for the PDF operators), there are the out of memory
exceptions on huge files. I think that there are at most 5-10 files left
with problems that can be solved. I'll start a new test when the Isartor
improvements are done with a bigger memory setting, and will also open
issues on the exceptions that I believe can be fixed.
Tilman
Am 23.01.2015 um 08:54 schrieb Maruan Sahyoun:
> Hi Tilman,
>
> that's very positive. Not only the number of failures is down by another 45% also the time has been reduced a lot. Might be a hint that some of the internal changes (parsing, closing …) and improvements in code quality start to pay off.
>
> For the 79 files - could you be a little more specific which errors we get? Are these still the ones mentioned in you earlier post?
>
> BR
>
> Maruan
>
> Am 23.01.2015 um 08:45 schrieb Tilman Hausherr <TH...@t-online.de>:
>
>> total: 231223, failed: 79, percentage failed (exceptions other than the "allowed" ValidationExceptions): 0.03416585677769035%
>>
>> This time it took only 2 days instead of 4. Maybe the change with closing made it faster?
>>
>> (This was done about a week ago, I forgot to send the posting)
>>
>> Tilman
>>
>> Am 05.12.2014 um 20:45 schrieb Tilman Hausherr:
>>> Some numbers... it took 4-5 days
>>>
>>> total: 231223, failed: 142, percentage failed: 0.06141257472336292
>>>
>>> Of these, one can substract 33 OutOfMemoryErrors that happened near the end of the test. Isolated runs went fine.
>>>
>>> about the rest:
>>> 18 are the isSymbol stackoverflow
>>> 9 are the getFontMatrix NPE
>>> 33 are the "root must be of type Pages" errors
>>>
>>> The rest is mostly related to very broken PDF files.
>>>
>>> Tilman
>>>
>>>
>>> Am 04.12.2014 um 14:55 schrieb Maruan Sahyoun:
>>>> Hi Tilman,
>>>>
>>>> that's very good news. I trust a lot of time went into reviewing the test results. wo your and Tim's efforts this achievement wouldn't have been possible.
>>>>
>>>> BR
>>>>
>>>> Maruan
>>>>
>>>> Am 03.12.2014 um 21:04 schrieb Tilman Hausherr <TH...@t-online.de>:
>>>>
>>>>> I've now run preflight on half of the govdocs files. Every issue I have opened on preflight is related to that test. The failure rate (exceptions other than the "allowed" ValidationExceptions) is down from 1% when I started to 0.05% now. Most of the frequent exceptions (e.g. the one with NonTermimalField) have been fixed. Whats left now are exceptions related to messy files, and some of the font related issues.
>>>>>
>>>>> Tilman
>>>>>
>>>>> Am 03.11.2014 um 22:58 schrieb Tilman Hausherr:
>>>>>> Am 03.11.2014 um 19:00 schrieb Tilman Hausherr:
>>>>>>> It is not looking good, there is at least one NPEs issue coming.
>>>>>> No more NPE after solving the two issues I opened today except PDFBOX-1743.pdf which is a known problem.
>>>>>>
>>>>>> Coming up soon: run preflight on the 231227 PDF files from digitalcorpora to see what happens.
>>>>>>
>>>>>> Tilman
>>>>>>
>
Re: preflight mass tests
Posted by Maruan Sahyoun <sa...@fileaffairs.de>.
Hi Tilman,
that's very positive. Not only the number of failures is down by another 45% also the time has been reduced a lot. Might be a hint that some of the internal changes (parsing, closing …) and improvements in code quality start to pay off.
For the 79 files - could you be a little more specific which errors we get? Are these still the ones mentioned in you earlier post?
BR
Maruan
Am 23.01.2015 um 08:45 schrieb Tilman Hausherr <TH...@t-online.de>:
> total: 231223, failed: 79, percentage failed (exceptions other than the "allowed" ValidationExceptions): 0.03416585677769035%
>
> This time it took only 2 days instead of 4. Maybe the change with closing made it faster?
>
> (This was done about a week ago, I forgot to send the posting)
>
> Tilman
>
> Am 05.12.2014 um 20:45 schrieb Tilman Hausherr:
>> Some numbers... it took 4-5 days
>>
>> total: 231223, failed: 142, percentage failed: 0.06141257472336292
>>
>> Of these, one can substract 33 OutOfMemoryErrors that happened near the end of the test. Isolated runs went fine.
>>
>> about the rest:
>> 18 are the isSymbol stackoverflow
>> 9 are the getFontMatrix NPE
>> 33 are the "root must be of type Pages" errors
>>
>> The rest is mostly related to very broken PDF files.
>>
>> Tilman
>>
>>
>> Am 04.12.2014 um 14:55 schrieb Maruan Sahyoun:
>>> Hi Tilman,
>>>
>>> that's very good news. I trust a lot of time went into reviewing the test results. wo your and Tim's efforts this achievement wouldn't have been possible.
>>>
>>> BR
>>>
>>> Maruan
>>>
>>> Am 03.12.2014 um 21:04 schrieb Tilman Hausherr <TH...@t-online.de>:
>>>
>>>> I've now run preflight on half of the govdocs files. Every issue I have opened on preflight is related to that test. The failure rate (exceptions other than the "allowed" ValidationExceptions) is down from 1% when I started to 0.05% now. Most of the frequent exceptions (e.g. the one with NonTermimalField) have been fixed. Whats left now are exceptions related to messy files, and some of the font related issues.
>>>>
>>>> Tilman
>>>>
>>>> Am 03.11.2014 um 22:58 schrieb Tilman Hausherr:
>>>>> Am 03.11.2014 um 19:00 schrieb Tilman Hausherr:
>>>>>> It is not looking good, there is at least one NPEs issue coming.
>>>>> No more NPE after solving the two issues I opened today except PDFBOX-1743.pdf which is a known problem.
>>>>>
>>>>> Coming up soon: run preflight on the 231227 PDF files from digitalcorpora to see what happens.
>>>>>
>>>>> Tilman
>>>>>
>>>
>>
>
Re: preflight mass tests
Posted by Maruan Sahyoun <sa...@fileaffairs.de>.
Hi Tilman,
no problem - I thought you might have the information.
BR
Maruan
Am 12.02.2015 um 08:53 schrieb Tilman Hausherr <TH...@t-online.de>:
> Sorry can't tell. Would be too much work to open 73 files manually and scroll through, then write down which displays without error.
>
> I rather try chosing some file with an exception that sounds like it could be corrected, and then submit that one. However now I'd rather concentrate on new issues and remaining 2.0 issues.
>
> Tilman
>
> Am 12.02.2015 um 07:52 schrieb Maruan Sahyoun:
>> great - another 6 files we are now able to process. Of the remaining 73 how many are left which Adobe Reader is able to process?
>>
>> Having that testbed really paid off!
>>
>> Maruan
>>
>> Am 12.02.2015 um 00:21 schrieb Tilman Hausherr <TH...@t-online.de>:
>>
>>> Am 23.01.2015 um 08:45 schrieb Tilman Hausherr:
>>>> total: 231223, failed: 79, percentage failed (exceptions other than the "allowed" ValidationExceptions): 0.03416585677769035%
>>> total: 231223, failed: 73, percentage failed: 0.0315705721732229%
>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: dev-unsubscribe@pdfbox.apache.org
>>> For additional commands, e-mail: dev-help@pdfbox.apache.org
>>>
>>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@pdfbox.apache.org
> For additional commands, e-mail: dev-help@pdfbox.apache.org
>
Re: preflight mass tests
Posted by Tilman Hausherr <TH...@t-online.de>.
Sorry can't tell. Would be too much work to open 73 files manually and
scroll through, then write down which displays without error.
I rather try chosing some file with an exception that sounds like it
could be corrected, and then submit that one. However now I'd rather
concentrate on new issues and remaining 2.0 issues.
Tilman
Am 12.02.2015 um 07:52 schrieb Maruan Sahyoun:
> great - another 6 files we are now able to process. Of the remaining 73 how many are left which Adobe Reader is able to process?
>
> Having that testbed really paid off!
>
> Maruan
>
> Am 12.02.2015 um 00:21 schrieb Tilman Hausherr <TH...@t-online.de>:
>
>> Am 23.01.2015 um 08:45 schrieb Tilman Hausherr:
>>> total: 231223, failed: 79, percentage failed (exceptions other than the "allowed" ValidationExceptions): 0.03416585677769035%
>> total: 231223, failed: 73, percentage failed: 0.0315705721732229%
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: dev-unsubscribe@pdfbox.apache.org
>> For additional commands, e-mail: dev-help@pdfbox.apache.org
>>
>
---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@pdfbox.apache.org
For additional commands, e-mail: dev-help@pdfbox.apache.org
Re: preflight mass tests
Posted by Maruan Sahyoun <sa...@fileaffairs.de>.
great - another 6 files we are now able to process. Of the remaining 73 how many are left which Adobe Reader is able to process?
Having that testbed really paid off!
Maruan
Am 12.02.2015 um 00:21 schrieb Tilman Hausherr <TH...@t-online.de>:
> Am 23.01.2015 um 08:45 schrieb Tilman Hausherr:
>> total: 231223, failed: 79, percentage failed (exceptions other than the "allowed" ValidationExceptions): 0.03416585677769035%
> total: 231223, failed: 73, percentage failed: 0.0315705721732229%
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@pdfbox.apache.org
> For additional commands, e-mail: dev-help@pdfbox.apache.org
>
Re: preflight mass tests
Posted by Tilman Hausherr <TH...@t-online.de>.
Am 23.01.2015 um 08:45 schrieb Tilman Hausherr:
> total: 231223, failed: 79, percentage failed (exceptions other than
> the "allowed" ValidationExceptions): 0.03416585677769035%
total: 231223, failed: 73, percentage failed: 0.0315705721732229%
---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@pdfbox.apache.org
For additional commands, e-mail: dev-help@pdfbox.apache.org
Re: preflight mass tests
Posted by Tilman Hausherr <TH...@t-online.de>.
total: 231223, failed: 79, percentage failed (exceptions other than the
"allowed" ValidationExceptions): 0.03416585677769035%
This time it took only 2 days instead of 4. Maybe the change with
closing made it faster?
(This was done about a week ago, I forgot to send the posting)
Tilman
Am 05.12.2014 um 20:45 schrieb Tilman Hausherr:
> Some numbers... it took 4-5 days
>
> total: 231223, failed: 142, percentage failed: 0.06141257472336292
>
> Of these, one can substract 33 OutOfMemoryErrors that happened near
> the end of the test. Isolated runs went fine.
>
> about the rest:
> 18 are the isSymbol stackoverflow
> 9 are the getFontMatrix NPE
> 33 are the "root must be of type Pages" errors
>
> The rest is mostly related to very broken PDF files.
>
> Tilman
>
>
> Am 04.12.2014 um 14:55 schrieb Maruan Sahyoun:
>> Hi Tilman,
>>
>> that's very good news. I trust a lot of time went into reviewing the
>> test results. wo your and Tim's efforts this achievement wouldn't
>> have been possible.
>>
>> BR
>>
>> Maruan
>>
>> Am 03.12.2014 um 21:04 schrieb Tilman Hausherr <TH...@t-online.de>:
>>
>>> I've now run preflight on half of the govdocs files. Every issue I
>>> have opened on preflight is related to that test. The failure rate
>>> (exceptions other than the "allowed" ValidationExceptions) is down
>>> from 1% when I started to 0.05% now. Most of the frequent exceptions
>>> (e.g. the one with NonTermimalField) have been fixed. Whats left now
>>> are exceptions related to messy files, and some of the font related
>>> issues.
>>>
>>> Tilman
>>>
>>> Am 03.11.2014 um 22:58 schrieb Tilman Hausherr:
>>>> Am 03.11.2014 um 19:00 schrieb Tilman Hausherr:
>>>>> It is not looking good, there is at least one NPEs issue coming.
>>>> No more NPE after solving the two issues I opened today except
>>>> PDFBOX-1743.pdf which is a known problem.
>>>>
>>>> Coming up soon: run preflight on the 231227 PDF files from
>>>> digitalcorpora to see what happens.
>>>>
>>>> Tilman
>>>>
>>
>
Re: preflight mass tests
Posted by John Hewson <jo...@jahewson.com>.
Very impressive!
-- John
> On 5 Dec 2014, at 11:45, Tilman Hausherr <TH...@t-online.de> wrote:
>
> Some numbers... it took 4-5 days
>
> total: 231223, failed: 142, percentage failed: 0.06141257472336292
>
> Of these, one can substract 33 OutOfMemoryErrors that happened near the end of the test. Isolated runs went fine.
>
> about the rest:
> 18 are the isSymbol stackoverflow
> 9 are the getFontMatrix NPE
> 33 are the "root must be of type Pages" errors
>
> The rest is mostly related to very broken PDF files.
>
> Tilman
>
>
> Am 04.12.2014 um 14:55 schrieb Maruan Sahyoun:
>> Hi Tilman,
>>
>> that's very good news. I trust a lot of time went into reviewing the test results. wo your and Tim's efforts this achievement wouldn't have been possible.
>>
>> BR
>>
>> Maruan
>>
>> Am 03.12.2014 um 21:04 schrieb Tilman Hausherr <TH...@t-online.de>:
>>
>>> I've now run preflight on half of the govdocs files. Every issue I have opened on preflight is related to that test. The failure rate (exceptions other than the "allowed" ValidationExceptions) is down from 1% when I started to 0.05% now. Most of the frequent exceptions (e.g. the one with NonTermimalField) have been fixed. Whats left now are exceptions related to messy files, and some of the font related issues.
>>>
>>> Tilman
>>>
>>> Am 03.11.2014 um 22:58 schrieb Tilman Hausherr:
>>>> Am 03.11.2014 um 19:00 schrieb Tilman Hausherr:
>>>>> It is not looking good, there is at least one NPEs issue coming.
>>>> No more NPE after solving the two issues I opened today except PDFBOX-1743.pdf which is a known problem.
>>>>
>>>> Coming up soon: run preflight on the 231227 PDF files from digitalcorpora to see what happens.
>>>>
>>>> Tilman
>>>>
>>
>
Re: preflight mass tests
Posted by Tilman Hausherr <TH...@t-online.de>.
Some numbers... it took 4-5 days
total: 231223, failed: 142, percentage failed: 0.06141257472336292
Of these, one can substract 33 OutOfMemoryErrors that happened near the
end of the test. Isolated runs went fine.
about the rest:
18 are the isSymbol stackoverflow
9 are the getFontMatrix NPE
33 are the "root must be of type Pages" errors
The rest is mostly related to very broken PDF files.
Tilman
Am 04.12.2014 um 14:55 schrieb Maruan Sahyoun:
> Hi Tilman,
>
> that's very good news. I trust a lot of time went into reviewing the test results. wo your and Tim's efforts this achievement wouldn't have been possible.
>
> BR
>
> Maruan
>
> Am 03.12.2014 um 21:04 schrieb Tilman Hausherr <TH...@t-online.de>:
>
>> I've now run preflight on half of the govdocs files. Every issue I have opened on preflight is related to that test. The failure rate (exceptions other than the "allowed" ValidationExceptions) is down from 1% when I started to 0.05% now. Most of the frequent exceptions (e.g. the one with NonTermimalField) have been fixed. Whats left now are exceptions related to messy files, and some of the font related issues.
>>
>> Tilman
>>
>> Am 03.11.2014 um 22:58 schrieb Tilman Hausherr:
>>> Am 03.11.2014 um 19:00 schrieb Tilman Hausherr:
>>>> It is not looking good, there is at least one NPEs issue coming.
>>> No more NPE after solving the two issues I opened today except PDFBOX-1743.pdf which is a known problem.
>>>
>>> Coming up soon: run preflight on the 231227 PDF files from digitalcorpora to see what happens.
>>>
>>> Tilman
>>>
>
Re: preflight mass tests
Posted by Maruan Sahyoun <sa...@fileaffairs.de>.
Hi Tilman,
that's very good news. I trust a lot of time went into reviewing the test results. wo your and Tim's efforts this achievement wouldn't have been possible.
BR
Maruan
Am 03.12.2014 um 21:04 schrieb Tilman Hausherr <TH...@t-online.de>:
> I've now run preflight on half of the govdocs files. Every issue I have opened on preflight is related to that test. The failure rate (exceptions other than the "allowed" ValidationExceptions) is down from 1% when I started to 0.05% now. Most of the frequent exceptions (e.g. the one with NonTermimalField) have been fixed. Whats left now are exceptions related to messy files, and some of the font related issues.
>
> Tilman
>
> Am 03.11.2014 um 22:58 schrieb Tilman Hausherr:
>> Am 03.11.2014 um 19:00 schrieb Tilman Hausherr:
>>> It is not looking good, there is at least one NPEs issue coming.
>>
>> No more NPE after solving the two issues I opened today except PDFBOX-1743.pdf which is a known problem.
>>
>> Coming up soon: run preflight on the 231227 PDF files from digitalcorpora to see what happens.
>>
>> Tilman
>>
>