You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@pdfbox.apache.org by John Hewson <jo...@jahewson.com> on 2014/07/04 02:16:17 UTC

Regression Testing

Hi All

I’ve been thinking about regression testing recently and how we can improve
our tests for rendering. There are currently two problems:

1) Different JDKs produce slightly different renderings (see PDFBOX-1843).
(I suspect that AWT fonts are a big part of this, so the problem might get a lot better
soon once we render all fonts ourselves).

2) Most PDF test files we have are not under an Apache-friendly license, so
we can’t put the test files into the trunk SVN.

It seems that some of you have your own collections of test PDF files which you are
running regression tests on: that’s great but it would be much better if we had a
central repository of test files and sample renderings.

I’d like to suggest the following solutions to the above issues:

1) We should choose a “blessed” JDK which will be used to perform the renderings
this should be whatever is a convenient and sensible default for committers. (My
preference would be for Oracle’s JDK 7 because JDK 6 is deprecated has known
rendering bugs). We should make sure that Jenkins runs tests using the ”blessed”
JDK.

The regression test can then check to see if it is running on the “blessed” JDK and
if not then the tests can be skipped and we can warn the user.

2) We should create a new “regression” branch in SVN which contains only PDF files
for testing and PNG images which contain known-good renderings created using the
“blessed” JDK. This branch would not be part of the source of PDFBox but will still
allow us to version control the test PDFs (it also simplifies the workflow for adding
new test PDFs and new known-good renderings: simply do an "svn add”).

As far as copyright and licensing is concerned we can put any PDF files which are
available publicly on the web into this branch without too much worry.

What does everybody think?

-- John

Re: PDFDays

Posted by Tilman Hausherr <TH...@t-online.de>.

Am 04.07.2014 19:50, schrieb John Hewson:
> PS. Nice job handling those tough questions a PDFDays, I watched the video.

Nobody expects the spanish inquisition:
https://www.youtube.com/watch?v=Tym0MObFpTI

I found what one of the inquisitors were referring to:
http://www.haskellforall.com/2014/04/worst-practices-are-viral-for-wrong.html

Btw recently the questions on stackoverflow have been less. Either the 
newbies are on vacation, or the software is getting better :)

Tilman

Re: Regression Testing

Posted by John Hewson <jo...@jahewson.com>.

Hi Maruan

Thanks for your thoughts...

> # Tests #
> In addition to rendering we shall be covering metadata and text extraction as well as PDF/A validation. 

Yes, we could add extracted text and validation results to the “regression” SVN repo also.

> # Testfiles # 
> Recently there were a number of test sets made available which we can use. […]

Excellent.

> In addition we can put additional files into our own repository as you suggested.
> So there is no shortage on test files. 

Some people seem to have downloaded many (or all) of the JIRA files, I guess we could add those too.

> TIKA-1300/TIKA-1302 has a discussion around the same topic together with some development for an infrastructure (VM, Jenkins …). IMHO we should join forces with them.

I see that in TIKA-1302 the Tika developers suggest that PDFBox should set up its own regression tests, so I guess that’s our starting point. We should make sure that it’s easy to run just the text extraction regression tests using maven, and also ask them to give us any test files they have.

-- John

PS. Nice job handling those tough questions a PDFDays, I watched the video.

On 3 Jul 2014, at 23:43, Maruan Sahyoun <sa...@fileaffairs.de> wrote:

> Hi John,
> 
> thanks for binging this up. This is a very important topic which was also discussed at the PDFDays in Germany.
> 
> # Tests #
> In addition to rendering we shall be covering metadata and text extraction as well as PDF/A validation. 
> 
> # Testfiles # 
> Recently there were a number of test sets made available which we can use. http://digitalcorpora.org/corpora/files , https://github.com/openplanets/format-corpus/tree/master/pdfCabinetOfHorrors …
> For PDF/A validation there is the Isartor test suite http://www.pdfa.org/2011/08/download-isartor-test-suite/. Some restrictions apply there.
> In addition we can put additional files into our own repository as you suggested.
> So there is no shortage on test files. 
> 
> TIKA-1300/TIKA-1302 has a discussion around the same topic together with some development for an infrastructure (VM, Jenkins …). IMHO we should join forces with them.
> 
> BR
> 
> Maruan
> 
> 
> Am 04.07.2014 um 02:16 schrieb John Hewson <jo...@jahewson.com>:
> 
>> Hi All
>> 
>> I’ve been thinking about regression testing recently and how we can improve
>> our tests for rendering. There are currently two problems:
>> 
>> 1) Different JDKs produce slightly different renderings (see PDFBOX-1843).
>>   (I suspect that AWT fonts are a big part of this, so the problem might get a lot better
>>   soon once we render all fonts ourselves).
>> 
>> 2) Most PDF test files we have are not under an Apache-friendly license, so
>>   we can’t put the test files into the trunk SVN.
>> 
>> It seems that some of you have your own collections of test PDF files which you are
>> running regression tests on: that’s great but it would be much better if we had a
>> central repository of test files and sample renderings.
>> 
>> I’d like to suggest the following solutions to the above issues:
>> 
>> 1) We should choose a “blessed” JDK which will be used to perform the renderings
>>   this should be whatever is a convenient and sensible default for committers. (My
>>   preference would be for Oracle’s JDK 7 because JDK 6 is deprecated has known
>>   rendering bugs). We should make sure that Jenkins runs tests using the ”blessed”
>>   JDK.
>> 
>>  The regression test can then check to see if it is running on the “blessed” JDK and
>>  if not then the tests can be skipped and we can warn the user.
>> 
>> 2) We should create a new “regression” branch in SVN which contains only PDF files
>>   for testing and PNG images which contain known-good renderings created using the
>>   “blessed” JDK. This branch would not be part of the source of PDFBox but will still
>>   allow us to version control the test PDFs (it also simplifies the workflow for adding
>>   new test PDFs and new known-good renderings: simply do an "svn add”).
>> 
>>   As far as copyright and licensing is concerned we can put any PDF files which are
>>   available publicly on the web into this branch without too much worry.
>> 
>> What does everybody think?
>> 
>> -- John
>> 
>

Re: Regression Testing

Posted by John Hewson <jo...@jahewson.com>.

Hi Tim,

>  My initial plan for TIKA-1302 is very similar to what Tilman outlined, and my understanding/concerns/thoughts were very much in line with what he articulated.  The idea is that there should be a small Apache license-able gold truth set like both projects now have for specific unit tests (patient-based care), but that we should also occasionally take a public-health view and compare the outputs of  different versions of our parsers on a large set of docs to identify new exceptions or large changes in extracted content/metadata. 

I’m not aware of a good supply of Apache license-able PDF files, we have very few such tests currently. For regression tests to be useful we really have to run our tests on a large corpus of real files every time.

>   I'm persuaded by your points about fair use and the importance of "open data."  Before proceeding on TIKA-1302, I'd like to get broader feedback on the way ahead via legal-discuss or maybe jira's Legal.  Do you mind if I quote your arguments?

Yes, certainly, obviously I’m not a lawyer. My reasoning is basically that Google do essentially the same thing that we want to and they have plenty of lawyers who presumably know what they’re doing.

>   Also, I was on my way to requesting a vm from infra for TIKA-1302.  Do you see any way that we could share resources so that we're not double-storing files on Apache infrastructure?  There may be easy ways to share some eval code as well.

I was thinking of just storing our test files in an SVN branch, the Tika project should already have read access (obviously write access would be for PDFBox committers only otherwise our builds will get broken). The tests could run on Jenkins as part of the normal build process. For eval code I was  planning to simply have a single paramaterized JUnit test which runs in parallel, that way it’s easy to run from an IDE and to debug and integrate with Maven. The unit test would look for source files in ../../regression which would be a directory above the SVN trunk (i.e. a separate repo). It would do a full rendering of each file to a PNG and compare the results, we’ll probably have a text extraction test too: perhaps that’s more like what Tika will need?

Thanks

-- John

RE: Regression Testing

Posted by "Allison, Timothy B." <ta...@mitre.org>.

John,

   My initial plan for TIKA-1302 is very similar to what Tilman outlined, and my understanding/concerns/thoughts were very much in line with what he articulated.  The idea is that there should be a small Apache license-able gold truth set like both projects now have for specific unit tests (patient-based care), but that we should also occasionally take a public-health view and compare the outputs of  different versions of our parsers on a large set of docs to identify new exceptions or large changes in extracted content/metadata. 

   I'm persuaded by your points about fair use and the importance of "open data."  Before proceeding on TIKA-1302, I'd like to get broader feedback on the way ahead via legal-discuss or maybe jira's Legal.  Do you mind if I quote your arguments?

   Also, I was on my way to requesting a vm from infra for TIKA-1302.  Do you see any way that we could share resources so that we're not double-storing files on Apache infrastructure?  There may be easy ways to share some eval code as well.

          Best,

                   Tim

-----Original Message-----
From: John Hewson [mailto:john@jahewson.com] 
Sent: Saturday, July 05, 2014 5:01 PM
To: dev@pdfbox.apache.org
Subject: Re: Regression Testing

On 5 Jul 2014, at 13:47, Tilman Hausherr <TH...@t-online.de> wrote:

> Am 05.07.2014 22:12, schrieb John Hewson:
>>>>> Copyrights is a problem: I'm testing mostly with JIRA attachments that I've downloaded over the years. While uploading such files to JIRA might count as fair use, I doubt that this would still be true if they are included in a distribution. Instead, they should be stored somewhere on Apache servers where only committers and build software ("Travis", "Jenkins", ...) can access then. The public PDFs that Maruan mentions don't possibly have all the Problem cases that we solved before. However I have started working with these files and there are at least 5 recent issues that deals with them.
>>>> The PDFs won't be in a distribution. They will just happen to be stored in an SVN repo but not our source code repo, in the same way that the website is stored in the "cmssite" branch of SVN or indeed, are on JIRA. The law doesn't distinguish between JIRA and SVN, both are publicly available via HTTP, so using SVN will simply be a continuation of what we're already doing with JIRA.
>>>> 
>>>> The crucial factor is that we're only storing publicly available PDFs,  because we have the right to do so, just like Google's cache, and like we currently do with JIRA.
>>> Yes but many PDFs we got aren't really "public". If this svn repo is only accessible to committers, and if the publicly available build scripts won't break because of this, then it is OK.
>> Any non-public PDFs will not be permitted in our test suite, just as they shouldn't be on JIRA.
>> 
>>> Note that even if something is "publicly available", it may still be copyrighted. Other risks can be that some people upload PDFs that include personal data. One really good test PDF was apparently a loan application. I remember that the user insisted that 1. it was test data, and 2. that it be removed.
>> All Apache development should be in the open, this is a key ASF principle, having a committers-only test suite is basically a no-no. It's important to understand that "fair use" allows us to use copyrighted works - this is expressly permitted, it's the same legal principle as Google's cache. There is no need to seek permission. This is what we've been doing with JIRA already for years, so we are already doing this - it's fine.
> 
> The problem is that this has all happened before. A few years ago, many files were deleted, see PDFBOX-391.

That issue is about including files in the source code repo as part of the PDFBox distribution, where there is a need to put files under an Apache 2.0 compatible license. What I'm advocating is keeping a separate public repository of test files which are not a part of the PDFBox source, like we currently have on JIRA.

-- John

Re: Regression Testing

Posted by John Hewson <jo...@jahewson.com>.

On 6 Jul 2014, at 01:28, Guillaume Bailleul <gb...@gmail.com> wrote:

> About "why are isartor test not done by default?"
> 
> In the first time of preflight in PDFBox, I made it not "by default"
> because some manipulation were needed to make it work, I was not good
> with maven in these time. When I changed that using some download
> plugin of maven, I did not changed the default mode... only not to
> break the build, as the preflight code was not so stable.
> 
> I do not find any objection to change the default mode. One idea could
> be to move the test in integration test, maybe using the failsafe
> plugin. I can work on it.

Great, I’m going to enable these tests by default in the trunk.

Running these tests as unit tests with surefire looks good to me. As there
isn’t a test environment which needs tearing down I’m not sure that we’d
stand to gain from moving to failsafe?

-- John

Re: Regression Testing

Posted by Guillaume Bailleul <gb...@gmail.com>.

About "why are isartor test not done by default?"

In the first time of preflight in PDFBox, I made it not "by default"
because some manipulation were needed to make it work, I was not good
with maven in these time. When I changed that using some download
plugin of maven, I did not changed the default mode... only not to
break the build, as the preflight code was not so stable.

I do not find any objection to change the default mode. One idea could
be to move the test in integration test, maybe using the failsafe
plugin. I can work on it.



On Sat, Jul 5, 2014 at 11:01 PM, John Hewson <jo...@jahewson.com> wrote:
>
> On 5 Jul 2014, at 13:47, Tilman Hausherr <TH...@t-online.de> wrote:
>
>> Am 05.07.2014 22:12, schrieb John Hewson:
>>>>>> Copyrights is a problem: I'm testing mostly with JIRA attachments that I've downloaded over the years. While uploading such files to JIRA might count as fair use, I doubt that this would still be true if they are included in a distribution. Instead, they should be stored somewhere on Apache servers where only committers and build software ("Travis", "Jenkins", ...) can access then. The public PDFs that Maruan mentions don't possibly have all the Problem cases that we solved before. However I have started working with these files and there are at least 5 recent issues that deals with them.
>>>>> The PDFs won’t be in a distribution. They will just happen to be stored in an SVN repo but not our source code repo, in the same way that the website is stored in the “cmssite” branch of SVN or indeed, are on JIRA. The law doesn’t distinguish between JIRA and SVN, both are publicly available via HTTP, so using SVN will simply be a continuation of what we’re already doing with JIRA.
>>>>>
>>>>> The crucial factor is that we’re only storing publicly available PDFs,  because we have the right to do so, just like Google’s cache, and like we currently do with JIRA.
>>>> Yes but many PDFs we got aren't really "public". If this svn repo is only accessible to committers, and if the publicly available build scripts won't break because of this, then it is OK.
>>> Any non-public PDFs will not be permitted in our test suite, just as they shouldn't be on JIRA.
>>>
>>>> Note that even if something is "publicly available", it may still be copyrighted. Other risks can be that some people upload PDFs that include personal data. One really good test PDF was apparently a loan application. I remember that the user insisted that 1. it was test data, and 2. that it be removed.
>>> All Apache development should be in the open, this is a key ASF principle, having a committers-only test suite is basically a no-no. It's important to understand that "fair use" allows us to use copyrighted works - this is expressly permitted, it's the same legal principle as Google’s cache. There is no need to seek permission. This is what we’ve been doing with JIRA already for years, so we are already doing this - it’s fine.
>>
>> The problem is that this has all happened before. A few years ago, many files were deleted, see PDFBOX-391.
>
> That issue is about including files in the source code repo as part of the PDFBox distribution, where there is a need to put files under an Apache 2.0 compatible license. What I’m advocating is keeping a separate public repository of test files which are not a part of the PDFBox source, like we currently have on JIRA.
>
> -- John

RE: Regression Testing

Posted by "Allison, Timothy B." <ta...@mitre.org>.

John,

   My initial plan for TIKA-1302 is very similar to what Tilman outlined, and my understanding/concerns/thoughts were very much in line with what he articulated.  The idea is that there should be a small Apache license-able gold truth set like both projects now have for specific unit tests (patient-based care), but that we should also occasionally take a public-health view and compare the outputs of  different versions of our parsers on a large set of docs to identify new exceptions or large changes in extracted content/metadata. 

   I'm persuaded by your points about fair use and the importance of "open data."  Before proceeding on TIKA-1302, I'd like to get broader feedback on the way ahead via legal-discuss or maybe jira's Legal.  Do you mind if I quote your arguments?

   Also, I was on my way to requesting a vm from infra for TIKA-1302.  Do you see any way that we could share resources so that we're not double-storing files on Apache infrastructure?  There may be easy ways to share some eval code as well.

          Best,

                   Tim

-----Original Message-----
From: John Hewson [mailto:john@jahewson.com] 
Sent: Saturday, July 05, 2014 5:01 PM
To: dev@pdfbox.apache.org
Subject: Re: Regression Testing

On 5 Jul 2014, at 13:47, Tilman Hausherr <TH...@t-online.de> wrote:

> Am 05.07.2014 22:12, schrieb John Hewson:
>>>>> Copyrights is a problem: I'm testing mostly with JIRA attachments that I've downloaded over the years. While uploading such files to JIRA might count as fair use, I doubt that this would still be true if they are included in a distribution. Instead, they should be stored somewhere on Apache servers where only committers and build software ("Travis", "Jenkins", ...) can access then. The public PDFs that Maruan mentions don't possibly have all the Problem cases that we solved before. However I have started working with these files and there are at least 5 recent issues that deals with them.
>>>> The PDFs won't be in a distribution. They will just happen to be stored in an SVN repo but not our source code repo, in the same way that the website is stored in the "cmssite" branch of SVN or indeed, are on JIRA. The law doesn't distinguish between JIRA and SVN, both are publicly available via HTTP, so using SVN will simply be a continuation of what we're already doing with JIRA.
>>>> 
>>>> The crucial factor is that we're only storing publicly available PDFs,  because we have the right to do so, just like Google's cache, and like we currently do with JIRA.
>>> Yes but many PDFs we got aren't really "public". If this svn repo is only accessible to committers, and if the publicly available build scripts won't break because of this, then it is OK.
>> Any non-public PDFs will not be permitted in our test suite, just as they shouldn't be on JIRA.
>> 
>>> Note that even if something is "publicly available", it may still be copyrighted. Other risks can be that some people upload PDFs that include personal data. One really good test PDF was apparently a loan application. I remember that the user insisted that 1. it was test data, and 2. that it be removed.
>> All Apache development should be in the open, this is a key ASF principle, having a committers-only test suite is basically a no-no. It's important to understand that "fair use" allows us to use copyrighted works - this is expressly permitted, it's the same legal principle as Google's cache. There is no need to seek permission. This is what we've been doing with JIRA already for years, so we are already doing this - it's fine.
> 
> The problem is that this has all happened before. A few years ago, many files were deleted, see PDFBOX-391.

That issue is about including files in the source code repo as part of the PDFBox distribution, where there is a need to put files under an Apache 2.0 compatible license. What I'm advocating is keeping a separate public repository of test files which are not a part of the PDFBox source, like we currently have on JIRA.

-- John

Re: Regression Testing

Posted by John Hewson <jo...@jahewson.com>.

On 5 Jul 2014, at 13:47, Tilman Hausherr <TH...@t-online.de> wrote:

> Am 05.07.2014 22:12, schrieb John Hewson:
>>>>> Copyrights is a problem: I'm testing mostly with JIRA attachments that I've downloaded over the years. While uploading such files to JIRA might count as fair use, I doubt that this would still be true if they are included in a distribution. Instead, they should be stored somewhere on Apache servers where only committers and build software ("Travis", "Jenkins", ...) can access then. The public PDFs that Maruan mentions don't possibly have all the Problem cases that we solved before. However I have started working with these files and there are at least 5 recent issues that deals with them.
>>>> The PDFs won’t be in a distribution. They will just happen to be stored in an SVN repo but not our source code repo, in the same way that the website is stored in the “cmssite” branch of SVN or indeed, are on JIRA. The law doesn’t distinguish between JIRA and SVN, both are publicly available via HTTP, so using SVN will simply be a continuation of what we’re already doing with JIRA.
>>>> 
>>>> The crucial factor is that we’re only storing publicly available PDFs,  because we have the right to do so, just like Google’s cache, and like we currently do with JIRA.
>>> Yes but many PDFs we got aren't really "public". If this svn repo is only accessible to committers, and if the publicly available build scripts won't break because of this, then it is OK.
>> Any non-public PDFs will not be permitted in our test suite, just as they shouldn't be on JIRA.
>> 
>>> Note that even if something is "publicly available", it may still be copyrighted. Other risks can be that some people upload PDFs that include personal data. One really good test PDF was apparently a loan application. I remember that the user insisted that 1. it was test data, and 2. that it be removed.
>> All Apache development should be in the open, this is a key ASF principle, having a committers-only test suite is basically a no-no. It's important to understand that "fair use" allows us to use copyrighted works - this is expressly permitted, it's the same legal principle as Google’s cache. There is no need to seek permission. This is what we’ve been doing with JIRA already for years, so we are already doing this - it’s fine.
> 
> The problem is that this has all happened before. A few years ago, many files were deleted, see PDFBOX-391.

That issue is about including files in the source code repo as part of the PDFBox distribution, where there is a need to put files under an Apache 2.0 compatible license. What I’m advocating is keeping a separate public repository of test files which are not a part of the PDFBox source, like we currently have on JIRA.

-- John

Re: Regression Testing

Posted by Tilman Hausherr <TH...@t-online.de>.

Am 05.07.2014 22:12, schrieb John Hewson:
>>>> Copyrights is a problem: I'm testing mostly with JIRA attachments that I've downloaded over the years. While uploading such files to JIRA might count as fair use, I doubt that this would still be true if they are included in a distribution. Instead, they should be stored somewhere on Apache servers where only committers and build software ("Travis", "Jenkins", ...) can access then. The public PDFs that Maruan mentions don't possibly have all the Problem cases that we solved before. However I have started working with these files and there are at least 5 recent issues that deals with them.
>>> The PDFs won’t be in a distribution. They will just happen to be stored in an SVN repo but not our source code repo, in the same way that the website is stored in the “cmssite” branch of SVN or indeed, are on JIRA. The law doesn’t distinguish between JIRA and SVN, both are publicly available via HTTP, so using SVN will simply be a continuation of what we’re already doing with JIRA.
>>>
>>> The crucial factor is that we’re only storing publicly available PDFs,  because we have the right to do so, just like Google’s cache, and like we currently do with JIRA.
>> Yes but many PDFs we got aren't really "public". If this svn repo is only accessible to committers, and if the publicly available build scripts won't break because of this, then it is OK.
> Any non-public PDFs will not be permitted in our test suite, just as they shouldn't be on JIRA.
>
>> Note that even if something is "publicly available", it may still be copyrighted. Other risks can be that some people upload PDFs that include personal data. One really good test PDF was apparently a loan application. I remember that the user insisted that 1. it was test data, and 2. that it be removed.
> All Apache development should be in the open, this is a key ASF principle, having a committers-only test suite is basically a no-no. It's important to understand that "fair use" allows us to use copyrighted works - this is expressly permitted, it's the same legal principle as Google’s cache. There is no need to seek permission. This is what we’ve been doing with JIRA already for years, so we are already doing this - it’s fine.

The problem is that this has all happened before. A few years ago, many 
files were deleted, see PDFBOX-391.

Tilman

>
> Naturally, if anybody objects to their PDF being in our test suite, we can always remove it, but it shouldn’t include anything which isn’t already on the public web.
>
> -- John

Re: Regression Testing

Posted by John Hewson <jo...@jahewson.com>.

>>> Copyrights is a problem: I'm testing mostly with JIRA attachments that I've downloaded over the years. While uploading such files to JIRA might count as fair use, I doubt that this would still be true if they are included in a distribution. Instead, they should be stored somewhere on Apache servers where only committers and build software ("Travis", "Jenkins", ...) can access then. The public PDFs that Maruan mentions don't possibly have all the Problem cases that we solved before. However I have started working with these files and there are at least 5 recent issues that deals with them.
>> The PDFs won’t be in a distribution. They will just happen to be stored in an SVN repo but not our source code repo, in the same way that the website is stored in the “cmssite” branch of SVN or indeed, are on JIRA. The law doesn’t distinguish between JIRA and SVN, both are publicly available via HTTP, so using SVN will simply be a continuation of what we’re already doing with JIRA.
>> 
>> The crucial factor is that we’re only storing publicly available PDFs,  because we have the right to do so, just like Google’s cache, and like we currently do with JIRA.
> 
> Yes but many PDFs we got aren't really "public". If this svn repo is only accessible to committers, and if the publicly available build scripts won't break because of this, then it is OK.

Any non-public PDFs will not be permitted in our test suite, just as they shouldn't be on JIRA.

> Note that even if something is "publicly available", it may still be copyrighted. Other risks can be that some people upload PDFs that include personal data. One really good test PDF was apparently a loan application. I remember that the user insisted that 1. it was test data, and 2. that it be removed.

All Apache development should be in the open, this is a key ASF principle, having a committers-only test suite is basically a no-no. It's important to understand that "fair use" allows us to use copyrighted works - this is expressly permitted, it's the same legal principle as Google’s cache. There is no need to seek permission. This is what we’ve been doing with JIRA already for years, so we are already doing this - it’s fine.

Naturally, if anybody objects to their PDF being in our test suite, we can always remove it, but it shouldn’t include anything which isn’t already on the public web.

-- John

Re: Regression Testing

Posted by Tilman Hausherr <TH...@t-online.de>.

Am 04.07.2014 19:39, schrieb John Hewson:
> Hi Tilman
>
> Thanks for your thoughts, I think that your concerns are already covered by my original proposal, I’ll try to explain why and how:
>
>> Of course I agree with the need for regression tests, however it isn't easy: besides the problems of the different JDKs (I use JDK7 Windows 64 bit), there is the problem that some enhancements create slight changes in rendering that are not errors, i.e. both the "before" and the "after" files look OK by itself. This has happened when we changed the text rendering recently, and has happened again when the clipping was improved. The cause are probably slight changes in color or in boundaries.
> If a rendering has changed then the regression test should fail. When a failure occurs the developer needs to manually inspect the differences (we could generate a visual diff which highlights what changed to make this easier) and if ok then they can replace the known-good PNG with the ones just rendered. Indeed this will be the basic workflow for working with regression tests.

Thats exactly what I do now, I generate a visual diff and I make a 
decision whether it is relevant or not. If I think not, then I replace 
the PNG.

>
>> Copyrights is a problem: I'm testing mostly with JIRA attachments that I've downloaded over the years. While uploading such files to JIRA might count as fair use, I doubt that this would still be true if they are included in a distribution. Instead, they should be stored somewhere on Apache servers where only committers and build software ("Travis", "Jenkins", ...) can access then. The public PDFs that Maruan mentions don't possibly have all the Problem cases that we solved before. However I have started working with these files and there are at least 5 recent issues that deals with them.
> The PDFs won’t be in a distribution. They will just happen to be stored in an SVN repo but not our source code repo, in the same way that the website is stored in the “cmssite” branch of SVN or indeed, are on JIRA. The law doesn’t distinguish between JIRA and SVN, both are publicly available via HTTP, so using SVN will simply be a continuation of what we’re already doing with JIRA.
>
> The crucial factor is that we’re only storing publicly available PDFs,  because we have the right to do so, just like Google’s cache, and like we currently do with JIRA.

Yes but many PDFs we got aren't really "public". If this svn repo is 
only accessible to committers, and if the publicly available build 
scripts won't break because of this, then it is OK.

Note that even if something is "publicly available", it may still be 
copyrighted. Other risks can be that some people upload PDFs that 
include personal data. One really good test PDF was apparently a loan 
application. I remember that the user insisted that 1. it was test data, 
and 2. that it be removed.

Tilman

> Additionally, the PDFs need to be version controlled otherwise we won’t be able to reliably recreate previous builds, so storing the files on a web server won’t be practical. Also committers will frequently be updating the renderings as bugs are fixed and we’ll need to version-control the rendered PNG files for the same reason. Finally, having committers-only files doesn’t fit well with the Apache goal of open development and would be unnecessary anyway given that all the PDFs are to be taken from public sources only.
>
> In summary, I’m proposing that we just keep doing what we’re currently doing with JIRA but we move it into its own SVN repo along with some pre-rendered PNGs.
>
>> Re preflight: the default mode should be to have the Isartor tests on. Individuals could still disable them locally, but the central build software should always use them.
> Yes - does anybody know why this isn’t the default?
>
> -- John

Re: Regression Testing

Posted by John Hewson <jo...@jahewson.com>.

Interesting, it certainly looks pretty similar.

-- John

On 14 Jul 2014, at 23:15, Tilman Hausherr <TH...@t-online.de> wrote:

> As yet another proof that IT people always solve things in similar ways, see this interesting blog post by one of our "competitors":
> http://blog.idrsolutions.com/2013/06/save-time-test/
> 
> Tilman
> 
> Am 04.07.2014 23:05, schrieb Petr Slabý:
>> Hi,
>> following is a description of what we are doing in our company.
>> 
>> With our software, we run regression tests after each nightly build and sometimes it is a tough fight. If there is a regression, it is not so easy to find which commit caused it, because there are potentially many between the nightly builds. Then, the decision whether the change is wanted and expected is in some cases also difficult (this part might be easier with PDF where there is the "golden standard" rendering in Acrobat). If the change is expected and the new rendering "better" then one has to commit the new reference. This means that the files produced on the nightly build machine must be available somehow - it is almost impossible to produce them locally as the rendering results are slightly different with different versions of java and many other reasons. All this has to be done before the next regression test is run to avoid that new regressions are hidden by earlier ones. Our complete build with all tests runs several hours...
>> 
>> To improve this workflow, we now use the following schema in addition:
>> - there is a smaller set of regression tests which runs relatively fast
>> - these tests are triggered by each commit in formatting and rendering related projects
>> - before running the test itself, the modified project(s) are compiled locally, w/o publishing the result to maven
>> - the reference rendering files are stored in SVN
>> - if a test finds a regression, it immediately stores the new result as a new reference into SVN. This makes sure that a) the test renderings do not get lost and b) that each regression exactly points to the commit that has caused it - the one that triggered the test. The failed test creates a new issue in JIRA with a pointer to SVN to the before and after rendering and a bitmap of the differencies. The issue is then processed. If we find the change to be expected then the issue is simply closed, otherwise we take actions to fix the problem. The only annoying thing about this scheme is that, after commiting the correction, the test runs again and reports a regression because it now compares to the faulty version of the rendering.
>> 
>> Best regards,
>> Petr.
>> 
>> -----Původní zpráva----- From: John Hewson
>> Sent: Friday, July 04, 2014 7:39 PM
>> To: dev@pdfbox.apache.org
>> Subject: Re: Regression Testing
>> 
>> Hi Tilman
>> 
>> Thanks for your thoughts, I think that your concerns are already covered by my original proposal, I’ll try to explain why and how:
>> 
>>> Of course I agree with the need for regression tests, however it isn't easy: besides the problems of the different JDKs (I use JDK7 Windows 64 bit), there is the problem that some enhancements create slight changes in rendering that are not errors, i.e. both the "before" and the "after" files look OK by itself. This has happened when we changed the text rendering recently, and has happened again when the clipping was improved. The cause are probably slight changes in color or in boundaries.
>> 
>> If a rendering has changed then the regression test should fail. When a failure occurs the developer needs to manually inspect the differences (we could generate a visual diff which highlights what changed to make this easier) and if ok then they can replace the known-good PNG with the ones just rendered. Indeed this will be the basic workflow for working with regression tests.
>> 
>>> Copyrights is a problem: I'm testing mostly with JIRA attachments that I've downloaded over the years. While uploading such files to JIRA might count as fair use, I doubt that this would still be true if they are included in a distribution. Instead, they should be stored somewhere on Apache servers where only committers and build software ("Travis", "Jenkins", ...) can access then. The public PDFs that Maruan mentions don't possibly have all the Problem cases that we solved before. However I have started working with these files and there are at least 5 recent issues that deals with them.
>> 
>> The PDFs won’t be in a distribution. They will just happen to be stored in an SVN repo but not our source code repo, in the same way that the website is stored in the “cmssite” branch of SVN or indeed, are on JIRA. The law doesn’t distinguish between JIRA and SVN, both are publicly available via HTTP, so using SVN will simply be a continuation of what we’re already doing with JIRA.
>> 
>> The crucial factor is that we’re only storing publicly available PDFs, because we have the right to do so, just like Google’s cache, and like we currently do with JIRA.
>> 
>> Additionally, the PDFs need to be version controlled otherwise we won’t be able to reliably recreate previous builds, so storing the files on a web server won’t be practical. Also committers will frequently be updating the renderings as bugs are fixed and we’ll need to version-control the rendered PNG files for the same reason. Finally, having committers-only files doesn’t fit well with the Apache goal of open development and would be unnecessary anyway given that all the PDFs are to be taken from public sources only.
>> 
>> In summary, I’m proposing that we just keep doing what we’re currently doing with JIRA but we move it into its own SVN repo along with some pre-rendered PNGs.
>> 
>>> Re preflight: the default mode should be to have the Isartor tests on. Individuals could still disable them locally, but the central build software should always use them.
>> 
>> Yes - does anybody know why this isn’t the default?
>> 
>> -- John
>

Re: Regression Testing

Posted by Tilman Hausherr <TH...@t-online.de>.

As yet another proof that IT people always solve things in similar ways, 
see this interesting blog post by one of our "competitors":
http://blog.idrsolutions.com/2013/06/save-time-test/

Tilman

Am 04.07.2014 23:05, schrieb Petr Slabý:
> Hi,
> following is a description of what we are doing in our company.
>
> With our software, we run regression tests after each nightly build 
> and sometimes it is a tough fight. If there is a regression, it is not 
> so easy to find which commit caused it, because there are potentially 
> many between the nightly builds. Then, the decision whether the change 
> is wanted and expected is in some cases also difficult (this part 
> might be easier with PDF where there is the "golden standard" 
> rendering in Acrobat). If the change is expected and the new rendering 
> "better" then one has to commit the new reference. This means that the 
> files produced on the nightly build machine must be available somehow 
> - it is almost impossible to produce them locally as the rendering 
> results are slightly different with different versions of java and 
> many other reasons. All this has to be done before the next regression 
> test is run to avoid that new regressions are hidden by earlier ones. 
> Our complete build with all tests runs several hours...
>
> To improve this workflow, we now use the following schema in addition:
> - there is a smaller set of regression tests which runs relatively fast
> - these tests are triggered by each commit in formatting and rendering 
> related projects
> - before running the test itself, the modified project(s) are compiled 
> locally, w/o publishing the result to maven
> - the reference rendering files are stored in SVN
> - if a test finds a regression, it immediately stores the new result 
> as a new reference into SVN. This makes sure that a) the test 
> renderings do not get lost and b) that each regression exactly points 
> to the commit that has caused it - the one that triggered the test. 
> The failed test creates a new issue in JIRA with a pointer to SVN to 
> the before and after rendering and a bitmap of the differencies. The 
> issue is then processed. If we find the change to be expected then the 
> issue is simply closed, otherwise we take actions to fix the problem. 
> The only annoying thing about this scheme is that, after commiting the 
> correction, the test runs again and reports a regression because it 
> now compares to the faulty version of the rendering.
>
> Best regards,
> Petr.
>
> -----Původní zpráva----- From: John Hewson
> Sent: Friday, July 04, 2014 7:39 PM
> To: dev@pdfbox.apache.org
> Subject: Re: Regression Testing
>
> Hi Tilman
>
> Thanks for your thoughts, I think that your concerns are already 
> covered by my original proposal, I’ll try to explain why and how:
>
>> Of course I agree with the need for regression tests, however it 
>> isn't easy: besides the problems of the different JDKs (I use JDK7 
>> Windows 64 bit), there is the problem that some enhancements create 
>> slight changes in rendering that are not errors, i.e. both the 
>> "before" and the "after" files look OK by itself. This has happened 
>> when we changed the text rendering recently, and has happened again 
>> when the clipping was improved. The cause are probably slight changes 
>> in color or in boundaries.
>
> If a rendering has changed then the regression test should fail. When 
> a failure occurs the developer needs to manually inspect the 
> differences (we could generate a visual diff which highlights what 
> changed to make this easier) and if ok then they can replace the 
> known-good PNG with the ones just rendered. Indeed this will be the 
> basic workflow for working with regression tests.
>
>> Copyrights is a problem: I'm testing mostly with JIRA attachments 
>> that I've downloaded over the years. While uploading such files to 
>> JIRA might count as fair use, I doubt that this would still be true 
>> if they are included in a distribution. Instead, they should be 
>> stored somewhere on Apache servers where only committers and build 
>> software ("Travis", "Jenkins", ...) can access then. The public PDFs 
>> that Maruan mentions don't possibly have all the Problem cases that 
>> we solved before. However I have started working with these files and 
>> there are at least 5 recent issues that deals with them.
>
> The PDFs won’t be in a distribution. They will just happen to be 
> stored in an SVN repo but not our source code repo, in the same way 
> that the website is stored in the “cmssite” branch of SVN or indeed, 
> are on JIRA. The law doesn’t distinguish between JIRA and SVN, both 
> are publicly available via HTTP, so using SVN will simply be a 
> continuation of what we’re already doing with JIRA.
>
> The crucial factor is that we’re only storing publicly available PDFs, 
> because we have the right to do so, just like Google’s cache, and like 
> we currently do with JIRA.
>
> Additionally, the PDFs need to be version controlled otherwise we 
> won’t be able to reliably recreate previous builds, so storing the 
> files on a web server won’t be practical. Also committers will 
> frequently be updating the renderings as bugs are fixed and we’ll need 
> to version-control the rendered PNG files for the same reason. 
> Finally, having committers-only files doesn’t fit well with the Apache 
> goal of open development and would be unnecessary anyway given that 
> all the PDFs are to be taken from public sources only.
>
> In summary, I’m proposing that we just keep doing what we’re currently 
> doing with JIRA but we move it into its own SVN repo along with some 
> pre-rendered PNGs.
>
>> Re preflight: the default mode should be to have the Isartor tests 
>> on. Individuals could still disable them locally, but the central 
>> build software should always use them.
>
> Yes - does anybody know why this isn’t the default?
>
> -- John

Re: Regression Testing

Posted by John Hewson <jo...@jahewson.com>.

On 4 Jul 2014, at 14:05, Petr Slabý <sl...@kadel.cz> wrote:

> Hi,
> following is a description of what we are doing in our company.
> 
> With our software, we run regression tests after each nightly build and sometimes it is a tough fight. If there is a regression, it is not so easy to find which commit caused it, because there are potentially many between the nightly builds.

Our Jenkins build should run after each commit, so that will simplify things a bit. Sometimes it doesn’t but we also have TravisCI which is always does.

> Then, the decision whether the change is wanted and expected is in some cases also difficult (this part might be easier with PDF where there is the "golden standard" rendering in Acrobat).

Yes, Acrobat is the answer here for PDF. Most of the time the decision should be straightforward.

> If the change is expected and the new rendering "better" then one has to commit the new reference. This means that the files produced on the nightly build machine must be available somehow - it is almost impossible to produce them locally as the rendering results are slightly different with different versions of java and many other reasons.

Yes, I really want to get local renderings working if possible. That might include some basic restrictions on which JVMs can be used (my “blessed JVM proposal) but also introducing some fuzziness into the image comparisons, perhaps allowing a small per-pixel error. I’m hoping that once AWT rendering of fonts is removed that we’ll see more consistent rendering across JVMs.

> All this has to be done before the next regression test is run to avoid that new regressions are hidden by earlier ones. Our complete build with all tests runs several hours…
> 
> To improve this workflow, we now use the following schema in addition:
> - there is a smaller set of regression tests which runs relatively fast
> - these tests are triggered by each commit in formatting and rendering related projects
> - before running the test itself, the modified project(s) are compiled locally, w/o publishing the result to maven
> - the reference rendering files are stored in SVN
> - if a test finds a regression, it immediately stores the new result as a new reference into SVN. This makes sure that a) the test renderings do not get lost and b) that each regression exactly points to the commit that has caused it - the one that triggered the test. The failed test creates a new issue in JIRA with a pointer to SVN to the before and after rendering and a bitmap of the differencies. The issue is then processed. If we find the change to be expected then the issue is simply closed, otherwise we take actions to fix the problem. The only annoying thing about this scheme is that, after commiting the correction, the test runs again and reports a regression because it now compares to the faulty version of the rendering.

That sounds fairly similar to my proposal, I like the aspect of pushing the server build’s PNGs to SVN. If we can’t get robust local rendering to work then that sounds like a good way to make the images easily available.

Thanks

-- John

Re: Regression Testing

Posted by Petr Slabý <sl...@kadel.cz>.

Hi,
following is a description of what we are doing in our company.

With our software, we run regression tests after each nightly build and 
sometimes it is a tough fight. If there is a regression, it is not so easy 
to find which commit caused it, because there are potentially many between 
the nightly builds. Then, the decision whether the change is wanted and 
expected is in some cases also difficult (this part might be easier with PDF 
where there is the "golden standard" rendering in Acrobat). If the change is 
expected and the new rendering "better" then one has to commit the new 
reference. This means that the files produced on the nightly build machine 
must be available somehow - it is almost impossible to produce them locally 
as the rendering results are slightly different with different versions of 
java and many other reasons. All this has to be done before the next 
regression test is run to avoid that new regressions are hidden by earlier 
ones. Our complete build with all tests runs several hours...

To improve this workflow, we now use the following schema in addition:
- there is a smaller set of regression tests which runs relatively fast
- these tests are triggered by each commit in formatting and rendering 
related projects
- before running the test itself, the modified project(s) are compiled 
locally, w/o publishing the result to maven
- the reference rendering files are stored in SVN
- if a test finds a regression, it immediately stores the new result as a 
new reference into SVN. This makes sure that a) the test renderings do not 
get lost and b) that each regression exactly points to the commit that has 
caused it - the one that triggered the test. The failed test creates a new 
issue in JIRA with a pointer to SVN to the before and after rendering and a 
bitmap of the differencies. The issue is then processed. If we find the 
change to be expected then the issue is simply closed, otherwise we take 
actions to fix the problem. The only annoying thing about this scheme is 
that, after commiting the correction, the test runs again and reports a 
regression because it now compares to the faulty version of the rendering.

Best regards,
Petr.

-----Původní zpráva----- 
From: John Hewson
Sent: Friday, July 04, 2014 7:39 PM
To: dev@pdfbox.apache.org
Subject: Re: Regression Testing

Hi Tilman

Thanks for your thoughts, I think that your concerns are already covered by 
my original proposal, I’ll try to explain why and how:

> Of course I agree with the need for regression tests, however it isn't 
> easy: besides the problems of the different JDKs (I use JDK7 Windows 64 
> bit), there is the problem that some enhancements create slight changes in 
> rendering that are not errors, i.e. both the "before" and the "after" 
> files look OK by itself. This has happened when we changed the text 
> rendering recently, and has happened again when the clipping was improved. 
> The cause are probably slight changes in color or in boundaries.

If a rendering has changed then the regression test should fail. When a 
failure occurs the developer needs to manually inspect the differences (we 
could generate a visual diff which highlights what changed to make this 
easier) and if ok then they can replace the known-good PNG with the ones 
just rendered. Indeed this will be the basic workflow for working with 
regression tests.

> Copyrights is a problem: I'm testing mostly with JIRA attachments that 
> I've downloaded over the years. While uploading such files to JIRA might 
> count as fair use, I doubt that this would still be true if they are 
> included in a distribution. Instead, they should be stored somewhere on 
> Apache servers where only committers and build software ("Travis", 
> "Jenkins", ...) can access then. The public PDFs that Maruan mentions 
> don't possibly have all the Problem cases that we solved before. However I 
> have started working with these files and there are at least 5 recent 
> issues that deals with them.

The PDFs won’t be in a distribution. They will just happen to be stored in 
an SVN repo but not our source code repo, in the same way that the website 
is stored in the “cmssite” branch of SVN or indeed, are on JIRA. The law 
doesn’t distinguish between JIRA and SVN, both are publicly available via 
HTTP, so using SVN will simply be a continuation of what we’re already doing 
with JIRA.

The crucial factor is that we’re only storing publicly available PDFs, 
because we have the right to do so, just like Google’s cache, and like we 
currently do with JIRA.

Additionally, the PDFs need to be version controlled otherwise we won’t be 
able to reliably recreate previous builds, so storing the files on a web 
server won’t be practical. Also committers will frequently be updating the 
renderings as bugs are fixed and we’ll need to version-control the rendered 
PNG files for the same reason. Finally, having committers-only files doesn’t 
fit well with the Apache goal of open development and would be unnecessary 
anyway given that all the PDFs are to be taken from public sources only.

In summary, I’m proposing that we just keep doing what we’re currently doing 
with JIRA but we move it into its own SVN repo along with some pre-rendered 
PNGs.

> Re preflight: the default mode should be to have the Isartor tests on. 
> Individuals could still disable them locally, but the central build 
> software should always use them.

Yes - does anybody know why this isn’t the default?

-- John

Re: Regression Testing

Posted by Maruan Sahyoun <sa...@fileaffairs.de>.

> Hi Tilman
> 
> Thanks for your thoughts, I think that your concerns are already covered by my original proposal, I’ll try to explain why and how:
> 
>> Of course I agree with the need for regression tests, however it isn't easy: besides the problems of the different JDKs (I use JDK7 Windows 64 bit), there is the problem that some enhancements create slight changes in rendering that are not errors, i.e. both the "before" and the "after" files look OK by itself. This has happened when we changed the text rendering recently, and has happened again when the clipping was improved. The cause are probably slight changes in color or in boundaries.
> 
> If a rendering has changed then the regression test should fail. When a failure occurs the developer needs to manually inspect the differences (we could generate a visual diff which highlights what changed to make this easier) and if ok then they can replace the known-good PNG with the ones just rendered. Indeed this will be the basic workflow for working with regression tests.
> 

I think this is the only way to handle that situation. The same applies for text extraction etc. - If an improvement changes the results the ‚base‘ needs to be reset by adding the new image, text etc as the validation source.

A basic testbed could also run against other JDKs - e.g. wo validating against the know-good files - so we pick up potential issues early. Should be easy with Jenkins and treated as a hint.  


>> Copyrights is a problem: I'm testing mostly with JIRA attachments that I've downloaded over the years. While uploading such files to JIRA might count as fair use, I doubt that this would still be true if they are included in a distribution. Instead, they should be stored somewhere on Apache servers where only committers and build software ("Travis", "Jenkins", ...) can access then. The public PDFs that Maruan mentions don't possibly have all the Problem cases that we solved before. However I have started working with these files and there are at least 5 recent issues that deals with them.
> 
> The PDFs won’t be in a distribution. They will just happen to be stored in an SVN repo but not our source code repo, in the same way that the website is stored in the “cmssite” branch of SVN or indeed, are on JIRA. The law doesn’t distinguish between JIRA and SVN, both are publicly available via HTTP, so using SVN will simply be a continuation of what we’re already doing with JIRA.
> 
> The crucial factor is that we’re only storing publicly available PDFs,  because we have the right to do so, just like Google’s cache, and like we currently do with JIRA.
> 
> Additionally, the PDFs need to be version controlled otherwise we won’t be able to reliably recreate previous builds, so storing the files on a web server won’t be practical. Also committers will frequently be updating the renderings as bugs are fixed and we’ll need to version-control the rendered PNG files for the same reason. Finally, having committers-only files doesn’t fit well with the Apache goal of open development and would be unnecessary anyway given that all the PDFs are to be taken from public sources only.
> 
> In summary, I’m proposing that we just keep doing what we’re currently doing with JIRA but we move it into its own SVN repo along with some pre-rendered PNGs.

In addition if we put in workarounds to handle nonconforming PDFs there should be a unit test added to make sure that we don’t break that e.g. when rewriting the parser. 

> 
>> Re preflight: the default mode should be to have the Isartor tests on. Individuals could still disable them locally, but the central build software should always use them.
> 
> Yes - does anybody know why this isn’t the default?
> 

No.

+1 for enabling it per default


> -- John

Re: Regression Testing

Posted by John Hewson <jo...@jahewson.com>.

Hi Tilman

Thanks for your thoughts, I think that your concerns are already covered by my original proposal, I’ll try to explain why and how:

> Of course I agree with the need for regression tests, however it isn't easy: besides the problems of the different JDKs (I use JDK7 Windows 64 bit), there is the problem that some enhancements create slight changes in rendering that are not errors, i.e. both the "before" and the "after" files look OK by itself. This has happened when we changed the text rendering recently, and has happened again when the clipping was improved. The cause are probably slight changes in color or in boundaries.

If a rendering has changed then the regression test should fail. When a failure occurs the developer needs to manually inspect the differences (we could generate a visual diff which highlights what changed to make this easier) and if ok then they can replace the known-good PNG with the ones just rendered. Indeed this will be the basic workflow for working with regression tests.

> Copyrights is a problem: I'm testing mostly with JIRA attachments that I've downloaded over the years. While uploading such files to JIRA might count as fair use, I doubt that this would still be true if they are included in a distribution. Instead, they should be stored somewhere on Apache servers where only committers and build software ("Travis", "Jenkins", ...) can access then. The public PDFs that Maruan mentions don't possibly have all the Problem cases that we solved before. However I have started working with these files and there are at least 5 recent issues that deals with them.

The PDFs won’t be in a distribution. They will just happen to be stored in an SVN repo but not our source code repo, in the same way that the website is stored in the “cmssite” branch of SVN or indeed, are on JIRA. The law doesn’t distinguish between JIRA and SVN, both are publicly available via HTTP, so using SVN will simply be a continuation of what we’re already doing with JIRA.

The crucial factor is that we’re only storing publicly available PDFs,  because we have the right to do so, just like Google’s cache, and like we currently do with JIRA.

Additionally, the PDFs need to be version controlled otherwise we won’t be able to reliably recreate previous builds, so storing the files on a web server won’t be practical. Also committers will frequently be updating the renderings as bugs are fixed and we’ll need to version-control the rendered PNG files for the same reason. Finally, having committers-only files doesn’t fit well with the Apache goal of open development and would be unnecessary anyway given that all the PDFs are to be taken from public sources only.

In summary, I’m proposing that we just keep doing what we’re currently doing with JIRA but we move it into its own SVN repo along with some pre-rendered PNGs.

> Re preflight: the default mode should be to have the Isartor tests on. Individuals could still disable them locally, but the central build software should always use them.

Yes - does anybody know why this isn’t the default?

-- John

Re: Regression Testing

Posted by Tilman Hausherr <TH...@t-online.de>.

Of course I agree with the need for regression tests, however it isn't 
easy: besides the problems of the different JDKs (I use JDK7 Windows 64 
bit), there is the problem that some enhancements create slight changes 
in rendering that are not errors, i.e. both the "before" and the "after" 
files look OK by itself. This has happened when we changed the text 
rendering recently, and has happened again when the clipping was 
improved. The cause are probably slight changes in color or in boundaries.

Copyrights is a problem: I'm testing mostly with JIRA attachments that 
I've downloaded over the years. While uploading such files to JIRA might 
count as fair use, I doubt that this would still be true if they are 
included in a distribution. Instead, they should be stored somewhere on 
Apache servers where only committers and build software ("Travis", 
"Jenkins", ...) can access then. The public PDFs that Maruan mentions 
don't possibly have all the Problem cases that we solved before. However 
I have started working with these files and there are at least 5 recent 
issues that deals with them.

I'm using an improved version of the TestPDFToImage class and I will 
commit it within a few days, but I must clean it up first.

Re preflight: the default mode should be to have the Isartor tests on. 
Individuals could still disable them locally, but the central build 
software should always use them.

Tilman


Am 04.07.2014 08:43, schrieb Maruan Sahyoun:
> Hi John,
>
> thanks for binging this up. This is a very important topic which was also discussed at the PDFDays in Germany.
>
>   # Tests #
> In addition to rendering we shall be covering metadata and text extraction as well as PDF/A validation.
>
> # Testfiles #
> Recently there were a number of test sets made available which we can use. http://digitalcorpora.org/corpora/files , https://github.com/openplanets/format-corpus/tree/master/pdfCabinetOfHorrors …
> For PDF/A validation there is the Isartor test suite http://www.pdfa.org/2011/08/download-isartor-test-suite/. Some restrictions apply there.
> In addition we can put additional files into our own repository as you suggested.
> So there is no shortage on test files.
>
> TIKA-1300/TIKA-1302 has a discussion around the same topic together with some development for an infrastructure (VM, Jenkins …). IMHO we should join forces with them.
>
> BR
>
> Maruan
>
>
> Am 04.07.2014 um 02:16 schrieb John Hewson <jo...@jahewson.com>:
>
>> Hi All
>>
>> I’ve been thinking about regression testing recently and how we can improve
>> our tests for rendering. There are currently two problems:
>>
>> 1) Different JDKs produce slightly different renderings (see PDFBOX-1843).
>>     (I suspect that AWT fonts are a big part of this, so the problem might get a lot better
>>     soon once we render all fonts ourselves).
>>
>> 2) Most PDF test files we have are not under an Apache-friendly license, so
>>     we can’t put the test files into the trunk SVN.
>>
>> It seems that some of you have your own collections of test PDF files which you are
>> running regression tests on: that’s great but it would be much better if we had a
>> central repository of test files and sample renderings.
>>
>> I’d like to suggest the following solutions to the above issues:
>>
>> 1) We should choose a “blessed” JDK which will be used to perform the renderings
>>     this should be whatever is a convenient and sensible default for committers. (My
>>     preference would be for Oracle’s JDK 7 because JDK 6 is deprecated has known
>>     rendering bugs). We should make sure that Jenkins runs tests using the ”blessed”
>>     JDK.
>>
>>    The regression test can then check to see if it is running on the “blessed” JDK and
>>    if not then the tests can be skipped and we can warn the user.
>>
>> 2) We should create a new “regression” branch in SVN which contains only PDF files
>>     for testing and PNG images which contain known-good renderings created using the
>>     “blessed” JDK. This branch would not be part of the source of PDFBox but will still
>>     allow us to version control the test PDFs (it also simplifies the workflow for adding
>>     new test PDFs and new known-good renderings: simply do an "svn add”).
>>
>>     As far as copyright and licensing is concerned we can put any PDF files which are
>>     available publicly on the web into this branch without too much worry.
>>
>> What does everybody think?
>>
>> -- John
>>
>

Re: Regression Testing

Posted by Maruan Sahyoun <sa...@fileaffairs.de>.

Hi John,

thanks for binging this up. This is a very important topic which was also discussed at the PDFDays in Germany.

 # Tests #
In addition to rendering we shall be covering metadata and text extraction as well as PDF/A validation. 

# Testfiles # 
Recently there were a number of test sets made available which we can use. http://digitalcorpora.org/corpora/files , https://github.com/openplanets/format-corpus/tree/master/pdfCabinetOfHorrors …
For PDF/A validation there is the Isartor test suite http://www.pdfa.org/2011/08/download-isartor-test-suite/. Some restrictions apply there.
In addition we can put additional files into our own repository as you suggested.
So there is no shortage on test files. 

TIKA-1300/TIKA-1302 has a discussion around the same topic together with some development for an infrastructure (VM, Jenkins …). IMHO we should join forces with them.

BR

Maruan


Am 04.07.2014 um 02:16 schrieb John Hewson <jo...@jahewson.com>:

> Hi All
> 
> I’ve been thinking about regression testing recently and how we can improve
> our tests for rendering. There are currently two problems:
> 
> 1) Different JDKs produce slightly different renderings (see PDFBOX-1843).
>    (I suspect that AWT fonts are a big part of this, so the problem might get a lot better
>    soon once we render all fonts ourselves).
> 
> 2) Most PDF test files we have are not under an Apache-friendly license, so
>    we can’t put the test files into the trunk SVN.
> 
> It seems that some of you have your own collections of test PDF files which you are
> running regression tests on: that’s great but it would be much better if we had a
> central repository of test files and sample renderings.
> 
> I’d like to suggest the following solutions to the above issues:
> 
> 1) We should choose a “blessed” JDK which will be used to perform the renderings
>    this should be whatever is a convenient and sensible default for committers. (My
>    preference would be for Oracle’s JDK 7 because JDK 6 is deprecated has known
>    rendering bugs). We should make sure that Jenkins runs tests using the ”blessed”
>    JDK.
> 
>   The regression test can then check to see if it is running on the “blessed” JDK and
>   if not then the tests can be skipped and we can warn the user.
> 
> 2) We should create a new “regression” branch in SVN which contains only PDF files
>    for testing and PNG images which contain known-good renderings created using the
>    “blessed” JDK. This branch would not be part of the source of PDFBox but will still
>    allow us to version control the test PDFs (it also simplifies the workflow for adding
>    new test PDFs and new known-good renderings: simply do an "svn add”).
> 
>    As far as copyright and licensing is concerned we can put any PDF files which are
>    available publicly on the web into this branch without too much worry.
> 
> What does everybody think?
> 
> -- John
>