You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@opennlp.apache.org by Joern Kottmann <ko...@gmail.com> on 2015/04/14 23:47:14 UTC

Automated testing with public data

Hi all,

this time the progress with the testing for 1.6.0 is rather slow. Most
tests are done now and I believe we are in a good shape to build RC3.
Anyway it would have bee better to be at that stage month ago.

To improve the situation in the future I would like to propose to automate
all tests which can be run against data which is publicly available. These
tests are all set up following the same pattern, they train a component on
a corpus and afterwards evaluate against it. If the results matches the
result of the previous release we hope the code doesn't contain any
regressions. In some cases we have changes which influence the performance
(e.g. bug fixes) in that case we adjust the expected performance score and
carefully test that a particular change caused it.

We sometimes have changes which shouldn't influence the performance of a
component but still do due to some mistakes. These we need to identify
during testing.

The big issue we have with testing against public data is that we usually
can't include the data in the OpenNLP release because of their license. And
today we just do all the work manually by training on a corpus and
afterwards running the built in evaluation against the model.

I suggest we write JUnit tests which are doing this in case the user has
the right corpus for the test. Those tests will be disabled by default and
can be run by providing the -Dtest property and the location of the data
director.

For example.
mvn test -Dtest=Conll06* -DOPENNLP_CORPUS_DIR=/home/admin/opennlp-data

The tests will do all the work and fail if the expected results don't match.

Automating those tests has the great advantage that we can run them much
more frequently during the development phase and hopefully identify bugs
before we even start with the release process.
Addionally we might be able to run that on our build server.

Any opinions?

Jörn

Re: Automated testing with public data

Posted by Aliaksandr Autayeu <al...@autayeu.com>.
Yes, I agree, it's about build time. Nevertheless, even for build time
fewer deps is better, especially if it can be done. There is no need to
include this step in every build, a separate goal might work just fine, or
a check based on presence of files before test execution.

On 30 April 2015 at 19:36, Richard Eckart de Castilho <
richard.eckart@gmail.com> wrote:

> On 30.04.2015, at 16:51, Aliaksandr Autayeu <al...@autayeu.com>
> wrote:
>
> > Well, ant is still an extra dependency, though better than wget.
> Something
> > like Wagon in Maven?
>
> I guess we are talking about build-time here - so it wouldn't be a test or
> runtime dependency and not even be referred to in actual code.
>
> -- Richard
>

Re: Automated testing with public data

Posted by Richard Eckart de Castilho <ri...@gmail.com>.
On 30.04.2015, at 16:51, Aliaksandr Autayeu <al...@autayeu.com> wrote:

> Well, ant is still an extra dependency, though better than wget. Something
> like Wagon in Maven?

I guess we are talking about build-time here - so it wouldn't be a test or 
runtime dependency and not even be referred to in actual code.

-- Richard

Re: Automated testing with public data

Posted by Aliaksandr Autayeu <al...@autayeu.com>.
Well, ant is still an extra dependency, though better than wget. Something
like Wagon in Maven?

On 30 April 2015 at 11:02, Richard Eckart de Castilho <
richard.eckart@gmail.com> wrote:

> Since OpenNLP is cross-platform/Java-based, something that works
> cross-platform/Java-based might be better than wget.
> I'm using Ant scripts for such tasks.
>
> -- Richard
>
> On 29.04.2015, at 17:11, William Colen <wi...@gmail.com> wrote:
>
> > +1
> >
> > The script would also be great for documentation.
> >
> > 2015-04-29 11:15 GMT-03:00 Joern Kottmann <ko...@gmail.com>:
> >
> >> Or we just make a download script which bootstraps the users corpus
> folder.
> >>
> >> Could be a couple of wget lines or so ...
> >>
> >>
> >> Jörn
>
>

Re: Automated testing with public data

Posted by Richard Eckart de Castilho <ri...@gmail.com>.
Since OpenNLP is cross-platform/Java-based, something that works
cross-platform/Java-based might be better than wget. 
I'm using Ant scripts for such tasks.

-- Richard

On 29.04.2015, at 17:11, William Colen <wi...@gmail.com> wrote:

> +1
> 
> The script would also be great for documentation.
> 
> 2015-04-29 11:15 GMT-03:00 Joern Kottmann <ko...@gmail.com>:
> 
>> Or we just make a download script which bootstraps the users corpus folder.
>> 
>> Could be a couple of wget lines or so ...
>> 
>> 
>> Jörn


Re: Automated testing with public data

Posted by William Colen <wi...@gmail.com>.
+1

The script would also be great for documentation.

2015-04-29 11:15 GMT-03:00 Joern Kottmann <ko...@gmail.com>:

> Or we just make a download script which bootstraps the users corpus folder.
>
> Could be a couple of wget lines or so ...
>
>
> Jörn
>
> On Wed, Apr 29, 2015 at 6:17 AM, William Colen <wi...@gmail.com>
> wrote:
>
> > Automating the download would be fine as long as we cache it, as Richard
> > suggested. Maybe it could be done by a script to prepare the environment,
> > and not be part of the unit test itself.
> > Anyway, it would be a good idea to save the data somewhere because we
> never
> > know if some of the websites will become unavailable in the future.
> >
> >
> > 2015-04-15 5:31 GMT-03:00 Richard Eckart de Castilho <
> > richard.eckart@gmail.com>:
> >
> > > On 15.04.2015, at 10:23, Joern Kottmann <ko...@gmail.com> wrote:
> > >
> > > > With publicly accessible data I mean a corpus you can somehow
> acquire,
> > > > opposed to the data you create on your own for a project.
> > > >
> > > > All the corpora we support in the formats package are publicly
> > > accessible.
> > > > Maybe
> > > > some you have to buy and for others you just have to sign some
> > agreement.
> > > >
> > > > A very interesting corpus for testing (and training models on) is
> > > OntoNotes.
> > > >
> > > > Here is a link to the LDC entry:
> > > > https://catalog.ldc.upenn.edu/LDC2011T03
> > > >
> > > > You can get it for free (or for a small distribution fee) but you
> can't
> > > > just download it.
> > > > It would be great if the ASF could acquire this data set so we can
> > share
> > > it
> > > > among the committers.
> > > >
> > > > Is that what you mean with proprietary data?
> > >
> > > Yes, that is what I mean.
> > >
> > > E.g. the TIGER corpus requires clicking through some pages and forms to
> > > reach a download page, but in principle, it appears as if the corpus
> was
> > > simply downloadable by a deep-link URL. The license terms state, that
> the
> > > corpus must not be redistributed.
> > >
> > > Some tools are also publicly accessible and downloadable but not
> > > redistributable. For example anybody can download TreeTagger and its
> > > models, but only from the original homepage. It is not permitted to
> > > redistribute it, i.e. to publish it to a repository or offer it on an
> > > alternative homepage.
> > >
> > > So there is a (small) class of resources between being redistributable
> > and
> > > proprietary (for fee), namely being in principle publicly accessible
> (for
> > > free) but not redistributable.
> > >
> > > Cheers,
> > >
> > > -- Richard
> >
>

Re: Automated testing with public data

Posted by Joern Kottmann <ko...@gmail.com>.
Or we just make a download script which bootstraps the users corpus folder.

Could be a couple of wget lines or so ...


Jörn

On Wed, Apr 29, 2015 at 6:17 AM, William Colen <wi...@gmail.com>
wrote:

> Automating the download would be fine as long as we cache it, as Richard
> suggested. Maybe it could be done by a script to prepare the environment,
> and not be part of the unit test itself.
> Anyway, it would be a good idea to save the data somewhere because we never
> know if some of the websites will become unavailable in the future.
>
>
> 2015-04-15 5:31 GMT-03:00 Richard Eckart de Castilho <
> richard.eckart@gmail.com>:
>
> > On 15.04.2015, at 10:23, Joern Kottmann <ko...@gmail.com> wrote:
> >
> > > With publicly accessible data I mean a corpus you can somehow acquire,
> > > opposed to the data you create on your own for a project.
> > >
> > > All the corpora we support in the formats package are publicly
> > accessible.
> > > Maybe
> > > some you have to buy and for others you just have to sign some
> agreement.
> > >
> > > A very interesting corpus for testing (and training models on) is
> > OntoNotes.
> > >
> > > Here is a link to the LDC entry:
> > > https://catalog.ldc.upenn.edu/LDC2011T03
> > >
> > > You can get it for free (or for a small distribution fee) but you can't
> > > just download it.
> > > It would be great if the ASF could acquire this data set so we can
> share
> > it
> > > among the committers.
> > >
> > > Is that what you mean with proprietary data?
> >
> > Yes, that is what I mean.
> >
> > E.g. the TIGER corpus requires clicking through some pages and forms to
> > reach a download page, but in principle, it appears as if the corpus was
> > simply downloadable by a deep-link URL. The license terms state, that the
> > corpus must not be redistributed.
> >
> > Some tools are also publicly accessible and downloadable but not
> > redistributable. For example anybody can download TreeTagger and its
> > models, but only from the original homepage. It is not permitted to
> > redistribute it, i.e. to publish it to a repository or offer it on an
> > alternative homepage.
> >
> > So there is a (small) class of resources between being redistributable
> and
> > proprietary (for fee), namely being in principle publicly accessible (for
> > free) but not redistributable.
> >
> > Cheers,
> >
> > -- Richard
>

Re: Automated testing with public data

Posted by William Colen <wi...@gmail.com>.
Automating the download would be fine as long as we cache it, as Richard
suggested. Maybe it could be done by a script to prepare the environment,
and not be part of the unit test itself.
Anyway, it would be a good idea to save the data somewhere because we never
know if some of the websites will become unavailable in the future.


2015-04-15 5:31 GMT-03:00 Richard Eckart de Castilho <
richard.eckart@gmail.com>:

> On 15.04.2015, at 10:23, Joern Kottmann <ko...@gmail.com> wrote:
>
> > With publicly accessible data I mean a corpus you can somehow acquire,
> > opposed to the data you create on your own for a project.
> >
> > All the corpora we support in the formats package are publicly
> accessible.
> > Maybe
> > some you have to buy and for others you just have to sign some agreement.
> >
> > A very interesting corpus for testing (and training models on) is
> OntoNotes.
> >
> > Here is a link to the LDC entry:
> > https://catalog.ldc.upenn.edu/LDC2011T03
> >
> > You can get it for free (or for a small distribution fee) but you can't
> > just download it.
> > It would be great if the ASF could acquire this data set so we can share
> it
> > among the committers.
> >
> > Is that what you mean with proprietary data?
>
> Yes, that is what I mean.
>
> E.g. the TIGER corpus requires clicking through some pages and forms to
> reach a download page, but in principle, it appears as if the corpus was
> simply downloadable by a deep-link URL. The license terms state, that the
> corpus must not be redistributed.
>
> Some tools are also publicly accessible and downloadable but not
> redistributable. For example anybody can download TreeTagger and its
> models, but only from the original homepage. It is not permitted to
> redistribute it, i.e. to publish it to a repository or offer it on an
> alternative homepage.
>
> So there is a (small) class of resources between being redistributable and
> proprietary (for fee), namely being in principle publicly accessible (for
> free) but not redistributable.
>
> Cheers,
>
> -- Richard

Re: Automated testing with public data

Posted by Richard Eckart de Castilho <ri...@gmail.com>.
On 15.04.2015, at 10:23, Joern Kottmann <ko...@gmail.com> wrote:

> With publicly accessible data I mean a corpus you can somehow acquire,
> opposed to the data you create on your own for a project.
> 
> All the corpora we support in the formats package are publicly accessible.
> Maybe
> some you have to buy and for others you just have to sign some agreement.
> 
> A very interesting corpus for testing (and training models on) is OntoNotes.
> 
> Here is a link to the LDC entry:
> https://catalog.ldc.upenn.edu/LDC2011T03
> 
> You can get it for free (or for a small distribution fee) but you can't
> just download it.
> It would be great if the ASF could acquire this data set so we can share it
> among the committers.
> 
> Is that what you mean with proprietary data?

Yes, that is what I mean.

E.g. the TIGER corpus requires clicking through some pages and forms to reach a download page, but in principle, it appears as if the corpus was simply downloadable by a deep-link URL. The license terms state, that the corpus must not be redistributed.

Some tools are also publicly accessible and downloadable but not redistributable. For example anybody can download TreeTagger and its models, but only from the original homepage. It is not permitted to redistribute it, i.e. to publish it to a repository or offer it on an alternative homepage.

So there is a (small) class of resources between being redistributable and proprietary (for fee), namely being in principle publicly accessible (for free) but not redistributable.

Cheers,

-- Richard

Re: Automated testing with public data

Posted by Joern Kottmann <ko...@gmail.com>.
With publicly accessible data I mean a corpus you can somehow acquire,
opposed to the data you create on your own for a project.

All the corpora we support in the formats package are publicly accessible.
Maybe
some you have to buy and for others you just have to sign some agreement.

A very interesting corpus for testing (and training models on) is OntoNotes.

Here is a link to the LDC entry:
https://catalog.ldc.upenn.edu/LDC2011T03

You can get it for free (or for a small distribution fee) but you can't
just download it.
It would be great if the ASF could acquire this data set so we can share it
among the committers.

Is that what you mean with proprietary data?

Jörn

On Wed, Apr 15, 2015 at 10:05 AM, Richard Eckart de Castilho <
richard.eckart@gmail.com> wrote:

> On 15.04.2015, at 09:39, Joern Kottmann <ko...@gmail.com> wrote:
>
> > Some data sets are publicly available but protected by copyright and just
> > can't be redistributed in
> > anyway. For this data we could get/buy a license and maybe restrict
> access
> > to it among the committers.
>
> That's what I'm saying ;) If you automatically download the data to a
> personal
> workstation during tests, you do not redistribute the data.
>
> For Jenkins builds, I just checked the Apache Jenkins and the "Workspace"
> does
> not seem to be publicly accessible. So stuff downloaded during tests there
> is
> also not made publicly available (redistributed) - it is only accessible to
> Apache developers that are logged in.
>
> IMHO only truely proprietary data that is not publicly accessible should be
> a problem, no?
>
> -- Richard

Re: Automated testing with public data

Posted by Richard Eckart de Castilho <ri...@gmail.com>.
On 15.04.2015, at 09:39, Joern Kottmann <ko...@gmail.com> wrote:

> Some data sets are publicly available but protected by copyright and just
> can't be redistributed in
> anyway. For this data we could get/buy a license and maybe restrict access
> to it among the committers.

That's what I'm saying ;) If you automatically download the data to a personal
workstation during tests, you do not redistribute the data.

For Jenkins builds, I just checked the Apache Jenkins and the "Workspace" does
not seem to be publicly accessible. So stuff downloaded during tests there is
also not made publicly available (redistributed) - it is only accessible to
Apache developers that are logged in. 

IMHO only truely proprietary data that is not publicly accessible should be
a problem, no?

-- Richard

Re: Automated testing with public data

Posted by Joern Kottmann <ko...@gmail.com>.
Thanks for sharing. This probably applies to a sub-set of the data we can
use.
And all data we somehow can get into subversion we should definitely
check-in.

Some data sets are publicly available but protected by copyright and just
can't be redistributed in
anyway. For this data we could get/buy a license and maybe restrict access
to it among the committers.

Jörn


On Tue, Apr 14, 2015 at 11:53 PM, Richard Eckart de Castilho <
richard.eckart@gmail.com> wrote:

> If the unit test automatically download publicly accessible test data,
> run the tests, and optionally delete the data afterwards, then the
> test data does not have to be redistributed. Instead of deleting,
> it might be even a good idea to cache the data to a) avoid hammering
> the remote source and b) still have a local copy in case the source
> fails.
>
> I believe several cases have been discussed on the legal mailing list
> where non-essential or test-only resources that were not part of the
> release could be under licenses that would not be deemed compatible
> with the Apache license. My understanding is that the release needs
> to be untained and the downstream users must be able to trust that
> they incur no license restrictions beyond the ASL.
>
> Cheers,
>
> -- Richard
>
> On 14.04.2015, at 23:47, Joern Kottmann <ko...@gmail.com> wrote:
>
> > Hi all,
> >
> > this time the progress with the testing for 1.6.0 is rather slow. Most
> > tests are done now and I believe we are in a good shape to build RC3.
> > Anyway it would have bee better to be at that stage month ago.
> >
> > To improve the situation in the future I would like to propose to
> automate
> > all tests which can be run against data which is publicly available.
> These
> > tests are all set up following the same pattern, they train a component
> on
> > a corpus and afterwards evaluate against it. If the results matches the
> > result of the previous release we hope the code doesn't contain any
> > regressions. In some cases we have changes which influence the
> performance
> > (e.g. bug fixes) in that case we adjust the expected performance score
> and
> > carefully test that a particular change caused it.
> >
> > We sometimes have changes which shouldn't influence the performance of a
> > component but still do due to some mistakes. These we need to identify
> > during testing.
> >
> > The big issue we have with testing against public data is that we usually
> > can't include the data in the OpenNLP release because of their license.
> And
> > today we just do all the work manually by training on a corpus and
> > afterwards running the built in evaluation against the model.
> >
> > I suggest we write JUnit tests which are doing this in case the user has
> > the right corpus for the test. Those tests will be disabled by default
> and
> > can be run by providing the -Dtest property and the location of the data
> > director.
> >
> > For example.
> > mvn test -Dtest=Conll06* -DOPENNLP_CORPUS_DIR=/home/admin/opennlp-data
> >
> > The tests will do all the work and fail if the expected results don't
> match.
> >
> > Automating those tests has the great advantage that we can run them much
> > more frequently during the development phase and hopefully identify bugs
> > before we even start with the release process.
> > Addionally we might be able to run that on our build server.
> >
> > Any opinions?
> >
> > Jörn
>
>

Re: Automated testing with public data

Posted by Richard Eckart de Castilho <ri...@gmail.com>.
If the unit test automatically download publicly accessible test data,
run the tests, and optionally delete the data afterwards, then the
test data does not have to be redistributed. Instead of deleting,
it might be even a good idea to cache the data to a) avoid hammering
the remote source and b) still have a local copy in case the source
fails.

I believe several cases have been discussed on the legal mailing list
where non-essential or test-only resources that were not part of the
release could be under licenses that would not be deemed compatible
with the Apache license. My understanding is that the release needs
to be untained and the downstream users must be able to trust that
they incur no license restrictions beyond the ASL. 

Cheers,

-- Richard

On 14.04.2015, at 23:47, Joern Kottmann <ko...@gmail.com> wrote:

> Hi all,
> 
> this time the progress with the testing for 1.6.0 is rather slow. Most
> tests are done now and I believe we are in a good shape to build RC3.
> Anyway it would have bee better to be at that stage month ago.
> 
> To improve the situation in the future I would like to propose to automate
> all tests which can be run against data which is publicly available. These
> tests are all set up following the same pattern, they train a component on
> a corpus and afterwards evaluate against it. If the results matches the
> result of the previous release we hope the code doesn't contain any
> regressions. In some cases we have changes which influence the performance
> (e.g. bug fixes) in that case we adjust the expected performance score and
> carefully test that a particular change caused it.
> 
> We sometimes have changes which shouldn't influence the performance of a
> component but still do due to some mistakes. These we need to identify
> during testing.
> 
> The big issue we have with testing against public data is that we usually
> can't include the data in the OpenNLP release because of their license. And
> today we just do all the work manually by training on a corpus and
> afterwards running the built in evaluation against the model.
> 
> I suggest we write JUnit tests which are doing this in case the user has
> the right corpus for the test. Those tests will be disabled by default and
> can be run by providing the -Dtest property and the location of the data
> director.
> 
> For example.
> mvn test -Dtest=Conll06* -DOPENNLP_CORPUS_DIR=/home/admin/opennlp-data
> 
> The tests will do all the work and fail if the expected results don't match.
> 
> Automating those tests has the great advantage that we can run them much
> more frequently during the development phase and hopefully identify bugs
> before we even start with the release process.
> Addionally we might be able to run that on our build server.
> 
> Any opinions?
> 
> Jörn