You are viewing a plain text version of this content. The canonical link for it is here.

Posted to legal-discuss@apache.org by Benson Margulies <bi...@gmail.com> on 2010/11/04 17:07:12 UTC

Fair-use data in svn

I write code in some areas where 'real world' textual data is fuel.
It's test cases. It's training corpora. It cannot be replaced by
constructed, test-tube, text that could be created under the AL or
some other 'class A' license.

I'd like to contribute some of that data here at ASF. In some cases,
that would require checking in test case data that consists of (for
example) miscellaneous web pages grabbed with wget. In other cases, it
might consist of larger collections of text derived from such pages.

I would like to discover that this is acceptable, perhaps with some
caveats and requirements for NOTICE.

---------------------------------------------------------------------
To unsubscribe, e-mail: legal-discuss-unsubscribe@apache.org
For additional commands, e-mail: legal-discuss-help@apache.org

RE: Fair-use data in svn

Posted by Lawrence Rosen <lr...@rosenlaw.com>.

> Here's a concrete example. Let's say that the job at hand is to
> extract useful text from webpages. You need to test this on the news
> sites that people want to work with, like CNN. The inventory of
> 'Commons' pages is not representative.

Perhaps you can obtain a special one-time license from CNN?

What's a "boilerpipe"?

I've probably taken this as far as I am qualified. Once I said "you can't make unlicensed copies of anyone's copyrighted HTML files," there isn't much more legal advice I can offer.

/Larry




> -----Original Message-----
> From: Benson Margulies [mailto:bimargulies@gmail.com]
> Sent: Thursday, November 04, 2010 5:00 PM
> To: legal-discuss@apache.org
> Subject: Re: Fair-use data in svn
> 
> On Thu, Nov 4, 2010 at 7:47 PM, Lawrence Rosen <lr...@rosenlaw.com>
> wrote:
> > Benson, how about copying materials that are explicitly marked
> "Creative Commons"? There must be enough of that stuff on the web to
> collect into a test case.
> 
> Here's a concrete example. Let's say that the job at hand is to
> extract useful text from webpages. You need to test this on the news
> sites that people want to work with, like CNN. The inventory of
> 'Commons' pages is not representative.
> 
> Another bit of concretude:
> 
> Case 1: you have a representative collection of HTML pages, and you
> use them to regress data extraction. Tika has avoided this by
> depending on a non-ASF component (boilerpipe).
> 
> Case 2: you have, oh, 250,000 words of news, and you get people to
> annotate them, and use them to train models. Whether there's enough of
> the right stuff out there under CC is an open question.
> 
> >
> > /Larry
> >
> >
> >
> >
> >> -----Original Message-----
> >> From: Benson Margulies [mailto:bimargulies@gmail.com]
> >> Sent: Thursday, November 04, 2010 2:56 PM
> >> To: legal-discuss@apache.org
> >> Subject: Re: Fair-use data in svn
> >>
> >> > There is no exception in copyright infringement law that allows
> you
> >> to copy other people's copyrighted materials and distribute them on
> an
> >> Apache website, no matter how upstanding the goals, without a
> license.
> >> Ask permission first.
> >>
> >> It won't be on an apache web site. It will be in a zip file in svn,
> >> read by (for example) a unit test. That seems a relevant distinction
> >> to me, but YAAL, not me.
> >>
> >> >
> >> > If you intend to rely on a fair use defense, don't count on it
> >> without analyzing the fair use factors carefully. I'll work with you
> on
> >> that analysis if you can't find a better alternative for generating
> >> test data.
> >> >
> >> > If these really are "miscellaneous" web pages, why can't you
> create a
> >> test consisting of links to the actual pages? Must you copy the
> pages
> >> themselves?
> >>
> >> You can't make a repeatable process that depends on ephemeral
> content
> >> -- and this content is always ephemeral -- sitting there when you
> want
> >> it.
> >>
> >>
> >> > /Larry
> >> >
> >> >
> >> >> -----Original Message-----
> >> >> From: Benson Margulies [mailto:bimargulies@gmail.com]
> >> >> Sent: Thursday, November 04, 2010 9:07 AM
> >> >> To: legal-discuss@apache.org
> >> >> Subject: Fair-use data in svn
> >> >>
> >> >> I write code in some areas where 'real world' textual data is
> fuel.
> >> >> It's test cases. It's training corpora. It cannot be replaced by
> >> >> constructed, test-tube, text that could be created under the AL
> or
> >> >> some other 'class A' license.
> >> >>
> >> >> I'd like to contribute some of that data here at ASF. In some
> cases,
> >> >> that would require checking in test case data that consists of
> (for
> >> >> example) miscellaneous web pages grabbed with wget. In other
> cases,
> >> it
> >> >> might consist of larger collections of text derived from such
> pages.
> >> >>
> >> >> I would like to discover that this is acceptable, perhaps with
> some
> >> >> caveats and requirements for NOTICE.
> >> >>
> >> >> -----------------------------------------------------------------
> ---
> >> -
> >> >> To unsubscribe, e-mail: legal-discuss-unsubscribe@apache.org
> >> >> For additional commands, e-mail: legal-discuss-help@apache.org
> >> >
> >> >
> >> >
> >> > ------------------------------------------------------------------
> ---
> >> > To unsubscribe, e-mail: legal-discuss-unsubscribe@apache.org
> >> > For additional commands, e-mail: legal-discuss-help@apache.org
> >> >
> >> >
> >>
> >> --------------------------------------------------------------------
> -
> >> To unsubscribe, e-mail: legal-discuss-unsubscribe@apache.org
> >> For additional commands, e-mail: legal-discuss-help@apache.org
> >
> >
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: legal-discuss-unsubscribe@apache.org
> > For additional commands, e-mail: legal-discuss-help@apache.org
> >
> >
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: legal-discuss-unsubscribe@apache.org
> For additional commands, e-mail: legal-discuss-help@apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: legal-discuss-unsubscribe@apache.org
For additional commands, e-mail: legal-discuss-help@apache.org

Re: Fair-use data in svn

Posted by Upayavira <uv...@odoko.co.uk>.

I think this might be a way to go. Ask here for folks that can
put you in touch with (hopefully) representative copyright
holders, and get grants for licenses to old content.

Now, if you plan to include it in SVN, use it for tests, and not
include it in releases, then the license grants need not be
AL2.0.

Upayavira

On Fri, 05 Nov 2010 09:21 +0100, "Santiago Gala"
<sa...@gmail.com> wrote:

  CNN can probably give permission to use a set of 6 month old
  "regular" news for such purpose. If contacted through their PR
  people you could 'pay' with words about how important this is,
  or a joint release about them helping research (but talk with
  press@ before assuming it)

El 05/11/2010 01:08, "Benson Margulies"
<[1...@gmail.com> escribió:
> On Thu, Nov 4, 2010 at 7:47 PM, Lawrence Rosen
<[2...@rosenlaw.com> wrote:
>> Benson, how about copying materials that are explicitly marked
"Creative Commons"? There must be enough of that stuff on the web
to collect into a test case.
>
> Here's a concrete example. Let's say that the job at hand is to
> extract useful text from webpages. You need to test this on the
news
> sites that people want to work with, like CNN. The inventory of
> 'Commons' pages is not representative.
>
> Another bit of concretude:
>
> Case 1: you have a representative collection of HTML pages, and
you
> use them to regress data extraction. Tika has avoided this by
> depending on a non-ASF component (boilerpipe).
>
> Case 2: you have, oh, 250,000 words of news, and you get people
to
> annotate them, and use them to train models. Whether there's
enough of
> the right stuff out there under CC is an open question.
>
>>
>> /Larry
>>
>>
>>
>>
>>> -----Original Message-----
>>> From: Benson Margulies [mailto:[3]bimargulies@gmail.com]
>>> Sent: Thursday, November 04, 2010 2:56 PM
>>> To: [4]legal-discuss@apache.org
>>> Subject: Re: Fair-use data in svn
>>>
>>> > There is no exception in copyright infringement law that
allows you
>>> to copy other people's copyrighted materials and distribute
them on an
>>> Apache website, no matter how upstanding the goals, without a
license.
>>> Ask permission first.
>>>
>>> It won't be on an apache web site. It will be in a zip file
in svn,
>>> read by (for example) a unit test. That seems a relevant
distinction
>>> to me, but YAAL, not me.
>>>
>>> >
>>> > If you intend to rely on a fair use defense, don't count on
it
>>> without analyzing the fair use factors carefully. I'll work
with you on
>>> that analysis if you can't find a better alternative for
generating
>>> test data.
>>> >
>>> > If these really are "miscellaneous" web pages, why can't
you create a
>>> test consisting of links to the actual pages? Must you copy
the pages
>>> themselves?
>>>
>>> You can't make a repeatable process that depends on ephemeral
content
>>> -- and this content is always ephemeral -- sitting there when
you want
>>> it.
>>>
>>>
>>> > /Larry
>>> >
>>> >
>>> >> -----Original Message-----
>>> >> From: Benson Margulies [mailto:[5]bimargulies@gmail.com]
>>> >> Sent: Thursday, November 04, 2010 9:07 AM
>>> >> To: [6]legal-discuss@apache.org
>>> >> Subject: Fair-use data in svn
>>> >>
>>> >> I write code in some areas where 'real world' textual data
is fuel.
>>> >> It's test cases. It's training corpora. It cannot be
replaced by
>>> >> constructed, test-tube, text that could be created under
the AL or
>>> >> some other 'class A' license.
>>> >>
>>> >> I'd like to contribute some of that data here at ASF. In
some cases,
>>> >> that would require checking in test case data that
consists of (for
>>> >> example) miscellaneous web pages grabbed with wget. In
other cases,
>>> it
>>> >> might consist of larger collections of text derived from
such pages.
>>> >>
>>> >> I would like to discover that this is acceptable, perhaps
with some
>>> >> caveats and requirements for NOTICE.
>>> >>
>>> >>
-----------------------------------------------------------------
---
>>> -
>>> >> To unsubscribe, e-mail:
[7]legal-discuss-unsubscribe@apache.org
>>> >> For additional commands, e-mail:
[8]legal-discuss-help@apache.org
>>> >
>>> >
>>> >
>>> >
-----------------------------------------------------------------
----
>>> > To unsubscribe, e-mail:
[9]legal-discuss-unsubscribe@apache.org
>>> > For additional commands, e-mail:
[10]legal-discuss-help@apache.org
>>> >
>>> >
>>>
>>>
-----------------------------------------------------------------
----
>>> To unsubscribe, e-mail:
[11]legal-discuss-unsubscribe@apache.org
>>> For additional commands, e-mail:
[12]legal-discuss-help@apache.org
>>
>>
>>
>>
-----------------------------------------------------------------
----
>> To unsubscribe, e-mail:
[13]legal-discuss-unsubscribe@apache.org
>> For additional commands, e-mail:
[14]legal-discuss-help@apache.org
>>
>>
>
>
-----------------------------------------------------------------
----
> To unsubscribe, e-mail:
[15]legal-discuss-unsubscribe@apache.org
> For additional commands, e-mail:
[16]legal-discuss-help@apache.org
>

References

1. mailto:bimargulies@gmail.com
2. mailto:lrosen@rosenlaw.com
3. mailto:bimargulies@gmail.com
4. mailto:legal-discuss@apache.org
5. mailto:bimargulies@gmail.com
6. mailto:legal-discuss@apache.org
7. mailto:legal-discuss-unsubscribe@apache.org
8. mailto:legal-discuss-help@apache.org
9. mailto:legal-discuss-unsubscribe@apache.org
  10. mailto:legal-discuss-help@apache.org
  11. mailto:legal-discuss-unsubscribe@apache.org
  12. mailto:legal-discuss-help@apache.org
  13. mailto:legal-discuss-unsubscribe@apache.org
  14. mailto:legal-discuss-help@apache.org
  15. mailto:legal-discuss-unsubscribe@apache.org
  16. mailto:legal-discuss-help@apache.org

Re: Fair-use data in svn

Posted by Santiago Gala <sa...@gmail.com>.

CNN can probably give permission to use a set of 6 month old "regular" news
for such purpose. If contacted through their PR people you could 'pay' with
words about how important this is, or a joint release about them helping
research (but talk with press@ before assuming it)
El 05/11/2010 01:08, "Benson Margulies" <bi...@gmail.com> escribió:
> On Thu, Nov 4, 2010 at 7:47 PM, Lawrence Rosen <lr...@rosenlaw.com>
wrote:
>> Benson, how about copying materials that are explicitly marked "Creative
Commons"? There must be enough of that stuff on the web to collect into a
test case.
>
> Here's a concrete example. Let's say that the job at hand is to
> extract useful text from webpages. You need to test this on the news
> sites that people want to work with, like CNN. The inventory of
> 'Commons' pages is not representative.
>
> Another bit of concretude:
>
> Case 1: you have a representative collection of HTML pages, and you
> use them to regress data extraction. Tika has avoided this by
> depending on a non-ASF component (boilerpipe).
>
> Case 2: you have, oh, 250,000 words of news, and you get people to
> annotate them, and use them to train models. Whether there's enough of
> the right stuff out there under CC is an open question.
>
>>
>> /Larry
>>
>>
>>
>>
>>> -----Original Message-----
>>> From: Benson Margulies [mailto:bimargulies@gmail.com]
>>> Sent: Thursday, November 04, 2010 2:56 PM
>>> To: legal-discuss@apache.org
>>> Subject: Re: Fair-use data in svn
>>>
>>> > There is no exception in copyright infringement law that allows you
>>> to copy other people's copyrighted materials and distribute them on an
>>> Apache website, no matter how upstanding the goals, without a license.
>>> Ask permission first.
>>>
>>> It won't be on an apache web site. It will be in a zip file in svn,
>>> read by (for example) a unit test. That seems a relevant distinction
>>> to me, but YAAL, not me.
>>>
>>> >
>>> > If you intend to rely on a fair use defense, don't count on it
>>> without analyzing the fair use factors carefully. I'll work with you on
>>> that analysis if you can't find a better alternative for generating
>>> test data.
>>> >
>>> > If these really are "miscellaneous" web pages, why can't you create a
>>> test consisting of links to the actual pages? Must you copy the pages
>>> themselves?
>>>
>>> You can't make a repeatable process that depends on ephemeral content
>>> -- and this content is always ephemeral -- sitting there when you want
>>> it.
>>>
>>>
>>> > /Larry
>>> >
>>> >
>>> >> -----Original Message-----
>>> >> From: Benson Margulies [mailto:bimargulies@gmail.com]
>>> >> Sent: Thursday, November 04, 2010 9:07 AM
>>> >> To: legal-discuss@apache.org
>>> >> Subject: Fair-use data in svn
>>> >>
>>> >> I write code in some areas where 'real world' textual data is fuel.
>>> >> It's test cases. It's training corpora. It cannot be replaced by
>>> >> constructed, test-tube, text that could be created under the AL or
>>> >> some other 'class A' license.
>>> >>
>>> >> I'd like to contribute some of that data here at ASF. In some cases,
>>> >> that would require checking in test case data that consists of (for
>>> >> example) miscellaneous web pages grabbed with wget. In other cases,
>>> it
>>> >> might consist of larger collections of text derived from such pages.
>>> >>
>>> >> I would like to discover that this is acceptable, perhaps with some
>>> >> caveats and requirements for NOTICE.
>>> >>
>>> >> --------------------------------------------------------------------
>>> -
>>> >> To unsubscribe, e-mail: legal-discuss-unsubscribe@apache.org
>>> >> For additional commands, e-mail: legal-discuss-help@apache.org
>>> >
>>> >
>>> >
>>> > ---------------------------------------------------------------------
>>> > To unsubscribe, e-mail: legal-discuss-unsubscribe@apache.org
>>> > For additional commands, e-mail: legal-discuss-help@apache.org
>>> >
>>> >
>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: legal-discuss-unsubscribe@apache.org
>>> For additional commands, e-mail: legal-discuss-help@apache.org
>>
>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: legal-discuss-unsubscribe@apache.org
>> For additional commands, e-mail: legal-discuss-help@apache.org
>>
>>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: legal-discuss-unsubscribe@apache.org
> For additional commands, e-mail: legal-discuss-help@apache.org
>

Re: Fair-use data in svn

Posted by Benson Margulies <bi...@gmail.com>.

On Thu, Nov 4, 2010 at 7:47 PM, Lawrence Rosen <lr...@rosenlaw.com> wrote:
> Benson, how about copying materials that are explicitly marked "Creative Commons"? There must be enough of that stuff on the web to collect into a test case.

Here's a concrete example. Let's say that the job at hand is to
extract useful text from webpages. You need to test this on the news
sites that people want to work with, like CNN. The inventory of
'Commons' pages is not representative.

Another bit of concretude:

Case 1: you have a representative collection of HTML pages, and you
use them to regress data extraction. Tika has avoided this by
depending on a non-ASF component (boilerpipe).

Case 2: you have, oh, 250,000 words of news, and you get people to
annotate them, and use them to train models. Whether there's enough of
the right stuff out there under CC is an open question.

>
> /Larry
>
>
>
>
>> -----Original Message-----
>> From: Benson Margulies [mailto:bimargulies@gmail.com]
>> Sent: Thursday, November 04, 2010 2:56 PM
>> To: legal-discuss@apache.org
>> Subject: Re: Fair-use data in svn
>>
>> > There is no exception in copyright infringement law that allows you
>> to copy other people's copyrighted materials and distribute them on an
>> Apache website, no matter how upstanding the goals, without a license.
>> Ask permission first.
>>
>> It won't be on an apache web site. It will be in a zip file in svn,
>> read by (for example) a unit test. That seems a relevant distinction
>> to me, but YAAL, not me.
>>
>> >
>> > If you intend to rely on a fair use defense, don't count on it
>> without analyzing the fair use factors carefully. I'll work with you on
>> that analysis if you can't find a better alternative for generating
>> test data.
>> >
>> > If these really are "miscellaneous" web pages, why can't you create a
>> test consisting of links to the actual pages? Must you copy the pages
>> themselves?
>>
>> You can't make a repeatable process that depends on ephemeral content
>> -- and this content is always ephemeral -- sitting there when you want
>> it.
>>
>>
>> > /Larry
>> >
>> >
>> >> -----Original Message-----
>> >> From: Benson Margulies [mailto:bimargulies@gmail.com]
>> >> Sent: Thursday, November 04, 2010 9:07 AM
>> >> To: legal-discuss@apache.org
>> >> Subject: Fair-use data in svn
>> >>
>> >> I write code in some areas where 'real world' textual data is fuel.
>> >> It's test cases. It's training corpora. It cannot be replaced by
>> >> constructed, test-tube, text that could be created under the AL or
>> >> some other 'class A' license.
>> >>
>> >> I'd like to contribute some of that data here at ASF. In some cases,
>> >> that would require checking in test case data that consists of (for
>> >> example) miscellaneous web pages grabbed with wget. In other cases,
>> it
>> >> might consist of larger collections of text derived from such pages.
>> >>
>> >> I would like to discover that this is acceptable, perhaps with some
>> >> caveats and requirements for NOTICE.
>> >>
>> >> --------------------------------------------------------------------
>> -
>> >> To unsubscribe, e-mail: legal-discuss-unsubscribe@apache.org
>> >> For additional commands, e-mail: legal-discuss-help@apache.org
>> >
>> >
>> >
>> > ---------------------------------------------------------------------
>> > To unsubscribe, e-mail: legal-discuss-unsubscribe@apache.org
>> > For additional commands, e-mail: legal-discuss-help@apache.org
>> >
>> >
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: legal-discuss-unsubscribe@apache.org
>> For additional commands, e-mail: legal-discuss-help@apache.org
>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: legal-discuss-unsubscribe@apache.org
> For additional commands, e-mail: legal-discuss-help@apache.org
>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: legal-discuss-unsubscribe@apache.org
For additional commands, e-mail: legal-discuss-help@apache.org

RE: Fair-use data in svn

Posted by Lawrence Rosen <lr...@rosenlaw.com>.

Benson, how about copying materials that are explicitly marked "Creative Commons"? There must be enough of that stuff on the web to collect into a test case.

/Larry




> -----Original Message-----
> From: Benson Margulies [mailto:bimargulies@gmail.com]
> Sent: Thursday, November 04, 2010 2:56 PM
> To: legal-discuss@apache.org
> Subject: Re: Fair-use data in svn
> 
> > There is no exception in copyright infringement law that allows you
> to copy other people's copyrighted materials and distribute them on an
> Apache website, no matter how upstanding the goals, without a license.
> Ask permission first.
> 
> It won't be on an apache web site. It will be in a zip file in svn,
> read by (for example) a unit test. That seems a relevant distinction
> to me, but YAAL, not me.
> 
> >
> > If you intend to rely on a fair use defense, don't count on it
> without analyzing the fair use factors carefully. I'll work with you on
> that analysis if you can't find a better alternative for generating
> test data.
> >
> > If these really are "miscellaneous" web pages, why can't you create a
> test consisting of links to the actual pages? Must you copy the pages
> themselves?
> 
> You can't make a repeatable process that depends on ephemeral content
> -- and this content is always ephemeral -- sitting there when you want
> it.
> 
> 
> > /Larry
> >
> >
> >> -----Original Message-----
> >> From: Benson Margulies [mailto:bimargulies@gmail.com]
> >> Sent: Thursday, November 04, 2010 9:07 AM
> >> To: legal-discuss@apache.org
> >> Subject: Fair-use data in svn
> >>
> >> I write code in some areas where 'real world' textual data is fuel.
> >> It's test cases. It's training corpora. It cannot be replaced by
> >> constructed, test-tube, text that could be created under the AL or
> >> some other 'class A' license.
> >>
> >> I'd like to contribute some of that data here at ASF. In some cases,
> >> that would require checking in test case data that consists of (for
> >> example) miscellaneous web pages grabbed with wget. In other cases,
> it
> >> might consist of larger collections of text derived from such pages.
> >>
> >> I would like to discover that this is acceptable, perhaps with some
> >> caveats and requirements for NOTICE.
> >>
> >> --------------------------------------------------------------------
> -
> >> To unsubscribe, e-mail: legal-discuss-unsubscribe@apache.org
> >> For additional commands, e-mail: legal-discuss-help@apache.org
> >
> >
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: legal-discuss-unsubscribe@apache.org
> > For additional commands, e-mail: legal-discuss-help@apache.org
> >
> >
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: legal-discuss-unsubscribe@apache.org
> For additional commands, e-mail: legal-discuss-help@apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: legal-discuss-unsubscribe@apache.org
For additional commands, e-mail: legal-discuss-help@apache.org

Re: Fair-use data in svn

Posted by Sam Ruby <ru...@intertwingly.net>.

On Thu, Nov 4, 2010 at 5:56 PM, Benson Margulies <bi...@gmail.com> wrote:
>>
>> If these really are "miscellaneous" web pages, why can't you create a test consisting of links to the actual pages? Must you copy the pages themselves?
>
> You can't make a repeatable process that depends on ephemeral content
> -- and this content is always ephemeral -- sitting there when you want
> it.

The internet archives are not ephemeral: http://www.archive.org/web/web.php

Does that help?

- Sam Ruby

---------------------------------------------------------------------
To unsubscribe, e-mail: legal-discuss-unsubscribe@apache.org
For additional commands, e-mail: legal-discuss-help@apache.org

Re: Fair-use data in svn

Posted by Benson Margulies <bi...@gmail.com>.

> There is no exception in copyright infringement law that allows you to copy other people's copyrighted materials and distribute them on an Apache website, no matter how upstanding the goals, without a license. Ask permission first.

It won't be on an apache web site. It will be in a zip file in svn,
read by (for example) a unit test. That seems a relevant distinction
to me, but YAAL, not me.

>
> If you intend to rely on a fair use defense, don't count on it without analyzing the fair use factors carefully. I'll work with you on that analysis if you can't find a better alternative for generating test data.
>
> If these really are "miscellaneous" web pages, why can't you create a test consisting of links to the actual pages? Must you copy the pages themselves?

You can't make a repeatable process that depends on ephemeral content
-- and this content is always ephemeral -- sitting there when you want
it.


> /Larry
>
>
>> -----Original Message-----
>> From: Benson Margulies [mailto:bimargulies@gmail.com]
>> Sent: Thursday, November 04, 2010 9:07 AM
>> To: legal-discuss@apache.org
>> Subject: Fair-use data in svn
>>
>> I write code in some areas where 'real world' textual data is fuel.
>> It's test cases. It's training corpora. It cannot be replaced by
>> constructed, test-tube, text that could be created under the AL or
>> some other 'class A' license.
>>
>> I'd like to contribute some of that data here at ASF. In some cases,
>> that would require checking in test case data that consists of (for
>> example) miscellaneous web pages grabbed with wget. In other cases, it
>> might consist of larger collections of text derived from such pages.
>>
>> I would like to discover that this is acceptable, perhaps with some
>> caveats and requirements for NOTICE.
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: legal-discuss-unsubscribe@apache.org
>> For additional commands, e-mail: legal-discuss-help@apache.org
>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: legal-discuss-unsubscribe@apache.org
> For additional commands, e-mail: legal-discuss-help@apache.org
>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: legal-discuss-unsubscribe@apache.org
For additional commands, e-mail: legal-discuss-help@apache.org

RE: Fair-use data in svn

Posted by Lawrence Rosen <lr...@rosenlaw.com>.

Benson Margulies wrote:
> I would like to discover that this is acceptable, perhaps with some
> caveats and requirements for NOTICE.

There is no exception in copyright infringement law that allows you to copy other people's copyrighted materials and distribute them on an Apache website, no matter how upstanding the goals, without a license. Ask permission first.

If you intend to rely on a fair use defense, don't count on it without analyzing the fair use factors carefully. I'll work with you on that analysis if you can't find a better alternative for generating test data. 

If these really are "miscellaneous" web pages, why can't you create a test consisting of links to the actual pages? Must you copy the pages themselves?

/Larry

> -----Original Message-----
> From: Benson Margulies [mailto:bimargulies@gmail.com]
> Sent: Thursday, November 04, 2010 9:07 AM
> To: legal-discuss@apache.org
> Subject: Fair-use data in svn
> 
> I write code in some areas where 'real world' textual data is fuel.
> It's test cases. It's training corpora. It cannot be replaced by
> constructed, test-tube, text that could be created under the AL or
> some other 'class A' license.
> 
> I'd like to contribute some of that data here at ASF. In some cases,
> that would require checking in test case data that consists of (for
> example) miscellaneous web pages grabbed with wget. In other cases, it
> might consist of larger collections of text derived from such pages.
> 
> I would like to discover that this is acceptable, perhaps with some
> caveats and requirements for NOTICE.
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: legal-discuss-unsubscribe@apache.org
> For additional commands, e-mail: legal-discuss-help@apache.org

---------------------------------------------------------------------
To unsubscribe, e-mail: legal-discuss-unsubscribe@apache.org
For additional commands, e-mail: legal-discuss-help@apache.org

Re: Fair-use data in svn

Posted by "Daryl C. W. O'Shea" <sp...@dostech.ca>.

On 05/11/2010 9:23 AM, Daniel Kulp wrote:
> On Friday 05 November 2010 8:39:12 am Sim IJskes wrote:
>> You cannot copy verbatim. But you can create and publish the tools. You
>> can also create a internal representation, say a neural net, or
>> statistics, and provide annotations, as long as it something new.
>>
>> So if you crawl the net, and build a statistics model of it, you can
>> distribute the staticstics model data as your own.
>
> That's kind of what I was thinking.   Doesn't Spamassassin do something
> similar.   They have a zone/jail someplace that collects a lot of copyrighted
> spam data and runs various analysis on it and such and then commits the
> results of said analysis into the repository.

Yeah, I suppose we do.  We collect ham (which I suppose would be 
copyrighted) and spam (which in many cases is ilegal itself, so I'm not 
sure about copyright protection for that) and then run statistical 
analysis on it (rule hits, rule generation, etc) with rules and scores 
generated and published in the repository.

I think our case differs a little more, though, in that people send us 
the data (via email)... we don't go out and collect it.  In any case, 
though, we're not publishing the actual ham and spam mail.

Daryl

---------------------------------------------------------------------
To unsubscribe, e-mail: legal-discuss-unsubscribe@apache.org
For additional commands, e-mail: legal-discuss-help@apache.org

Re: Fair-use data in svn

Posted by Sim IJskes <si...@apache.org>.

On 11/07/2010 04:13 AM, Benson Margulies wrote:
> So, in my mind, this brings us to the question of how the ASF could
> serve as a collection point for copyrighted corpora. The answer might
> be, "It can't." Dan Kulp raised what to me is the obvious alternative:
> some storage accessible to committers but not the general public.

Sorry, but a copy is a copy. No copy without a license. It doesn't 
matter how big the population is that has access to the copy. You may 
limit the detection of the infringement. But thats not what we do here.

You need a special license to show a dvd movie at a clubhouse, why would 
a newspaper be any different?

Gr. Sim

---------------------------------------------------------------------
To unsubscribe, e-mail: legal-discuss-unsubscribe@apache.org
For additional commands, e-mail: legal-discuss-help@apache.org

Re: Fair-use data in svn

Posted by Sim IJskes <si...@apache.org>.

On 11/07/2010 04:21 AM, Benson Margulies wrote:
> On Sat, Nov 6, 2010 at 11:17 PM, Joe Schaefer<jo...@yahoo.com>  wrote:
>> The SpamAsssassin stuff lives in a virtual host provided
>> by Apache.  That is how I would go about acquiring the
>> copyrighted content without redistributing it to anyone
>> other than those with an account on the virtual host.

IANAL, but to me email looks like a different case than copied webpages. 
An email is sent to a recipient. The recipient can store the email but 
not redistribute without license. Emails are also copyright protected.

Did you already check the fair use criteria for your project and argued 
your case how you think they could mitigate copyright infringment claims 
of your project to Lawrence?

Gr. Sim

---------------------------------------------------------------------
To unsubscribe, e-mail: legal-discuss-unsubscribe@apache.org
For additional commands, e-mail: legal-discuss-help@apache.org

Re: Fair-use data in svn

Posted by "Daryl C. W. O'Shea" <sp...@dostech.ca>.

On 06/11/2010 11:21 PM, Benson Margulies wrote:
> On Sat, Nov 6, 2010 at 11:17 PM, Joe Schaefer<jo...@yahoo.com>  wrote:
>> The SpamAsssassin stuff lives in a virtual host provided
>> by Apache.  That is how I would go about acquiring the
>> copyrighted content without redistributing it to anyone
>> other than those with an account on the virtual host.
>
> How do we decide who gets an account?

In our case it's been mainly PMC members or a couple of committers we 
trust not to screw things up.... which is pretty much the entire project 
(we're pretty small).  There's about a dozen people with access and I 
don't think we've every turned anyone down for an account (or felt the 
need to).

>>> From: Benson Margulies<bi...@gmail.com>
>>> infringement? The spamassasin example seems apposite, and I wish  that
>>> Daryl would give more details about where the ham is kept and who  has
>>> access to it, and what legal determination went into setting up  the
>>> whole  business.

As Joe noted some of the ham is on the virtual host provided by Apache. 
  The majority of it, though, is kept by the "owners", well recipients, 
of the ham on their own personal hosts.

Our mass-check software (that generates log files of what anti-spam 
rules match against ham/spam messages identified by message ID or 
mailbox filename) is run against the ham/spam corpora either on people's 
personal hosts or on the Apache hosted virtual server every night.  The 
only reason we have some ham/spam on the Apache hosted virtual machine 
is for the people who do not have access to their own CPU cycles for 
doing this analysis every night (it's quite CPU intensive as we run 
SpamAssassin against millions of messages on a much larger ruleset than 
what is published for general use).

Daryl

---------------------------------------------------------------------
To unsubscribe, e-mail: legal-discuss-unsubscribe@apache.org
For additional commands, e-mail: legal-discuss-help@apache.org

Re: Fair-use data in svn

Posted by Sim IJskes <si...@apache.org>.

On 07-11-10 13:49, Benson Margulies wrote:
>>>
>>> How do we decide who gets an  account?
>>
>> By applying common sense.
>>

Would it be an idea to complete the following:

HYPOTHETICAL

Dear Lawrence,

I've crawled the web, with my product <fill in>, and stored the crawled 
pages on ASF infrastructure.

The BigNewsCorporation.com now accuses me of copyright infringement.

But i claim fair use based on the following arguments:

Substantiality
- i crawled their website, and i only used 1 in each 10 articles i found.

Market value
- i only used articles that were more than six months old, and after 
removing the formatting i split the article on a random boundery and 
reversed the order of the words in the sentences in an article.

Etc. etc. etc.

Maybe this would be something lawrence could work with.

Gr. Sim

---------------------------------------------------------------------
To unsubscribe, e-mail: legal-discuss-unsubscribe@apache.org
For additional commands, e-mail: legal-discuss-help@apache.org

Re: Fair-use data in svn

Posted by Benson Margulies <bi...@gmail.com>.

>>
>> How do we decide who gets an  account?
>
> By applying common sense.
>

I'm all in favor of common sense, but I'm trying to make sure that I
understand the legal thinking that went into it before trying to make
a proposal for something analogous.

---------------------------------------------------------------------
To unsubscribe, e-mail: legal-discuss-unsubscribe@apache.org
For additional commands, e-mail: legal-discuss-help@apache.org

Re: Fair-use data in svn

Posted by Joe Schaefer <jo...@yahoo.com>.

----- Original Message ----

> From: Benson Margulies <bi...@gmail.com>
> To: legal-discuss@apache.org
> Sent: Sat, November 6, 2010 11:21:29 PM
> Subject: Re: Fair-use data in svn
> 
> On Sat, Nov 6, 2010 at 11:17 PM, Joe Schaefer <jo...@yahoo.com>  wrote:
> > The SpamAsssassin stuff lives in a virtual host provided
> >  by Apache.  That is how I would go about acquiring the
> > copyrighted  content without redistributing it to anyone
> > other than those with an  account on the virtual host.
> >
> 
> How do we decide who gets an  account?

By applying common sense.

> 
> >
> >
> > ----- Original Message ----
> >>  From: Benson Margulies <bi...@gmail.com>
> >>  To: legal-discuss@apache.org
> >>  Sent: Sat, November 6, 2010 11:13:39 PM
> >> Subject: Re: Fair-use data  in svn
> >>
> >> Larry,
> >>
> >> Before I type  anything else, I'd better say, "Thank you, I  now
> >> appreciate that  'fair use' has nothing much to do with the  practical
> >> matter at  hand."
> >>
> >> The process of building NLP models has  three  parts: first, collect a
> >> corpus. Second, annotate it. Third, build a   model.
> >>
> >> My original query here concerns the ability of  the ASF to host  the
> >> first part -- in the case where the desired  corpus is made up  of
> >> copyrighted materials for which no special  permissions have  been
> >> obtained. What I think I've learned from this  discussion is that  the
> >> usual ASF practice -- all 'source' materials  are in the source  tree,
> >> available to anyone -- is essentially a  publication that is likely  to
> >> infringe on  copyright.
> >>
> >> So, unless the ASF is willing to sanction an   alternative process to
> >> checking everything into the public source  tree, ASF  projects can't do
> >> this entire process. Not because the  models, as per your  most recent
> >> message, themselves can infringe,  but because the publication of  the
> >> source materials would. I did  want to double-check my belief that  a
> >> model derived from text was  not, on its face, a derived work that
> >> could  infringe -- before I  bothered anyone any further about this.
> >>
> >> So, in my  mind,  this brings us to the question of how the ASF could
> >> serve as a   collection point for copyrighted corpora. The answer might
> >> be, "It  can't."  Dan Kulp raised what to me is the obvious alternative:
> >> some  storage  accessible to committers but not the general public.
> >> Since  this is the  legal-discuss list, it strikes me as sensible for
> >> this  discussion to discover  those strategies that are *legally*
> >>  reasonable (if any), and leave it to,  well, the board, to decide if
> >>  any of those are tolerable from the standpoint  of the Foundation's
> >>  goals. So, if I use a spider to grab a large amount of  copyrighted
> >>  material, how narrowly do I have to control its distribution to   avoid
> >> infringement? The spamassasin example seems apposite, and I  wish  that
> >> Daryl would give more details about where the ham is kept  and who  has
> >> access to it, and what legal determination went into  setting up  the
> >> whole  business.
> >>
> >>  ---------------------------------------------------------------------
> >>  To  unsubscribe, e-mail: legal-discuss-unsubscribe@apache.org
> >>  For  additional commands, e-mail: legal-discuss-help@apache.org
> >>
> >>
> >
> >
> >
> >
> >  ---------------------------------------------------------------------
> >  To unsubscribe, e-mail: legal-discuss-unsubscribe@apache.org
> >  For additional commands, e-mail: legal-discuss-help@apache.org
> >
> >
> 
> ---------------------------------------------------------------------
> To  unsubscribe, e-mail: legal-discuss-unsubscribe@apache.org
> For  additional commands, e-mail: legal-discuss-help@apache.org
> 
> 


      

---------------------------------------------------------------------
To unsubscribe, e-mail: legal-discuss-unsubscribe@apache.org
For additional commands, e-mail: legal-discuss-help@apache.org

Re: Fair-use data in svn

Posted by Benson Margulies <bi...@gmail.com>.

On Sat, Nov 6, 2010 at 11:17 PM, Joe Schaefer <jo...@yahoo.com> wrote:
> The SpamAsssassin stuff lives in a virtual host provided
> by Apache.  That is how I would go about acquiring the
> copyrighted content without redistributing it to anyone
> other than those with an account on the virtual host.
>

How do we decide who gets an account?

>
>
> ----- Original Message ----
>> From: Benson Margulies <bi...@gmail.com>
>> To: legal-discuss@apache.org
>> Sent: Sat, November 6, 2010 11:13:39 PM
>> Subject: Re: Fair-use data in svn
>>
>> Larry,
>>
>> Before I type anything else, I'd better say, "Thank you, I  now
>> appreciate that 'fair use' has nothing much to do with the  practical
>> matter at hand."
>>
>> The process of building NLP models has  three parts: first, collect a
>> corpus. Second, annotate it. Third, build a  model.
>>
>> My original query here concerns the ability of the ASF to host  the
>> first part -- in the case where the desired corpus is made up  of
>> copyrighted materials for which no special permissions have  been
>> obtained. What I think I've learned from this discussion is that  the
>> usual ASF practice -- all 'source' materials are in the source  tree,
>> available to anyone -- is essentially a publication that is likely  to
>> infringe on copyright.
>>
>> So, unless the ASF is willing to sanction an  alternative process to
>> checking everything into the public source tree, ASF  projects can't do
>> this entire process. Not because the models, as per your  most recent
>> message, themselves can infringe, but because the publication of  the
>> source materials would. I did want to double-check my belief that  a
>> model derived from text was not, on its face, a derived work that
>> could  infringe -- before I bothered anyone any further about this.
>>
>> So, in my  mind, this brings us to the question of how the ASF could
>> serve as a  collection point for copyrighted corpora. The answer might
>> be, "It can't."  Dan Kulp raised what to me is the obvious alternative:
>> some storage  accessible to committers but not the general public.
>> Since this is the  legal-discuss list, it strikes me as sensible for
>> this discussion to discover  those strategies that are *legally*
>> reasonable (if any), and leave it to,  well, the board, to decide if
>> any of those are tolerable from the standpoint  of the Foundation's
>> goals. So, if I use a spider to grab a large amount of  copyrighted
>> material, how narrowly do I have to control its distribution to  avoid
>> infringement? The spamassasin example seems apposite, and I wish  that
>> Daryl would give more details about where the ham is kept and who  has
>> access to it, and what legal determination went into setting up  the
>> whole  business.
>>
>> ---------------------------------------------------------------------
>> To  unsubscribe, e-mail: legal-discuss-unsubscribe@apache.org
>> For  additional commands, e-mail: legal-discuss-help@apache.org
>>
>>
>
>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: legal-discuss-unsubscribe@apache.org
> For additional commands, e-mail: legal-discuss-help@apache.org
>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: legal-discuss-unsubscribe@apache.org
For additional commands, e-mail: legal-discuss-help@apache.org

Re: Fair-use data in svn

Posted by Joe Schaefer <jo...@yahoo.com>.

The SpamAsssassin stuff lives in a virtual host provided
by Apache.  That is how I would go about acquiring the
copyrighted content without redistributing it to anyone
other than those with an account on the virtual host.



----- Original Message ----
> From: Benson Margulies <bi...@gmail.com>
> To: legal-discuss@apache.org
> Sent: Sat, November 6, 2010 11:13:39 PM
> Subject: Re: Fair-use data in svn
> 
> Larry,
> 
> Before I type anything else, I'd better say, "Thank you, I  now
> appreciate that 'fair use' has nothing much to do with the  practical
> matter at hand."
> 
> The process of building NLP models has  three parts: first, collect a
> corpus. Second, annotate it. Third, build a  model.
> 
> My original query here concerns the ability of the ASF to host  the
> first part -- in the case where the desired corpus is made up  of
> copyrighted materials for which no special permissions have  been
> obtained. What I think I've learned from this discussion is that  the
> usual ASF practice -- all 'source' materials are in the source  tree,
> available to anyone -- is essentially a publication that is likely  to
> infringe on copyright.
> 
> So, unless the ASF is willing to sanction an  alternative process to
> checking everything into the public source tree, ASF  projects can't do
> this entire process. Not because the models, as per your  most recent
> message, themselves can infringe, but because the publication of  the
> source materials would. I did want to double-check my belief that  a
> model derived from text was not, on its face, a derived work that
> could  infringe -- before I bothered anyone any further about this.
> 
> So, in my  mind, this brings us to the question of how the ASF could
> serve as a  collection point for copyrighted corpora. The answer might
> be, "It can't."  Dan Kulp raised what to me is the obvious alternative:
> some storage  accessible to committers but not the general public.
> Since this is the  legal-discuss list, it strikes me as sensible for
> this discussion to discover  those strategies that are *legally*
> reasonable (if any), and leave it to,  well, the board, to decide if
> any of those are tolerable from the standpoint  of the Foundation's
> goals. So, if I use a spider to grab a large amount of  copyrighted
> material, how narrowly do I have to control its distribution to  avoid
> infringement? The spamassasin example seems apposite, and I wish  that
> Daryl would give more details about where the ham is kept and who  has
> access to it, and what legal determination went into setting up  the
> whole  business.
> 
> ---------------------------------------------------------------------
> To  unsubscribe, e-mail: legal-discuss-unsubscribe@apache.org
> For  additional commands, e-mail: legal-discuss-help@apache.org
> 
> 


      

---------------------------------------------------------------------
To unsubscribe, e-mail: legal-discuss-unsubscribe@apache.org
For additional commands, e-mail: legal-discuss-help@apache.org

Re: Fair-use data in svn

Posted by Benson Margulies <bi...@gmail.com>.

Larry,

Before I type anything else, I'd better say, "Thank you, I now
appreciate that 'fair use' has nothing much to do with the practical
matter at hand."

The process of building NLP models has three parts: first, collect a
corpus. Second, annotate it. Third, build a model.

My original query here concerns the ability of the ASF to host the
first part -- in the case where the desired corpus is made up of
copyrighted materials for which no special permissions have been
obtained. What I think I've learned from this discussion is that the
usual ASF practice -- all 'source' materials are in the source tree,
available to anyone -- is essentially a publication that is likely to
infringe on copyright.

So, unless the ASF is willing to sanction an alternative process to
checking everything into the public source tree, ASF projects can't do
this entire process. Not because the models, as per your most recent
message, themselves can infringe, but because the publication of the
source materials would. I did want to double-check my belief that a
model derived from text was not, on its face, a derived work that
could infringe -- before I bothered anyone any further about this.

So, in my mind, this brings us to the question of how the ASF could
serve as a collection point for copyrighted corpora. The answer might
be, "It can't." Dan Kulp raised what to me is the obvious alternative:
some storage accessible to committers but not the general public.
Since this is the legal-discuss list, it strikes me as sensible for
this discussion to discover those strategies that are *legally*
reasonable (if any), and leave it to, well, the board, to decide if
any of those are tolerable from the standpoint of the Foundation's
goals. So, if I use a spider to grab a large amount of copyrighted
material, how narrowly do I have to control its distribution to avoid
infringement? The spamassasin example seems apposite, and I wish that
Daryl would give more details about where the ham is kept and who has
access to it, and what legal determination went into setting up the
whole business.

---------------------------------------------------------------------
To unsubscribe, e-mail: legal-discuss-unsubscribe@apache.org
For additional commands, e-mail: legal-discuss-help@apache.org

RE: Fair-use data in svn

Posted by Lawrence Rosen <lr...@rosenlaw.com>.

Benson,

You ask good questions. But I'd rather answer the real question you asked earlier than hypothetical questions that merely test my knowledge of the edges of copyright law.

You said earlier, in essence, that you intended to copy entire web pages owned by others and that you would store them in an Apache repository somewhere. I see nothing to suggest that isn't a derivative work, at least as a first guess by a somewhat experienced copyright lawyer. Doing that would probably be copyright infringement.

Fair use is a defense to the tort of copyright infringement. So the fair use question only comes up if you admit -- or are found guilty of -- copyright infringement, perhaps because in this case you created a derivative work. At that point, the court will expect the attorneys to argue the fair use factors for your particular infringing use, among which are the substantiality of your copies, the purposes to which you have put them, the effects on the copyright owner's commercial opportunities, etc. If the amalgam of that analysis is deemed "fair use" by the court, you don't have to pay infringement damages for your unauthorized derivative works; otherwise you do.

You have asked below somewhat different questions. If, for example, you calculate the frequencies of all the letters (or words, or concepts) in a copyrighted work, I don't believe that is a derivative work at all. Or if you take a copyrighted work and "run it through a statistical process and then hand out the result," I'd argue in court that that isn't a derivative work at all. Because there is no copyright infringement, the fair use defense won't be needed at all.

Just like murder: Unless someone is actually killed, you don't have to plead self defense. Unless there is actual copyright infringement, you don't have to plead the fair use defense.

So for your earlier question, first let's decide if you are creating a derivative work. I think you will if you make copies of other people's web sites. Then we have to ask, is your infringing use a "fair use"? My article summarizes the fair use factors. Perhaps you can make a first pass at that multi-factor analysis once you convince yourself that you are (or will be) a copyright infringer by doing what you suggest with web pages?

/Larry




> -----Original Message-----
> From: Benson Margulies [mailto:bimargulies@gmail.com]
> Sent: Saturday, November 06, 2010 7:13 PM
> To: legal-discuss@apache.org
> Subject: Re: Fair-use data in svn
> 
> Larry, if we ignore, for the moment, the issue of 'publishing' via
> checking text into a public svn, there's a question of fair use which
> I don't feel illuminated on after reading your article.
> 
> If I absorb a stack of copyrighted material, and run it through a
> statistical process, and then hand out the result, have I 'used' it at
> all, fairly or otherwise? The constitutional principle and following
> discussion all seems to be discussing 'information in, recognizable
> derivative of information out.' In an extreme case, if I make a chart
> of the frequencies of all the letters in a copyrighted work, and
> publish the resulting chart, what's the situation?
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: legal-discuss-unsubscribe@apache.org
> For additional commands, e-mail: legal-discuss-help@apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: legal-discuss-unsubscribe@apache.org
For additional commands, e-mail: legal-discuss-help@apache.org

Re: Fair-use data in svn

Posted by Benson Margulies <bi...@gmail.com>.

Larry, if we ignore, for the moment, the issue of 'publishing' via
checking text into a public svn, there's a question of fair use which
I don't feel illuminated on after reading your article.

If I absorb a stack of copyrighted material, and run it through a
statistical process, and then hand out the result, have I 'used' it at
all, fairly or otherwise? The constitutional principle and following
discussion all seems to be discussing 'information in, recognizable
derivative of information out.' In an extreme case, if I make a chart
of the frequencies of all the letters in a copyrighted work, and
publish the resulting chart, what's the situation?

---------------------------------------------------------------------
To unsubscribe, e-mail: legal-discuss-unsubscribe@apache.org
For additional commands, e-mail: legal-discuss-help@apache.org

RE: Fair-use data in svn

Posted by Lawrence Rosen <lr...@rosenlaw.com>.

Benson and others,

This started off as a question about fair use. If you'd like a short introduction to the fair use doctrine in copyright law, see this old article of mine from Linux Journal:

   http://www.linuxjournal.com/article/6080 

/Larry


> -----Original Message-----
> From: Benson Margulies [mailto:bimargulies@gmail.com]
> Sent: Friday, November 05, 2010 6:27 AM
> To: legal-discuss@apache.org
> Subject: Re: Fair-use data in svn
> 
> On Nov 5, 2010, at 9:24 AM, Daniel Kulp <dk...@apache.org> wrote:
> 
> >
> >
> > On Friday 05 November 2010 8:39:12 am Sim IJskes wrote:
> >> On 05-11-10 13:25, Benson Margulies wrote:
> >>> Let me be clear on the regime that this discussion is heading for.
> >>
> >> First, i hope i didn't and don't give you the impression that i'm a
> >> lawyer. :-)
> >>
> >> So anything i say right now needs to be cleared by a lawyer.
> >>
> >> You cannot copy verbatim. But you can create and publish the tools.
> You
> >> can also create a internal representation, say a neural net, or
> >> statistics, and provide annotations, as long as it something new.
> >>
> >> So if you crawl the net, and build a statistics model of it, you can
> >> distribute the staticstics model data as your own.
> >
> > That's kind of what I was thinking.   Doesn't Spamassassin do
> something
> > similar.   They have a zone/jail someplace that collects a lot of
> copyrighted
> > spam data and runs various analysis on it and such and then commits
> the
> > results of said analysis into the repository.
> 
> this is what I was hoping for, but I haven't received much
> encouragement.
> 
> >
> > Dan
> >
> >
> >>
> >> A practical rule might be, that it must be impossible to recreate
> the
> >> original crawled webpages of the news publishers from the published
> >> dataset. So you can count words, correlate them, score them etc. But
> you
> >> cannot crawl the net, collect the sourcematerial put it in an
> archive
> >> and say, "here's the data i build the model with".
> >>
> >> IANAL! TINLA!
> >>
> >> Gr. Sim
> >>
> >>
> >>
> >> --------------------------------------------------------------------
> -
> >> To unsubscribe, e-mail: legal-discuss-unsubscribe@apache.org
> >> For additional commands, e-mail: legal-discuss-help@apache.org
> >
> > --
> > Daniel Kulp
> > dkulp@apache.org
> > http://dankulp.com/blog
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: legal-discuss-unsubscribe@apache.org
> > For additional commands, e-mail: legal-discuss-help@apache.org
> >
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: legal-discuss-unsubscribe@apache.org
> For additional commands, e-mail: legal-discuss-help@apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: legal-discuss-unsubscribe@apache.org
For additional commands, e-mail: legal-discuss-help@apache.org

Re: Fair-use data in svn

Posted by Benson Margulies <bi...@gmail.com>.

On Nov 5, 2010, at 9:24 AM, Daniel Kulp <dk...@apache.org> wrote:

>
>
> On Friday 05 November 2010 8:39:12 am Sim IJskes wrote:
>> On 05-11-10 13:25, Benson Margulies wrote:
>>> Let me be clear on the regime that this discussion is heading for.
>>
>> First, i hope i didn't and don't give you the impression that i'm a
>> lawyer. :-)
>>
>> So anything i say right now needs to be cleared by a lawyer.
>>
>> You cannot copy verbatim. But you can create and publish the tools. You
>> can also create a internal representation, say a neural net, or
>> statistics, and provide annotations, as long as it something new.
>>
>> So if you crawl the net, and build a statistics model of it, you can
>> distribute the staticstics model data as your own.
>
> That's kind of what I was thinking.   Doesn't Spamassassin do something
> similar.   They have a zone/jail someplace that collects a lot of copyrighted
> spam data and runs various analysis on it and such and then commits the
> results of said analysis into the repository.

this is what I was hoping for, but I haven't received much encouragement.

>
> Dan
>
>
>>
>> A practical rule might be, that it must be impossible to recreate the
>> original crawled webpages of the news publishers from the published
>> dataset. So you can count words, correlate them, score them etc. But you
>> cannot crawl the net, collect the sourcematerial put it in an archive
>> and say, "here's the data i build the model with".
>>
>> IANAL! TINLA!
>>
>> Gr. Sim
>>
>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: legal-discuss-unsubscribe@apache.org
>> For additional commands, e-mail: legal-discuss-help@apache.org
>
> --
> Daniel Kulp
> dkulp@apache.org
> http://dankulp.com/blog
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: legal-discuss-unsubscribe@apache.org
> For additional commands, e-mail: legal-discuss-help@apache.org
>

---------------------------------------------------------------------
To unsubscribe, e-mail: legal-discuss-unsubscribe@apache.org
For additional commands, e-mail: legal-discuss-help@apache.org

Re: Fair-use data in svn

Posted by Daniel Kulp <dk...@apache.org>.


On Friday 05 November 2010 8:39:12 am Sim IJskes wrote:
> On 05-11-10 13:25, Benson Margulies wrote:
> > Let me be clear on the regime that this discussion is heading for.
> 
> First, i hope i didn't and don't give you the impression that i'm a
> lawyer. :-)
> 
> So anything i say right now needs to be cleared by a lawyer.
> 
> You cannot copy verbatim. But you can create and publish the tools. You
> can also create a internal representation, say a neural net, or
> statistics, and provide annotations, as long as it something new.
> 
> So if you crawl the net, and build a statistics model of it, you can
> distribute the staticstics model data as your own.

That's kind of what I was thinking.   Doesn't Spamassassin do something 
similar.   They have a zone/jail someplace that collects a lot of copyrighted 
spam data and runs various analysis on it and such and then commits the 
results of said analysis into the repository.   

Dan


> 
> A practical rule might be, that it must be impossible to recreate the
> original crawled webpages of the news publishers from the published
> dataset. So you can count words, correlate them, score them etc. But you
> cannot crawl the net, collect the sourcematerial put it in an archive
> and say, "here's the data i build the model with".
> 
> IANAL! TINLA!
> 
> Gr. Sim
> 
> 
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: legal-discuss-unsubscribe@apache.org
> For additional commands, e-mail: legal-discuss-help@apache.org

-- 
Daniel Kulp
dkulp@apache.org
http://dankulp.com/blog

---------------------------------------------------------------------
To unsubscribe, e-mail: legal-discuss-unsubscribe@apache.org
For additional commands, e-mail: legal-discuss-help@apache.org

Re: Fair-use data in svn

Posted by Sim IJskes <si...@apache.org>.

On 05-11-10 13:25, Benson Margulies wrote:
> Let me be clear on the regime that this discussion is heading for.

First, i hope i didn't and don't give you the impression that i'm a 
lawyer. :-)

So anything i say right now needs to be cleared by a lawyer.

You cannot copy verbatim. But you can create and publish the tools. You 
can also create a internal representation, say a neural net, or 
statistics, and provide annotations, as long as it something new.

So if you crawl the net, and build a statistics model of it, you can 
distribute the staticstics model data as your own.

A practical rule might be, that it must be impossible to recreate the 
original crawled webpages of the news publishers from the published 
dataset. So you can count words, correlate them, score them etc. But you 
cannot crawl the net, collect the sourcematerial put it in an archive 
and say, "here's the data i build the model with".

IANAL! TINLA!

Gr. Sim



---------------------------------------------------------------------
To unsubscribe, e-mail: legal-discuss-unsubscribe@apache.org
For additional commands, e-mail: legal-discuss-help@apache.org

Re: Fair-use data in svn

Posted by Benson Margulies <bi...@gmail.com>.

Let me be clear on the regime that this discussion is heading for.

You can collect up a corpus of unencumbered items. You can annotate
them. You can train a model and measure the success of your algorithm.
All good.

What you cannot do is build a model that is any use on real world
data. What the world wants are classifiers (e.g.) that work on actual
CNN news feeds. Training on 'gutenberg' or CC materials won't produce
that. The data is often the hardest part of the problem, far harder
and most costly than the code. Just publishing code that can be used
to train such a thing is very convenient for very large organizations
who can join the LDC (a center at UPenn that acquires and relicenses
corpora) or, more likely, make their own.

Given that my livelihood depends on selling such things, I am perhaps
not heartbroken to discover that the ASF (at least) isn't a viable
home for free competition. On the other hand, perhaps the ASF could
effect a giant change in the landscape here by negotiating some sort
of grant from a variety of web publishers.

The legal principle at work here is very frustrating. I can collect
this stuff. I can use it. I can quietly share it with others via
private communications. But I can't check it into a public SVN, since
that looks like 'publication'. I do wonder whether simply bundling
into a .tar.gz changes anything. The traditional complaint of content
sources is against people who appropriate their content to essentially
complete with them by (it)publishing it where people can easily read
it. Do they really have a cause for complaint if the data is packaged
so that it isn't trivially readable in a web browser?

---------------------------------------------------------------------
To unsubscribe, e-mail: legal-discuss-unsubscribe@apache.org
For additional commands, e-mail: legal-discuss-help@apache.org

Re: Fair-use data in svn

Posted by Niall Pemberton <ni...@gmail.com>.

On Thu, Nov 4, 2010 at 4:07 PM, Benson Margulies <bi...@gmail.com> wrote:
> I write code in some areas where 'real world' textual data is fuel.
> It's test cases. It's training corpora. It cannot be replaced by
> constructed, test-tube, text that could be created under the AL or
> some other 'class A' license.
>
> I'd like to contribute some of that data here at ASF. In some cases,
> that would require checking in test case data that consists of (for
> example) miscellaneous web pages grabbed with wget. In other cases, it
> might consist of larger collections of text derived from such pages.
>
> I would like to discover that this is acceptable, perhaps with some
> caveats and requirements for NOTICE.

There was a requirement that was similar for Lucene that was asked
about on this list. Assuming that went ahead, then perhaps they have
documents that you could (re)use for your purpose:

http://markmail.org/message/ysjxojxu3gset5gq

Niall

---------------------------------------------------------------------
To unsubscribe, e-mail: legal-discuss-unsubscribe@apache.org
For additional commands, e-mail: legal-discuss-help@apache.org

Re: Fair-use data in svn

Posted by Sim IJskes <si...@apache.org>.

On 05-11-10 11:37, Benson Margulies wrote:
> Folks,
>
> What I think we've established here is that a certain category of NLP
> tasks can't really be undertaken at Apache in the usual way. I'm not
> saying that this the end of the world or that it's not worthwhile to
> try to undertake them in some other way.
>
> The NLP research community has 'been there and done that' in terms of
> trying to clear rights to corpora. It's not necessarily impossible in
> all cases, but it's not by any means guaranteed to be possible when
> you need it to be possible.
>
> It's an interesting limit, perhaps, on open source: as a commercial
> enterprise, I use a spider and grab all the visible content of the
> web, with no regard for copyright, and so long as I don't turn around
> and publish that text, I have essentially no legal exposure. I can do
> statistics on it, train models on it, etc. Perhaps a content
> publisher, if they knew that I had used a large amount of their data,
> would take issue and ask me to pay something, and then perhaps we'd
> have a discussion of fair use, or perhaps we'd pay.
>
> For the immediate project I'm working on, I'll just push it to github
> after making my own personal (or corporate) determination of legal
> risk of being accused of unfair use of a bag of web pages, in a
> compressed tar file, is in a public source control repository. For the
> proposed OpenNLP podling, this will put some boundaries on them, but
> they might be happy to only check in code and 'cleared' corpora, and
> leave it to their users to apply the code to more interesting corpora.

You could scrape the urls of:

http://wiki.creativecommons.org/Books

And classify them manually, and put these into your dataset.

Or limit your crawler to

http://www.gutenberg.org/wiki/Main_Page

Gr. Sim


---------------------------------------------------------------------
To unsubscribe, e-mail: legal-discuss-unsubscribe@apache.org
For additional commands, e-mail: legal-discuss-help@apache.org

Re: Fair-use data in svn

Posted by Sim IJskes <si...@apache.org>.

On 05-11-10 14:14, Sim IJskes wrote:
> On 05-11-10 14:10, Benson Margulies wrote:
>> It has to be CNN, *and* Reuters, *and* NYT ... and then we start on
>
> Small note about NYT. If it is behind a user login, you have to be
> extremely carefull, but then it is not a public source anymore.
>
> TINLA!
>
> Gr. Sim

Sorry about my english. you have to be precise reading the license, 
because, it is not a public source anymore.

Gr. Sim



---------------------------------------------------------------------
To unsubscribe, e-mail: legal-discuss-unsubscribe@apache.org
For additional commands, e-mail: legal-discuss-help@apache.org

Re: Fair-use data in svn

Posted by Sim IJskes <si...@apache.org>.

On 05-11-10 14:10, Benson Margulies wrote:
> It has to be CNN, *and* Reuters, *and* NYT ... and then we start on

Small note about NYT. If it is behind a user login, you have to be 
extremely carefull, but then it is not a public source anymore.

TINLA!

Gr. Sim

---------------------------------------------------------------------
To unsubscribe, e-mail: legal-discuss-unsubscribe@apache.org
For additional commands, e-mail: legal-discuss-help@apache.org

Re: Fair-use data in svn

Posted by Sim IJskes <si...@apache.org>.

On 05-11-10 14:39, Sim IJskes wrote:
> It would be a perfect opportunity for those companies, who possibly make
> use of ASF software for years, to get in the spotlight for cooperating
> with and contributing to the ASF, and sponsor our efforts in this way.
>
> Something for the publicity department? My checklist in this case would be:
> - verify the software works technically
> - verify the software has benefits for the contributor
> - ask for permission

A real benefit for the news publishers would of course be, if the 
software is primed with their specific dataset, they can find copyright 
violations more easily. I will leave it to others to decide if this 
would be something Eric Arthur Blair could have come up with.

Gr. Sim


---------------------------------------------------------------------
To unsubscribe, e-mail: legal-discuss-unsubscribe@apache.org
For additional commands, e-mail: legal-discuss-help@apache.org

Re: Fair-use data in svn

Posted by Sim IJskes <si...@apache.org>.

On 05-11-10 14:20, Benson Margulies wrote:
> The purpose of this email thread was to question if the ASF could find
> a *legal* path to do, collectively, what companies and academics do
> individually currently. If the answer to that is 'no', then looking to
> get permissions from content sources is a logical next step.

It would be a perfect opportunity for those companies, who possibly make 
use of ASF software for years, to get in the spotlight for cooperating 
with and contributing to the ASF, and sponsor our efforts in this way.

Something for the publicity department? My checklist in this case would be:
- verify the software works technically
- verify the software has benefits for the contributor
- ask for permission

But, it is a quite different from the workings of the ASF, because to me 
it looks in this case more like a trusted intermediary, where the ASF 
gets permission for dataming on thirdparty material, and is the 
publisher for the derived results. Nobody would be able to verify the 
results without obtaining similar licenses. This would be in contrast 
with the transparency of other ASF processes, but i couldnt find a 
specific reference to an article on the ASF site where this was mandated.

Gr. Sim

---------------------------------------------------------------------
To unsubscribe, e-mail: legal-discuss-unsubscribe@apache.org
For additional commands, e-mail: legal-discuss-help@apache.org

Re: Fair-use data in svn

Posted by Benson Margulies <bi...@gmail.com>.

Niall,

Have I asked them if the ASF can have permission? No, I have not. Have
many, many, NLP researchers asked these questions over many, many
years? Yes. Is it worth a try again? Sure.

The purpose of this email thread was to question if the ASF could find
a *legal* path to do, collectively, what companies and academics do
individually currently. If the answer to that is 'no', then looking to
get permissions from content sources is a logical next step.

--benson

---------------------------------------------------------------------
To unsubscribe, e-mail: legal-discuss-unsubscribe@apache.org
For additional commands, e-mail: legal-discuss-help@apache.org

Re: Fair-use data in svn

Posted by Niall Pemberton <ni...@gmail.com>.

On Fri, Nov 5, 2010 at 1:10 PM, Benson Margulies <bi...@gmail.com> wrote:
> It has to be CNN, *and* Reuters, *and* NYT ... and then we start on
> languages that aren't English, and then you see how we stay very,
> very, busy at my day job.

Have you tried asking them for permission to do what you want?

Niall


> A model only works on data that you train it on. If you train it on
> Wikinews, you get a classifier (or whatever) for ... Wikinews. Sim has
> grasped the essental: using limited data, you can certainly prove out
> an algorithm. But a school of minnows can't set out to produce an open
> source competitor for, say, OpenCalais, unless they can share real
> data, lots and lots of real data.
>
>
> On Fri, Nov 5, 2010 at 8:43 AM, Ross Gardler <rg...@apache.org> wrote:
>> Does it have to be CNN? if it is News you want how about WikiNews?
>>
>> http://en.wikinews.org/wiki/Main_Page
>>
>> Ross
>>
>> Sent from my mobile device.
>>
>> On 5 Nov 2010, at 06:37, Benson Margulies <bi...@gmail.com> wrote:
>>
>>> Folks,
>>>
>>> What I think we've established here is that a certain category of NLP
>>> tasks can't really be undertaken at Apache in the usual way. I'm not
>>> saying that this the end of the world or that it's not worthwhile to
>>> try to undertake them in some other way.
>>>
>>> The NLP research community has 'been there and done that' in terms of
>>> trying to clear rights to corpora. It's not necessarily impossible in
>>> all cases, but it's not by any means guaranteed to be possible when
>>> you need it to be possible.
>>>
>>> It's an interesting limit, perhaps, on open source: as a commercial
>>> enterprise, I use a spider and grab all the visible content of the
>>> web, with no regard for copyright, and so long as I don't turn around
>>> and publish that text, I have essentially no legal exposure. I can do
>>> statistics on it, train models on it, etc. Perhaps a content
>>> publisher, if they knew that I had used a large amount of their data,
>>> would take issue and ask me to pay something, and then perhaps we'd
>>> have a discussion of fair use, or perhaps we'd pay.
>>>
>>> For the immediate project I'm working on, I'll just push it to github
>>> after making my own personal (or corporate) determination of legal
>>> risk of being accused of unfair use of a bag of web pages, in a
>>> compressed tar file, is in a public source control repository. For the
>>> proposed OpenNLP podling, this will put some boundaries on them, but
>>> they might be happy to only check in code and 'cleared' corpora, and
>>> leave it to their users to apply the code to more interesting corpora.
>>>
>>> --benson
>>>
>>>
>>> On Fri, Nov 5, 2010 at 5:15 AM, Sim IJskes <si...@apache.org> wrote:
>>>> On 11/05/2010 09:56 AM, Jukka Zitting wrote:
>>>>>
>>>>> Hi,
>>>>>
>>>>> On Fri, Nov 5, 2010 at 10:07 AM, Sim IJskes<si...@apache.org>  wrote:
>>>>>>
>>>>>> Wouldn't data publicly accesible in jira be just another case of
>>>>>> redistribution? And by this falling within the scope of copyright
>>>>>> in many jurisdictions?
>>>>>
>>>>> Sure, but the "purpose and character" of a Jira attachment is much
>>>>> more limited than that of an official Apache release. Plus the need
>>>>> for explicitly documenting the licensing status is much more relaxed.
>>>>> We have lots of non-licensed Jira attachments that (at least to my
>>>>> layman mind) clearly fall within fair use for research purposes.
>>>>
>>>> I'm a layman;
>>>>
>>>> Isn't the distinction here that we are not talking about an original
>>>> contribution, made by the author, but with an artifact that is nothing more
>>>> then an aggregation of public available material? In the jurisdiction i live
>>>> under (The Netherlands), this will expose you to legal actions. If you want
>>>> to know more, look at the 'Knipselkrant-arrest'.
>>>>
>>>> Gr. Sim
>>>>
>>>>
>>>> ---------------------------------------------------------------------
>>>> To unsubscribe, e-mail: legal-discuss-unsubscribe@apache.org
>>>> For additional commands, e-mail: legal-discuss-help@apache.org
>>>>
>>>>
>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: legal-discuss-unsubscribe@apache.org
>>> For additional commands, e-mail: legal-discuss-help@apache.org
>>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: legal-discuss-unsubscribe@apache.org
>> For additional commands, e-mail: legal-discuss-help@apache.org
>>
>>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: legal-discuss-unsubscribe@apache.org
> For additional commands, e-mail: legal-discuss-help@apache.org
>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: legal-discuss-unsubscribe@apache.org
For additional commands, e-mail: legal-discuss-help@apache.org

Re: Fair-use data in svn

Posted by Benson Margulies <bi...@gmail.com>.

It has to be CNN, *and* Reuters, *and* NYT ... and then we start on
languages that aren't English, and then you see how we stay very,
very, busy at my day job.

A model only works on data that you train it on. If you train it on
Wikinews, you get a classifier (or whatever) for ... Wikinews. Sim has
grasped the essental: using limited data, you can certainly prove out
an algorithm. But a school of minnows can't set out to produce an open
source competitor for, say, OpenCalais, unless they can share real
data, lots and lots of real data.


On Fri, Nov 5, 2010 at 8:43 AM, Ross Gardler <rg...@apache.org> wrote:
> Does it have to be CNN? if it is News you want how about WikiNews?
>
> http://en.wikinews.org/wiki/Main_Page
>
> Ross
>
> Sent from my mobile device.
>
> On 5 Nov 2010, at 06:37, Benson Margulies <bi...@gmail.com> wrote:
>
>> Folks,
>>
>> What I think we've established here is that a certain category of NLP
>> tasks can't really be undertaken at Apache in the usual way. I'm not
>> saying that this the end of the world or that it's not worthwhile to
>> try to undertake them in some other way.
>>
>> The NLP research community has 'been there and done that' in terms of
>> trying to clear rights to corpora. It's not necessarily impossible in
>> all cases, but it's not by any means guaranteed to be possible when
>> you need it to be possible.
>>
>> It's an interesting limit, perhaps, on open source: as a commercial
>> enterprise, I use a spider and grab all the visible content of the
>> web, with no regard for copyright, and so long as I don't turn around
>> and publish that text, I have essentially no legal exposure. I can do
>> statistics on it, train models on it, etc. Perhaps a content
>> publisher, if they knew that I had used a large amount of their data,
>> would take issue and ask me to pay something, and then perhaps we'd
>> have a discussion of fair use, or perhaps we'd pay.
>>
>> For the immediate project I'm working on, I'll just push it to github
>> after making my own personal (or corporate) determination of legal
>> risk of being accused of unfair use of a bag of web pages, in a
>> compressed tar file, is in a public source control repository. For the
>> proposed OpenNLP podling, this will put some boundaries on them, but
>> they might be happy to only check in code and 'cleared' corpora, and
>> leave it to their users to apply the code to more interesting corpora.
>>
>> --benson
>>
>>
>> On Fri, Nov 5, 2010 at 5:15 AM, Sim IJskes <si...@apache.org> wrote:
>>> On 11/05/2010 09:56 AM, Jukka Zitting wrote:
>>>>
>>>> Hi,
>>>>
>>>> On Fri, Nov 5, 2010 at 10:07 AM, Sim IJskes<si...@apache.org>  wrote:
>>>>>
>>>>> Wouldn't data publicly accesible in jira be just another case of
>>>>> redistribution? And by this falling within the scope of copyright
>>>>> in many jurisdictions?
>>>>
>>>> Sure, but the "purpose and character" of a Jira attachment is much
>>>> more limited than that of an official Apache release. Plus the need
>>>> for explicitly documenting the licensing status is much more relaxed.
>>>> We have lots of non-licensed Jira attachments that (at least to my
>>>> layman mind) clearly fall within fair use for research purposes.
>>>
>>> I'm a layman;
>>>
>>> Isn't the distinction here that we are not talking about an original
>>> contribution, made by the author, but with an artifact that is nothing more
>>> then an aggregation of public available material? In the jurisdiction i live
>>> under (The Netherlands), this will expose you to legal actions. If you want
>>> to know more, look at the 'Knipselkrant-arrest'.
>>>
>>> Gr. Sim
>>>
>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: legal-discuss-unsubscribe@apache.org
>>> For additional commands, e-mail: legal-discuss-help@apache.org
>>>
>>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: legal-discuss-unsubscribe@apache.org
>> For additional commands, e-mail: legal-discuss-help@apache.org
>>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: legal-discuss-unsubscribe@apache.org
> For additional commands, e-mail: legal-discuss-help@apache.org
>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: legal-discuss-unsubscribe@apache.org
For additional commands, e-mail: legal-discuss-help@apache.org

Re: Fair-use data in svn

Posted by Ross Gardler <rg...@apache.org>.

Does it have to be CNN? if it is News you want how about WikiNews?

http://en.wikinews.org/wiki/Main_Page

Ross

Sent from my mobile device.

On 5 Nov 2010, at 06:37, Benson Margulies <bi...@gmail.com> wrote:

> Folks,
> 
> What I think we've established here is that a certain category of NLP
> tasks can't really be undertaken at Apache in the usual way. I'm not
> saying that this the end of the world or that it's not worthwhile to
> try to undertake them in some other way.
> 
> The NLP research community has 'been there and done that' in terms of
> trying to clear rights to corpora. It's not necessarily impossible in
> all cases, but it's not by any means guaranteed to be possible when
> you need it to be possible.
> 
> It's an interesting limit, perhaps, on open source: as a commercial
> enterprise, I use a spider and grab all the visible content of the
> web, with no regard for copyright, and so long as I don't turn around
> and publish that text, I have essentially no legal exposure. I can do
> statistics on it, train models on it, etc. Perhaps a content
> publisher, if they knew that I had used a large amount of their data,
> would take issue and ask me to pay something, and then perhaps we'd
> have a discussion of fair use, or perhaps we'd pay.
> 
> For the immediate project I'm working on, I'll just push it to github
> after making my own personal (or corporate) determination of legal
> risk of being accused of unfair use of a bag of web pages, in a
> compressed tar file, is in a public source control repository. For the
> proposed OpenNLP podling, this will put some boundaries on them, but
> they might be happy to only check in code and 'cleared' corpora, and
> leave it to their users to apply the code to more interesting corpora.
> 
> --benson
> 
> 
> On Fri, Nov 5, 2010 at 5:15 AM, Sim IJskes <si...@apache.org> wrote:
>> On 11/05/2010 09:56 AM, Jukka Zitting wrote:
>>> 
>>> Hi,
>>> 
>>> On Fri, Nov 5, 2010 at 10:07 AM, Sim IJskes<si...@apache.org>  wrote:
>>>> 
>>>> Wouldn't data publicly accesible in jira be just another case of
>>>> redistribution? And by this falling within the scope of copyright
>>>> in many jurisdictions?
>>> 
>>> Sure, but the "purpose and character" of a Jira attachment is much
>>> more limited than that of an official Apache release. Plus the need
>>> for explicitly documenting the licensing status is much more relaxed.
>>> We have lots of non-licensed Jira attachments that (at least to my
>>> layman mind) clearly fall within fair use for research purposes.
>> 
>> I'm a layman;
>> 
>> Isn't the distinction here that we are not talking about an original
>> contribution, made by the author, but with an artifact that is nothing more
>> then an aggregation of public available material? In the jurisdiction i live
>> under (The Netherlands), this will expose you to legal actions. If you want
>> to know more, look at the 'Knipselkrant-arrest'.
>> 
>> Gr. Sim
>> 
>> 
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: legal-discuss-unsubscribe@apache.org
>> For additional commands, e-mail: legal-discuss-help@apache.org
>> 
>> 
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: legal-discuss-unsubscribe@apache.org
> For additional commands, e-mail: legal-discuss-help@apache.org
> 

---------------------------------------------------------------------
To unsubscribe, e-mail: legal-discuss-unsubscribe@apache.org
For additional commands, e-mail: legal-discuss-help@apache.org

Re: Fair-use data in svn

Posted by Benson Margulies <bi...@gmail.com>.

Folks,

What I think we've established here is that a certain category of NLP
tasks can't really be undertaken at Apache in the usual way. I'm not
saying that this the end of the world or that it's not worthwhile to
try to undertake them in some other way.

The NLP research community has 'been there and done that' in terms of
trying to clear rights to corpora. It's not necessarily impossible in
all cases, but it's not by any means guaranteed to be possible when
you need it to be possible.

It's an interesting limit, perhaps, on open source: as a commercial
enterprise, I use a spider and grab all the visible content of the
web, with no regard for copyright, and so long as I don't turn around
and publish that text, I have essentially no legal exposure. I can do
statistics on it, train models on it, etc. Perhaps a content
publisher, if they knew that I had used a large amount of their data,
would take issue and ask me to pay something, and then perhaps we'd
have a discussion of fair use, or perhaps we'd pay.

For the immediate project I'm working on, I'll just push it to github
after making my own personal (or corporate) determination of legal
risk of being accused of unfair use of a bag of web pages, in a
compressed tar file, is in a public source control repository. For the
proposed OpenNLP podling, this will put some boundaries on them, but
they might be happy to only check in code and 'cleared' corpora, and
leave it to their users to apply the code to more interesting corpora.

--benson

On Fri, Nov 5, 2010 at 5:15 AM, Sim IJskes <si...@apache.org> wrote:
> On 11/05/2010 09:56 AM, Jukka Zitting wrote:
>>
>> Hi,
>>
>> On Fri, Nov 5, 2010 at 10:07 AM, Sim IJskes<si...@apache.org>  wrote:
>>>
>>> Wouldn't data publicly accesible in jira be just another case of
>>> redistribution? And by this falling within the scope of copyright
>>> in many jurisdictions?
>>
>> Sure, but the "purpose and character" of a Jira attachment is much
>> more limited than that of an official Apache release. Plus the need
>> for explicitly documenting the licensing status is much more relaxed.
>> We have lots of non-licensed Jira attachments that (at least to my
>> layman mind) clearly fall within fair use for research purposes.
>
> I'm a layman;
>
> Isn't the distinction here that we are not talking about an original
> contribution, made by the author, but with an artifact that is nothing more
> then an aggregation of public available material? In the jurisdiction i live
> under (The Netherlands), this will expose you to legal actions. If you want
> to know more, look at the 'Knipselkrant-arrest'.
>
> Gr. Sim
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: legal-discuss-unsubscribe@apache.org
> For additional commands, e-mail: legal-discuss-help@apache.org
>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: legal-discuss-unsubscribe@apache.org
For additional commands, e-mail: legal-discuss-help@apache.org

Re: Fair-use data in svn

Posted by Sim IJskes <si...@apache.org>.

On 11/05/2010 09:56 AM, Jukka Zitting wrote:
> Hi,
>
> On Fri, Nov 5, 2010 at 10:07 AM, Sim IJskes<si...@apache.org>  wrote:
>> Wouldn't data publicly accesible in jira be just another case of
>> redistribution? And by this falling within the scope of copyright
>> in many jurisdictions?
>
> Sure, but the "purpose and character" of a Jira attachment is much
> more limited than that of an official Apache release. Plus the need
> for explicitly documenting the licensing status is much more relaxed.
> We have lots of non-licensed Jira attachments that (at least to my
> layman mind) clearly fall within fair use for research purposes.

I'm a layman;

Isn't the distinction here that we are not talking about an original 
contribution, made by the author, but with an artifact that is nothing 
more then an aggregation of public available material? In the 
jurisdiction i live under (The Netherlands), this will expose you to 
legal actions. If you want to know more, look at the 'Knipselkrant-arrest'.

Gr. Sim


---------------------------------------------------------------------
To unsubscribe, e-mail: legal-discuss-unsubscribe@apache.org
For additional commands, e-mail: legal-discuss-help@apache.org

Re: Fair-use data in svn

Posted by Sim IJskes <si...@apache.org>.

On 05-11-10 09:56, Jukka Zitting wrote:

> Re: explicit permission. I'd only consider going through that trouble
> if we're talking about really substantial amounts of data. If I
> understood correctly, Benson was only looking for things like a few
> representative pages per site, so the "amount and substantiality"
> criteria for fair use would clearly be met and no explicit permission
> should be needed.

I'm new here, so thats why i'm persuing this a bit.

Isn't the ASF about a simple almost maintenance free licensing system? 
There was talk about CNN, a company that relies only on copyright law to 
protect their business. Shouldn't we steer away from issues that are 
just in or out of fair use? Fair use has a strong possibility of beeing 
tested in court. Whe are in the business of creating software instead of 
jurisprudence. In practical terms, we cannot guard against a user 
posting unlicensed products in our jira, but when a member of the ASF 
starts exploring the boundaries of copyright law by posting such a 
product, i dought we can excercise the excuse of good faith.

If you agree i would propose a slogan:

"Never copy without a license. If you can't get it, forget it!"

Gr. Sim

---------------------------------------------------------------------
To unsubscribe, e-mail: legal-discuss-unsubscribe@apache.org
For additional commands, e-mail: legal-discuss-help@apache.org

Re: Fair-use data in svn

Posted by Jukka Zitting <ju...@gmail.com>.

Hi,

On Fri, Nov 5, 2010 at 10:07 AM, Sim IJskes <si...@apache.org> wrote:
> Wouldn't data publicly accesible in jira be just another case of
> redistribution? And by this falling within the scope of copyright
> in many jurisdictions?

Sure, but the "purpose and character" of a Jira attachment is much
more limited than that of an official Apache release. Plus the need
for explicitly documenting the licensing status is much more relaxed.
We have lots of non-licensed Jira attachments that (at least to my
layman mind) clearly fall within fair use for research purposes.

Re: explicit permission. I'd only consider going through that trouble
if we're talking about really substantial amounts of data. If I
understood correctly, Benson was only looking for things like a few
representative pages per site, so the "amount and substantiality"
criteria for fair use would clearly be met and no explicit permission
should be needed.

BR,

Jukka Zitting

---------------------------------------------------------------------
To unsubscribe, e-mail: legal-discuss-unsubscribe@apache.org
For additional commands, e-mail: legal-discuss-help@apache.org

Re: Fair-use data in svn

Posted by Sim IJskes <si...@apache.org>.

On 11/05/2010 08:36 AM, Jukka Zitting wrote:
> My feeling is that it would be best to avoid including such data in an
> Apache release, if only to avoid the complexities of properly
> justifying and documenting the licensing status of such data.
>
> What you probably could do instead is collect such data and package it
> as a test suite, but only attach it to Jira without the license flag
> set. Then anyone working on the relevant code can still access the
> test data and run the tests locally, but we wouldn't need to include
> the data in a release.

Wouldn't data publicly accesible in jira be just another case of 
redistribution? And by this falling within the scope of copyright in 
many jurisdictions?

Gr. Sim

---------------------------------------------------------------------
To unsubscribe, e-mail: legal-discuss-unsubscribe@apache.org
For additional commands, e-mail: legal-discuss-help@apache.org

Re: Fair-use data in svn

Posted by Jukka Zitting <ju...@gmail.com>.

Hi,

On Thu, Nov 4, 2010 at 6:07 PM, Benson Margulies <bi...@gmail.com> wrote:
> I'd like to contribute some of that data here at ASF. In some cases,
> that would require checking in test case data that consists of (for
> example) miscellaneous web pages grabbed with wget. In other cases, it
> might consist of larger collections of text derived from such pages.

See http://markmail.org/message/ysjxojxu3gset5gq for an earlier
discussion of a similar case.

My feeling is that it would be best to avoid including such data in an
Apache release, if only to avoid the complexities of properly
justifying and documenting the licensing status of such data.

What you probably could do instead is collect such data and package it
as a test suite, but only attach it to Jira without the license flag
set. Then anyone working on the relevant code can still access the
test data and run the tests locally, but we wouldn't need to include
the data in a release.

BR,

Jukka Zitting

---------------------------------------------------------------------
To unsubscribe, e-mail: legal-discuss-unsubscribe@apache.org
For additional commands, e-mail: legal-discuss-help@apache.org