You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by Berlin Brown <be...@gmail.com> on 2006/03/30 09:13:57 UTC

Legal issues

What are say the legal issues of crawling a site like reddit, digg or
slashdot.  Assuming that you are just collecting links that users post
through that service and then you are regathering those links.  I
can't see an issue there.

The other extreme would be crawling google and requerying or something
along those lines.

Re: Legal issues

Posted by TDLN <di...@gmail.com>.
The "official" reason one reads about, is that in case the server that the
page resides on is
 down or unreachable, the user can stil access the search result. The Google
Terms phrase it like this: "Google stores many
web pages in its cache to retrieve for users as a back-up in case the page's
server temporarily fails."

 The only cases I remember where content providers objected to caching their
pages, were
related to cases where the cache provided access to pages from so called
"member areas". I think
the NYT once had a case against Google, where the latter was caching pages
normally only accessible
to subscription members.

We recently had an interesting case here in the Netherlands. A new search
engine (www.zoekallehuizen.nl)
crawled the internet sites of housing brokers to collect information of real
estate for sale. The owners of the sites
claimed protection under the European database act, if I remember correctly,
which is one of the most
strict (and questionable) copyright acts out there. They lost their case
though, a court decision that was hailed
as a big victory by everyone involved in search engines in Holland.

Rgrds, Thomas








On 3/30/06, Insurance Squared Inc. <gc...@insurancesquared.com> wrote:
>
> I'm not trying to argue legalities, just pointing out that there's an
> undercurrent out there in the community where there's some backlash
> against SE's and crawlers because of the cache. Here's an example; this
> guy: http://incredibill.blogspot.com/  is scraper/bot/crawler crazy.
> And he actively blocks nutch.  *and* that blog is widely read.
> (actually, I think what he does is serve some nonsense phrase that gets
> indexed.  That lets him search in your SE for his nonsense phrase).
>
> It's a good idea to keep the content providers happy.  If we don't, more
> of them can block our crawlers for those engines they feel don't provide
> value to them.  And that's bad.
>
> I'd be curious if anyone has any good reasons for actually showing the
> 'cache'.  I personally don't see any real use for it, other than for
> someone's competitors using it to check to see if they're cloaking.
>
> g.
>
>
>
>
> Nutch Newbie wrote:
>
> >Hmmm.. How about this... The photographer who take a photo has the
> >copyright over the photo not the owner of the picture motive, you, me
> >or any other photo object. So caching is nothing but taking a picture
> >using another sort of camera called robot :-) Nothing more really. If
> >a browser maker decides to show an HTML tag lets say <H1> in 300
> >pixels will that be a copyright or trademark violation then?
> >
> >What one can do is to prevent one to be photographed or stop the
> >robots visit one's website :-)
> >
> >On 3/30/06, Insurance Squared Inc. <gc...@insurancesquared.com> wrote:
> >
> >
> >>FWIW, I believe all of what's been stated is the case - and I'd also
> >>assume that since Google/MSN/Yahoo are all doing this that it's been
> >>tested and OK.
> >>
> >>However I know many people complain about the cache.  Some people see it
> >>as a copyright violation - technically correct or not, the cache does
> >>basically duplicate their site and make it available online.  And I've
> >>never seen how to argue against that other than 'legally it's not'.  IMO
> >>it's cutting it pretty close.
> >>
> >>The other issue some have with displaying cache is that it allows people
> >>to pull down websites without ever visiting the website in questions.
> >>If I put serious effort into blocking bots and scrapers for example, but
> >>let the SE's in so I can get indexed, then the bots and scrapers can
> >>completely bypass my efforts, visit the SE and pull down the cached
> >>pages there.  They can then do nasty stuff with my content, like copy it
> >>on their site for their own purposes.  Not good, and that's the reason
> >>why I don't show the cache on my SE.
> >>
>
> Google stores many web pages in its cache to retrieve for users as a
> back-up in case the page's server temporarily fails.
>
> >>g.
> >>
> >>
> >>Dan Morrill wrote:
> >>
> >>
> >>
> >>>If I remember it correctly, google as been sued and won a number of
> times on
> >>>this issue, you can cache, you can search others web sites, grocklaw
> has the
> >>>data on this one, but I know you can search, you can cache under fair
> use,
> >>>and the idea of public access, as long as you are not cracking
> passwords,
> >>>and honor robots.txt and they post it on the web, it is considered
> public in
> >>>that regard.
> >>>
> >>>I am not a lawyer, check grocklaw.
> >>>
> >>>r/d
> >>>
> >>>-----Original Message-----
> >>>From: TDLN [mailto:diamond108@gmail.com]
> >>>Sent: Thursday, March 30, 2006 3:34 AM
> >>>To: nutch-user@lucene.apache.org
> >>>Subject: Re: Legal issues
> >>>
> >>>Google's and Yahoo's Terms of Service provide interesting reading
> regarding
> >>>such legal issues.
> >>>
> >>>http://www.google.com/terms_of_service.html
> >>>http://docs.yahoo.com/info/terms/
> >>>
> >>>Rgrds, Thomas
> >>>
> >>>On 3/30/06, gekkokid <me...@gekkokid.org.uk> wrote:
> >>>
> >>>
> >>>
> >>>
> >>>>Shouldn't be a problem if your honouring the robots.txt
> >>>>
> >>>>Legal issues could be Stealing Copyrighted Material? thats if your
> >>>>reproducing it but if your analysing the content and links and keeping
> to
> >>>>the robots.txt rules I doubt your have a problem unless its crawling
> every
> >>>>10 minutes,
> >>>>
> >>>>wouldn't grabbing the RSS feed be better?
> >>>>
> >>>>would http://diggdot.us be a good example of what your trying to do?
> or
> >>>>have
> >>>>i got the wrong idea entirely?
> >>>>
> >>>>Any one else have any thoughts?
> >>>>
> >>>>_gk
> >>>>
> >>>>----- Original Message -----
> >>>>From: "Berlin Brown" <be...@gmail.com>
> >>>>To: <nu...@lucene.apache.org>
> >>>>Sent: Thursday, March 30, 2006 8:13 AM
> >>>>Subject: Legal issues
> >>>>
> >>>>
> >>>>What are say the legal issues of crawling a site like reddit, digg or
> >>>>slashdot.  Assuming that you are just collecting links that users post
> >>>>through that service and then you are regathering those links.  I
> >>>>can't see an issue there.
> >>>>
> >>>>The other extreme would be crawling google and requerying or something
> >>>>along those lines.
> >>>>
> >>>>
> >>>>
> >>>>
> >>>>
> >>>>
> >>>
> >>>
> >>>
> >>>
> >
> >
> >
>

Re: Legal issues

Posted by "Insurance Squared Inc." <gc...@insurancesquared.com>.
I'm not trying to argue legalities, just pointing out that there's an 
undercurrent out there in the community where there's some backlash 
against SE's and crawlers because of the cache. Here's an example; this 
guy: http://incredibill.blogspot.com/  is scraper/bot/crawler crazy.  
And he actively blocks nutch.  *and* that blog is widely read.  
(actually, I think what he does is serve some nonsense phrase that gets 
indexed.  That lets him search in your SE for his nonsense phrase).

It's a good idea to keep the content providers happy.  If we don't, more 
of them can block our crawlers for those engines they feel don't provide 
value to them.  And that's bad.

I'd be curious if anyone has any good reasons for actually showing the 
'cache'.  I personally don't see any real use for it, other than for 
someone's competitors using it to check to see if they're cloaking.

g.




Nutch Newbie wrote:

>Hmmm.. How about this... The photographer who take a photo has the
>copyright over the photo not the owner of the picture motive, you, me
>or any other photo object. So caching is nothing but taking a picture
>using another sort of camera called robot :-) Nothing more really. If
>a browser maker decides to show an HTML tag lets say <H1> in 300
>pixels will that be a copyright or trademark violation then?
>
>What one can do is to prevent one to be photographed or stop the
>robots visit one's website :-)
>
>On 3/30/06, Insurance Squared Inc. <gc...@insurancesquared.com> wrote:
>  
>
>>FWIW, I believe all of what's been stated is the case - and I'd also
>>assume that since Google/MSN/Yahoo are all doing this that it's been
>>tested and OK.
>>
>>However I know many people complain about the cache.  Some people see it
>>as a copyright violation - technically correct or not, the cache does
>>basically duplicate their site and make it available online.  And I've
>>never seen how to argue against that other than 'legally it's not'.  IMO
>>it's cutting it pretty close.
>>
>>The other issue some have with displaying cache is that it allows people
>>to pull down websites without ever visiting the website in questions.
>>If I put serious effort into blocking bots and scrapers for example, but
>>let the SE's in so I can get indexed, then the bots and scrapers can
>>completely bypass my efforts, visit the SE and pull down the cached
>>pages there.  They can then do nasty stuff with my content, like copy it
>>on their site for their own purposes.  Not good, and that's the reason
>>why I don't show the cache on my SE.
>>
>>g.
>>
>>
>>Dan Morrill wrote:
>>
>>    
>>
>>>If I remember it correctly, google as been sued and won a number of times on
>>>this issue, you can cache, you can search others web sites, grocklaw has the
>>>data on this one, but I know you can search, you can cache under fair use,
>>>and the idea of public access, as long as you are not cracking passwords,
>>>and honor robots.txt and they post it on the web, it is considered public in
>>>that regard.
>>>
>>>I am not a lawyer, check grocklaw.
>>>
>>>r/d
>>>
>>>-----Original Message-----
>>>From: TDLN [mailto:diamond108@gmail.com]
>>>Sent: Thursday, March 30, 2006 3:34 AM
>>>To: nutch-user@lucene.apache.org
>>>Subject: Re: Legal issues
>>>
>>>Google's and Yahoo's Terms of Service provide interesting reading regarding
>>>such legal issues.
>>>
>>>http://www.google.com/terms_of_service.html
>>>http://docs.yahoo.com/info/terms/
>>>
>>>Rgrds, Thomas
>>>
>>>On 3/30/06, gekkokid <me...@gekkokid.org.uk> wrote:
>>>
>>>
>>>      
>>>
>>>>Shouldn't be a problem if your honouring the robots.txt
>>>>
>>>>Legal issues could be Stealing Copyrighted Material? thats if your
>>>>reproducing it but if your analysing the content and links and keeping to
>>>>the robots.txt rules I doubt your have a problem unless its crawling every
>>>>10 minutes,
>>>>
>>>>wouldn't grabbing the RSS feed be better?
>>>>
>>>>would http://diggdot.us be a good example of what your trying to do? or
>>>>have
>>>>i got the wrong idea entirely?
>>>>
>>>>Any one else have any thoughts?
>>>>
>>>>_gk
>>>>
>>>>----- Original Message -----
>>>>From: "Berlin Brown" <be...@gmail.com>
>>>>To: <nu...@lucene.apache.org>
>>>>Sent: Thursday, March 30, 2006 8:13 AM
>>>>Subject: Legal issues
>>>>
>>>>
>>>>What are say the legal issues of crawling a site like reddit, digg or
>>>>slashdot.  Assuming that you are just collecting links that users post
>>>>through that service and then you are regathering those links.  I
>>>>can't see an issue there.
>>>>
>>>>The other extreme would be crawling google and requerying or something
>>>>along those lines.
>>>>
>>>>
>>>>
>>>>
>>>>        
>>>>
>>>
>>>
>>>      
>>>
>
>  
>

Re: Legal issues

Posted by Nutch Newbie <nu...@gmail.com>.
Hmmm.. How about this... The photographer who take a photo has the
copyright over the photo not the owner of the picture motive, you, me
or any other photo object. So caching is nothing but taking a picture
using another sort of camera called robot :-) Nothing more really. If
a browser maker decides to show an HTML tag lets say <H1> in 300
pixels will that be a copyright or trademark violation then?

What one can do is to prevent one to be photographed or stop the
robots visit one's website :-)

On 3/30/06, Insurance Squared Inc. <gc...@insurancesquared.com> wrote:
> FWIW, I believe all of what's been stated is the case - and I'd also
> assume that since Google/MSN/Yahoo are all doing this that it's been
> tested and OK.
>
> However I know many people complain about the cache.  Some people see it
> as a copyright violation - technically correct or not, the cache does
> basically duplicate their site and make it available online.  And I've
> never seen how to argue against that other than 'legally it's not'.  IMO
> it's cutting it pretty close.
>
> The other issue some have with displaying cache is that it allows people
> to pull down websites without ever visiting the website in questions.
> If I put serious effort into blocking bots and scrapers for example, but
> let the SE's in so I can get indexed, then the bots and scrapers can
> completely bypass my efforts, visit the SE and pull down the cached
> pages there.  They can then do nasty stuff with my content, like copy it
> on their site for their own purposes.  Not good, and that's the reason
> why I don't show the cache on my SE.
>
> g.
>
>
> Dan Morrill wrote:
>
> >If I remember it correctly, google as been sued and won a number of times on
> >this issue, you can cache, you can search others web sites, grocklaw has the
> >data on this one, but I know you can search, you can cache under fair use,
> >and the idea of public access, as long as you are not cracking passwords,
> >and honor robots.txt and they post it on the web, it is considered public in
> >that regard.
> >
> >I am not a lawyer, check grocklaw.
> >
> >r/d
> >
> >-----Original Message-----
> >From: TDLN [mailto:diamond108@gmail.com]
> >Sent: Thursday, March 30, 2006 3:34 AM
> >To: nutch-user@lucene.apache.org
> >Subject: Re: Legal issues
> >
> >Google's and Yahoo's Terms of Service provide interesting reading regarding
> >such legal issues.
> >
> >http://www.google.com/terms_of_service.html
> >http://docs.yahoo.com/info/terms/
> >
> >Rgrds, Thomas
> >
> >On 3/30/06, gekkokid <me...@gekkokid.org.uk> wrote:
> >
> >
> >>Shouldn't be a problem if your honouring the robots.txt
> >>
> >>Legal issues could be Stealing Copyrighted Material? thats if your
> >>reproducing it but if your analysing the content and links and keeping to
> >>the robots.txt rules I doubt your have a problem unless its crawling every
> >>10 minutes,
> >>
> >>wouldn't grabbing the RSS feed be better?
> >>
> >>would http://diggdot.us be a good example of what your trying to do? or
> >>have
> >>i got the wrong idea entirely?
> >>
> >>Any one else have any thoughts?
> >>
> >>_gk
> >>
> >>----- Original Message -----
> >>From: "Berlin Brown" <be...@gmail.com>
> >>To: <nu...@lucene.apache.org>
> >>Sent: Thursday, March 30, 2006 8:13 AM
> >>Subject: Legal issues
> >>
> >>
> >>What are say the legal issues of crawling a site like reddit, digg or
> >>slashdot.  Assuming that you are just collecting links that users post
> >>through that service and then you are regathering those links.  I
> >>can't see an issue there.
> >>
> >>The other extreme would be crawling google and requerying or something
> >>along those lines.
> >>
> >>
> >>
> >>
> >
> >
> >
> >
>

Re: Legal issues

Posted by "Insurance Squared Inc." <gc...@insurancesquared.com>.
FWIW, I believe all of what's been stated is the case - and I'd also 
assume that since Google/MSN/Yahoo are all doing this that it's been 
tested and OK. 

However I know many people complain about the cache.  Some people see it 
as a copyright violation - technically correct or not, the cache does 
basically duplicate their site and make it available online.  And I've 
never seen how to argue against that other than 'legally it's not'.  IMO 
it's cutting it pretty close. 

The other issue some have with displaying cache is that it allows people 
to pull down websites without ever visiting the website in questions.  
If I put serious effort into blocking bots and scrapers for example, but 
let the SE's in so I can get indexed, then the bots and scrapers can 
completely bypass my efforts, visit the SE and pull down the cached 
pages there.  They can then do nasty stuff with my content, like copy it 
on their site for their own purposes.  Not good, and that's the reason 
why I don't show the cache on my SE.

g.


Dan Morrill wrote:

>If I remember it correctly, google as been sued and won a number of times on
>this issue, you can cache, you can search others web sites, grocklaw has the
>data on this one, but I know you can search, you can cache under fair use,
>and the idea of public access, as long as you are not cracking passwords,
>and honor robots.txt and they post it on the web, it is considered public in
>that regard. 
>
>I am not a lawyer, check grocklaw. 
>
>r/d
>
>-----Original Message-----
>From: TDLN [mailto:diamond108@gmail.com] 
>Sent: Thursday, March 30, 2006 3:34 AM
>To: nutch-user@lucene.apache.org
>Subject: Re: Legal issues
>
>Google's and Yahoo's Terms of Service provide interesting reading regarding
>such legal issues.
>
>http://www.google.com/terms_of_service.html
>http://docs.yahoo.com/info/terms/
>
>Rgrds, Thomas
>
>On 3/30/06, gekkokid <me...@gekkokid.org.uk> wrote:
>  
>
>>Shouldn't be a problem if your honouring the robots.txt
>>
>>Legal issues could be Stealing Copyrighted Material? thats if your
>>reproducing it but if your analysing the content and links and keeping to
>>the robots.txt rules I doubt your have a problem unless its crawling every
>>10 minutes,
>>
>>wouldn't grabbing the RSS feed be better?
>>
>>would http://diggdot.us be a good example of what your trying to do? or
>>have
>>i got the wrong idea entirely?
>>
>>Any one else have any thoughts?
>>
>>_gk
>>
>>----- Original Message -----
>>From: "Berlin Brown" <be...@gmail.com>
>>To: <nu...@lucene.apache.org>
>>Sent: Thursday, March 30, 2006 8:13 AM
>>Subject: Legal issues
>>
>>
>>What are say the legal issues of crawling a site like reddit, digg or
>>slashdot.  Assuming that you are just collecting links that users post
>>through that service and then you are regathering those links.  I
>>can't see an issue there.
>>
>>The other extreme would be crawling google and requerying or something
>>along those lines.
>>
>>
>>    
>>
>
>
>  
>

RE: Legal issues

Posted by Dan Morrill <ra...@baker.edu>.
If I remember it correctly, google as been sued and won a number of times on
this issue, you can cache, you can search others web sites, grocklaw has the
data on this one, but I know you can search, you can cache under fair use,
and the idea of public access, as long as you are not cracking passwords,
and honor robots.txt and they post it on the web, it is considered public in
that regard. 

I am not a lawyer, check grocklaw. 

r/d

-----Original Message-----
From: TDLN [mailto:diamond108@gmail.com] 
Sent: Thursday, March 30, 2006 3:34 AM
To: nutch-user@lucene.apache.org
Subject: Re: Legal issues

Google's and Yahoo's Terms of Service provide interesting reading regarding
such legal issues.

http://www.google.com/terms_of_service.html
http://docs.yahoo.com/info/terms/

Rgrds, Thomas

On 3/30/06, gekkokid <me...@gekkokid.org.uk> wrote:
>
> Shouldn't be a problem if your honouring the robots.txt
>
> Legal issues could be Stealing Copyrighted Material? thats if your
> reproducing it but if your analysing the content and links and keeping to
> the robots.txt rules I doubt your have a problem unless its crawling every
> 10 minutes,
>
> wouldn't grabbing the RSS feed be better?
>
> would http://diggdot.us be a good example of what your trying to do? or
> have
> i got the wrong idea entirely?
>
> Any one else have any thoughts?
>
> _gk
>
> ----- Original Message -----
> From: "Berlin Brown" <be...@gmail.com>
> To: <nu...@lucene.apache.org>
> Sent: Thursday, March 30, 2006 8:13 AM
> Subject: Legal issues
>
>
> What are say the legal issues of crawling a site like reddit, digg or
> slashdot.  Assuming that you are just collecting links that users post
> through that service and then you are regathering those links.  I
> can't see an issue there.
>
> The other extreme would be crawling google and requerying or something
> along those lines.
>
>


Re: Legal issues

Posted by TDLN <di...@gmail.com>.
Google's and Yahoo's Terms of Service provide interesting reading regarding
such legal issues.

http://www.google.com/terms_of_service.html
http://docs.yahoo.com/info/terms/

Rgrds, Thomas

On 3/30/06, gekkokid <me...@gekkokid.org.uk> wrote:
>
> Shouldn't be a problem if your honouring the robots.txt
>
> Legal issues could be Stealing Copyrighted Material? thats if your
> reproducing it but if your analysing the content and links and keeping to
> the robots.txt rules I doubt your have a problem unless its crawling every
> 10 minutes,
>
> wouldn't grabbing the RSS feed be better?
>
> would http://diggdot.us be a good example of what your trying to do? or
> have
> i got the wrong idea entirely?
>
> Any one else have any thoughts?
>
> _gk
>
> ----- Original Message -----
> From: "Berlin Brown" <be...@gmail.com>
> To: <nu...@lucene.apache.org>
> Sent: Thursday, March 30, 2006 8:13 AM
> Subject: Legal issues
>
>
> What are say the legal issues of crawling a site like reddit, digg or
> slashdot.  Assuming that you are just collecting links that users post
> through that service and then you are regathering those links.  I
> can't see an issue there.
>
> The other extreme would be crawling google and requerying or something
> along those lines.
>
>

Crawler

Posted by David Webster <tr...@loxinfo.co.th>.
Can someone recommend a good crawler to work with CLucene?


Re: Legal issues

Posted by gekkokid <me...@gekkokid.org.uk>.
Shouldn't be a problem if your honouring the robots.txt

Legal issues could be Stealing Copyrighted Material? thats if your 
reproducing it but if your analysing the content and links and keeping to 
the robots.txt rules I doubt your have a problem unless its crawling every 
10 minutes,

wouldn't grabbing the RSS feed be better?

would http://diggdot.us be a good example of what your trying to do? or have 
i got the wrong idea entirely?

Any one else have any thoughts?

_gk

----- Original Message ----- 
From: "Berlin Brown" <be...@gmail.com>
To: <nu...@lucene.apache.org>
Sent: Thursday, March 30, 2006 8:13 AM
Subject: Legal issues


What are say the legal issues of crawling a site like reddit, digg or
slashdot.  Assuming that you are just collecting links that users post
through that service and then you are regathering those links.  I
can't see an issue there.

The other extreme would be crawling google and requerying or something
along those lines.