You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-user@lucene.apache.org by David Choi <ch...@gmail.com> on 2017/06/01 21:31:55 UTC

Solr Web Crawler - Robots.txt

Hello,

   I was wondering if anyone could guide me on how to crawl the web and
ignore the robots.txt since I can not index some big sites. Or if someone
could point how to get around it. I read somewhere about a
protocol.plugin.check.robots
but that was for nutch.

The way I index is
bin/post -c gettingstarted https://en.wikipedia.org/

but I can't index the site I'm guessing because of the robots.txt.
I can index with
bin/post -c gettingstarted http://lucene.apache.org/solr

which I am guessing allows it. I was also wondering how to find the name of
the crawler bin/post uses.

Re: Solr Web Crawler - Robots.txt

Posted by Vivek Pathak <vp...@orgmeta.com>.
I can help.  We can chat in some freenode chatroom in an hour or so.  
Let me know where you hang out.

Thanks

Vivek


On 6/1/17 5:45 PM, Dave wrote:
> If you are not capable of even writing your own indexing code, let alone crawler, I would prefer that you just stop now.  No one is going to help you with this request, at least I'd hope not.
>
>> On Jun 1, 2017, at 5:31 PM, David Choi <ch...@gmail.com> wrote:
>>
>> Hello,
>>
>>    I was wondering if anyone could guide me on how to crawl the web and
>> ignore the robots.txt since I can not index some big sites. Or if someone
>> could point how to get around it. I read somewhere about a
>> protocol.plugin.check.robots
>> but that was for nutch.
>>
>> The way I index is
>> bin/post -c gettingstarted https://en.wikipedia.org/
>>
>> but I can't index the site I'm guessing because of the robots.txt.
>> I can index with
>> bin/post -c gettingstarted http://lucene.apache.org/solr
>>
>> which I am guessing allows it. I was also wondering how to find the name of
>> the crawler bin/post uses.


Re: Solr Web Crawler - Robots.txt

Posted by Dave <ha...@gmail.com>.
If you are not capable of even writing your own indexing code, let alone crawler, I would prefer that you just stop now.  No one is going to help you with this request, at least I'd hope not. 

> On Jun 1, 2017, at 5:31 PM, David Choi <ch...@gmail.com> wrote:
> 
> Hello,
> 
>   I was wondering if anyone could guide me on how to crawl the web and
> ignore the robots.txt since I can not index some big sites. Or if someone
> could point how to get around it. I read somewhere about a
> protocol.plugin.check.robots
> but that was for nutch.
> 
> The way I index is
> bin/post -c gettingstarted https://en.wikipedia.org/
> 
> but I can't index the site I'm guessing because of the robots.txt.
> I can index with
> bin/post -c gettingstarted http://lucene.apache.org/solr
> 
> which I am guessing allows it. I was also wondering how to find the name of
> the crawler bin/post uses.

Re: Solr Web Crawler - Robots.txt

Posted by Jan Høydahl <ja...@cominvent.com>.
bin/post is not a crawler, just a small java class that collects links from html pages using SolrCell. It respects very basic robots.txt but far from the full spec. This is just a local prototyping tool, not meant for production use.

Jan Høydahl

> 1. mar. 2020 kl. 09:27 skrev Mutuhprasannth <mp...@gmail.com>:
> 
> Have you found out the name of the crawler which is used by Solr bin/post or
> how to ignore robots.txt in Solr post tool
> 
> 
> 
> 
> --
> Sent from: https://lucene.472066.n3.nabble.com/Solr-User-f472068.html

Re: Solr Web Crawler - Robots.txt

Posted by Mutuhprasannth <mp...@gmail.com>.
Have you found out the name of the crawler which is used by Solr bin/post or
how to ignore robots.txt in Solr post tool




--
Sent from: https://lucene.472066.n3.nabble.com/Solr-User-f472068.html

Re: Solr Web Crawler - Robots.txt

Posted by Charlie Hull <ch...@flax.co.uk>.
On 02/06/2017 00:56, Doug Turnbull wrote:
> Scrapy is fantastic and I use it scrape search results pages for clients to
> take quality snapshots for relevance work

+1 for Scrapy; it was built by a team at Mydeco.com while we were 
building their search backend and has gone from strength to strength since.

Cheers

Charlie
>
> Ignoring robots.txt sometimes legit comes up because a staging site might
> be telling google not to crawl but don't care about a developer crawling
> for internal purposes.
>
> Doug
> On Thu, Jun 1, 2017 at 6:34 PM Walter Underwood <wu...@wunderwood.org>
> wrote:
>
>> Which was exactly what I suggested.
>>
>> wunder
>> Walter Underwood
>> wunder@wunderwood.org
>> http://observer.wunderwood.org/  (my blog)
>>
>>
>>> On Jun 1, 2017, at 3:31 PM, David Choi <ch...@gmail.com> wrote:
>>>
>>> In the mean time I have found a better solution at the moment is to test
>> on
>>> a site that allows users to crawl their site.
>>>
>>> On Thu, Jun 1, 2017 at 5:26 PM David Choi <ch...@gmail.com>
>> wrote:
>>>
>>>> I think you misunderstand the argument was about stealing content. Sorry
>>>> but I think you need to read what people write before making bold
>>>> statements.
>>>>
>>>> On Thu, Jun 1, 2017 at 5:20 PM Walter Underwood <wu...@wunderwood.org>
>>>> wrote:
>>>>
>>>>> Let’s not get snarky right away, especially when you are wrong.
>>>>>
>>>>> Corporations do not generally ignore robots.txt. I worked on a
>> commercial
>>>>> web spider for ten years. Occasionally, our customers did need to
>> bypass
>>>>> portions of robots.txt. That was usually because of a
>> poorly-maintained web
>>>>> server, or because our spider could safely crawl some content that
>> would
>>>>> cause problems for other crawlers.
>>>>>
>>>>> If you want to learn crawling, don’t start by breaking the conventions
>> of
>>>>> good web citizenship. Instead, start with sitemap.xml and crawl the
>>>>> preferred portions of a site.
>>>>>
>>>>> https://www.sitemaps.org/index.html <
>> https://www.sitemaps.org/index.html>
>>>>>
>>>>> If the site blocks you, find a different site to learn on.
>>>>>
>>>>> I like the looks of “Scrapy”, written in Python. I haven’t used it for
>>>>> anything big, but I’d start with that for learning.
>>>>>
>>>>> https://scrapy.org/ <https://scrapy.org/>
>>>>>
>>>>> If you want to learn on a site with a lot of content, try ours,
>> chegg.com
>>>>> But if your crawler gets out of hand, crawling too fast, we’ll block
>> it.
>>>>> Any other site will do the same.
>>>>>
>>>>> I would not base the crawler directly on Solr. A crawler needs a
>>>>> dedicated database to record the URLs visited, errors, duplicates,
>> etc. The
>>>>> output of the crawl goes to Solr. That is how we did it with Ultraseek
>>>>> (before Solr existed).
>>>>>
>>>>> wunder
>>>>> Walter Underwood
>>>>> wunder@wunderwood.org
>>>>> http://observer.wunderwood.org/  (my blog)
>>>>>
>>>>>
>>>>>> On Jun 1, 2017, at 3:01 PM, David Choi <ch...@gmail.com>
>> wrote:
>>>>>>
>>>>>> Oh well I guess its ok if a corporation does it but not someone
>> wanting
>>>>> to
>>>>>> learn more about the field. I actually have written a crawler before
>> as
>>>>>> well as the you know Inverted Index of how solr works but I just
>> thought
>>>>>> its architecture was better suited for scaling.
>>>>>>
>>>>>> On Thu, Jun 1, 2017 at 4:47 PM Dave <ha...@gmail.com>
>>>>> wrote:
>>>>>>
>>>>>>> And I mean that in the context of stealing content from sites that
>>>>>>> explicitly declare they don't want to be crawled. Robots.txt is to be
>>>>>>> followed.
>>>>>>>
>>>>>>>> On Jun 1, 2017, at 5:31 PM, David Choi <ch...@gmail.com>
>>>>> wrote:
>>>>>>>>
>>>>>>>> Hello,
>>>>>>>>
>>>>>>>> I was wondering if anyone could guide me on how to crawl the web and
>>>>>>>> ignore the robots.txt since I can not index some big sites. Or if
>>>>> someone
>>>>>>>> could point how to get around it. I read somewhere about a
>>>>>>>> protocol.plugin.check.robots
>>>>>>>> but that was for nutch.
>>>>>>>>
>>>>>>>> The way I index is
>>>>>>>> bin/post -c gettingstarted https://en.wikipedia.org/
>>>>>>>>
>>>>>>>> but I can't index the site I'm guessing because of the robots.txt.
>>>>>>>> I can index with
>>>>>>>> bin/post -c gettingstarted http://lucene.apache.org/solr
>>>>>>>>
>>>>>>>> which I am guessing allows it. I was also wondering how to find the
>>>>> name
>>>>>>> of
>>>>>>>> the crawler bin/post uses.
>>>>>>>
>>>>>
>>>>>
>>
>>
>
>
> ---
> This email has been checked for viruses by AVG.
> http://www.avg.com
>


-- 
Charlie Hull
Flax - Open Source Enterprise Search

tel/fax: +44 (0)8700 118334
mobile:  +44 (0)7767 825828
web: www.flax.co.uk

Re: Solr Web Crawler - Robots.txt

Posted by Doug Turnbull <dt...@opensourceconnections.com>.
Scrapy is fantastic and I use it scrape search results pages for clients to
take quality snapshots for relevance work

Ignoring robots.txt sometimes legit comes up because a staging site might
be telling google not to crawl but don't care about a developer crawling
for internal purposes.

Doug
On Thu, Jun 1, 2017 at 6:34 PM Walter Underwood <wu...@wunderwood.org>
wrote:

> Which was exactly what I suggested.
>
> wunder
> Walter Underwood
> wunder@wunderwood.org
> http://observer.wunderwood.org/  (my blog)
>
>
> > On Jun 1, 2017, at 3:31 PM, David Choi <ch...@gmail.com> wrote:
> >
> > In the mean time I have found a better solution at the moment is to test
> on
> > a site that allows users to crawl their site.
> >
> > On Thu, Jun 1, 2017 at 5:26 PM David Choi <ch...@gmail.com>
> wrote:
> >
> >> I think you misunderstand the argument was about stealing content. Sorry
> >> but I think you need to read what people write before making bold
> >> statements.
> >>
> >> On Thu, Jun 1, 2017 at 5:20 PM Walter Underwood <wu...@wunderwood.org>
> >> wrote:
> >>
> >>> Let’s not get snarky right away, especially when you are wrong.
> >>>
> >>> Corporations do not generally ignore robots.txt. I worked on a
> commercial
> >>> web spider for ten years. Occasionally, our customers did need to
> bypass
> >>> portions of robots.txt. That was usually because of a
> poorly-maintained web
> >>> server, or because our spider could safely crawl some content that
> would
> >>> cause problems for other crawlers.
> >>>
> >>> If you want to learn crawling, don’t start by breaking the conventions
> of
> >>> good web citizenship. Instead, start with sitemap.xml and crawl the
> >>> preferred portions of a site.
> >>>
> >>> https://www.sitemaps.org/index.html <
> https://www.sitemaps.org/index.html>
> >>>
> >>> If the site blocks you, find a different site to learn on.
> >>>
> >>> I like the looks of “Scrapy”, written in Python. I haven’t used it for
> >>> anything big, but I’d start with that for learning.
> >>>
> >>> https://scrapy.org/ <https://scrapy.org/>
> >>>
> >>> If you want to learn on a site with a lot of content, try ours,
> chegg.com
> >>> But if your crawler gets out of hand, crawling too fast, we’ll block
> it.
> >>> Any other site will do the same.
> >>>
> >>> I would not base the crawler directly on Solr. A crawler needs a
> >>> dedicated database to record the URLs visited, errors, duplicates,
> etc. The
> >>> output of the crawl goes to Solr. That is how we did it with Ultraseek
> >>> (before Solr existed).
> >>>
> >>> wunder
> >>> Walter Underwood
> >>> wunder@wunderwood.org
> >>> http://observer.wunderwood.org/  (my blog)
> >>>
> >>>
> >>>> On Jun 1, 2017, at 3:01 PM, David Choi <ch...@gmail.com>
> wrote:
> >>>>
> >>>> Oh well I guess its ok if a corporation does it but not someone
> wanting
> >>> to
> >>>> learn more about the field. I actually have written a crawler before
> as
> >>>> well as the you know Inverted Index of how solr works but I just
> thought
> >>>> its architecture was better suited for scaling.
> >>>>
> >>>> On Thu, Jun 1, 2017 at 4:47 PM Dave <ha...@gmail.com>
> >>> wrote:
> >>>>
> >>>>> And I mean that in the context of stealing content from sites that
> >>>>> explicitly declare they don't want to be crawled. Robots.txt is to be
> >>>>> followed.
> >>>>>
> >>>>>> On Jun 1, 2017, at 5:31 PM, David Choi <ch...@gmail.com>
> >>> wrote:
> >>>>>>
> >>>>>> Hello,
> >>>>>>
> >>>>>> I was wondering if anyone could guide me on how to crawl the web and
> >>>>>> ignore the robots.txt since I can not index some big sites. Or if
> >>> someone
> >>>>>> could point how to get around it. I read somewhere about a
> >>>>>> protocol.plugin.check.robots
> >>>>>> but that was for nutch.
> >>>>>>
> >>>>>> The way I index is
> >>>>>> bin/post -c gettingstarted https://en.wikipedia.org/
> >>>>>>
> >>>>>> but I can't index the site I'm guessing because of the robots.txt.
> >>>>>> I can index with
> >>>>>> bin/post -c gettingstarted http://lucene.apache.org/solr
> >>>>>>
> >>>>>> which I am guessing allows it. I was also wondering how to find the
> >>> name
> >>>>> of
> >>>>>> the crawler bin/post uses.
> >>>>>
> >>>
> >>>
>
>

Re: Solr Web Crawler - Robots.txt

Posted by Walter Underwood <wu...@wunderwood.org>.
Nutch was built for that, but it is a pain to use. I’m still sad that I couldn’t get Mike Lynch to open source Ultraseek. So easy and much more powerful than Nutch.

Ignoring robots.txt is often a bad idea. You may get into a REST API or into a calendar that generates an unending number of valid, different pages. Or the combinatorial explosion of diffs between revisions of a wiki page. Those are really fun.

There are some web servers that put a session ID in the path, so you get an endless set of URLs for the exact same page. We called those a “black hole” because it sucked spiders in and never let them out.

The comments in the Wikipedia robots.txt are instructive. For example, they allow access to the documentation for the REST API (Allow: /api/rest_v1/?doc) then disallow the other paths in the API (Disallow: /api)

https://en.wikipedia.org/robots.txt <https://en.wikipedia.org/robots.txt>

wunder
Walter Underwood
wunder@wunderwood.org
http://observer.wunderwood.org/  (my blog)


> On Jun 1, 2017, at 4:58 PM, Mike Drob <md...@apache.org> wrote:
> 
> Isn't this exactly what Apache Nutch was built for?
> 
> On Thu, Jun 1, 2017 at 6:56 PM, David Choi <ch...@gmail.com> wrote:
> 
>> In any case after digging further I have found where it checks for
>> robots.txt. Thanks!
>> 
>> On Thu, Jun 1, 2017 at 5:34 PM Walter Underwood <wu...@wunderwood.org>
>> wrote:
>> 
>>> Which was exactly what I suggested.
>>> 
>>> wunder
>>> Walter Underwood
>>> wunder@wunderwood.org
>>> http://observer.wunderwood.org/  (my blog)
>>> 
>>> 
>>>> On Jun 1, 2017, at 3:31 PM, David Choi <ch...@gmail.com> wrote:
>>>> 
>>>> In the mean time I have found a better solution at the moment is to
>> test
>>> on
>>>> a site that allows users to crawl their site.
>>>> 
>>>> On Thu, Jun 1, 2017 at 5:26 PM David Choi <ch...@gmail.com>
>>> wrote:
>>>> 
>>>>> I think you misunderstand the argument was about stealing content.
>> Sorry
>>>>> but I think you need to read what people write before making bold
>>>>> statements.
>>>>> 
>>>>> On Thu, Jun 1, 2017 at 5:20 PM Walter Underwood <
>> wunder@wunderwood.org>
>>>>> wrote:
>>>>> 
>>>>>> Let’s not get snarky right away, especially when you are wrong.
>>>>>> 
>>>>>> Corporations do not generally ignore robots.txt. I worked on a
>>> commercial
>>>>>> web spider for ten years. Occasionally, our customers did need to
>>> bypass
>>>>>> portions of robots.txt. That was usually because of a
>>> poorly-maintained web
>>>>>> server, or because our spider could safely crawl some content that
>>> would
>>>>>> cause problems for other crawlers.
>>>>>> 
>>>>>> If you want to learn crawling, don’t start by breaking the
>> conventions
>>> of
>>>>>> good web citizenship. Instead, start with sitemap.xml and crawl the
>>>>>> preferred portions of a site.
>>>>>> 
>>>>>> https://www.sitemaps.org/index.html <
>>> https://www.sitemaps.org/index.html>
>>>>>> 
>>>>>> If the site blocks you, find a different site to learn on.
>>>>>> 
>>>>>> I like the looks of “Scrapy”, written in Python. I haven’t used it
>> for
>>>>>> anything big, but I’d start with that for learning.
>>>>>> 
>>>>>> https://scrapy.org/ <https://scrapy.org/>
>>>>>> 
>>>>>> If you want to learn on a site with a lot of content, try ours,
>>> chegg.com
>>>>>> But if your crawler gets out of hand, crawling too fast, we’ll block
>>> it.
>>>>>> Any other site will do the same.
>>>>>> 
>>>>>> I would not base the crawler directly on Solr. A crawler needs a
>>>>>> dedicated database to record the URLs visited, errors, duplicates,
>>> etc. The
>>>>>> output of the crawl goes to Solr. That is how we did it with
>> Ultraseek
>>>>>> (before Solr existed).
>>>>>> 
>>>>>> wunder
>>>>>> Walter Underwood
>>>>>> wunder@wunderwood.org
>>>>>> http://observer.wunderwood.org/  (my blog)
>>>>>> 
>>>>>> 
>>>>>>> On Jun 1, 2017, at 3:01 PM, David Choi <ch...@gmail.com>
>>> wrote:
>>>>>>> 
>>>>>>> Oh well I guess its ok if a corporation does it but not someone
>>> wanting
>>>>>> to
>>>>>>> learn more about the field. I actually have written a crawler before
>>> as
>>>>>>> well as the you know Inverted Index of how solr works but I just
>>> thought
>>>>>>> its architecture was better suited for scaling.
>>>>>>> 
>>>>>>> On Thu, Jun 1, 2017 at 4:47 PM Dave <ha...@gmail.com>
>>>>>> wrote:
>>>>>>> 
>>>>>>>> And I mean that in the context of stealing content from sites that
>>>>>>>> explicitly declare they don't want to be crawled. Robots.txt is to
>> be
>>>>>>>> followed.
>>>>>>>> 
>>>>>>>>> On Jun 1, 2017, at 5:31 PM, David Choi <ch...@gmail.com>
>>>>>> wrote:
>>>>>>>>> 
>>>>>>>>> Hello,
>>>>>>>>> 
>>>>>>>>> I was wondering if anyone could guide me on how to crawl the web
>> and
>>>>>>>>> ignore the robots.txt since I can not index some big sites. Or if
>>>>>> someone
>>>>>>>>> could point how to get around it. I read somewhere about a
>>>>>>>>> protocol.plugin.check.robots
>>>>>>>>> but that was for nutch.
>>>>>>>>> 
>>>>>>>>> The way I index is
>>>>>>>>> bin/post -c gettingstarted https://en.wikipedia.org/
>>>>>>>>> 
>>>>>>>>> but I can't index the site I'm guessing because of the robots.txt.
>>>>>>>>> I can index with
>>>>>>>>> bin/post -c gettingstarted http://lucene.apache.org/solr
>>>>>>>>> 
>>>>>>>>> which I am guessing allows it. I was also wondering how to find
>> the
>>>>>> name
>>>>>>>> of
>>>>>>>>> the crawler bin/post uses.
>>>>>>>> 
>>>>>> 
>>>>>> 
>>> 
>>> 
>> 


Re: Solr Web Crawler - Robots.txt

Posted by Mike Drob <md...@apache.org>.
Isn't this exactly what Apache Nutch was built for?

On Thu, Jun 1, 2017 at 6:56 PM, David Choi <ch...@gmail.com> wrote:

> In any case after digging further I have found where it checks for
> robots.txt. Thanks!
>
> On Thu, Jun 1, 2017 at 5:34 PM Walter Underwood <wu...@wunderwood.org>
> wrote:
>
> > Which was exactly what I suggested.
> >
> > wunder
> > Walter Underwood
> > wunder@wunderwood.org
> > http://observer.wunderwood.org/  (my blog)
> >
> >
> > > On Jun 1, 2017, at 3:31 PM, David Choi <ch...@gmail.com> wrote:
> > >
> > > In the mean time I have found a better solution at the moment is to
> test
> > on
> > > a site that allows users to crawl their site.
> > >
> > > On Thu, Jun 1, 2017 at 5:26 PM David Choi <ch...@gmail.com>
> > wrote:
> > >
> > >> I think you misunderstand the argument was about stealing content.
> Sorry
> > >> but I think you need to read what people write before making bold
> > >> statements.
> > >>
> > >> On Thu, Jun 1, 2017 at 5:20 PM Walter Underwood <
> wunder@wunderwood.org>
> > >> wrote:
> > >>
> > >>> Let’s not get snarky right away, especially when you are wrong.
> > >>>
> > >>> Corporations do not generally ignore robots.txt. I worked on a
> > commercial
> > >>> web spider for ten years. Occasionally, our customers did need to
> > bypass
> > >>> portions of robots.txt. That was usually because of a
> > poorly-maintained web
> > >>> server, or because our spider could safely crawl some content that
> > would
> > >>> cause problems for other crawlers.
> > >>>
> > >>> If you want to learn crawling, don’t start by breaking the
> conventions
> > of
> > >>> good web citizenship. Instead, start with sitemap.xml and crawl the
> > >>> preferred portions of a site.
> > >>>
> > >>> https://www.sitemaps.org/index.html <
> > https://www.sitemaps.org/index.html>
> > >>>
> > >>> If the site blocks you, find a different site to learn on.
> > >>>
> > >>> I like the looks of “Scrapy”, written in Python. I haven’t used it
> for
> > >>> anything big, but I’d start with that for learning.
> > >>>
> > >>> https://scrapy.org/ <https://scrapy.org/>
> > >>>
> > >>> If you want to learn on a site with a lot of content, try ours,
> > chegg.com
> > >>> But if your crawler gets out of hand, crawling too fast, we’ll block
> > it.
> > >>> Any other site will do the same.
> > >>>
> > >>> I would not base the crawler directly on Solr. A crawler needs a
> > >>> dedicated database to record the URLs visited, errors, duplicates,
> > etc. The
> > >>> output of the crawl goes to Solr. That is how we did it with
> Ultraseek
> > >>> (before Solr existed).
> > >>>
> > >>> wunder
> > >>> Walter Underwood
> > >>> wunder@wunderwood.org
> > >>> http://observer.wunderwood.org/  (my blog)
> > >>>
> > >>>
> > >>>> On Jun 1, 2017, at 3:01 PM, David Choi <ch...@gmail.com>
> > wrote:
> > >>>>
> > >>>> Oh well I guess its ok if a corporation does it but not someone
> > wanting
> > >>> to
> > >>>> learn more about the field. I actually have written a crawler before
> > as
> > >>>> well as the you know Inverted Index of how solr works but I just
> > thought
> > >>>> its architecture was better suited for scaling.
> > >>>>
> > >>>> On Thu, Jun 1, 2017 at 4:47 PM Dave <ha...@gmail.com>
> > >>> wrote:
> > >>>>
> > >>>>> And I mean that in the context of stealing content from sites that
> > >>>>> explicitly declare they don't want to be crawled. Robots.txt is to
> be
> > >>>>> followed.
> > >>>>>
> > >>>>>> On Jun 1, 2017, at 5:31 PM, David Choi <ch...@gmail.com>
> > >>> wrote:
> > >>>>>>
> > >>>>>> Hello,
> > >>>>>>
> > >>>>>> I was wondering if anyone could guide me on how to crawl the web
> and
> > >>>>>> ignore the robots.txt since I can not index some big sites. Or if
> > >>> someone
> > >>>>>> could point how to get around it. I read somewhere about a
> > >>>>>> protocol.plugin.check.robots
> > >>>>>> but that was for nutch.
> > >>>>>>
> > >>>>>> The way I index is
> > >>>>>> bin/post -c gettingstarted https://en.wikipedia.org/
> > >>>>>>
> > >>>>>> but I can't index the site I'm guessing because of the robots.txt.
> > >>>>>> I can index with
> > >>>>>> bin/post -c gettingstarted http://lucene.apache.org/solr
> > >>>>>>
> > >>>>>> which I am guessing allows it. I was also wondering how to find
> the
> > >>> name
> > >>>>> of
> > >>>>>> the crawler bin/post uses.
> > >>>>>
> > >>>
> > >>>
> >
> >
>

Re: Solr Web Crawler - Robots.txt

Posted by David Choi <ch...@gmail.com>.
In any case after digging further I have found where it checks for
robots.txt. Thanks!

On Thu, Jun 1, 2017 at 5:34 PM Walter Underwood <wu...@wunderwood.org>
wrote:

> Which was exactly what I suggested.
>
> wunder
> Walter Underwood
> wunder@wunderwood.org
> http://observer.wunderwood.org/  (my blog)
>
>
> > On Jun 1, 2017, at 3:31 PM, David Choi <ch...@gmail.com> wrote:
> >
> > In the mean time I have found a better solution at the moment is to test
> on
> > a site that allows users to crawl their site.
> >
> > On Thu, Jun 1, 2017 at 5:26 PM David Choi <ch...@gmail.com>
> wrote:
> >
> >> I think you misunderstand the argument was about stealing content. Sorry
> >> but I think you need to read what people write before making bold
> >> statements.
> >>
> >> On Thu, Jun 1, 2017 at 5:20 PM Walter Underwood <wu...@wunderwood.org>
> >> wrote:
> >>
> >>> Let’s not get snarky right away, especially when you are wrong.
> >>>
> >>> Corporations do not generally ignore robots.txt. I worked on a
> commercial
> >>> web spider for ten years. Occasionally, our customers did need to
> bypass
> >>> portions of robots.txt. That was usually because of a
> poorly-maintained web
> >>> server, or because our spider could safely crawl some content that
> would
> >>> cause problems for other crawlers.
> >>>
> >>> If you want to learn crawling, don’t start by breaking the conventions
> of
> >>> good web citizenship. Instead, start with sitemap.xml and crawl the
> >>> preferred portions of a site.
> >>>
> >>> https://www.sitemaps.org/index.html <
> https://www.sitemaps.org/index.html>
> >>>
> >>> If the site blocks you, find a different site to learn on.
> >>>
> >>> I like the looks of “Scrapy”, written in Python. I haven’t used it for
> >>> anything big, but I’d start with that for learning.
> >>>
> >>> https://scrapy.org/ <https://scrapy.org/>
> >>>
> >>> If you want to learn on a site with a lot of content, try ours,
> chegg.com
> >>> But if your crawler gets out of hand, crawling too fast, we’ll block
> it.
> >>> Any other site will do the same.
> >>>
> >>> I would not base the crawler directly on Solr. A crawler needs a
> >>> dedicated database to record the URLs visited, errors, duplicates,
> etc. The
> >>> output of the crawl goes to Solr. That is how we did it with Ultraseek
> >>> (before Solr existed).
> >>>
> >>> wunder
> >>> Walter Underwood
> >>> wunder@wunderwood.org
> >>> http://observer.wunderwood.org/  (my blog)
> >>>
> >>>
> >>>> On Jun 1, 2017, at 3:01 PM, David Choi <ch...@gmail.com>
> wrote:
> >>>>
> >>>> Oh well I guess its ok if a corporation does it but not someone
> wanting
> >>> to
> >>>> learn more about the field. I actually have written a crawler before
> as
> >>>> well as the you know Inverted Index of how solr works but I just
> thought
> >>>> its architecture was better suited for scaling.
> >>>>
> >>>> On Thu, Jun 1, 2017 at 4:47 PM Dave <ha...@gmail.com>
> >>> wrote:
> >>>>
> >>>>> And I mean that in the context of stealing content from sites that
> >>>>> explicitly declare they don't want to be crawled. Robots.txt is to be
> >>>>> followed.
> >>>>>
> >>>>>> On Jun 1, 2017, at 5:31 PM, David Choi <ch...@gmail.com>
> >>> wrote:
> >>>>>>
> >>>>>> Hello,
> >>>>>>
> >>>>>> I was wondering if anyone could guide me on how to crawl the web and
> >>>>>> ignore the robots.txt since I can not index some big sites. Or if
> >>> someone
> >>>>>> could point how to get around it. I read somewhere about a
> >>>>>> protocol.plugin.check.robots
> >>>>>> but that was for nutch.
> >>>>>>
> >>>>>> The way I index is
> >>>>>> bin/post -c gettingstarted https://en.wikipedia.org/
> >>>>>>
> >>>>>> but I can't index the site I'm guessing because of the robots.txt.
> >>>>>> I can index with
> >>>>>> bin/post -c gettingstarted http://lucene.apache.org/solr
> >>>>>>
> >>>>>> which I am guessing allows it. I was also wondering how to find the
> >>> name
> >>>>> of
> >>>>>> the crawler bin/post uses.
> >>>>>
> >>>
> >>>
>
>

Re: Solr Web Crawler - Robots.txt

Posted by Walter Underwood <wu...@wunderwood.org>.
Which was exactly what I suggested.

wunder
Walter Underwood
wunder@wunderwood.org
http://observer.wunderwood.org/  (my blog)


> On Jun 1, 2017, at 3:31 PM, David Choi <ch...@gmail.com> wrote:
> 
> In the mean time I have found a better solution at the moment is to test on
> a site that allows users to crawl their site.
> 
> On Thu, Jun 1, 2017 at 5:26 PM David Choi <ch...@gmail.com> wrote:
> 
>> I think you misunderstand the argument was about stealing content. Sorry
>> but I think you need to read what people write before making bold
>> statements.
>> 
>> On Thu, Jun 1, 2017 at 5:20 PM Walter Underwood <wu...@wunderwood.org>
>> wrote:
>> 
>>> Let’s not get snarky right away, especially when you are wrong.
>>> 
>>> Corporations do not generally ignore robots.txt. I worked on a commercial
>>> web spider for ten years. Occasionally, our customers did need to bypass
>>> portions of robots.txt. That was usually because of a poorly-maintained web
>>> server, or because our spider could safely crawl some content that would
>>> cause problems for other crawlers.
>>> 
>>> If you want to learn crawling, don’t start by breaking the conventions of
>>> good web citizenship. Instead, start with sitemap.xml and crawl the
>>> preferred portions of a site.
>>> 
>>> https://www.sitemaps.org/index.html <https://www.sitemaps.org/index.html>
>>> 
>>> If the site blocks you, find a different site to learn on.
>>> 
>>> I like the looks of “Scrapy”, written in Python. I haven’t used it for
>>> anything big, but I’d start with that for learning.
>>> 
>>> https://scrapy.org/ <https://scrapy.org/>
>>> 
>>> If you want to learn on a site with a lot of content, try ours, chegg.com
>>> But if your crawler gets out of hand, crawling too fast, we’ll block it.
>>> Any other site will do the same.
>>> 
>>> I would not base the crawler directly on Solr. A crawler needs a
>>> dedicated database to record the URLs visited, errors, duplicates, etc. The
>>> output of the crawl goes to Solr. That is how we did it with Ultraseek
>>> (before Solr existed).
>>> 
>>> wunder
>>> Walter Underwood
>>> wunder@wunderwood.org
>>> http://observer.wunderwood.org/  (my blog)
>>> 
>>> 
>>>> On Jun 1, 2017, at 3:01 PM, David Choi <ch...@gmail.com> wrote:
>>>> 
>>>> Oh well I guess its ok if a corporation does it but not someone wanting
>>> to
>>>> learn more about the field. I actually have written a crawler before as
>>>> well as the you know Inverted Index of how solr works but I just thought
>>>> its architecture was better suited for scaling.
>>>> 
>>>> On Thu, Jun 1, 2017 at 4:47 PM Dave <ha...@gmail.com>
>>> wrote:
>>>> 
>>>>> And I mean that in the context of stealing content from sites that
>>>>> explicitly declare they don't want to be crawled. Robots.txt is to be
>>>>> followed.
>>>>> 
>>>>>> On Jun 1, 2017, at 5:31 PM, David Choi <ch...@gmail.com>
>>> wrote:
>>>>>> 
>>>>>> Hello,
>>>>>> 
>>>>>> I was wondering if anyone could guide me on how to crawl the web and
>>>>>> ignore the robots.txt since I can not index some big sites. Or if
>>> someone
>>>>>> could point how to get around it. I read somewhere about a
>>>>>> protocol.plugin.check.robots
>>>>>> but that was for nutch.
>>>>>> 
>>>>>> The way I index is
>>>>>> bin/post -c gettingstarted https://en.wikipedia.org/
>>>>>> 
>>>>>> but I can't index the site I'm guessing because of the robots.txt.
>>>>>> I can index with
>>>>>> bin/post -c gettingstarted http://lucene.apache.org/solr
>>>>>> 
>>>>>> which I am guessing allows it. I was also wondering how to find the
>>> name
>>>>> of
>>>>>> the crawler bin/post uses.
>>>>> 
>>> 
>>> 


Re: Solr Web Crawler - Robots.txt

Posted by David Choi <ch...@gmail.com>.
In the mean time I have found a better solution at the moment is to test on
a site that allows users to crawl their site.

On Thu, Jun 1, 2017 at 5:26 PM David Choi <ch...@gmail.com> wrote:

> I think you misunderstand the argument was about stealing content. Sorry
> but I think you need to read what people write before making bold
> statements.
>
> On Thu, Jun 1, 2017 at 5:20 PM Walter Underwood <wu...@wunderwood.org>
> wrote:
>
>> Let’s not get snarky right away, especially when you are wrong.
>>
>> Corporations do not generally ignore robots.txt. I worked on a commercial
>> web spider for ten years. Occasionally, our customers did need to bypass
>> portions of robots.txt. That was usually because of a poorly-maintained web
>> server, or because our spider could safely crawl some content that would
>> cause problems for other crawlers.
>>
>> If you want to learn crawling, don’t start by breaking the conventions of
>> good web citizenship. Instead, start with sitemap.xml and crawl the
>> preferred portions of a site.
>>
>> https://www.sitemaps.org/index.html <https://www.sitemaps.org/index.html>
>>
>> If the site blocks you, find a different site to learn on.
>>
>> I like the looks of “Scrapy”, written in Python. I haven’t used it for
>> anything big, but I’d start with that for learning.
>>
>> https://scrapy.org/ <https://scrapy.org/>
>>
>> If you want to learn on a site with a lot of content, try ours, chegg.com
>> But if your crawler gets out of hand, crawling too fast, we’ll block it.
>> Any other site will do the same.
>>
>> I would not base the crawler directly on Solr. A crawler needs a
>> dedicated database to record the URLs visited, errors, duplicates, etc. The
>> output of the crawl goes to Solr. That is how we did it with Ultraseek
>> (before Solr existed).
>>
>> wunder
>> Walter Underwood
>> wunder@wunderwood.org
>> http://observer.wunderwood.org/  (my blog)
>>
>>
>> > On Jun 1, 2017, at 3:01 PM, David Choi <ch...@gmail.com> wrote:
>> >
>> > Oh well I guess its ok if a corporation does it but not someone wanting
>> to
>> > learn more about the field. I actually have written a crawler before as
>> > well as the you know Inverted Index of how solr works but I just thought
>> > its architecture was better suited for scaling.
>> >
>> > On Thu, Jun 1, 2017 at 4:47 PM Dave <ha...@gmail.com>
>> wrote:
>> >
>> >> And I mean that in the context of stealing content from sites that
>> >> explicitly declare they don't want to be crawled. Robots.txt is to be
>> >> followed.
>> >>
>> >>> On Jun 1, 2017, at 5:31 PM, David Choi <ch...@gmail.com>
>> wrote:
>> >>>
>> >>> Hello,
>> >>>
>> >>>  I was wondering if anyone could guide me on how to crawl the web and
>> >>> ignore the robots.txt since I can not index some big sites. Or if
>> someone
>> >>> could point how to get around it. I read somewhere about a
>> >>> protocol.plugin.check.robots
>> >>> but that was for nutch.
>> >>>
>> >>> The way I index is
>> >>> bin/post -c gettingstarted https://en.wikipedia.org/
>> >>>
>> >>> but I can't index the site I'm guessing because of the robots.txt.
>> >>> I can index with
>> >>> bin/post -c gettingstarted http://lucene.apache.org/solr
>> >>>
>> >>> which I am guessing allows it. I was also wondering how to find the
>> name
>> >> of
>> >>> the crawler bin/post uses.
>> >>
>>
>>

Re: Solr Web Crawler - Robots.txt

Posted by David Choi <ch...@gmail.com>.
I think you misunderstand the argument was about stealing content. Sorry
but I think you need to read what people write before making bold
statements.

On Thu, Jun 1, 2017 at 5:20 PM Walter Underwood <wu...@wunderwood.org>
wrote:

> Let’s not get snarky right away, especially when you are wrong.
>
> Corporations do not generally ignore robots.txt. I worked on a commercial
> web spider for ten years. Occasionally, our customers did need to bypass
> portions of robots.txt. That was usually because of a poorly-maintained web
> server, or because our spider could safely crawl some content that would
> cause problems for other crawlers.
>
> If you want to learn crawling, don’t start by breaking the conventions of
> good web citizenship. Instead, start with sitemap.xml and crawl the
> preferred portions of a site.
>
> https://www.sitemaps.org/index.html <https://www.sitemaps.org/index.html>
>
> If the site blocks you, find a different site to learn on.
>
> I like the looks of “Scrapy”, written in Python. I haven’t used it for
> anything big, but I’d start with that for learning.
>
> https://scrapy.org/ <https://scrapy.org/>
>
> If you want to learn on a site with a lot of content, try ours, chegg.com
> But if your crawler gets out of hand, crawling too fast, we’ll block it.
> Any other site will do the same.
>
> I would not base the crawler directly on Solr. A crawler needs a dedicated
> database to record the URLs visited, errors, duplicates, etc. The output of
> the crawl goes to Solr. That is how we did it with Ultraseek (before Solr
> existed).
>
> wunder
> Walter Underwood
> wunder@wunderwood.org
> http://observer.wunderwood.org/  (my blog)
>
>
> > On Jun 1, 2017, at 3:01 PM, David Choi <ch...@gmail.com> wrote:
> >
> > Oh well I guess its ok if a corporation does it but not someone wanting
> to
> > learn more about the field. I actually have written a crawler before as
> > well as the you know Inverted Index of how solr works but I just thought
> > its architecture was better suited for scaling.
> >
> > On Thu, Jun 1, 2017 at 4:47 PM Dave <ha...@gmail.com>
> wrote:
> >
> >> And I mean that in the context of stealing content from sites that
> >> explicitly declare they don't want to be crawled. Robots.txt is to be
> >> followed.
> >>
> >>> On Jun 1, 2017, at 5:31 PM, David Choi <ch...@gmail.com> wrote:
> >>>
> >>> Hello,
> >>>
> >>>  I was wondering if anyone could guide me on how to crawl the web and
> >>> ignore the robots.txt since I can not index some big sites. Or if
> someone
> >>> could point how to get around it. I read somewhere about a
> >>> protocol.plugin.check.robots
> >>> but that was for nutch.
> >>>
> >>> The way I index is
> >>> bin/post -c gettingstarted https://en.wikipedia.org/
> >>>
> >>> but I can't index the site I'm guessing because of the robots.txt.
> >>> I can index with
> >>> bin/post -c gettingstarted http://lucene.apache.org/solr
> >>>
> >>> which I am guessing allows it. I was also wondering how to find the
> name
> >> of
> >>> the crawler bin/post uses.
> >>
>
>

Re: Solr Web Crawler - Robots.txt

Posted by Walter Underwood <wu...@wunderwood.org>.
Let’s not get snarky right away, especially when you are wrong.

Corporations do not generally ignore robots.txt. I worked on a commercial web spider for ten years. Occasionally, our customers did need to bypass portions of robots.txt. That was usually because of a poorly-maintained web server, or because our spider could safely crawl some content that would cause problems for other crawlers.

If you want to learn crawling, don’t start by breaking the conventions of good web citizenship. Instead, start with sitemap.xml and crawl the preferred portions of a site.

https://www.sitemaps.org/index.html <https://www.sitemaps.org/index.html>

If the site blocks you, find a different site to learn on.

I like the looks of “Scrapy”, written in Python. I haven’t used it for anything big, but I’d start with that for learning.

https://scrapy.org/ <https://scrapy.org/>

If you want to learn on a site with a lot of content, try ours, chegg.com But if your crawler gets out of hand, crawling too fast, we’ll block it. Any other site will do the same.

I would not base the crawler directly on Solr. A crawler needs a dedicated database to record the URLs visited, errors, duplicates, etc. The output of the crawl goes to Solr. That is how we did it with Ultraseek (before Solr existed).

wunder
Walter Underwood
wunder@wunderwood.org
http://observer.wunderwood.org/  (my blog)


> On Jun 1, 2017, at 3:01 PM, David Choi <ch...@gmail.com> wrote:
> 
> Oh well I guess its ok if a corporation does it but not someone wanting to
> learn more about the field. I actually have written a crawler before as
> well as the you know Inverted Index of how solr works but I just thought
> its architecture was better suited for scaling.
> 
> On Thu, Jun 1, 2017 at 4:47 PM Dave <ha...@gmail.com> wrote:
> 
>> And I mean that in the context of stealing content from sites that
>> explicitly declare they don't want to be crawled. Robots.txt is to be
>> followed.
>> 
>>> On Jun 1, 2017, at 5:31 PM, David Choi <ch...@gmail.com> wrote:
>>> 
>>> Hello,
>>> 
>>>  I was wondering if anyone could guide me on how to crawl the web and
>>> ignore the robots.txt since I can not index some big sites. Or if someone
>>> could point how to get around it. I read somewhere about a
>>> protocol.plugin.check.robots
>>> but that was for nutch.
>>> 
>>> The way I index is
>>> bin/post -c gettingstarted https://en.wikipedia.org/
>>> 
>>> but I can't index the site I'm guessing because of the robots.txt.
>>> I can index with
>>> bin/post -c gettingstarted http://lucene.apache.org/solr
>>> 
>>> which I am guessing allows it. I was also wondering how to find the name
>> of
>>> the crawler bin/post uses.
>> 


Re: Solr Web Crawler - Robots.txt

Posted by David Choi <ch...@gmail.com>.
Oh well I guess its ok if a corporation does it but not someone wanting to
learn more about the field. I actually have written a crawler before as
well as the you know Inverted Index of how solr works but I just thought
its architecture was better suited for scaling.

On Thu, Jun 1, 2017 at 4:47 PM Dave <ha...@gmail.com> wrote:

> And I mean that in the context of stealing content from sites that
> explicitly declare they don't want to be crawled. Robots.txt is to be
> followed.
>
> > On Jun 1, 2017, at 5:31 PM, David Choi <ch...@gmail.com> wrote:
> >
> > Hello,
> >
> >   I was wondering if anyone could guide me on how to crawl the web and
> > ignore the robots.txt since I can not index some big sites. Or if someone
> > could point how to get around it. I read somewhere about a
> > protocol.plugin.check.robots
> > but that was for nutch.
> >
> > The way I index is
> > bin/post -c gettingstarted https://en.wikipedia.org/
> >
> > but I can't index the site I'm guessing because of the robots.txt.
> > I can index with
> > bin/post -c gettingstarted http://lucene.apache.org/solr
> >
> > which I am guessing allows it. I was also wondering how to find the name
> of
> > the crawler bin/post uses.
>

Re: Solr Web Crawler - Robots.txt

Posted by Dave <ha...@gmail.com>.
And I mean that in the context of stealing content from sites that explicitly declare they don't want to be crawled. Robots.txt is to be followed. 

> On Jun 1, 2017, at 5:31 PM, David Choi <ch...@gmail.com> wrote:
> 
> Hello,
> 
>   I was wondering if anyone could guide me on how to crawl the web and
> ignore the robots.txt since I can not index some big sites. Or if someone
> could point how to get around it. I read somewhere about a
> protocol.plugin.check.robots
> but that was for nutch.
> 
> The way I index is
> bin/post -c gettingstarted https://en.wikipedia.org/
> 
> but I can't index the site I'm guessing because of the robots.txt.
> I can index with
> bin/post -c gettingstarted http://lucene.apache.org/solr
> 
> which I am guessing allows it. I was also wondering how to find the name of
> the crawler bin/post uses.

Re: Solr Web Crawler - Robots.txt

Posted by Mutuhprasannth <mp...@gmail.com>.
Hi David Choi,

Have you found out the name of the crawler which is used by Solr bin/post?



--
Sent from: https://lucene.472066.n3.nabble.com/Solr-User-f472068.html