You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by David Philip <da...@gmail.com> on 2014/08/02 13:27:09 UTC
Why is that few http sites doesn't get crawled.
Hi,
This should be naive question. Apologies for that.
I was trying to crawl the quora Q and A site. The seed file with these two
urls http://www.quora.com/Data-Visualization/
http://www.quora.com/
But the crawl didn't pick any of these pages. Why?
While I give "http://nutch.apache.org/", this site gets crawled.
Note that I have not put any restriction in regex filter. It is +.
Thanks - David
Re: Why is that few http sites doesn't get crawled.
Posted by John Lafitte <jl...@brandextract.com>.
It looks like Quora only allows some specific crawlers:
http://www.quora.com/robots.txt
nutch.apache.org doesn't have a robots.txt
On Sat, Aug 2, 2014 at 10:17 AM, Bin Wang <bi...@gmail.com> wrote:
> Hi David,
>
> Maybe the page requests have been disallowed by robots.txt, which Nutch
> will obey as default? Can you check?
>
> Bin
>
>
> On Sat, Aug 2, 2014 at 5:27 AM, David Philip <da...@gmail.com>
> wrote:
>
> > Hi,
> >
> > This should be naive question. Apologies for that.
> >
> > I was trying to crawl the quora Q and A site. The seed file with these
> two
> > urls http://www.quora.com/Data-Visualization/
> > http://www.quora.com/
> >
> > But the crawl didn't pick any of these pages. Why?
> >
> > While I give "http://nutch.apache.org/", this site gets crawled.
> >
> > Note that I have not put any restriction in regex filter. It is +.
> >
> > Thanks - David
> >
>
Re: Why is that few http sites doesn't get crawled.
Posted by Bin Wang <bi...@gmail.com>.
Hi David,
Maybe the page requests have been disallowed by robots.txt, which Nutch
will obey as default? Can you check?
Bin
On Sat, Aug 2, 2014 at 5:27 AM, David Philip <da...@gmail.com>
wrote:
> Hi,
>
> This should be naive question. Apologies for that.
>
> I was trying to crawl the quora Q and A site. The seed file with these two
> urls http://www.quora.com/Data-Visualization/
> http://www.quora.com/
>
> But the crawl didn't pick any of these pages. Why?
>
> While I give "http://nutch.apache.org/", this site gets crawled.
>
> Note that I have not put any restriction in regex filter. It is +.
>
> Thanks - David
>