You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@nutch.apache.org by David Philip <da...@gmail.com> on 2014/08/02 13:27:09 UTC

Why is that few http sites doesn't get crawled.

Hi,

   This should be naive question. Apologies for that.

I was trying to crawl the quora Q and A site. The seed file with these two
urls http://www.quora.com/Data-Visualization/
http://www.quora.com/

But the crawl didn't pick any of these pages. Why?

While I give "http://nutch.apache.org/", this site gets crawled.

Note that I have not put any restriction in regex filter. It is +.

Thanks - David

Re: Why is that few http sites doesn't get crawled.

Posted by John Lafitte <jl...@brandextract.com>.

It looks like Quora only allows some specific crawlers:
http://www.quora.com/robots.txt

nutch.apache.org doesn't have a robots.txt


On Sat, Aug 2, 2014 at 10:17 AM, Bin Wang <bi...@gmail.com> wrote:

> Hi David,
>
> Maybe the page requests have been disallowed by robots.txt, which Nutch
> will obey as default? Can you check?
>
> Bin
>
>
> On Sat, Aug 2, 2014 at 5:27 AM, David Philip <da...@gmail.com>
> wrote:
>
> > Hi,
> >
> >    This should be naive question. Apologies for that.
> >
> > I was trying to crawl the quora Q and A site. The seed file with these
> two
> > urls http://www.quora.com/Data-Visualization/
> > http://www.quora.com/
> >
> > But the crawl didn't pick any of these pages. Why?
> >
> > While I give "http://nutch.apache.org/", this site gets crawled.
> >
> > Note that I have not put any restriction in regex filter. It is +.
> >
> > Thanks - David
> >
>

Re: Why is that few http sites doesn't get crawled.

Posted by Bin Wang <bi...@gmail.com>.

Hi David,

Maybe the page requests have been disallowed by robots.txt, which Nutch
will obey as default? Can you check?

Bin


On Sat, Aug 2, 2014 at 5:27 AM, David Philip <da...@gmail.com>
wrote:

> Hi,
>
>    This should be naive question. Apologies for that.
>
> I was trying to crawl the quora Q and A site. The seed file with these two
> urls http://www.quora.com/Data-Visualization/
> http://www.quora.com/
>
> But the crawl didn't pick any of these pages. Why?
>
> While I give "http://nutch.apache.org/", this site gets crawled.
>
> Note that I have not put any restriction in regex filter. It is +.
>
> Thanks - David
>