You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@nutch.apache.org by Michael Chen <yi...@u.northwestern.edu> on 2017/08/18 01:40:56 UTC

Sitemap detection bug?

Hi,

I've been unable to detect the sitemap for 
https://www.mscdirect.com/robots.txt, I did some searching and I think 
it might be due to their robots.txt line spacing format. I tried 
user-agent=Googlebot but that didn't help either. Could someone 
reproduce the problem?

Thanks!

Michael

Re: Sitemap detection bug?

Posted by Michael Chen <yi...@u.northwestern.edu>.

Hi Sebastian,

Sorry forgot to reply to list.

I remember enabling debug logging once before and found that the parsing of robots.txt stops after it finds the entry relevant to the crawler ID. Is the site map information displayed there too? 

It would also be great if someone could test it on 2.x, which should be very quick. I'm positive that there is something wrong with MSCDirect in specific that's blocking the site map extraction. Other sites work.

Thanks!
Michael

> On Aug 18, 2017, at 12:41, Sebastian Nagel <wa...@googlemail.com> wrote:
> 
> Hi Michael,
> 
> yes, I tried the mentioned sitemap with crawler-commons. The sitemap URL was detected in the
> robots.txt file. It needs some more debugging. The problem for me: I know 2.x not from running
> any production crawler, so it will take longer for me to get into it.
> 
> But would you mind to move all discussions to user@nutch. It's important
> to keep them public, as some sort of documentation.
> 
> Thanks,
> Sebastian
> 
> 
>> On 08/18/2017 08:10 PM, Michael Chen wrote:
>> Could you check it for mscdirect.com? Some documentation on sitemaps suggest that there should be a
>> blank line before sitemaps, which MSCDirect doesn't have. Also might have something to do with the
>> crawler ID?
>> 
>> Please let me know if I can provide you with any additional information.
>> 
>> Thank you!
>> 
>> Michael
>> 
>> 
>>> On 08/18/2017 06:16 AM, Sebastian Nagel wrote:
>>> Hi Michael,
>>> 
>>> I've checked crawler-commons which is used for robots.txt parsing (recent version and also 0.5 used
>>> by Nutch 2.x).  Seems to work. But it needs a closer look where the problem is.
>>> 
>>> Best,
>>> Sebastian
>>> 
>>>> On 08/18/2017 03:40 AM, Michael Chen wrote:
>>>> Hi,
>>>> 
>>>> I've been unable to detect the sitemap for https://www.mscdirect.com/robots.txt, I did some
>>>> searching and I think it might be due to their robots.txt line spacing format. I tried
>>>> user-agent=Googlebot but that didn't help either. Could someone reproduce the problem?
>>>> 
>>>> Thanks!
>>>> 
>>>> Michael
>>>> 
>> 
>