You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@nutch.apache.org by Raj Chidara <ra...@ddismart.com> on 2022/11/23 10:07:02 UTC

Few websites not crawling

I am not able to crawl these websites.  They do not have robots.txt file. Can any one suggest a solution for this

https://www.cmde.org.cn/ 

https://www.bfarm.de/EN/Home/_node.html


Thanks and Regards

Raj Chidara

Re[2]: Few websites not crawling

Posted by Raj Chidara <ra...@ddismart.com>.

Hello Markus
  I am receiving of status of 202 for Chinese site and receiving http status 403 for German site and crawling is stopping without crawling single URL 


Thanks and Regards

Raj Chidara

Mobile: +91-7680929509

----- Original Message -----
From: Markus Jelsma (markus.jelsma@openindex.io)
Date: 23-11-2022 16:52
To: user@nutch.apache.org
Subject: Re: Few websites not crawling

Hello,

The German site is crawlable, but it does produce awful URLs with some
;jsessionid=<> attached to it. The Chinese site is all Javascript, it
requires HtmlUnit or Selenium protocol plugin for it to work at all. No
guarantee if it will.

Regards,
Markus

Op wo 23 nov. 2022 om 11:07 schreef Raj Chidara <ra...@ddismart.com>:

>
> I am not able to crawl these websites.  They do not have robots.txt file.
> Can any one suggest a solution for this
>
> https://www.cmde.org.cn/
>
> https://www.bfarm.de/EN/Home/_node.html
>
>
> Thanks and Regards
>
> Raj Chidara
>
>
>
>

Re: Few websites not crawling

Posted by Markus Jelsma <ma...@openindex.io>.

Hello,

The German site is crawlable, but it does produce awful URLs with some
;jsessionid=<> attached to it. The Chinese site is all Javascript, it
requires HtmlUnit or Selenium protocol plugin for it to work at all. No
guarantee if it will.

Regards,
Markus

Op wo 23 nov. 2022 om 11:07 schreef Raj Chidara <ra...@ddismart.com>:

>
> I am not able to crawl these websites.  They do not have robots.txt file.
> Can any one suggest a solution for this
>
> https://www.cmde.org.cn/
>
> https://www.bfarm.de/EN/Home/_node.html
>
>
> Thanks and Regards
>
> Raj Chidara
>
>
>
>

Re: Not able to crawl ich

Posted by Markus Jelsma <ma...@openindex.io>.

Hello Raj,

This site loads its content via Javascript, so you need a protocol plugin
that supports it. HtmlUnit does not seem to work with this site, but
Selenium does. Please change your protocol plugin accordingly in you
plugin.includes configuration directive.

I tested it with our own parser as i have no Nutch here at the moment. But
it has support for Selenium so it should work, even though the version is a
bit outdated.

Regards,
Markus

Op za 17 dec. 2022 om 10:28 schreef Raj Chidara <ra...@ddismart.com>:

>
> Hi
>   I am not able to crawl this site https://www.ich.org/.  Can any one
> suggest a solution for this.  This site does not has robots.txt file.  When
> I try to check robots.txt, site is shown as under construction and
> returning response status 200.  Could it be any reason for issue?
>
>
>
> Thanks and Regards
>
> Raj Chidara
>
>
>
>
>
>
>

Not able to crawl ich

Posted by Raj Chidara <ra...@ddismart.com>.

Hi
  I am not able to crawl this site https://www.ich.org/.  Can any one suggest a solution for this.  This site does not has robots.txt file.  When I try to check robots.txt, site is shown as under construction and returning response status 200.  Could it be any reason for issue?



Thanks and Regards

Raj Chidara