You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by Raj Chidara <ra...@ddismart.com> on 2023/01/30 10:39:30 UTC

Siet is not crawling

Hello,

  Nutch is not able crawl this site.  Are there any nutch configuration changes required for this site?

https://www.ich.org/


Thanks and Regards

Raj Chidara



Re: Re[2]: Siet is not crawling

Posted by Abhay Ratnaparkhi <ab...@gmail.com>.
What I ended up doing is
- Developed a service to fetch pages (Used nodejs with Google Puppeteer
https://pptr.dev/ for fetching).
- Used browserless (https://www.browserless.io/) and made fetch to use live
chromium browser instances
- Scaled this all in the Kubernetes cluster so we can fetch many pages
simultaneously.
- Developed a plugin for Nutch which uses a fetch service to fetch pages.

This is better solution that using HTMLUnit or Selenium (as compared to
puppeteer which works great)


On Sun, Aug 13, 2023 at 2:53 PM Markus Jelsma <ma...@openindex.io>
wrote:

> Hello Raj,
>
> I see. Unfortunately turning on Javascript supporting protocol plugins such
> as Htmlunit or Selenium does not always solve the problem
>
> Maybe you can ask at the Selenium project about this. They are the experts
> on that particular problem.
>
> Regards,
> Markus
>
> Op di 1 aug 2023 om 19:38 schreef Raj Chidara <ra...@ddismart.com>:
>
> > Hello Markus
> >   Now, I have removed all other protocol-* and given only
> > protocol-selenium.  Now it crawled few pages.  However, there is no
> content
> > read from pages.  All pages are shown as only with text *Home*
> >
> > Thanks and Regards
> > Raj Chidara
> >
> >
> >
> > ---- On Mon, 30 Jan 2023 18:35:06 +0530 *Markus Jelsma
> > <markus.jelsma@openindex.io <ma...@openindex.io>>* wrote ---
> >
> > Yes, remove the other protocol-* plugins from the configuration. With all
> > three active it is not always determined which one is going to do the
> > work.
> >
> > Op ma 30 jan. 2023 om 12:50 schreef Raj Chidara <
> raj.chidara@ddismart.com>:
> >
> >
> > >
> > > Hello Markus
> > > Sorry for duplicate question. I added selenium plugin in
> > > conf/nutch-default.xml and included following
> > >
> > > <name>plugin.includes</name>
> > >
> > >
> >
> <value>protocol-http|protocol-httpclient|protocol-selenium|urlfilter-(regex|validator)|parse-(html|tika)|index-(basic|anchor)|indexer-solr|scoring-opic|urlnormalizer-(pass|regex|basic)</value>
> >
> > >
> > > Still the site is not crawling. Are there any additional steps to be
> > > followed for installation of selenium. Please suggest
> > >
> > >
> > > Thanks and Regards
> > >
> > > Raj Chidara
> > >
> > > ----- Original Message -----
> > > From: Markus Jelsma (markus.jelsma@openindex.io)
> > > Date: 30-01-2023 16:26
> > > To: user@nutch.apache.org
> > > Subject: Re: Siet is not crawling
> > >
> > > Hello Raj,
> > >
> > > I think the same question about the same site was asked here some time
> > ago.
> > > Anyway, this site loads its content via Javascript. You will need a
> > > protocol plugin that supports it, either protocol-htmlunit, or
> > > protocol-selenium, instead of protocol-http or any other.
> > >
> > > Change the configuration for plugin.includes, and it should work.
> > >
> > > Markus
> > >
> > > Op ma 30 jan. 2023 om 10:39 schreef Raj Chidara <
> > raj.chidara@ddismart.com
> > > >:
> > >
> > > >
> > > > Hello,
> > > >
> > > > Nutch is not able crawl this site. Are there any nutch configuration
> > > > changes required for this site?
> > > >
> > > > https://www.ich.org/
> > > >
> > > >
> > > > Thanks and Regards
> > > >
> > > > Raj Chidara
> > > >
> > > >
> > > >
> > >
> > >
> >
> >
> >
> >
>

Re: Re[2]: Siet is not crawling

Posted by Markus Jelsma <ma...@openindex.io>.
Hello Raj,

I see. Unfortunately turning on Javascript supporting protocol plugins such
as Htmlunit or Selenium does not always solve the problem

Maybe you can ask at the Selenium project about this. They are the experts
on that particular problem.

Regards,
Markus

Op di 1 aug 2023 om 19:38 schreef Raj Chidara <ra...@ddismart.com>:

> Hello Markus
>   Now, I have removed all other protocol-* and given only
> protocol-selenium.  Now it crawled few pages.  However, there is no content
> read from pages.  All pages are shown as only with text *Home*
>
> Thanks and Regards
> Raj Chidara
>
>
>
> ---- On Mon, 30 Jan 2023 18:35:06 +0530 *Markus Jelsma
> <markus.jelsma@openindex.io <ma...@openindex.io>>* wrote ---
>
> Yes, remove the other protocol-* plugins from the configuration. With all
> three active it is not always determined which one is going to do the
> work.
>
> Op ma 30 jan. 2023 om 12:50 schreef Raj Chidara <ra...@ddismart.com>:
>
>
> >
> > Hello Markus
> > Sorry for duplicate question. I added selenium plugin in
> > conf/nutch-default.xml and included following
> >
> > <name>plugin.includes</name>
> >
> >
> <value>protocol-http|protocol-httpclient|protocol-selenium|urlfilter-(regex|validator)|parse-(html|tika)|index-(basic|anchor)|indexer-solr|scoring-opic|urlnormalizer-(pass|regex|basic)</value>
>
> >
> > Still the site is not crawling. Are there any additional steps to be
> > followed for installation of selenium. Please suggest
> >
> >
> > Thanks and Regards
> >
> > Raj Chidara
> >
> > ----- Original Message -----
> > From: Markus Jelsma (markus.jelsma@openindex.io)
> > Date: 30-01-2023 16:26
> > To: user@nutch.apache.org
> > Subject: Re: Siet is not crawling
> >
> > Hello Raj,
> >
> > I think the same question about the same site was asked here some time
> ago.
> > Anyway, this site loads its content via Javascript. You will need a
> > protocol plugin that supports it, either protocol-htmlunit, or
> > protocol-selenium, instead of protocol-http or any other.
> >
> > Change the configuration for plugin.includes, and it should work.
> >
> > Markus
> >
> > Op ma 30 jan. 2023 om 10:39 schreef Raj Chidara <
> raj.chidara@ddismart.com
> > >:
> >
> > >
> > > Hello,
> > >
> > > Nutch is not able crawl this site. Are there any nutch configuration
> > > changes required for this site?
> > >
> > > https://www.ich.org/
> > >
> > >
> > > Thanks and Regards
> > >
> > > Raj Chidara
> > >
> > >
> > >
> >
> >
>
>
>
>

Re: Re[2]: Siet is not crawling

Posted by Raj Chidara <ra...@ddismart.com>.
Hello Markus

  Now, I have removed all other protocol-* and given only protocol-selenium.  Now it crawled few pages.  However, there is no content read from pages.  All pages are shown as only with text Home



Thanks and Regards

Raj Chidara








---- On Mon, 30 Jan 2023 18:35:06 +0530 Markus Jelsma <ma...@openindex.io> wrote ---



Yes, remove the other protocol-* plugins from the configuration. With all 
three active it is not always determined which one is going to do the work. 
 
Op ma 30 jan. 2023 om 12:50 schreef Raj Chidara <ma...@ddismart.com>: 
 
> 
> Hello Markus 
>   Sorry for duplicate question.  I added selenium plugin in 
> conf/nutch-default.xml and included following 
> 
> <name>plugin.includes</name> 
> 
> <value>protocol-http|protocol-httpclient|protocol-selenium|urlfilter-(regex|validator)|parse-(html|tika)|index-(basic|anchor)|indexer-solr|scoring-opic|urlnormalizer-(pass|regex|basic)</value> 
> 
> Still the site is not crawling.  Are there any additional steps to be 
> followed for installation of selenium. Please suggest 
> 
> 
> Thanks and Regards 
> 
> Raj Chidara 
> 
> ----- Original Message ----- 
> From: Markus Jelsma (mailto:markus.jelsma@openindex.io) 
> Date: 30-01-2023 16:26 
> To: mailto:user@nutch.apache.org 
> Subject: Re: Siet is not crawling 
> 
> Hello Raj, 
> 
> I think the same question about the same site was asked here some time ago. 
> Anyway, this site loads its content via Javascript. You will need a 
> protocol plugin that supports it, either protocol-htmlunit, or 
> protocol-selenium, instead of protocol-http or any other. 
> 
> Change the configuration for plugin.includes, and it should work. 
> 
> Markus 
> 
> Op ma 30 jan. 2023 om 10:39 schreef Raj Chidara <mailto:raj.chidara@ddismart.com 
> >: 
> 
> > 
> > Hello, 
> > 
> >   Nutch is not able crawl this site.  Are there any nutch configuration 
> > changes required for this site? 
> > 
> > https://www.ich.org/ 
> > 
> > 
> > Thanks and Regards 
> > 
> > Raj Chidara 
> > 
> > 
> > 
> 
>

Re: Re[2]: Siet is not crawling

Posted by Steven Zhu <hs...@gmail.com>.
Already unsubscribed. Why do I still get this email?
Thanks

Steven

On Mon, Jan 30, 2023 at 7:06 AM Markus Jelsma <ma...@openindex.io>
wrote:

> Yes, remove the other protocol-* plugins from the configuration. With all
> three active it is not always determined which one is going to do the work.
>
> Op ma 30 jan. 2023 om 12:50 schreef Raj Chidara <raj.chidara@ddismart.com
> >:
>
> >
> > Hello Markus
> >   Sorry for duplicate question.  I added selenium plugin in
> > conf/nutch-default.xml and included following
> >
> > <name>plugin.includes</name>
> >
> >
> <value>protocol-http|protocol-httpclient|protocol-selenium|urlfilter-(regex|validator)|parse-(html|tika)|index-(basic|anchor)|indexer-solr|scoring-opic|urlnormalizer-(pass|regex|basic)</value>
> >
> > Still the site is not crawling.  Are there any additional steps to be
> > followed for installation of selenium. Please suggest
> >
> >
> > Thanks and Regards
> >
> > Raj Chidara
> >
> > ----- Original Message -----
> > From: Markus Jelsma (markus.jelsma@openindex.io)
> > Date: 30-01-2023 16:26
> > To: user@nutch.apache.org
> > Subject: Re: Siet is not crawling
> >
> > Hello Raj,
> >
> > I think the same question about the same site was asked here some time
> ago.
> > Anyway, this site loads its content via Javascript. You will need a
> > protocol plugin that supports it, either protocol-htmlunit, or
> > protocol-selenium, instead of protocol-http or any other.
> >
> > Change the configuration for plugin.includes, and it should work.
> >
> > Markus
> >
> > Op ma 30 jan. 2023 om 10:39 schreef Raj Chidara <
> raj.chidara@ddismart.com
> > >:
> >
> > >
> > > Hello,
> > >
> > >   Nutch is not able crawl this site.  Are there any nutch configuration
> > > changes required for this site?
> > >
> > > https://www.ich.org/
> > >
> > >
> > > Thanks and Regards
> > >
> > > Raj Chidara
> > >
> > >
> > >
> >
> >
>

Re: Re[2]: Siet is not crawling

Posted by Markus Jelsma <ma...@openindex.io>.
Yes, remove the other protocol-* plugins from the configuration. With all
three active it is not always determined which one is going to do the work.

Op ma 30 jan. 2023 om 12:50 schreef Raj Chidara <ra...@ddismart.com>:

>
> Hello Markus
>   Sorry for duplicate question.  I added selenium plugin in
> conf/nutch-default.xml and included following
>
> <name>plugin.includes</name>
>
> <value>protocol-http|protocol-httpclient|protocol-selenium|urlfilter-(regex|validator)|parse-(html|tika)|index-(basic|anchor)|indexer-solr|scoring-opic|urlnormalizer-(pass|regex|basic)</value>
>
> Still the site is not crawling.  Are there any additional steps to be
> followed for installation of selenium. Please suggest
>
>
> Thanks and Regards
>
> Raj Chidara
>
> ----- Original Message -----
> From: Markus Jelsma (markus.jelsma@openindex.io)
> Date: 30-01-2023 16:26
> To: user@nutch.apache.org
> Subject: Re: Siet is not crawling
>
> Hello Raj,
>
> I think the same question about the same site was asked here some time ago.
> Anyway, this site loads its content via Javascript. You will need a
> protocol plugin that supports it, either protocol-htmlunit, or
> protocol-selenium, instead of protocol-http or any other.
>
> Change the configuration for plugin.includes, and it should work.
>
> Markus
>
> Op ma 30 jan. 2023 om 10:39 schreef Raj Chidara <raj.chidara@ddismart.com
> >:
>
> >
> > Hello,
> >
> >   Nutch is not able crawl this site.  Are there any nutch configuration
> > changes required for this site?
> >
> > https://www.ich.org/
> >
> >
> > Thanks and Regards
> >
> > Raj Chidara
> >
> >
> >
>
>

Re[2]: Siet is not crawling

Posted by Raj Chidara <ra...@ddismart.com>.
Hello Markus
  Sorry for duplicate question.  I added selenium plugin in conf/nutch-default.xml and included following

<name>plugin.includes</name>
  <value>protocol-http|protocol-httpclient|protocol-selenium|urlfilter-(regex|validator)|parse-(html|tika)|index-(basic|anchor)|indexer-solr|scoring-opic|urlnormalizer-(pass|regex|basic)</value>

Still the site is not crawling.  Are there any additional steps to be followed for installation of selenium. Please suggest


Thanks and Regards

Raj Chidara

----- Original Message -----
From: Markus Jelsma (markus.jelsma@openindex.io)
Date: 30-01-2023 16:26
To: user@nutch.apache.org
Subject: Re: Siet is not crawling

Hello Raj,

I think the same question about the same site was asked here some time ago.
Anyway, this site loads its content via Javascript. You will need a
protocol plugin that supports it, either protocol-htmlunit, or
protocol-selenium, instead of protocol-http or any other.

Change the configuration for plugin.includes, and it should work.

Markus

Op ma 30 jan. 2023 om 10:39 schreef Raj Chidara <ra...@ddismart.com>:

>
> Hello,
>
>   Nutch is not able crawl this site.  Are there any nutch configuration
> changes required for this site?
>
> https://www.ich.org/
>
>
> Thanks and Regards
>
> Raj Chidara
>
>
>


Re: Siet is not crawling

Posted by Markus Jelsma <ma...@openindex.io>.
Hello Raj,

I think the same question about the same site was asked here some time ago.
Anyway, this site loads its content via Javascript. You will need a
protocol plugin that supports it, either protocol-htmlunit, or
protocol-selenium, instead of protocol-http or any other.

Change the configuration for plugin.includes, and it should work.

Markus

Op ma 30 jan. 2023 om 10:39 schreef Raj Chidara <ra...@ddismart.com>:

>
> Hello,
>
>   Nutch is not able crawl this site.  Are there any nutch configuration
> changes required for this site?
>
> https://www.ich.org/
>
>
> Thanks and Regards
>
> Raj Chidara
>
>
>