You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by Raj Chidara <ra...@ddismart.com> on 2023/01/30 10:39:30 UTC
Siet is not crawling
Hello,
Nutch is not able crawl this site. Are there any nutch configuration changes required for this site?
https://www.ich.org/
Thanks and Regards
Raj Chidara
Re: Re[2]: Siet is not crawling
Posted by Abhay Ratnaparkhi <ab...@gmail.com>.
What I ended up doing is
- Developed a service to fetch pages (Used nodejs with Google Puppeteer
https://pptr.dev/ for fetching).
- Used browserless (https://www.browserless.io/) and made fetch to use live
chromium browser instances
- Scaled this all in the Kubernetes cluster so we can fetch many pages
simultaneously.
- Developed a plugin for Nutch which uses a fetch service to fetch pages.
This is better solution that using HTMLUnit or Selenium (as compared to
puppeteer which works great)
On Sun, Aug 13, 2023 at 2:53 PM Markus Jelsma <ma...@openindex.io>
wrote:
> Hello Raj,
>
> I see. Unfortunately turning on Javascript supporting protocol plugins such
> as Htmlunit or Selenium does not always solve the problem
>
> Maybe you can ask at the Selenium project about this. They are the experts
> on that particular problem.
>
> Regards,
> Markus
>
> Op di 1 aug 2023 om 19:38 schreef Raj Chidara <ra...@ddismart.com>:
>
> > Hello Markus
> > Now, I have removed all other protocol-* and given only
> > protocol-selenium. Now it crawled few pages. However, there is no
> content
> > read from pages. All pages are shown as only with text *Home*
> >
> > Thanks and Regards
> > Raj Chidara
> >
> >
> >
> > ---- On Mon, 30 Jan 2023 18:35:06 +0530 *Markus Jelsma
> > <markus.jelsma@openindex.io <ma...@openindex.io>>* wrote ---
> >
> > Yes, remove the other protocol-* plugins from the configuration. With all
> > three active it is not always determined which one is going to do the
> > work.
> >
> > Op ma 30 jan. 2023 om 12:50 schreef Raj Chidara <
> raj.chidara@ddismart.com>:
> >
> >
> > >
> > > Hello Markus
> > > Sorry for duplicate question. I added selenium plugin in
> > > conf/nutch-default.xml and included following
> > >
> > > <name>plugin.includes</name>
> > >
> > >
> >
> <value>protocol-http|protocol-httpclient|protocol-selenium|urlfilter-(regex|validator)|parse-(html|tika)|index-(basic|anchor)|indexer-solr|scoring-opic|urlnormalizer-(pass|regex|basic)</value>
> >
> > >
> > > Still the site is not crawling. Are there any additional steps to be
> > > followed for installation of selenium. Please suggest
> > >
> > >
> > > Thanks and Regards
> > >
> > > Raj Chidara
> > >
> > > ----- Original Message -----
> > > From: Markus Jelsma (markus.jelsma@openindex.io)
> > > Date: 30-01-2023 16:26
> > > To: user@nutch.apache.org
> > > Subject: Re: Siet is not crawling
> > >
> > > Hello Raj,
> > >
> > > I think the same question about the same site was asked here some time
> > ago.
> > > Anyway, this site loads its content via Javascript. You will need a
> > > protocol plugin that supports it, either protocol-htmlunit, or
> > > protocol-selenium, instead of protocol-http or any other.
> > >
> > > Change the configuration for plugin.includes, and it should work.
> > >
> > > Markus
> > >
> > > Op ma 30 jan. 2023 om 10:39 schreef Raj Chidara <
> > raj.chidara@ddismart.com
> > > >:
> > >
> > > >
> > > > Hello,
> > > >
> > > > Nutch is not able crawl this site. Are there any nutch configuration
> > > > changes required for this site?
> > > >
> > > > https://www.ich.org/
> > > >
> > > >
> > > > Thanks and Regards
> > > >
> > > > Raj Chidara
> > > >
> > > >
> > > >
> > >
> > >
> >
> >
> >
> >
>
Re: Re[2]: Siet is not crawling
Posted by Markus Jelsma <ma...@openindex.io>.
Hello Raj,
I see. Unfortunately turning on Javascript supporting protocol plugins such
as Htmlunit or Selenium does not always solve the problem
Maybe you can ask at the Selenium project about this. They are the experts
on that particular problem.
Regards,
Markus
Op di 1 aug 2023 om 19:38 schreef Raj Chidara <ra...@ddismart.com>:
> Hello Markus
> Now, I have removed all other protocol-* and given only
> protocol-selenium. Now it crawled few pages. However, there is no content
> read from pages. All pages are shown as only with text *Home*
>
> Thanks and Regards
> Raj Chidara
>
>
>
> ---- On Mon, 30 Jan 2023 18:35:06 +0530 *Markus Jelsma
> <markus.jelsma@openindex.io <ma...@openindex.io>>* wrote ---
>
> Yes, remove the other protocol-* plugins from the configuration. With all
> three active it is not always determined which one is going to do the
> work.
>
> Op ma 30 jan. 2023 om 12:50 schreef Raj Chidara <ra...@ddismart.com>:
>
>
> >
> > Hello Markus
> > Sorry for duplicate question. I added selenium plugin in
> > conf/nutch-default.xml and included following
> >
> > <name>plugin.includes</name>
> >
> >
> <value>protocol-http|protocol-httpclient|protocol-selenium|urlfilter-(regex|validator)|parse-(html|tika)|index-(basic|anchor)|indexer-solr|scoring-opic|urlnormalizer-(pass|regex|basic)</value>
>
> >
> > Still the site is not crawling. Are there any additional steps to be
> > followed for installation of selenium. Please suggest
> >
> >
> > Thanks and Regards
> >
> > Raj Chidara
> >
> > ----- Original Message -----
> > From: Markus Jelsma (markus.jelsma@openindex.io)
> > Date: 30-01-2023 16:26
> > To: user@nutch.apache.org
> > Subject: Re: Siet is not crawling
> >
> > Hello Raj,
> >
> > I think the same question about the same site was asked here some time
> ago.
> > Anyway, this site loads its content via Javascript. You will need a
> > protocol plugin that supports it, either protocol-htmlunit, or
> > protocol-selenium, instead of protocol-http or any other.
> >
> > Change the configuration for plugin.includes, and it should work.
> >
> > Markus
> >
> > Op ma 30 jan. 2023 om 10:39 schreef Raj Chidara <
> raj.chidara@ddismart.com
> > >:
> >
> > >
> > > Hello,
> > >
> > > Nutch is not able crawl this site. Are there any nutch configuration
> > > changes required for this site?
> > >
> > > https://www.ich.org/
> > >
> > >
> > > Thanks and Regards
> > >
> > > Raj Chidara
> > >
> > >
> > >
> >
> >
>
>
>
>
Re: Re[2]: Siet is not crawling
Posted by Raj Chidara <ra...@ddismart.com>.
Hello Markus
Now, I have removed all other protocol-* and given only protocol-selenium. Now it crawled few pages. However, there is no content read from pages. All pages are shown as only with text Home
Thanks and Regards
Raj Chidara
---- On Mon, 30 Jan 2023 18:35:06 +0530 Markus Jelsma <ma...@openindex.io> wrote ---
Yes, remove the other protocol-* plugins from the configuration. With all
three active it is not always determined which one is going to do the work.
Op ma 30 jan. 2023 om 12:50 schreef Raj Chidara <ma...@ddismart.com>:
>
> Hello Markus
> Sorry for duplicate question. I added selenium plugin in
> conf/nutch-default.xml and included following
>
> <name>plugin.includes</name>
>
> <value>protocol-http|protocol-httpclient|protocol-selenium|urlfilter-(regex|validator)|parse-(html|tika)|index-(basic|anchor)|indexer-solr|scoring-opic|urlnormalizer-(pass|regex|basic)</value>
>
> Still the site is not crawling. Are there any additional steps to be
> followed for installation of selenium. Please suggest
>
>
> Thanks and Regards
>
> Raj Chidara
>
> ----- Original Message -----
> From: Markus Jelsma (mailto:markus.jelsma@openindex.io)
> Date: 30-01-2023 16:26
> To: mailto:user@nutch.apache.org
> Subject: Re: Siet is not crawling
>
> Hello Raj,
>
> I think the same question about the same site was asked here some time ago.
> Anyway, this site loads its content via Javascript. You will need a
> protocol plugin that supports it, either protocol-htmlunit, or
> protocol-selenium, instead of protocol-http or any other.
>
> Change the configuration for plugin.includes, and it should work.
>
> Markus
>
> Op ma 30 jan. 2023 om 10:39 schreef Raj Chidara <mailto:raj.chidara@ddismart.com
> >:
>
> >
> > Hello,
> >
> > Nutch is not able crawl this site. Are there any nutch configuration
> > changes required for this site?
> >
> > https://www.ich.org/
> >
> >
> > Thanks and Regards
> >
> > Raj Chidara
> >
> >
> >
>
>
Re: Re[2]: Siet is not crawling
Posted by Steven Zhu <hs...@gmail.com>.
Already unsubscribed. Why do I still get this email?
Thanks
Steven
On Mon, Jan 30, 2023 at 7:06 AM Markus Jelsma <ma...@openindex.io>
wrote:
> Yes, remove the other protocol-* plugins from the configuration. With all
> three active it is not always determined which one is going to do the work.
>
> Op ma 30 jan. 2023 om 12:50 schreef Raj Chidara <raj.chidara@ddismart.com
> >:
>
> >
> > Hello Markus
> > Sorry for duplicate question. I added selenium plugin in
> > conf/nutch-default.xml and included following
> >
> > <name>plugin.includes</name>
> >
> >
> <value>protocol-http|protocol-httpclient|protocol-selenium|urlfilter-(regex|validator)|parse-(html|tika)|index-(basic|anchor)|indexer-solr|scoring-opic|urlnormalizer-(pass|regex|basic)</value>
> >
> > Still the site is not crawling. Are there any additional steps to be
> > followed for installation of selenium. Please suggest
> >
> >
> > Thanks and Regards
> >
> > Raj Chidara
> >
> > ----- Original Message -----
> > From: Markus Jelsma (markus.jelsma@openindex.io)
> > Date: 30-01-2023 16:26
> > To: user@nutch.apache.org
> > Subject: Re: Siet is not crawling
> >
> > Hello Raj,
> >
> > I think the same question about the same site was asked here some time
> ago.
> > Anyway, this site loads its content via Javascript. You will need a
> > protocol plugin that supports it, either protocol-htmlunit, or
> > protocol-selenium, instead of protocol-http or any other.
> >
> > Change the configuration for plugin.includes, and it should work.
> >
> > Markus
> >
> > Op ma 30 jan. 2023 om 10:39 schreef Raj Chidara <
> raj.chidara@ddismart.com
> > >:
> >
> > >
> > > Hello,
> > >
> > > Nutch is not able crawl this site. Are there any nutch configuration
> > > changes required for this site?
> > >
> > > https://www.ich.org/
> > >
> > >
> > > Thanks and Regards
> > >
> > > Raj Chidara
> > >
> > >
> > >
> >
> >
>
Re: Re[2]: Siet is not crawling
Posted by Markus Jelsma <ma...@openindex.io>.
Yes, remove the other protocol-* plugins from the configuration. With all
three active it is not always determined which one is going to do the work.
Op ma 30 jan. 2023 om 12:50 schreef Raj Chidara <ra...@ddismart.com>:
>
> Hello Markus
> Sorry for duplicate question. I added selenium plugin in
> conf/nutch-default.xml and included following
>
> <name>plugin.includes</name>
>
> <value>protocol-http|protocol-httpclient|protocol-selenium|urlfilter-(regex|validator)|parse-(html|tika)|index-(basic|anchor)|indexer-solr|scoring-opic|urlnormalizer-(pass|regex|basic)</value>
>
> Still the site is not crawling. Are there any additional steps to be
> followed for installation of selenium. Please suggest
>
>
> Thanks and Regards
>
> Raj Chidara
>
> ----- Original Message -----
> From: Markus Jelsma (markus.jelsma@openindex.io)
> Date: 30-01-2023 16:26
> To: user@nutch.apache.org
> Subject: Re: Siet is not crawling
>
> Hello Raj,
>
> I think the same question about the same site was asked here some time ago.
> Anyway, this site loads its content via Javascript. You will need a
> protocol plugin that supports it, either protocol-htmlunit, or
> protocol-selenium, instead of protocol-http or any other.
>
> Change the configuration for plugin.includes, and it should work.
>
> Markus
>
> Op ma 30 jan. 2023 om 10:39 schreef Raj Chidara <raj.chidara@ddismart.com
> >:
>
> >
> > Hello,
> >
> > Nutch is not able crawl this site. Are there any nutch configuration
> > changes required for this site?
> >
> > https://www.ich.org/
> >
> >
> > Thanks and Regards
> >
> > Raj Chidara
> >
> >
> >
>
>
Re[2]: Siet is not crawling
Posted by Raj Chidara <ra...@ddismart.com>.
Hello Markus
Sorry for duplicate question. I added selenium plugin in conf/nutch-default.xml and included following
<name>plugin.includes</name>
<value>protocol-http|protocol-httpclient|protocol-selenium|urlfilter-(regex|validator)|parse-(html|tika)|index-(basic|anchor)|indexer-solr|scoring-opic|urlnormalizer-(pass|regex|basic)</value>
Still the site is not crawling. Are there any additional steps to be followed for installation of selenium. Please suggest
Thanks and Regards
Raj Chidara
----- Original Message -----
From: Markus Jelsma (markus.jelsma@openindex.io)
Date: 30-01-2023 16:26
To: user@nutch.apache.org
Subject: Re: Siet is not crawling
Hello Raj,
I think the same question about the same site was asked here some time ago.
Anyway, this site loads its content via Javascript. You will need a
protocol plugin that supports it, either protocol-htmlunit, or
protocol-selenium, instead of protocol-http or any other.
Change the configuration for plugin.includes, and it should work.
Markus
Op ma 30 jan. 2023 om 10:39 schreef Raj Chidara <ra...@ddismart.com>:
>
> Hello,
>
> Nutch is not able crawl this site. Are there any nutch configuration
> changes required for this site?
>
> https://www.ich.org/
>
>
> Thanks and Regards
>
> Raj Chidara
>
>
>
Re: Siet is not crawling
Posted by Markus Jelsma <ma...@openindex.io>.
Hello Raj,
I think the same question about the same site was asked here some time ago.
Anyway, this site loads its content via Javascript. You will need a
protocol plugin that supports it, either protocol-htmlunit, or
protocol-selenium, instead of protocol-http or any other.
Change the configuration for plugin.includes, and it should work.
Markus
Op ma 30 jan. 2023 om 10:39 schreef Raj Chidara <ra...@ddismart.com>:
>
> Hello,
>
> Nutch is not able crawl this site. Are there any nutch configuration
> changes required for this site?
>
> https://www.ich.org/
>
>
> Thanks and Regards
>
> Raj Chidara
>
>
>