You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by Daniele Cremonini <dc...@sedona.fr> on 2016/11/18 14:28:49 UTC

Nutch2 - What are exactly the steps to execute?

Hello,

I installed and configured Nutch2 with MongoDB and Elasticsearch.

Im pretty convinced that the configuration is correct but I dont see how
to invoke Nutch.

In this page: https://wiki.apache.org/nutch/NutchTutorial there are I
think enough details to call Nutch 1.x
but in this page: https://wiki.apache.org/nutch/Nutch2Tutorial the Invoke
chapter is pretty poor.

What I did:

bin/nutch inject /apps/nutch-urls/
bin/nutch generate -topN 40
bin/nutch fetch -all
bin/nutch parse -all
bin/nutch updatedb -all
bin/nutch index all

but Nutch never tries to index data I know because I enriched the logging
activity of ElasticIndexWriter a little bit.

May anybody give me some ideas?

Thanks
Daniele

RE: Nutch2 - What are exactly the steps to execute?

Posted by Daniele Cremonini <dc...@sedona.fr>.
Thank you Tom and Marty,

Here is the snippet for configuring the plugin:

	<!-- activate the elasticsearch indexer plugin  -->
	<property>
		<name>plugin.includes</name>

<value>protocol-http|urlfilter-regex|parse-(html|tika)|index-(basic|anchor
)|indexer-elastic|scoring-opic|urlnormalizer-(pass|regex|basic)</value>
		<description>Regular expression naming plugin directory
names to
		include.  Any plugin not matching this expression is
excluded.
		In any case you need at least include the
nutch-extensionpoints plugin. By
		default Nutch includes crawling just HTML and plain text
via HTTP,
		and basic indexing and search plugins. In order to use
HTTPS please enable
		protocol-httpclient, but be aware of possible intermittent
problems with the
		underlying commons-httpclient library.
		</description>
	</property>

And here is the Gist:
https://gist.github.com/dcremonini/563e612e9d5c7051ea31c3a7fd9f5966

One think among others I could miss is the invertlinks step.
Cheers
Daniele Cremonini


-----Message d'origine-----
De: Marty-Scott Sainty (NWIS - Software Development)
[mailto:Marty-Scott.Sainty@wales.nhs.uk]
Envoy: vendredi 18 novembre 2016 16:44
: user@nutch.apache.org
Objet: RE: Nutch2 - What are exactly the steps to execute?

Hi Tom,

You make sure you have specified the elastic search indexer plugin in
/conf/nutch-site.xml

  <property>
    <name>plugin.includes</name>
    <value>indexer-elastic</value>
  </property>


-----Original Message-----
From: Tom Chiverton [mailto:tc@extravision.com]
Sent: 18 November 2016 15:38
To: user@nutch.apache.org
Subject: Re: Nutch2 - What are exactly the steps to execute?

Please post the output of each step.

You might want to use something like a GitHub Gist for that as it could be
fairly long over email.

Tom


On 18/11/16 14:28, Daniele Cremonini wrote:
> Hello,
>
> I installed and configured Nutch2 with MongoDB and Elasticsearch.
>
> I'm pretty convinced that the configuration is correct but I don't see
> how to invoke Nutch.
>
> In this page : https://wiki.apache.org/nutch/NutchTutorial there are I
> think enough details to call Nutch 1.x but in this page :
> https://wiki.apache.org/nutch/Nutch2Tutorial the Invoke chapter is
> pretty poor.
>
> What I did :
>
> bin/nutch inject /apps/nutch-urls/
> bin/nutch generate -topN 40
> bin/nutch fetch -all
> bin/nutch parse -all
> bin/nutch updatedb -all
> bin/nutch index -all
>
> but Nutch never tries to index data I know because I enriched the
> logging activity of ElasticIndexWriter a little bit.
>
> May anybody give me some ideas?
>
> Thanks
> Daniele
>
> ______________________________________________________________________
> This email has been scanned by the Symantec Email Security.cloud
service.
> For more information please visit http://www.symanteccloud.com
> ______________________________________________________________________
>


RE: Nutch2 - What are exactly the steps to execute?

Posted by "Marty-Scott Sainty (NWIS - Software Development)" <Ma...@wales.nhs.uk>.
Hi Tom,

You make sure you have specified the elastic search indexer plugin in /conf/nutch-site.xml

  <property>
    <name>plugin.includes</name>
    <value>indexer-elastic</value>
  </property>


-----Original Message-----
From: Tom Chiverton [mailto:tc@extravision.com] 
Sent: 18 November 2016 15:38
To: user@nutch.apache.org
Subject: Re: Nutch2 - What are exactly the steps to execute?

Please post the output of each step.

You might want to use something like a GitHub Gist for that as it could be fairly long over email.

Tom


On 18/11/16 14:28, Daniele Cremonini wrote:
> Hello,
>
> I installed and configured Nutch2 with MongoDB and Elasticsearch.
>
> I'm pretty convinced that the configuration is correct but I don't see 
> how to invoke Nutch.
>
> In this page : https://wiki.apache.org/nutch/NutchTutorial there are I 
> think enough details to call Nutch 1.x but in this page : 
> https://wiki.apache.org/nutch/Nutch2Tutorial the Invoke chapter is 
> pretty poor.
>
> What I did :
>
> bin/nutch inject /apps/nutch-urls/
> bin/nutch generate -topN 40
> bin/nutch fetch -all
> bin/nutch parse -all
> bin/nutch updatedb -all
> bin/nutch index -all
>
> but Nutch never tries to index data I know because I enriched the 
> logging activity of ElasticIndexWriter a little bit.
>
> May anybody give me some ideas?
>
> Thanks
> Daniele
>
> ______________________________________________________________________
> This email has been scanned by the Symantec Email Security.cloud service.
> For more information please visit http://www.symanteccloud.com 
> ______________________________________________________________________
>


Re: Nutch2 - What are exactly the steps to execute?

Posted by Tom Chiverton <tc...@extravision.com>.
Please post the output of each step.

You might want to use something like a GitHub Gist for that as it could 
be fairly long over email.

Tom


On 18/11/16 14:28, Daniele Cremonini wrote:
> Hello,
>
> I installed and configured Nutch2 with MongoDB and Elasticsearch.
>
> Im pretty convinced that the configuration is correct but I dont see how
> to invoke Nutch.
>
> In this page : https://wiki.apache.org/nutch/NutchTutorial there are I
> think enough details to call Nutch 1.x
> but in this page : https://wiki.apache.org/nutch/Nutch2Tutorial the Invoke
> chapter is pretty poor.
>
> What I did :
>
> bin/nutch inject /apps/nutch-urls/
> bin/nutch generate -topN 40
> bin/nutch fetch -all
> bin/nutch parse -all
> bin/nutch updatedb -all
> bin/nutch index all
>
> but Nutch never tries to index data I know because I enriched the logging
> activity of ElasticIndexWriter a little bit.
>
> May anybody give me some ideas?
>
> Thanks
> Daniele
>
> ______________________________________________________________________
> This email has been scanned by the Symantec Email Security.cloud service.
> For more information please visit http://www.symanteccloud.com
> ______________________________________________________________________
>