You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@nutch.apache.org by Srinivasan Ramaswamy <ur...@gmail.com> on 2016/11/30 00:16:12 UTC

unable to index to elasticsearch from nutch 1.12

I am using nutch-1.12. I downloaded the binary and setup as instructed in
the wiki. I have setup the following properties in my nutch-site.xml

<property>
  <name>plugin.includes</name>

<value>protocol-http|urlfilter-regex|parse-(html|tika)|index-(basic|anchor)|indexer-elastic|scoring-opic|urlnormalizer-(pass|regex|basic)</value>
 </property>

 <property>
  <name>elastic.host</name>
  <value>localhost</value>
  <description>The hostname to send documents to using TransportClient.
Either host
  and port must be defined or cluster.</description>
</property>

<property>
  <name>elastic.port</name>
  <value>9300</value>
  <description>The port to connect to using TransportClient.</description>
</property>

<property>
  <name>elastic.cluster</name>
  <value>elasticsearch</value>
  <description>The cluster name to discover. Either host and port must be
defined
  or cluster.</description>
</property>

after crawling when i try to index the content using the command

$ bin/nutch index elasticsearch crawl/segments/20161129130824/

srramasw-osx:apache-nutch-1.12 srramasw$ bin/nutch index elasticsearch $s1
Segment dir is complete: crawl/segments/20161129130824.
Indexer: starting at 2016-11-29 16:07:03
Indexer: deleting gone documents: false
Indexer: URL filtering: false
Indexer: URL normalizing: false
Active IndexWriters :
ElasticIndexWriter
elastic.cluster : elastic prefix cluster
elastic.host : hostname
elastic.port : port
elastic.index : elastic index command
elastic.max.bulk.docs : elastic bulk index doc counts. (default 250)
elastic.max.bulk.size : elastic bulk index length. (default 2500500 ~2.5MB)


Indexer: org.apache.hadoop.mapred.InvalidInputException: Input path does
not exist:
file:/Users/srramasw/Tools/apache-nutch-1.12/elasticsearch/current
at
org.apache.hadoop.mapred.FileInputFormat.singleThreadedListStatus(FileInputFormat.java:285)
at
org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:228)
at
org.apache.hadoop.mapred.SequenceFileInputFormat.listStatus(SequenceFileInputFormat.java:45)
at
org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:304)
at
org.apache.hadoop.mapreduce.JobSubmitter.writeOldSplits(JobSubmitter.java:520)
at
org.apache.hadoop.mapreduce.JobSubmitter.writeSplits(JobSubmitter.java:512)
at
org.apache.hadoop.mapreduce.JobSubmitter.submitJobInternal(JobSubmitter.java:394)
at org.apache.hadoop.mapreduce.Job$10.run(Job.java:1285)
at org.apache.hadoop.mapreduce.Job$10.run(Job.java:1282)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:422)
at
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1548)
at org.apache.hadoop.mapreduce.Job.submit(Job.java:1282)
at org.apache.hadoop.mapred.JobClient$1.run(JobClient.java:562)
at org.apache.hadoop.mapred.JobClient$1.run(JobClient.java:557)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:422)
at
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1548)
at org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:557)
at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:548)
at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:833)
at org.apache.nutch.indexer.IndexingJob.index(IndexingJob.java:145)
at org.apache.nutch.indexer.IndexingJob.run(IndexingJob.java:228)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70)
at org.apache.nutch.indexer.IndexingJob.main(IndexingJob.java:237)


I searched around in the web for the problem, lot of people reported it
could be due to elasticsearch version mismatch. I made sure i am running
1.4.1 version of elasticsearch locally.

Any idea on what causes this error ?


Thanks
Srini

Re: unable to index to elasticsearch from nutch 1.12

Posted by Yongyao Jiang <j....@gmail.com>.

Hi Srini,

I had the same problem before. Thanks to @Lewis, it has been solved by
building the source code from the master branch
https://github.com/apache/nutch.

Now I am able to use it even with ES 2.3.3.

Thanks,
Yongyao

On Tue, Nov 29, 2016 at 7:16 PM, Srinivasan Ramaswamy <ur...@gmail.com>
wrote:

> I am using nutch-1.12. I downloaded the binary and setup as instructed in
> the wiki. I have setup the following properties in my nutch-site.xml
>
> <property>
>   <name>plugin.includes</name>
>
> <value>protocol-http|urlfilter-regex|parse-(html|
> tika)|index-(basic|anchor)|indexer-elastic|scoring-opic|
> urlnormalizer-(pass|regex|basic)</value>
>  </property>
>
>  <property>
>   <name>elastic.host</name>
>   <value>localhost</value>
>   <description>The hostname to send documents to using TransportClient.
> Either host
>   and port must be defined or cluster.</description>
> </property>
>
> <property>
>   <name>elastic.port</name>
>   <value>9300</value>
>   <description>The port to connect to using TransportClient.</description>
> </property>
>
> <property>
>   <name>elastic.cluster</name>
>   <value>elasticsearch</value>
>   <description>The cluster name to discover. Either host and port must be
> defined
>   or cluster.</description>
> </property>
>
> after crawling when i try to index the content using the command
>
> $ bin/nutch index elasticsearch crawl/segments/20161129130824/
>
> srramasw-osx:apache-nutch-1.12 srramasw$ bin/nutch index elasticsearch $s1
> Segment dir is complete: crawl/segments/20161129130824.
> Indexer: starting at 2016-11-29 16:07:03
> Indexer: deleting gone documents: false
> Indexer: URL filtering: false
> Indexer: URL normalizing: false
> Active IndexWriters :
> ElasticIndexWriter
> elastic.cluster : elastic prefix cluster
> elastic.host : hostname
> elastic.port : port
> elastic.index : elastic index command
> elastic.max.bulk.docs : elastic bulk index doc counts. (default 250)
> elastic.max.bulk.size : elastic bulk index length. (default 2500500 ~2.5MB)
>
>
> Indexer: org.apache.hadoop.mapred.InvalidInputException: Input path does
> not exist:
> file:/Users/srramasw/Tools/apache-nutch-1.12/elasticsearch/current
> at
> org.apache.hadoop.mapred.FileInputFormat.singleThreadedListStatus(
> FileInputFormat.java:285)
> at
> org.apache.hadoop.mapred.FileInputFormat.listStatus(
> FileInputFormat.java:228)
> at
> org.apache.hadoop.mapred.SequenceFileInputFormat.listStatus(
> SequenceFileInputFormat.java:45)
> at
> org.apache.hadoop.mapred.FileInputFormat.getSplits(
> FileInputFormat.java:304)
> at
> org.apache.hadoop.mapreduce.JobSubmitter.writeOldSplits(
> JobSubmitter.java:520)
> at
> org.apache.hadoop.mapreduce.JobSubmitter.writeSplits(
> JobSubmitter.java:512)
> at
> org.apache.hadoop.mapreduce.JobSubmitter.submitJobInternal(
> JobSubmitter.java:394)
> at org.apache.hadoop.mapreduce.Job$10.run(Job.java:1285)
> at org.apache.hadoop.mapreduce.Job$10.run(Job.java:1282)
> at java.security.AccessController.doPrivileged(Native Method)
> at javax.security.auth.Subject.doAs(Subject.java:422)
> at
> org.apache.hadoop.security.UserGroupInformation.doAs(
> UserGroupInformation.java:1548)
> at org.apache.hadoop.mapreduce.Job.submit(Job.java:1282)
> at org.apache.hadoop.mapred.JobClient$1.run(JobClient.java:562)
> at org.apache.hadoop.mapred.JobClient$1.run(JobClient.java:557)
> at java.security.AccessController.doPrivileged(Native Method)
> at javax.security.auth.Subject.doAs(Subject.java:422)
> at
> org.apache.hadoop.security.UserGroupInformation.doAs(
> UserGroupInformation.java:1548)
> at org.apache.hadoop.mapred.JobClient.submitJobInternal(
> JobClient.java:557)
> at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:548)
> at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:833)
> at org.apache.nutch.indexer.IndexingJob.index(IndexingJob.java:145)
> at org.apache.nutch.indexer.IndexingJob.run(IndexingJob.java:228)
> at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70)
> at org.apache.nutch.indexer.IndexingJob.main(IndexingJob.java:237)
>
>
> I searched around in the web for the problem, lot of people reported it
> could be due to elasticsearch version mismatch. I made sure i am running
> 1.4.1 version of elasticsearch locally.
>
> Any idea on what causes this error ?
>
>
> Thanks
> Srini
>



-- 
Yongyao Jiang
https://www.linkedin.com/in/yongyao-jiang-42516164
Ph.D. Student in Earth Systems and GeoInformation Sciences
NSF Spatiotemporal Innovation Center
George Mason University