You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by Hung Nguyen <Hu...@ambientdigitalgroup.com> on 2014/08/09 13:07:23 UTC

[Nutch 2.2.1] InjectorJob always fail

Hi,

I am trying to get integrate Nutch into our java application.
As I understand through crawl shell script shipped with Nutch source code, the flow should be:

-inject
-generate
-fetch
-parse
-updatedb
-invert link
- index

My setup is:

nutch-site.xml
<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>

<!-- Put site-specific property overrides in this file. -->

<configuration>
   <property>
     <name>http.agent.name</name>
     <value>your-crawler-name</value>
   </property>
   <property>
     <name>storage.data.store.class</name>
     <value>org.apache.gora.hbase.store.HBaseStore</value>
     <description>Default class for storing data</description>
   </property>
   <property>
     <name>solr.server.url</name>
     <value>http://192.168.10.225:8983/solr</value>
     <description>Solr Server</description>
   </property>
   <property>
     <name>plugin.includes</name>
     <value>protocol-http|urlfilter-regex|parse-(html|tika)|index-(basic|anchor)|indexer-solr|scoring-opic|urlnormalizer-(pass|regex|basic)</value>
     <description>Enabled plugins</description>
   </property>
   <property>
     <name>plugin.folders</name>
     <value>/Users/hungnguyen/NetBeansProjects/nutch/runtime/local/plugins/</value>
     <description>Plugins folder</description>
    </property>

    <property>
  <name>storage.crawl.id</name>
  <value>hcrawler</value>
  <description>This value helps differentiate between the datasets that
  the jobs in the crawl cycle generate and operate on. The value will
  be input to all the jobs which then will use it as a prefix when
  accessing to the schemas. The default configuration uses no id to prefix
  the schemas. The value could also be given as a command line argument
  to each job.
  </description>
</property>
</configuration>


Hbase is installed and running on my laptop, I can see when I run inject job, nutch can connect to hbase without problem.

The step to run nutch is:

        Configuration nutchConfiguration = NutchConfiguration.create();
        InjectorJob injectorJob = new InjectorJob(nutchConfiguration);
        injectorJob.inject(new Path(urlDirName));



It always fail with this exception:

14/08/09 18:12:14 WARN mapred.LocalJobRunner: job_local762776259_0001
java.lang.Exception: java.lang.AbstractMethodError: org.apache.nutch.scoring.opic.OPICScoringFilter.injectedScore(Ljava/lang/String;Lorg/apache/nutch/storage/WebPage;)V
at org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:354) at org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:354)
Caused by: java.lang.AbstractMethodError: org.apache.nutch.scoring.opic.OPICScoringFilter.injectedScore(Ljava/lang/String;Lorg/apache/nutch/storage/WebPage;)V
at org.apache.nutch.scoring.ScoringFilters.injectedScore(ScoringFilters.java:108)
at org.apache.nutch.crawl.InjectorJob$UrlMapper.map(InjectorJob.java:181)
at org.apache.nutch.crawl.InjectorJob$UrlMapper.map(InjectorJob.java:88)
at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:145)
at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:764)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:364)
at org.apache.hadoop.mapred.LocalJobRunner$Job$MapTaskRunnable.run(LocalJobRunner.java:223)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471)
at java.util.concurrent.FutureTask.run(FutureTask.java:262)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)
14/08/09 18:12:14 INFO mapred.JobClient:  map 0% reduce 0%
14/08/09 18:12:14 INFO mapred.JobClient: Job complete: job_local762776259_0001
14/08/09 18:12:14 INFO mapred.JobClient: Counters: 0
Aug 09, 2014 6:12:14 PM Worker main
SEVERE: null
java.lang.RuntimeException: job failed: name=[hcrawler]inject /Users/hungnguyen/NetBeansProjects/nutch/ANude/urls, jobid=job_local762776259_0001
at org.apache.nutch.util.NutchJob.waitForCompletion(NutchJob.java:55)
at org.apache.nutch.crawl.InjectorJob.run(InjectorJob.java:233)
at org.apache.nutch.crawl.InjectorJob.inject(InjectorJob.java:255)
at Worker.main(Worker.java:87)

at org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:354)

Someone please point out which step did I miss?

Thanks,

Hưng