You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by Luis Cappa Banda <lu...@gmail.com> on 2011/09/15 17:00:41 UTC

Integrating Nutch-1.3 SVN version into another project.

Hello.

I've downloaded Nutch-1.3 version via Subversion and modified some classes a
little. My intention is to integrate with Maven the new artifacts created
from the new "hacked" Nutch version and integrate them with another Maven
project which has a dependency to the hacked version mentioned. Both
projects (Nutch personalized version and the other project) are inside a
parent project that orchestrates compilation by modules. All configuration
aparently looks good and compiles correctly.

When launching a crawling process using Solr index option appears the
following error:

2011-09-15 16:57:07,137 0    [main] INFO
es.desa.empleate.infojobs.CrawlingProperties  - Loading property file...
 2011-09-15 16:57:07,144 7    [main] INFO
es.desa.empleate.infojobs.CrawlingProperties  - Property file loaded!
 2011-09-15 16:57:07,145 8    [main] INFO
es.desa.empleate.infojobs.CrawlingProperties  - Retrieving property
'URLS_DIR'
 2011-09-15 16:57:07,145 8    [main] INFO
es.desa.empleate.infojobs.CrawlingProperties  - Retrieving property
'SOLR_SERVER'
 2011-09-15 16:57:07,145 8    [main] INFO
es.desa.empleate.infojobs.CrawlingProperties  - Retrieving property 'DEPTH'
 2011-09-15 16:57:07,145 8    [main] INFO
es.desa.empleate.infojobs.CrawlingProperties  - Retrieving property
'THREADS'
 2011-09-15 16:57:08,259 1122 [main] INFO
es.desa.empleate.infojobs.CrawlingProcess  - > Crawling process started...
2011-09-15 16:57:09,653 2516 [main] INFO  org.apache.nutch.crawl.Crawl  -
crawl started in: crawl-20110915165709
 2011-09-15 16:57:09,653 2516 [main] INFO  org.apache.nutch.crawl.Crawl  -
rootUrlDir =urls
 2011-09-15 16:57:09,653 2516 [main] INFO  org.apache.nutch.crawl.Crawl  -
threads = 10
 2011-09-15 16:57:09,653 2516 [main] INFO  org.apache.nutch.crawl.Crawl  -
depth = 3
 2011-09-15 16:57:09,653 2516 [main] INFO  org.apache.nutch.crawl.Crawl  -
solrUrl=http://localhost:8080/server_infojobs
 2011-09-15 16:57:10,090 2953 [main] INFO  org.apache.nutch.crawl.Injector
- Injector: starting at 2011-09-15 16:57:10
 2011-09-15 16:57:10,090 2953 [main] INFO  org.apache.nutch.crawl.Injector
- Injector: crawlDb: crawl-20110915165709/crawldb
 2011-09-15 16:57:10,090 2953 [main] INFO  org.apache.nutch.crawl.Injector
- Injector: urlDir:
/home/lcappa/Escritorio/workspaces/Tomcats/Tomcat2/apache-tomcat-6.0.29/urls
 2011-09-15 16:57:10,236 3099 [main] INFO  org.apache.nutch.crawl.Injector
- Injector: Converting injected urls to crawl db entries.
 2011-09-15 16:57:10,258 3121 [main] INFO
org.apache.hadoop.metrics.jvm.JvmMetrics  - Initializing JVM Metrics with
processName=JobTracker, sessionId=
* 2011-09-15 16:57:10,328 3191 [main] WARN
org.apache.hadoop.mapred.JobClient  - No job jar file set.  User classes may
not be found. See JobConf(Class) or JobConf#setJar(String).*
 2011-09-15 16:57:10,344 3207 [main] INFO
org.apache.hadoop.mapred.FileInputFormat  - Total input paths to process : 1
 2011-09-15 16:57:10,567 3430 [Thread-10] INFO
org.apache.hadoop.mapred.FileInputFormat  - Total input paths to process : 1
 2011-09-15 16:57:10,584 3447 [main] INFO
org.apache.hadoop.mapred.JobClient  - Running job: job_local_0001
 2011-09-15 16:57:10,642 3505 [Thread-10] INFO
org.apache.hadoop.mapred.MapTask  - numReduceTasks: 1
 2011-09-15 16:57:10,648 3511 [Thread-10] INFO
org.apache.hadoop.mapred.MapTask  - io.sort.mb = 100
 2011-09-15 16:57:10,772 3635 [Thread-10] INFO
org.apache.hadoop.mapred.MapTask  - data buffer = 79691776/99614720
 2011-09-15 16:57:10,772 3635 [Thread-10] INFO
org.apache.hadoop.mapred.MapTask  - record buffer = 262144/327680
 2011-09-15 16:57:10,794 3657 [Thread-10] WARN
org.apache.hadoop.mapred.LocalJobRunner  - job_local_0001
* java.lang.RuntimeException: Error in configuring object*
    at
org.apache.hadoop.util.ReflectionUtils.setJobConf(ReflectionUtils.java:93)
    at
org.apache.hadoop.util.ReflectionUtils.setConf(ReflectionUtils.java:64)
    at
org.apache.hadoop.util.ReflectionUtils.newInstance(ReflectionUtils.java:117)
    at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:354)
    at org.apache.hadoop.mapred.MapTask.run(MapTask.java:307)
    at
org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:177)
Caused by: java.lang.reflect.InvocationTargetException
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
    at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
    at java.lang.reflect.Method.invoke(Method.java:597)
    at
org.apache.hadoop.util.ReflectionUtils.setJobConf(ReflectionUtils.java:88)
    ... 5 more
Caused by: java.lang.RuntimeException: Error in configuring object
    at
org.apache.hadoop.util.ReflectionUtils.setJobConf(ReflectionUtils.java:93)
    at
org.apache.hadoop.util.ReflectionUtils.setConf(ReflectionUtils.java:64)
    at
org.apache.hadoop.util.ReflectionUtils.newInstance(ReflectionUtils.java:117)
    at org.apache.hadoop.mapred.MapRunner.configure(MapRunner.java:34)
    ... 10 more
Caused by: java.lang.reflect.InvocationTargetException
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
    at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
    at java.lang.reflect.Method.invoke(Method.java:597)
    at
org.apache.hadoop.util.ReflectionUtils.setJobConf(ReflectionUtils.java:88)
    ... 13 more
Caused by: java.lang.IllegalArgumentException: plugin.folders is not defined
    at
org.apache.nutch.plugin.PluginManifestParser.parsePluginFolder(PluginManifestParser.java:78)
    at
org.apache.nutch.plugin.PluginRepository.<init>(PluginRepository.java:71)
    at
org.apache.nutch.plugin.PluginRepository.get(PluginRepository.java:99)
    at org.apache.nutch.net.URLNormalizers.<init>(URLNormalizers.java:117)
    at
org.apache.nutch.crawl.Injector$InjectMapper.configure(Injector.java:70)
    ... 18 more
2011-09-15 16:57:11,587 4450 [main] INFO
org.apache.hadoop.mapred.JobClient  -  map 0% reduce 0%
 2011-09-15 16:57:11,590 4453 [main] INFO
org.apache.hadoop.mapred.JobClient  - Job complete: job_local_0001
 2011-09-15 16:57:11,591 4454 [main] INFO
org.apache.hadoop.mapred.JobClient  - Counters: 0
 2011-09-15 16:57:11,591 4454 [main] ERROR
es.desa.empleate.infojobs.CrawlingProcess  - > INFOJOBS CRAWLING ERROR: Job
failed!
 2011-09-15 16:57:11,591 4454 [main] INFO
es.desa.empleate.infojobs.CrawlingProcess  -  > Crawling process finished.


Looking at the error I think that I need to include nutch .job artifact too.
The question is: is that so? If I have to, how can include it with Maven?
Any recomendation?

Thank you very much.