You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by Manikandan Saravanan <ma...@thesocialpeople.net> on 2014/02/03 16:44:17 UTC
Nutch - Hadoop Help

Hi Lewis,

I’m Manikandan (hope you remember me from my issues with Nutch and Gora).

I’m running a fresh hadoop cluster and I’m trying to run Nutch 2.2.1 on top of it.

Here are the sequence of steps I’ve performed:

1. Downloaded nutch src and store it in ~/temp
2. Untar-ed, edited the nutch-site.xml to add http.agent.name etc.
3. Downloaded DMOZ urls, parsed 776 of them into a file called seed.txt
4. Ran “ant runtime” on the directory
5. Moved the contents of runtime/deploy into /usr/local/nutch
6. Created a file called seed.txt containing around 776 urls from the dmoz directory.
7. Mounted seed.txt into a hdfs directory named dmoz
8. Downloaded and saved [0] as a script file into /usr/local/nutch/bin
9. Mounted the crawl script into a hdfs directory named crawl.

And then, I’m running this:
$HADOOP_HOME/bin/hadoop jar /usr/local/nutch/nutch.job org.apache.nutch.crawl.Crawler dmoz -dir /user/hduser/crawl -depth 3 -topN 5000

And I get this:
14/02/03 10:35:46 INFO crawl.InjectorJob: InjectorJob: Using class org.apache.gora.memory.store.MemStore as the Gora storage class.
14/02/03 10:35:49 INFO input.FileInputFormat: Total input paths to process : 1
14/02/03 10:35:49 INFO util.NativeCodeLoader: Loaded the native-hadoop library
14/02/03 10:35:49 WARN snappy.LoadSnappy: Snappy native library not loaded
14/02/03 10:35:49 INFO mapred.JobClient: Running job: job_201402030708_0016
14/02/03 10:35:50 INFO mapred.JobClient:  map 0% reduce 0%
14/02/03 10:36:01 INFO mapred.JobClient:  map 100% reduce 0%
14/02/03 10:36:04 INFO mapred.JobClient: Job complete: job_201402030708_0016
14/02/03 10:36:04 INFO mapred.JobClient: Counters: 19
14/02/03 10:36:04 INFO mapred.JobClient:   Job Counters 
14/02/03 10:36:04 INFO mapred.JobClient:     SLOTS_MILLIS_MAPS=11354
14/02/03 10:36:04 INFO mapred.JobClient:     Total time spent by all reduces waiting after reserving slots (ms)=0
14/02/03 10:36:04 INFO mapred.JobClient:     Total time spent by all maps waiting after reserving slots (ms)=0
14/02/03 10:36:04 INFO mapred.JobClient:     Launched map tasks=1
14/02/03 10:36:04 INFO mapred.JobClient:     Data-local map tasks=1
14/02/03 10:36:04 INFO mapred.JobClient:     SLOTS_MILLIS_REDUCES=0
14/02/03 10:36:04 INFO mapred.JobClient:   File Output Format Counters 
14/02/03 10:36:04 INFO mapred.JobClient:     Bytes Written=0
14/02/03 10:36:04 INFO mapred.JobClient:   injector
14/02/03 10:36:04 INFO mapred.JobClient:     urls_filtered=122
14/02/03 10:36:04 INFO mapred.JobClient:   FileSystemCounters
14/02/03 10:36:04 INFO mapred.JobClient:     HDFS_BYTES_READ=4684
14/02/03 10:36:04 INFO mapred.JobClient:     FILE_BYTES_WRITTEN=79672
14/02/03 10:36:04 INFO mapred.JobClient:   File Input Format Counters 
14/02/03 10:36:04 INFO mapred.JobClient:     Bytes Read=4582
14/02/03 10:36:04 INFO mapred.JobClient:   Map-Reduce Framework
14/02/03 10:36:04 INFO mapred.JobClient:     Map input records=167
14/02/03 10:36:04 INFO mapred.JobClient:     Physical memory (bytes) snapshot=110018560
14/02/03 10:36:04 INFO mapred.JobClient:     Spilled Records=0
14/02/03 10:36:04 INFO mapred.JobClient:     CPU time spent (ms)=1450
14/02/03 10:36:04 INFO mapred.JobClient:     Total committed heap usage (bytes)=58195968
14/02/03 10:36:04 INFO mapred.JobClient:     Virtual memory (bytes) snapshot=1001594880
14/02/03 10:36:04 INFO mapred.JobClient:     Map output records=0
14/02/03 10:36:04 INFO mapred.JobClient:     SPLIT_RAW_BYTES=102
14/02/03 10:36:04 INFO crawl.InjectorJob: InjectorJob: total number of urls rejected by filters: 122
14/02/03 10:36:04 INFO crawl.InjectorJob: InjectorJob: total number of urls injected after normalization and filtering: 0
14/02/03 10:36:04 INFO crawl.FetchScheduleFactory: Using FetchSchedule impl: org.apache.nutch.crawl.DefaultFetchSchedule
14/02/03 10:36:04 INFO crawl.AbstractFetchSchedule: defaultInterval=2592000
14/02/03 10:36:04 INFO crawl.AbstractFetchSchedule: maxInterval=7776000
14/02/03 10:36:08 INFO mapred.JobClient: Running job: job_201402030708_0017
14/02/03 10:36:09 INFO mapred.JobClient:  map 0% reduce 0%
14/02/03 10:36:17 INFO mapred.JobClient:  map 100% reduce 0%
14/02/03 10:36:26 INFO mapred.JobClient:  map 100% reduce 33%
14/02/03 10:36:28 INFO mapred.JobClient: Task Id : attempt_201402030708_0017_r_000000_0, Status : FAILED
java.lang.NullPointerException
	at org.apache.avro.util.Utf8.<init>(Utf8.java:37)
	at org.apache.nutch.crawl.GeneratorReducer.setup(GeneratorReducer.java:100)
	at org.apache.hadoop.mapreduce.Reducer.run(Reducer.java:174)
	at org.apache.hadoop.mapred.ReduceTask.runNewReducer(ReduceTask.java:649)
	at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:418)
	at org.apache.hadoop.mapred.Child$4.run(Child.java:255)
	at java.security.AccessController.doPrivileged(Native Method)
	at javax.security.auth.Subject.doAs(Subject.java:415)
	at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1190)
	at org.apache.hadoop.mapred.Child.main(Child.java:249)

Please help me resolve this
[0] https://svn.apache.org/repos/asf/nutch/branches/2.x/src/bin/crawl
-- 
Manikandan Saravanan
Architect - Technology
TheSocialPeople