You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@nutch.apache.org by band_master <sw...@gmail.com> on 2013/07/23 22:20:15 UTC

Null Pointer Exception trying to run Nutch

Hello,
I am new to Nutch and have been trying desperately to get a basic web
crawler going using the following packages:

HBase 0.90.4
Nutch 2.2.1
Solr 4.3.0

I have Hbase running and can execute commands via terminal. I also have Solr
running and have used the schema-solr4.xml that came with Nutch 2.2.1 in
schema.xml file under the conf folder in collection1 of Solr. I even added
the field for "_version_" that is missing from the schema-solr4.xml example.
I am having trouble, though, getting Nutch to work. I can successfully
inject urls, but there seems to be an error in the Hadoop log around parsing
UTF8 characters. 

Here is the contents of nutch-site.xml

<property>
 <name>storage.data.store.class</name>
 <value>org.apache.gora.hbase.store.HBaseStore</value>
 <description>Default class for storing data</description>
</property>
<property>
  <name>http.agent.name</name>
  <value>SwirlCrawler</value>
  <description>HTTP 'User-Agent' request header. MUST NOT be empty -
  please set this to a single word uniquely related to your organization.
  </description>
</property>

<property>
  <name>http.robots.agents</name>
  <value>SwirlCrawler</value>
  <description>The agent strings we'll look for in robots.txt files,
  comma-separated, in decreasing order of precedence. You should
  put the value of http.agent.name as the first agent name, and keep the
  default * at the end of the list. E.g.: BlurflDev,Blurfl,*
  </description>
</property> 
<property>
   <name>plugin.folders</name>
   <value>/bin/apache-nutch-2.2.1/runtime/local/plugins</value>
 </property>
</configuration>

and here is the contents of hadoop.log

2013-07-23 13:07:19,615 INFO  crawl.InjectorJob - InjectorJob: Using class
org.apache.gora.hbase.store.HBaseStore as the Gora storage class.
2013-07-23 13:07:19,662 WARN  util.NativeCodeLoader - Unable to load
native-hadoop library for your platform... using builtin-java classes where
applicable
2013-07-23 13:07:19,882 WARN  snappy.LoadSnappy - Snappy native library not
loaded
2013-07-23 13:07:20,546 INFO  mapreduce.GoraRecordWriter -
gora.buffer.write.limit = 10000
2013-07-23 13:07:20,739 INFO  regex.RegexURLNormalizer - can't find rules
for scope 'inject', using default
2013-07-23 13:07:20,988 INFO  mapreduce.GoraRecordWriter -
gora.buffer.write.limit = 10000
2013-07-23 13:07:20,999 INFO  regex.RegexURLNormalizer - can't find rules
for scope 'inject', using default
2013-07-23 13:07:21,052 WARN  mapred.FileOutputCommitter - Output path is
null in cleanup
2013-07-23 13:07:21,280 INFO  crawl.InjectorJob - InjectorJob: total number
of urls rejected by filters: 0
2013-07-23 13:07:21,280 INFO  crawl.InjectorJob - InjectorJob: total number
of urls injected after normalization and filtering: 3
2013-07-23 13:07:21,287 INFO  crawl.FetchScheduleFactory - Using
FetchSchedule impl: org.apache.nutch.crawl.DefaultFetchSchedule
2013-07-23 13:07:21,287 INFO  crawl.AbstractFetchSchedule -
defaultInterval=2592000
2013-07-23 13:07:21,287 INFO  crawl.AbstractFetchSchedule -
maxInterval=7776000
2013-07-23 13:07:21,935 INFO  mapreduce.GoraRecordReader -
gora.buffer.read.limit = 10000
2013-07-23 13:07:22,063 INFO  crawl.FetchScheduleFactory - Using
FetchSchedule impl: org.apache.nutch.crawl.DefaultFetchSchedule
2013-07-23 13:07:22,064 INFO  crawl.AbstractFetchSchedule -
defaultInterval=2592000
2013-07-23 13:07:22,064 INFO  crawl.AbstractFetchSchedule -
maxInterval=7776000
2013-07-23 13:07:22,126 INFO  regex.RegexURLNormalizer - can't find rules
for scope 'generate_host_count', using default
2013-07-23 13:07:22,258 INFO  mapreduce.GoraRecordWriter -
gora.buffer.write.limit = 10000
2013-07-23 13:07:22,272 WARN  mapred.FileOutputCommitter - Output path is
null in cleanup
2013-07-23 13:07:22,273 WARN  mapred.LocalJobRunner -
job_local117641048_0002
java.lang.NullPointerException
	at org.apache.avro.util.Utf8.<init>(Utf8.java:37)
	at org.apache.nutch.crawl.GeneratorReducer.setup(GeneratorReducer.java:100)
	at org.apache.hadoop.mapreduce.Reducer.run(Reducer.java:174)
	at org.apache.hadoop.mapred.ReduceTask.runNewReducer(ReduceTask.java:649)
	at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:418)
	at org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:398)

Please help?

cheers,
BD



--
View this message in context: http://lucene.472066.n3.nabble.com/Null-Pointer-Exception-trying-to-run-Nutch-tp4079866.html
Sent from the Nutch - User mailing list archive at Nabble.com.

Re:Null Pointer Exception trying to run Nutch

Posted by Rui Gao <ga...@163.com>.

Hi

I have the some software configuration and the same error under Cygwin + windows XP.

Some error found when using hbase 0.90.x. Here's the log:
2013-07-21 14:51:29,500 WARN  mapred.FileOutputCommitter - Output path is null in cleanup
2013-07-21 14:51:29,500 WARN  mapred.LocalJobRunner - job_local196483647_0002
java.lang.NullPointerException
    at org.apache.avro.util.Utf8.<init>(Utf8.java:37)
    at org.apache.nutch.crawl.GeneratorReducer.setup(GeneratorReducer.java:100)
    at org.apache.hadoop.mapreduce.Reducer.run(Reducer.java:174)
    at org.apache.hadoop.mapred.ReduceTask.runNewReducer(ReduceTask.java:649)
    at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:418)
    at org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:398)






At 2013-07-24 04:20:15,band_master <sw...@gmail.com> wrote:
>Hello,
>I am new to Nutch and have been trying desperately to get a basic web
>crawler going using the following packages:
>
>HBase 0.90.4
>Nutch 2.2.1
>Solr 4.3.0
>
>I have Hbase running and can execute commands via terminal. I also have Solr
>running and have used the schema-solr4.xml that came with Nutch 2.2.1 in
>schema.xml file under the conf folder in collection1 of Solr. I even added
>the field for "_version_" that is missing from the schema-solr4.xml example.
>I am having trouble, though, getting Nutch to work. I can successfully
>inject urls, but there seems to be an error in the Hadoop log around parsing
>UTF8 characters. 
>
>Here is the contents of nutch-site.xml
>
><property>
> <name>storage.data.store.class</name>
> <value>org.apache.gora.hbase.store.HBaseStore</value>
> <description>Default class for storing data</description>
></property>
><property>
>  <name>http.agent.name</name>
>  <value>SwirlCrawler</value>
>  <description>HTTP 'User-Agent' request header. MUST NOT be empty -
>  please set this to a single word uniquely related to your organization.
>  </description>
></property>
>
><property>
>  <name>http.robots.agents</name>
>  <value>SwirlCrawler</value>
>  <description>The agent strings we'll look for in robots.txt files,
>  comma-separated, in decreasing order of precedence. You should
>  put the value of http.agent.name as the first agent name, and keep the
>  default * at the end of the list. E.g.: BlurflDev,Blurfl,*
>  </description>
></property> 
><property>
>   <name>plugin.folders</name>
>   <value>/bin/apache-nutch-2.2.1/runtime/local/plugins</value>
> </property>
></configuration>
>
>and here is the contents of hadoop.log
>
>2013-07-23 13:07:19,615 INFO  crawl.InjectorJob - InjectorJob: Using class
>org.apache.gora.hbase.store.HBaseStore as the Gora storage class.
>2013-07-23 13:07:19,662 WARN  util.NativeCodeLoader - Unable to load
>native-hadoop library for your platform... using builtin-java classes where
>applicable
>2013-07-23 13:07:19,882 WARN  snappy.LoadSnappy - Snappy native library not
>loaded
>2013-07-23 13:07:20,546 INFO  mapreduce.GoraRecordWriter -
>gora.buffer.write.limit = 10000
>2013-07-23 13:07:20,739 INFO  regex.RegexURLNormalizer - can't find rules
>for scope 'inject', using default
>2013-07-23 13:07:20,988 INFO  mapreduce.GoraRecordWriter -
>gora.buffer.write.limit = 10000
>2013-07-23 13:07:20,999 INFO  regex.RegexURLNormalizer - can't find rules
>for scope 'inject', using default
>2013-07-23 13:07:21,052 WARN  mapred.FileOutputCommitter - Output path is
>null in cleanup
>2013-07-23 13:07:21,280 INFO  crawl.InjectorJob - InjectorJob: total number
>of urls rejected by filters: 0
>2013-07-23 13:07:21,280 INFO  crawl.InjectorJob - InjectorJob: total number
>of urls injected after normalization and filtering: 3
>2013-07-23 13:07:21,287 INFO  crawl.FetchScheduleFactory - Using
>FetchSchedule impl: org.apache.nutch.crawl.DefaultFetchSchedule
>2013-07-23 13:07:21,287 INFO  crawl.AbstractFetchSchedule -
>defaultInterval=2592000
>2013-07-23 13:07:21,287 INFO  crawl.AbstractFetchSchedule -
>maxInterval=7776000
>2013-07-23 13:07:21,935 INFO  mapreduce.GoraRecordReader -
>gora.buffer.read.limit = 10000
>2013-07-23 13:07:22,063 INFO  crawl.FetchScheduleFactory - Using
>FetchSchedule impl: org.apache.nutch.crawl.DefaultFetchSchedule
>2013-07-23 13:07:22,064 INFO  crawl.AbstractFetchSchedule -
>defaultInterval=2592000
>2013-07-23 13:07:22,064 INFO  crawl.AbstractFetchSchedule -
>maxInterval=7776000
>2013-07-23 13:07:22,126 INFO  regex.RegexURLNormalizer - can't find rules
>for scope 'generate_host_count', using default
>2013-07-23 13:07:22,258 INFO  mapreduce.GoraRecordWriter -
>gora.buffer.write.limit = 10000
>2013-07-23 13:07:22,272 WARN  mapred.FileOutputCommitter - Output path is
>null in cleanup
>2013-07-23 13:07:22,273 WARN  mapred.LocalJobRunner -
>job_local117641048_0002
>java.lang.NullPointerException
>	at org.apache.avro.util.Utf8.<init>(Utf8.java:37)
>	at org.apache.nutch.crawl.GeneratorReducer.setup(GeneratorReducer.java:100)
>	at org.apache.hadoop.mapreduce.Reducer.run(Reducer.java:174)
>	at org.apache.hadoop.mapred.ReduceTask.runNewReducer(ReduceTask.java:649)
>	at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:418)
>	at org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:398)
>
>Please help?
>
>cheers,
>BD
>
>
>
>--
>View this message in context: http://lucene.472066.n3.nabble.com/Null-Pointer-Exception-trying-to-run-Nutch-tp4079866.html
>Sent from the Nutch - User mailing list archive at Nabble.com.

Re: Null Pointer Exception trying to run Nutch

Posted by band_master <sw...@gmail.com>.

Hi Lewis,
Thanks for your clarification. This is all new to me, so I am still trying
to understand the various steps involved in the crawl function. I was able
to execute individual commands like inject, generate, parse, and readdb.
when trying to execute 'bin/nutch crawl' however i would get the
nullPointerexception. After reading up a bit more, I see the 'crawl'
function is deprecated in Nutch2 in favor of a java file located in
'bin/crawl' that executes each command in sequence. I am able to
successfully crawl and Index using the command below in terminal:

/bin/apache-nutch-2.2.1/runtime/local$ bin/crawl urls 1
http://localhost:8983/solr/ 5

I know for some it might be obvious, but as a newbie it was pretty confusing
trying to get this going when most sample tutorials on the net cover
Nutch1.x specific requirements. perhaps more could be said about the 'crawl'
deprecation on the official Nutch2 tutorial page?

http://wiki.apache.org/nutch/Nutch2Tutorial




--
View this message in context: http://lucene.472066.n3.nabble.com/Null-Pointer-Exception-trying-to-run-Nutch-tp4079866p4080174.html
Sent from the Nutch - User mailing list archive at Nabble.com.