You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@nutch.apache.org by Rui Gao <ga...@163.com> on 2013/07/07 09:36:51 UTC

[2.2.1] What does inject job do?

Hello,

I have set up eclipse environment according to the WIKI. Here's some something I did before I run the inject job:
1. I use SqlStore as storage class
2. I started HSql database which contains the table 'webpage'.
3. I added 1 URL in seed.txt.
Then I run the inject job. It seems the job is finished successfully. But I there's no change be made to my HSql database. Any thought about this? Here's the log:
InjectorJob: starting at 2013-07-07 15:28:42
InjectorJob: Injecting urlDir: urls/dev
InjectorJob: Using class org.apache.gora.sql.store.SqlStore as the Gora storage class.
InjectorJob: total number of urls rejected by filters: 0
InjectorJob: total number of urls injected after normalization and filtering: 1
Injector: finished at 2013-07-07 15:28:44, elapsed: 00:00:02

Best Regards,
Rui

Re:Re: Re: [2.2.1] What does inject job do?

Posted by Rui Gao <ga...@163.com>.

I will put my own plugin in the second plugin folder. This works fine in Nutch-2.1.








At 2013-07-22 04:08:57,"Lewis John Mcgibbney" <le...@gmail.com> wrote:
>On Sat, Jul 20, 2013 at 10:58 PM, Rui Gao <ga...@163.com> wrote:
>
>> I checked the DB, the URL is already in DB.
>> The plugin property is configured like this:
>> <property>
>>   <name>plugin.folders</name>
>>   <value>./src/plugin,./plugins</value>
>>
>
>Any reason at all that you have two directories listed? Do you have two
>plugin directories with compiled classes in there?

Re: Re: [2.2.1] What does inject job do?

Posted by Lewis John Mcgibbney <le...@gmail.com>.

On Sat, Jul 20, 2013 at 10:58 PM, Rui Gao <ga...@163.com> wrote:

> I checked the DB, the URL is already in DB.
> The plugin property is configured like this:
> <property>
>   <name>plugin.folders</name>
>   <value>./src/plugin,./plugins</value>
>

Any reason at all that you have two directories listed? Do you have two
plugin directories with compiled classes in there?

Re:Re:Re: [2.2.1] What does inject job do?

Posted by Rui Gao <ga...@163.com>.

Some error found when using hbase 0.90.x. Here's the log:
2013-07-21 14:51:29,500 WARN  mapred.FileOutputCommitter - Output path is null in cleanup
2013-07-21 14:51:29,500 WARN  mapred.LocalJobRunner - job_local196483647_0002
java.lang.NullPointerException
    at org.apache.avro.util.Utf8.<init>(Utf8.java:37)
    at org.apache.nutch.crawl.GeneratorReducer.setup(GeneratorReducer.java:100)
    at org.apache.hadoop.mapreduce.Reducer.run(Reducer.java:174)
    at org.apache.hadoop.mapred.ReduceTask.runNewReducer(ReduceTask.java:649)
    at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:418)
    at org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:398)









At 2013-07-21 13:58:33,"Rui Gao" <ga...@163.com> wrote:
>I checked the DB, the URL is already in DB.
>The plugin property is configured like this:
><property>
>  <name>plugin.folders</name>
>  <value>./src/plugin,./plugins</value>
>  <description>Directories where nutch plugins are located.  Each
>  element may be a relative or absolute path.  If absolute, it is used
>  as is.  If relative, it is searched for on the classpath.</description>
></property>
>
>I guess the plugin property is configured properly. Because when I change it to other value, it complains plugins could not be found.
>
>
>
>
>
>
>
>At 2013-07-21 13:48:33,"Lewis John Mcgibbney" <le...@gmail.com> wrote:
>>Yes the warns which you've shown now are fine this is the old mapred API.
>>Its OK.
>>It's now stating that you've got 1 URL injected. Can you check the db?
>>either check contents or dump/read them with readdb tool?
>>Please remember that somewhere in the tutorial you reference the absolute
>>patch to plugins folder needs to be changed. This is your problem here.
>>InjectorJob doesn't require plugins to work... however when your indexing
>>plugins are called you are in trouble. You need to sort this out.
>>
>>On Saturday, July 20, 2013, Rui Gao <ga...@163.com> wrote:
>>> I am following this article http://wiki.apache.org/nutch/RunNutchInEclipse.
>>My environment is windows XP + cygwin + eclipse.
>>> I thinks the top several WARN logs are not the blocker. (The
>>plugin.folders contains an additional folder, after I remove it job still
>>fails.) We can compare it with the logs from InjectorJob which runs
>>successfully:
>>>
>>> 2013-07-21 12:45:01,968 INFO  crawl.InjectorJob - InjectorJob: starting
>>at 2013-07-21 12:45:01
>>> 2013-07-21 12:45:01,968 INFO  crawl.InjectorJob - InjectorJob: Injecting
>>urlDir: urls/dev
>>> 2013-07-21 12:45:02,921 INFO  crawl.InjectorJob - InjectorJob: Using
>>class org.apache.gora.sql.store.SqlStore as the Gora storage class.
>>> 2013-07-21 12:45:02,968 WARN  util.NativeCodeLoader - Unable to load
>>native-hadoop library for your platform... using builtin-java classes where
>>applicable
>>> 2013-07-21 12:45:02,984 WARN  mapred.JobClient - No job jar file set.
>> User classes may not be found. See JobConf(Class) or
>>JobConf#setJar(String).
>>> 2013-07-21 12:45:03,015 WARN  snappy.LoadSnappy - Snappy native library
>>not loaded
>>> 2013-07-21 12:45:03,328 INFO  mapreduce.GoraRecordWriter -
>>gora.buffer.write.limit = 10000
>>> 2013-07-21 12:45:03,437 WARN  plugin.PluginRepository - Plugins:
>>directory not found: ./plugins
>>> 2013-07-21 12:45:03,484 INFO  regex.RegexURLNormalizer - can't find rules
>>for scope 'inject', using default
>>> 2013-07-21 12:45:03,625 WARN  mapred.FileOutputCommitter - Output path is
>>null in cleanup
>>> 2013-07-21 12:45:04,218 INFO  crawl.InjectorJob - InjectorJob: total
>>number of urls rejected by filters: 0
>>> 2013-07-21 12:45:04,218 INFO  crawl.InjectorJob - InjectorJob: total
>>number of urls injected after normalization and filtering: 1
>>> 2013-07-21 12:45:04,218 INFO  crawl.InjectorJob - Injector: finished at
>>2013-07-21 12:45:04, elapsed: 00:00:02
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>> At 2013-07-21 12:36:29,"Lewis John Mcgibbney" <le...@gmail.com>
>>wrote:
>>>>Please read the exception trace. You are running on Hadoop? You need to
>>>>ensure that your plugins.directory points to the right path. There is also
>>>>a mention of a missing job file. Please ensure that your nutch job file is
>>>>on the Hadoop jobtracker classpath.
>>>>hth
>>>>
>>>>On Saturday, July 20, 2013, Rui Gao <ga...@163.com> wrote:
>>>>> Hi Lewis,
>>>>>
>>>>> I tried to downgrade gora-core to 0.2.1. then, I could run InjectorJob
>>>>with both hsql and mysql. But the Crawler job still fail. here's the log:
>>>>> 2013-07-21 12:23:41,156 INFO  crawl.InjectorJob - InjectorJob: Using
>>>>class org.apache.gora.sql.store.SqlStore as the Gora storage class.
>>>>> 2013-07-21 12:23:41,203 WARN  util.NativeCodeLoader - Unable to load
>>>>native-hadoop library for your platform... using builtin-java classes
>>where
>>>>applicable
>>>>> 2013-07-21 12:23:41,234 WARN  mapred.JobClient - No job jar file set.
>>>> User classes may not be found. See JobConf(Class) or
>>>>JobConf#setJar(String).
>>>>> 2013-07-21 12:23:41,265 WARN  snappy.LoadSnappy - Snappy native library
>>>>not loaded
>>>>> 2013-07-21 12:23:41,578 INFO  mapreduce.GoraRecordWriter -
>>>>gora.buffer.write.limit = 10000
>>>>> 2013-07-21 12:23:41,718 WARN  plugin.PluginRepository - Plugins:
>>>>directory not found: ./plugins
>>>>> 2013-07-21 12:23:41,765 INFO  regex.RegexURLNormalizer - can't find
>>rules
>>>>for scope 'inject', using default
>>>>> 2013-07-21 12:23:41,937 WARN  mapred.FileOutputCommitter - Output path
>>is
>>>>null in cleanup
>>>>> 2013-07-21 12:23:42,468 INFO  crawl.InjectorJob - InjectorJob: total
>>>>number of urls rejected by filters: 0
>>>>> 2013-07-21 12:23:42,468 INFO  crawl.InjectorJob - InjectorJob: total
>>>>number of urls injected after normalization and filtering: 1
>>>>> 2013-07-21 12:23:42,468 INFO  crawl.FetchScheduleFactory - Using
>>>>FetchSchedule impl: org.apache.nutch.crawl.DefaultFetchSchedule
>>>>> 2013-07-21 12:23:42,468 INFO  crawl.AbstractFetchSchedule -
>>>>defaultInterval=2592000
>>>>> 2013-07-21 12:23:42,468 INFO  crawl.AbstractFetchSchedule -
>>>>maxInterval=7776000
>>>>> 2013-07-21 12:23:42,593 WARN  mapred.JobClient - No job jar file set.
>>>> User classes may not be found. See JobConf(Class) or
>>>>JobConf#setJar(String).
>>>>> 2013-07-21 12:23:42,796 INFO  mapreduce.GoraRecordReader -
>>>>gora.buffer.read.limit = 10000
>>>>> 2013-07-21 12:23:43,062 INFO  crawl.FetchScheduleFactory - Using
>>>>FetchSchedule impl: org.apache.nutch.crawl.DefaultFetchSchedule
>>>>> 2013-07-21 12:23:43,062 INFO  crawl.AbstractFetchSchedule -
>>>>defaultInterval=2592000
>>>>> 2013-07-21 12:23:43,062 INFO  crawl.AbstractFetchSchedule -
>>>>maxInterval=7776000
>>>>> 2013-07-21 12:23:43,093 INFO  regex.RegexURLNormalizer - can't find
>>rules
>>>>for scope 'generate_host_count', using default
>>>>> 2013-07-21 12:23:43,234 INFO  mapreduce.GoraRecordWriter -
>>>>gora.buffer.write.limit = 10000
>>>>> 2013-07-21 12:23:43,250 WARN  mapred.FileOutputCommitter - Output path
>>is
>>>>null in cleanup
>>>>> 2013-07-21 12:23:43,250 WARN  mapred.LocalJobRunner -
>>>>job_local1378002997_0002
>>>>> java.lang.NullPointerException
>>>>>     at org.apache.avro.util.Utf8.<init>(Utf8.java:37)
>>>>>     at
>>>>org.apache.nutch.crawl.GeneratorReducer.setup(GeneratorReducer.java:100)
>>>>>     at org.apache.hadoop.mapreduce.Reducer.run(Reducer.java:174)
>>>>>     at
>>>>org.apache.hadoop.mapred.ReduceTask.runNewReducer(ReduceTask.java:649)
>>>>>     at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:418)
>>>>>     at
>>>>org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:398)
>>>>>
>>>>> I don't know if this is the right direction I should continue with. But
>>>>any way, hopefully my experience could help others.
>>>>>
>>>>>
>>>>> Regards,
>>>>> Rui
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> At 2013-07-20 23:07:41,"Rui Gao" <>*Lewis*
>>>
>>
>>-- 
>>*Lewis*

Re:Re: [2.2.1] What does inject job do?

Posted by Rui Gao <ga...@163.com>.

I checked the DB, the URL is already in DB.
The plugin property is configured like this:
<property>
  <name>plugin.folders</name>
  <value>./src/plugin,./plugins</value>
  <description>Directories where nutch plugins are located.  Each
  element may be a relative or absolute path.  If absolute, it is used
  as is.  If relative, it is searched for on the classpath.</description>
</property>

I guess the plugin property is configured properly. Because when I change it to other value, it complains plugins could not be found.







At 2013-07-21 13:48:33,"Lewis John Mcgibbney" <le...@gmail.com> wrote:
>Yes the warns which you've shown now are fine this is the old mapred API.
>Its OK.
>It's now stating that you've got 1 URL injected. Can you check the db?
>either check contents or dump/read them with readdb tool?
>Please remember that somewhere in the tutorial you reference the absolute
>patch to plugins folder needs to be changed. This is your problem here.
>InjectorJob doesn't require plugins to work... however when your indexing
>plugins are called you are in trouble. You need to sort this out.
>
>On Saturday, July 20, 2013, Rui Gao <ga...@163.com> wrote:
>> I am following this article http://wiki.apache.org/nutch/RunNutchInEclipse.
>My environment is windows XP + cygwin + eclipse.
>> I thinks the top several WARN logs are not the blocker. (The
>plugin.folders contains an additional folder, after I remove it job still
>fails.) We can compare it with the logs from InjectorJob which runs
>successfully:
>>
>> 2013-07-21 12:45:01,968 INFO  crawl.InjectorJob - InjectorJob: starting
>at 2013-07-21 12:45:01
>> 2013-07-21 12:45:01,968 INFO  crawl.InjectorJob - InjectorJob: Injecting
>urlDir: urls/dev
>> 2013-07-21 12:45:02,921 INFO  crawl.InjectorJob - InjectorJob: Using
>class org.apache.gora.sql.store.SqlStore as the Gora storage class.
>> 2013-07-21 12:45:02,968 WARN  util.NativeCodeLoader - Unable to load
>native-hadoop library for your platform... using builtin-java classes where
>applicable
>> 2013-07-21 12:45:02,984 WARN  mapred.JobClient - No job jar file set.
> User classes may not be found. See JobConf(Class) or
>JobConf#setJar(String).
>> 2013-07-21 12:45:03,015 WARN  snappy.LoadSnappy - Snappy native library
>not loaded
>> 2013-07-21 12:45:03,328 INFO  mapreduce.GoraRecordWriter -
>gora.buffer.write.limit = 10000
>> 2013-07-21 12:45:03,437 WARN  plugin.PluginRepository - Plugins:
>directory not found: ./plugins
>> 2013-07-21 12:45:03,484 INFO  regex.RegexURLNormalizer - can't find rules
>for scope 'inject', using default
>> 2013-07-21 12:45:03,625 WARN  mapred.FileOutputCommitter - Output path is
>null in cleanup
>> 2013-07-21 12:45:04,218 INFO  crawl.InjectorJob - InjectorJob: total
>number of urls rejected by filters: 0
>> 2013-07-21 12:45:04,218 INFO  crawl.InjectorJob - InjectorJob: total
>number of urls injected after normalization and filtering: 1
>> 2013-07-21 12:45:04,218 INFO  crawl.InjectorJob - Injector: finished at
>2013-07-21 12:45:04, elapsed: 00:00:02
>>
>>
>>
>>
>>
>>
>>
>>
>>
>> At 2013-07-21 12:36:29,"Lewis John Mcgibbney" <le...@gmail.com>
>wrote:
>>>Please read the exception trace. You are running on Hadoop? You need to
>>>ensure that your plugins.directory points to the right path. There is also
>>>a mention of a missing job file. Please ensure that your nutch job file is
>>>on the Hadoop jobtracker classpath.
>>>hth
>>>
>>>On Saturday, July 20, 2013, Rui Gao <ga...@163.com> wrote:
>>>> Hi Lewis,
>>>>
>>>> I tried to downgrade gora-core to 0.2.1. then, I could run InjectorJob
>>>with both hsql and mysql. But the Crawler job still fail. here's the log:
>>>> 2013-07-21 12:23:41,156 INFO  crawl.InjectorJob - InjectorJob: Using
>>>class org.apache.gora.sql.store.SqlStore as the Gora storage class.
>>>> 2013-07-21 12:23:41,203 WARN  util.NativeCodeLoader - Unable to load
>>>native-hadoop library for your platform... using builtin-java classes
>where
>>>applicable
>>>> 2013-07-21 12:23:41,234 WARN  mapred.JobClient - No job jar file set.
>>> User classes may not be found. See JobConf(Class) or
>>>JobConf#setJar(String).
>>>> 2013-07-21 12:23:41,265 WARN  snappy.LoadSnappy - Snappy native library
>>>not loaded
>>>> 2013-07-21 12:23:41,578 INFO  mapreduce.GoraRecordWriter -
>>>gora.buffer.write.limit = 10000
>>>> 2013-07-21 12:23:41,718 WARN  plugin.PluginRepository - Plugins:
>>>directory not found: ./plugins
>>>> 2013-07-21 12:23:41,765 INFO  regex.RegexURLNormalizer - can't find
>rules
>>>for scope 'inject', using default
>>>> 2013-07-21 12:23:41,937 WARN  mapred.FileOutputCommitter - Output path
>is
>>>null in cleanup
>>>> 2013-07-21 12:23:42,468 INFO  crawl.InjectorJob - InjectorJob: total
>>>number of urls rejected by filters: 0
>>>> 2013-07-21 12:23:42,468 INFO  crawl.InjectorJob - InjectorJob: total
>>>number of urls injected after normalization and filtering: 1
>>>> 2013-07-21 12:23:42,468 INFO  crawl.FetchScheduleFactory - Using
>>>FetchSchedule impl: org.apache.nutch.crawl.DefaultFetchSchedule
>>>> 2013-07-21 12:23:42,468 INFO  crawl.AbstractFetchSchedule -
>>>defaultInterval=2592000
>>>> 2013-07-21 12:23:42,468 INFO  crawl.AbstractFetchSchedule -
>>>maxInterval=7776000
>>>> 2013-07-21 12:23:42,593 WARN  mapred.JobClient - No job jar file set.
>>> User classes may not be found. See JobConf(Class) or
>>>JobConf#setJar(String).
>>>> 2013-07-21 12:23:42,796 INFO  mapreduce.GoraRecordReader -
>>>gora.buffer.read.limit = 10000
>>>> 2013-07-21 12:23:43,062 INFO  crawl.FetchScheduleFactory - Using
>>>FetchSchedule impl: org.apache.nutch.crawl.DefaultFetchSchedule
>>>> 2013-07-21 12:23:43,062 INFO  crawl.AbstractFetchSchedule -
>>>defaultInterval=2592000
>>>> 2013-07-21 12:23:43,062 INFO  crawl.AbstractFetchSchedule -
>>>maxInterval=7776000
>>>> 2013-07-21 12:23:43,093 INFO  regex.RegexURLNormalizer - can't find
>rules
>>>for scope 'generate_host_count', using default
>>>> 2013-07-21 12:23:43,234 INFO  mapreduce.GoraRecordWriter -
>>>gora.buffer.write.limit = 10000
>>>> 2013-07-21 12:23:43,250 WARN  mapred.FileOutputCommitter - Output path
>is
>>>null in cleanup
>>>> 2013-07-21 12:23:43,250 WARN  mapred.LocalJobRunner -
>>>job_local1378002997_0002
>>>> java.lang.NullPointerException
>>>>     at org.apache.avro.util.Utf8.<init>(Utf8.java:37)
>>>>     at
>>>org.apache.nutch.crawl.GeneratorReducer.setup(GeneratorReducer.java:100)
>>>>     at org.apache.hadoop.mapreduce.Reducer.run(Reducer.java:174)
>>>>     at
>>>org.apache.hadoop.mapred.ReduceTask.runNewReducer(ReduceTask.java:649)
>>>>     at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:418)
>>>>     at
>>>org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:398)
>>>>
>>>> I don't know if this is the right direction I should continue with. But
>>>any way, hopefully my experience could help others.
>>>>
>>>>
>>>> Regards,
>>>> Rui
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>> At 2013-07-20 23:07:41,"Rui Gao" <>*Lewis*
>>
>
>-- 
>*Lewis*

Re: [2.2.1] What does inject job do?

Posted by Lewis John Mcgibbney <le...@gmail.com>.

Yes the warns which you've shown now are fine this is the old mapred API.
Its OK.
It's now stating that you've got 1 URL injected. Can you check the db?
either check contents or dump/read them with readdb tool?
Please remember that somewhere in the tutorial you reference the absolute
patch to plugins folder needs to be changed. This is your problem here.
InjectorJob doesn't require plugins to work... however when your indexing
plugins are called you are in trouble. You need to sort this out.

On Saturday, July 20, 2013, Rui Gao <ga...@163.com> wrote:
> I am following this article http://wiki.apache.org/nutch/RunNutchInEclipse.
My environment is windows XP + cygwin + eclipse.
> I thinks the top several WARN logs are not the blocker. (The
plugin.folders contains an additional folder, after I remove it job still
fails.) We can compare it with the logs from InjectorJob which runs
successfully:
>
> 2013-07-21 12:45:01,968 INFO  crawl.InjectorJob - InjectorJob: starting
at 2013-07-21 12:45:01
> 2013-07-21 12:45:01,968 INFO  crawl.InjectorJob - InjectorJob: Injecting
urlDir: urls/dev
> 2013-07-21 12:45:02,921 INFO  crawl.InjectorJob - InjectorJob: Using
class org.apache.gora.sql.store.SqlStore as the Gora storage class.
> 2013-07-21 12:45:02,968 WARN  util.NativeCodeLoader - Unable to load
native-hadoop library for your platform... using builtin-java classes where
applicable
> 2013-07-21 12:45:02,984 WARN  mapred.JobClient - No job jar file set.
 User classes may not be found. See JobConf(Class) or
JobConf#setJar(String).
> 2013-07-21 12:45:03,015 WARN  snappy.LoadSnappy - Snappy native library
not loaded
> 2013-07-21 12:45:03,328 INFO  mapreduce.GoraRecordWriter -
gora.buffer.write.limit = 10000
> 2013-07-21 12:45:03,437 WARN  plugin.PluginRepository - Plugins:
directory not found: ./plugins
> 2013-07-21 12:45:03,484 INFO  regex.RegexURLNormalizer - can't find rules
for scope 'inject', using default
> 2013-07-21 12:45:03,625 WARN  mapred.FileOutputCommitter - Output path is
null in cleanup
> 2013-07-21 12:45:04,218 INFO  crawl.InjectorJob - InjectorJob: total
number of urls rejected by filters: 0
> 2013-07-21 12:45:04,218 INFO  crawl.InjectorJob - InjectorJob: total
number of urls injected after normalization and filtering: 1
> 2013-07-21 12:45:04,218 INFO  crawl.InjectorJob - Injector: finished at
2013-07-21 12:45:04, elapsed: 00:00:02
>
>
>
>
>
>
>
>
>
> At 2013-07-21 12:36:29,"Lewis John Mcgibbney" <le...@gmail.com>
wrote:
>>Please read the exception trace. You are running on Hadoop? You need to
>>ensure that your plugins.directory points to the right path. There is also
>>a mention of a missing job file. Please ensure that your nutch job file is
>>on the Hadoop jobtracker classpath.
>>hth
>>
>>On Saturday, July 20, 2013, Rui Gao <ga...@163.com> wrote:
>>> Hi Lewis,
>>>
>>> I tried to downgrade gora-core to 0.2.1. then, I could run InjectorJob
>>with both hsql and mysql. But the Crawler job still fail. here's the log:
>>> 2013-07-21 12:23:41,156 INFO  crawl.InjectorJob - InjectorJob: Using
>>class org.apache.gora.sql.store.SqlStore as the Gora storage class.
>>> 2013-07-21 12:23:41,203 WARN  util.NativeCodeLoader - Unable to load
>>native-hadoop library for your platform... using builtin-java classes
where
>>applicable
>>> 2013-07-21 12:23:41,234 WARN  mapred.JobClient - No job jar file set.
>> User classes may not be found. See JobConf(Class) or
>>JobConf#setJar(String).
>>> 2013-07-21 12:23:41,265 WARN  snappy.LoadSnappy - Snappy native library
>>not loaded
>>> 2013-07-21 12:23:41,578 INFO  mapreduce.GoraRecordWriter -
>>gora.buffer.write.limit = 10000
>>> 2013-07-21 12:23:41,718 WARN  plugin.PluginRepository - Plugins:
>>directory not found: ./plugins
>>> 2013-07-21 12:23:41,765 INFO  regex.RegexURLNormalizer - can't find
rules
>>for scope 'inject', using default
>>> 2013-07-21 12:23:41,937 WARN  mapred.FileOutputCommitter - Output path
is
>>null in cleanup
>>> 2013-07-21 12:23:42,468 INFO  crawl.InjectorJob - InjectorJob: total
>>number of urls rejected by filters: 0
>>> 2013-07-21 12:23:42,468 INFO  crawl.InjectorJob - InjectorJob: total
>>number of urls injected after normalization and filtering: 1
>>> 2013-07-21 12:23:42,468 INFO  crawl.FetchScheduleFactory - Using
>>FetchSchedule impl: org.apache.nutch.crawl.DefaultFetchSchedule
>>> 2013-07-21 12:23:42,468 INFO  crawl.AbstractFetchSchedule -
>>defaultInterval=2592000
>>> 2013-07-21 12:23:42,468 INFO  crawl.AbstractFetchSchedule -
>>maxInterval=7776000
>>> 2013-07-21 12:23:42,593 WARN  mapred.JobClient - No job jar file set.
>> User classes may not be found. See JobConf(Class) or
>>JobConf#setJar(String).
>>> 2013-07-21 12:23:42,796 INFO  mapreduce.GoraRecordReader -
>>gora.buffer.read.limit = 10000
>>> 2013-07-21 12:23:43,062 INFO  crawl.FetchScheduleFactory - Using
>>FetchSchedule impl: org.apache.nutch.crawl.DefaultFetchSchedule
>>> 2013-07-21 12:23:43,062 INFO  crawl.AbstractFetchSchedule -
>>defaultInterval=2592000
>>> 2013-07-21 12:23:43,062 INFO  crawl.AbstractFetchSchedule -
>>maxInterval=7776000
>>> 2013-07-21 12:23:43,093 INFO  regex.RegexURLNormalizer - can't find
rules
>>for scope 'generate_host_count', using default
>>> 2013-07-21 12:23:43,234 INFO  mapreduce.GoraRecordWriter -
>>gora.buffer.write.limit = 10000
>>> 2013-07-21 12:23:43,250 WARN  mapred.FileOutputCommitter - Output path
is
>>null in cleanup
>>> 2013-07-21 12:23:43,250 WARN  mapred.LocalJobRunner -
>>job_local1378002997_0002
>>> java.lang.NullPointerException
>>>     at org.apache.avro.util.Utf8.<init>(Utf8.java:37)
>>>     at
>>org.apache.nutch.crawl.GeneratorReducer.setup(GeneratorReducer.java:100)
>>>     at org.apache.hadoop.mapreduce.Reducer.run(Reducer.java:174)
>>>     at
>>org.apache.hadoop.mapred.ReduceTask.runNewReducer(ReduceTask.java:649)
>>>     at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:418)
>>>     at
>>org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:398)
>>>
>>> I don't know if this is the right direction I should continue with. But
>>any way, hopefully my experience could help others.
>>>
>>>
>>> Regards,
>>> Rui
>>>
>>>
>>>
>>>
>>>
>>>
>>> At 2013-07-20 23:07:41,"Rui Gao" <>*Lewis*
>

-- 
*Lewis*

Re:Re: [2.2.1] What does inject job do?

Posted by Rui Gao <ga...@163.com>.

I am following this article http://wiki.apache.org/nutch/RunNutchInEclipse. My environment is windows XP + cygwin + eclipse.
I thinks the top several WARN logs are not the blocker. (The plugin.folders contains an additional folder, after I remove it job still fails.) We can compare it with the logs from InjectorJob which runs successfully:

2013-07-21 12:45:01,968 INFO  crawl.InjectorJob - InjectorJob: starting at 2013-07-21 12:45:01
2013-07-21 12:45:01,968 INFO  crawl.InjectorJob - InjectorJob: Injecting urlDir: urls/dev
2013-07-21 12:45:02,921 INFO  crawl.InjectorJob - InjectorJob: Using class org.apache.gora.sql.store.SqlStore as the Gora storage class.
2013-07-21 12:45:02,968 WARN  util.NativeCodeLoader - Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
2013-07-21 12:45:02,984 WARN  mapred.JobClient - No job jar file set.  User classes may not be found. See JobConf(Class) or JobConf#setJar(String).
2013-07-21 12:45:03,015 WARN  snappy.LoadSnappy - Snappy native library not loaded
2013-07-21 12:45:03,328 INFO  mapreduce.GoraRecordWriter - gora.buffer.write.limit = 10000
2013-07-21 12:45:03,437 WARN  plugin.PluginRepository - Plugins: directory not found: ./plugins
2013-07-21 12:45:03,484 INFO  regex.RegexURLNormalizer - can't find rules for scope 'inject', using default
2013-07-21 12:45:03,625 WARN  mapred.FileOutputCommitter - Output path is null in cleanup
2013-07-21 12:45:04,218 INFO  crawl.InjectorJob - InjectorJob: total number of urls rejected by filters: 0
2013-07-21 12:45:04,218 INFO  crawl.InjectorJob - InjectorJob: total number of urls injected after normalization and filtering: 1
2013-07-21 12:45:04,218 INFO  crawl.InjectorJob - Injector: finished at 2013-07-21 12:45:04, elapsed: 00:00:02









At 2013-07-21 12:36:29,"Lewis John Mcgibbney" <le...@gmail.com> wrote:
>Please read the exception trace. You are running on Hadoop? You need to
>ensure that your plugins.directory points to the right path. There is also
>a mention of a missing job file. Please ensure that your nutch job file is
>on the Hadoop jobtracker classpath.
>hth
>
>On Saturday, July 20, 2013, Rui Gao <ga...@163.com> wrote:
>> Hi Lewis,
>>
>> I tried to downgrade gora-core to 0.2.1. then, I could run InjectorJob
>with both hsql and mysql. But the Crawler job still fail. here's the log:
>> 2013-07-21 12:23:41,156 INFO  crawl.InjectorJob - InjectorJob: Using
>class org.apache.gora.sql.store.SqlStore as the Gora storage class.
>> 2013-07-21 12:23:41,203 WARN  util.NativeCodeLoader - Unable to load
>native-hadoop library for your platform... using builtin-java classes where
>applicable
>> 2013-07-21 12:23:41,234 WARN  mapred.JobClient - No job jar file set.
> User classes may not be found. See JobConf(Class) or
>JobConf#setJar(String).
>> 2013-07-21 12:23:41,265 WARN  snappy.LoadSnappy - Snappy native library
>not loaded
>> 2013-07-21 12:23:41,578 INFO  mapreduce.GoraRecordWriter -
>gora.buffer.write.limit = 10000
>> 2013-07-21 12:23:41,718 WARN  plugin.PluginRepository - Plugins:
>directory not found: ./plugins
>> 2013-07-21 12:23:41,765 INFO  regex.RegexURLNormalizer - can't find rules
>for scope 'inject', using default
>> 2013-07-21 12:23:41,937 WARN  mapred.FileOutputCommitter - Output path is
>null in cleanup
>> 2013-07-21 12:23:42,468 INFO  crawl.InjectorJob - InjectorJob: total
>number of urls rejected by filters: 0
>> 2013-07-21 12:23:42,468 INFO  crawl.InjectorJob - InjectorJob: total
>number of urls injected after normalization and filtering: 1
>> 2013-07-21 12:23:42,468 INFO  crawl.FetchScheduleFactory - Using
>FetchSchedule impl: org.apache.nutch.crawl.DefaultFetchSchedule
>> 2013-07-21 12:23:42,468 INFO  crawl.AbstractFetchSchedule -
>defaultInterval=2592000
>> 2013-07-21 12:23:42,468 INFO  crawl.AbstractFetchSchedule -
>maxInterval=7776000
>> 2013-07-21 12:23:42,593 WARN  mapred.JobClient - No job jar file set.
> User classes may not be found. See JobConf(Class) or
>JobConf#setJar(String).
>> 2013-07-21 12:23:42,796 INFO  mapreduce.GoraRecordReader -
>gora.buffer.read.limit = 10000
>> 2013-07-21 12:23:43,062 INFO  crawl.FetchScheduleFactory - Using
>FetchSchedule impl: org.apache.nutch.crawl.DefaultFetchSchedule
>> 2013-07-21 12:23:43,062 INFO  crawl.AbstractFetchSchedule -
>defaultInterval=2592000
>> 2013-07-21 12:23:43,062 INFO  crawl.AbstractFetchSchedule -
>maxInterval=7776000
>> 2013-07-21 12:23:43,093 INFO  regex.RegexURLNormalizer - can't find rules
>for scope 'generate_host_count', using default
>> 2013-07-21 12:23:43,234 INFO  mapreduce.GoraRecordWriter -
>gora.buffer.write.limit = 10000
>> 2013-07-21 12:23:43,250 WARN  mapred.FileOutputCommitter - Output path is
>null in cleanup
>> 2013-07-21 12:23:43,250 WARN  mapred.LocalJobRunner -
>job_local1378002997_0002
>> java.lang.NullPointerException
>>     at org.apache.avro.util.Utf8.<init>(Utf8.java:37)
>>     at
>org.apache.nutch.crawl.GeneratorReducer.setup(GeneratorReducer.java:100)
>>     at org.apache.hadoop.mapreduce.Reducer.run(Reducer.java:174)
>>     at
>org.apache.hadoop.mapred.ReduceTask.runNewReducer(ReduceTask.java:649)
>>     at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:418)
>>     at
>org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:398)
>>
>> I don't know if this is the right direction I should continue with. But
>any way, hopefully my experience could help others.
>>
>>
>> Regards,
>> Rui
>>
>>
>>
>>
>>
>>
>> At 2013-07-20 23:07:41,"Rui Gao" <ga...@163.com> wrote:
>>>Hi Lewis,
>>>
>>>Thanks for your answer.
>>>So, what direction will Nutch go? Will it co-operate with relationship
>database or will it only work on non-relationship database like hbase?
>>>I remember when 2.2.1 has been released, I checked the release note, it
>says some bugs related with mysql has been fixed. That's why I try to
>integrate it with mysql or hsql. And also, in the wiki, there's a link
>talking about how to integrate nutch with mysql:
>http://nlp.solutions.asia/?p=362
>>>
>>>Do you have any suggestion?
>>>
>>>Thanks.
>>>
>>>Best Regards,
>>>Rui
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>At 2013-07-11 03:53:12,"Lewis John Mcgibbney" <le...@gmail.com>
>wrote:
>>>>Hi Rui,
>>>>This should not work.
>>>>The SqlStore module and support for it is now deprecated within Apache
>Gora.
>>>>If you would like to downgrade to use Nutch 2.1, then you can use older
>>>>Gora artifacts but this is not recommended.
>>>>Thanks
>>>>Lewis
>>>>
>>>>
>>>>On Sun, Jul 7, 2013 at 12:36 AM, Rui Gao <ga...@163.com> wrote:
>>>>
>>>>> Hello,
>>>>>
>>>>> I have set up eclipse environment according to the WIKI. Here's some
>>>>> something I did before I run the inject job:
>>>>> 1. I use SqlStore as storage class
>>>>> 2. I started HSql database which contains the table 'webpage'.
>>>>> 3. I added 1 URL in seed.txt.
>>>>> Then I run the inject job. It seems the job is finished successfully.
>But
>>>>> I there's no change be made to my HSql database. Any thought about
>this?
>>>>> Here's the log:
>>>>> InjectorJob: starting at 2013-07-07 15:28:42
>>>>> InjectorJob: Injecting urlDir: urls/dev
>>>>> InjectorJob: Using class org.apache.gora.sql.store.SqlStore as the Gora
>>>>> storage class.
>>>>> InjectorJob: total number of urls rejected by filters: 0
>>>>> InjectorJob: total number of urls injected after normalization and
>>>>> filtering: 1
>>>>> Injector: finished at 2013-07-07 15:28:44, elapsed: 00:00:02
>>>>>
>>>>> Best Regards,
>>>>> Rui
>>>>>
>>>>
>>>>
>>>>
>>>>--
>>>>*Lewis*
>>
>
>-- 
>*Lewis*

Re: [2.2.1] What does inject job do?

Posted by Lewis John Mcgibbney <le...@gmail.com>.

Please read the exception trace. You are running on Hadoop? You need to
ensure that your plugins.directory points to the right path. There is also
a mention of a missing job file. Please ensure that your nutch job file is
on the Hadoop jobtracker classpath.
hth

On Saturday, July 20, 2013, Rui Gao <ga...@163.com> wrote:
> Hi Lewis,
>
> I tried to downgrade gora-core to 0.2.1. then, I could run InjectorJob
with both hsql and mysql. But the Crawler job still fail. here's the log:
> 2013-07-21 12:23:41,156 INFO  crawl.InjectorJob - InjectorJob: Using
class org.apache.gora.sql.store.SqlStore as the Gora storage class.
> 2013-07-21 12:23:41,203 WARN  util.NativeCodeLoader - Unable to load
native-hadoop library for your platform... using builtin-java classes where
applicable
> 2013-07-21 12:23:41,234 WARN  mapred.JobClient - No job jar file set.
 User classes may not be found. See JobConf(Class) or
JobConf#setJar(String).
> 2013-07-21 12:23:41,265 WARN  snappy.LoadSnappy - Snappy native library
not loaded
> 2013-07-21 12:23:41,578 INFO  mapreduce.GoraRecordWriter -
gora.buffer.write.limit = 10000
> 2013-07-21 12:23:41,718 WARN  plugin.PluginRepository - Plugins:
directory not found: ./plugins
> 2013-07-21 12:23:41,765 INFO  regex.RegexURLNormalizer - can't find rules
for scope 'inject', using default
> 2013-07-21 12:23:41,937 WARN  mapred.FileOutputCommitter - Output path is
null in cleanup
> 2013-07-21 12:23:42,468 INFO  crawl.InjectorJob - InjectorJob: total
number of urls rejected by filters: 0
> 2013-07-21 12:23:42,468 INFO  crawl.InjectorJob - InjectorJob: total
number of urls injected after normalization and filtering: 1
> 2013-07-21 12:23:42,468 INFO  crawl.FetchScheduleFactory - Using
FetchSchedule impl: org.apache.nutch.crawl.DefaultFetchSchedule
> 2013-07-21 12:23:42,468 INFO  crawl.AbstractFetchSchedule -
defaultInterval=2592000
> 2013-07-21 12:23:42,468 INFO  crawl.AbstractFetchSchedule -
maxInterval=7776000
> 2013-07-21 12:23:42,593 WARN  mapred.JobClient - No job jar file set.
 User classes may not be found. See JobConf(Class) or
JobConf#setJar(String).
> 2013-07-21 12:23:42,796 INFO  mapreduce.GoraRecordReader -
gora.buffer.read.limit = 10000
> 2013-07-21 12:23:43,062 INFO  crawl.FetchScheduleFactory - Using
FetchSchedule impl: org.apache.nutch.crawl.DefaultFetchSchedule
> 2013-07-21 12:23:43,062 INFO  crawl.AbstractFetchSchedule -
defaultInterval=2592000
> 2013-07-21 12:23:43,062 INFO  crawl.AbstractFetchSchedule -
maxInterval=7776000
> 2013-07-21 12:23:43,093 INFO  regex.RegexURLNormalizer - can't find rules
for scope 'generate_host_count', using default
> 2013-07-21 12:23:43,234 INFO  mapreduce.GoraRecordWriter -
gora.buffer.write.limit = 10000
> 2013-07-21 12:23:43,250 WARN  mapred.FileOutputCommitter - Output path is
null in cleanup
> 2013-07-21 12:23:43,250 WARN  mapred.LocalJobRunner -
job_local1378002997_0002
> java.lang.NullPointerException
>     at org.apache.avro.util.Utf8.<init>(Utf8.java:37)
>     at
org.apache.nutch.crawl.GeneratorReducer.setup(GeneratorReducer.java:100)
>     at org.apache.hadoop.mapreduce.Reducer.run(Reducer.java:174)
>     at
org.apache.hadoop.mapred.ReduceTask.runNewReducer(ReduceTask.java:649)
>     at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:418)
>     at
org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:398)
>
> I don't know if this is the right direction I should continue with. But
any way, hopefully my experience could help others.
>
>
> Regards,
> Rui
>
>
>
>
>
>
> At 2013-07-20 23:07:41,"Rui Gao" <ga...@163.com> wrote:
>>Hi Lewis,
>>
>>Thanks for your answer.
>>So, what direction will Nutch go? Will it co-operate with relationship
database or will it only work on non-relationship database like hbase?
>>I remember when 2.2.1 has been released, I checked the release note, it
says some bugs related with mysql has been fixed. That's why I try to
integrate it with mysql or hsql. And also, in the wiki, there's a link
talking about how to integrate nutch with mysql:
http://nlp.solutions.asia/?p=362
>>
>>Do you have any suggestion?
>>
>>Thanks.
>>
>>Best Regards,
>>Rui
>>
>>
>>
>>
>>
>>
>>
>>At 2013-07-11 03:53:12,"Lewis John Mcgibbney" <le...@gmail.com>
wrote:
>>>Hi Rui,
>>>This should not work.
>>>The SqlStore module and support for it is now deprecated within Apache
Gora.
>>>If you would like to downgrade to use Nutch 2.1, then you can use older
>>>Gora artifacts but this is not recommended.
>>>Thanks
>>>Lewis
>>>
>>>
>>>On Sun, Jul 7, 2013 at 12:36 AM, Rui Gao <ga...@163.com> wrote:
>>>
>>>> Hello,
>>>>
>>>> I have set up eclipse environment according to the WIKI. Here's some
>>>> something I did before I run the inject job:
>>>> 1. I use SqlStore as storage class
>>>> 2. I started HSql database which contains the table 'webpage'.
>>>> 3. I added 1 URL in seed.txt.
>>>> Then I run the inject job. It seems the job is finished successfully.
But
>>>> I there's no change be made to my HSql database. Any thought about
this?
>>>> Here's the log:
>>>> InjectorJob: starting at 2013-07-07 15:28:42
>>>> InjectorJob: Injecting urlDir: urls/dev
>>>> InjectorJob: Using class org.apache.gora.sql.store.SqlStore as the Gora
>>>> storage class.
>>>> InjectorJob: total number of urls rejected by filters: 0
>>>> InjectorJob: total number of urls injected after normalization and
>>>> filtering: 1
>>>> Injector: finished at 2013-07-07 15:28:44, elapsed: 00:00:02
>>>>
>>>> Best Regards,
>>>> Rui
>>>>
>>>
>>>
>>>
>>>--
>>>*Lewis*
>

-- 
*Lewis*

Re: [2.2.1] What does inject job do?

Posted by Lewis John Mcgibbney <le...@gmail.com>.

Please read the exception trace. You are running on Hadoop? You need to
ensure that your plugins.directory points to the right path. There is also
a mention of a missing job file. Please ensure that your nutch job file is
on the Hadoop jobtracker classpath.
hth

On Saturday, July 20, 2013, Rui Gao <ga...@163.com> wrote:
> Hi Lewis,
>
> I tried to downgrade gora-core to 0.2.1. then, I could run InjectorJob
with both hsql and mysql. But the Crawler job still fail. here's the log:
> 2013-07-21 12:23:41,156 INFO  crawl.InjectorJob - InjectorJob: Using
class org.apache.gora.sql.store.SqlStore as the Gora storage class.
> 2013-07-21 12:23:41,203 WARN  util.NativeCodeLoader - Unable to load
native-hadoop library for your platform... using builtin-java classes where
applicable
> 2013-07-21 12:23:41,234 WARN  mapred.JobClient - No job jar file set.
 User classes may not be found. See JobConf(Class) or
JobConf#setJar(String).
> 2013-07-21 12:23:41,265 WARN  snappy.LoadSnappy - Snappy native library
not loaded
> 2013-07-21 12:23:41,578 INFO  mapreduce.GoraRecordWriter -
gora.buffer.write.limit = 10000
> 2013-07-21 12:23:41,718 WARN  plugin.PluginRepository - Plugins:
directory not found: ./plugins
> 2013-07-21 12:23:41,765 INFO  regex.RegexURLNormalizer - can't find rules
for scope 'inject', using default
> 2013-07-21 12:23:41,937 WARN  mapred.FileOutputCommitter - Output path is
null in cleanup
> 2013-07-21 12:23:42,468 INFO  crawl.InjectorJob - InjectorJob: total
number of urls rejected by filters: 0
> 2013-07-21 12:23:42,468 INFO  crawl.InjectorJob - InjectorJob: total
number of urls injected after normalization and filtering: 1
> 2013-07-21 12:23:42,468 INFO  crawl.FetchScheduleFactory - Using
FetchSchedule impl: org.apache.nutch.crawl.DefaultFetchSchedule
> 2013-07-21 12:23:42,468 INFO  crawl.AbstractFetchSchedule -
defaultInterval=2592000
> 2013-07-21 12:23:42,468 INFO  crawl.AbstractFetchSchedule -
maxInterval=7776000
> 2013-07-21 12:23:42,593 WARN  mapred.JobClient - No job jar file set.
 User classes may not be found. See JobConf(Class) or
JobConf#setJar(String).
> 2013-07-21 12:23:42,796 INFO  mapreduce.GoraRecordReader -
gora.buffer.read.limit = 10000
> 2013-07-21 12:23:43,062 INFO  crawl.FetchScheduleFactory - Using
FetchSchedule impl: org.apache.nutch.crawl.DefaultFetchSchedule
> 2013-07-21 12:23:43,062 INFO  crawl.AbstractFetchSchedule -
defaultInterval=2592000
> 2013-07-21 12:23:43,062 INFO  crawl.AbstractFetchSchedule -
maxInterval=7776000
> 2013-07-21 12:23:43,093 INFO  regex.RegexURLNormalizer - can't find rules
for scope 'generate_host_count', using default
> 2013-07-21 12:23:43,234 INFO  mapreduce.GoraRecordWriter -
gora.buffer.write.limit = 10000
> 2013-07-21 12:23:43,250 WARN  mapred.FileOutputCommitter - Output path is
null in cleanup
> 2013-07-21 12:23:43,250 WARN  mapred.LocalJobRunner -
job_local1378002997_0002
> java.lang.NullPointerException
>     at org.apache.avro.util.Utf8.<init>(Utf8.java:37)
>     at
org.apache.nutch.crawl.GeneratorReducer.setup(GeneratorReducer.java:100)
>     at org.apache.hadoop.mapreduce.Reducer.run(Reducer.java:174)
>     at
org.apache.hadoop.mapred.ReduceTask.runNewReducer(ReduceTask.java:649)
>     at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:418)
>     at
org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:398)
>
> I don't know if this is the right direction I should continue with. But
any way, hopefully my experience could help others.
>
>
> Regards,
> Rui
>
>
>
>
>
>
> At 2013-07-20 23:07:41,"Rui Gao" <ga...@163.com> wrote:
>>Hi Lewis,
>>
>>Thanks for your answer.
>>So, what direction will Nutch go? Will it co-operate with relationship
database or will it only work on non-relationship database like hbase?
>>I remember when 2.2.1 has been released, I checked the release note, it
says some bugs related with mysql has been fixed. That's why I try to
integrate it with mysql or hsql. And also, in the wiki, there's a link
talking about how to integrate nutch with mysql:
http://nlp.solutions.asia/?p=362
>>
>>Do you have any suggestion?
>>
>>Thanks.
>>
>>Best Regards,
>>Rui
>>
>>
>>
>>
>>
>>
>>
>>At 2013-07-11 03:53:12,"Lewis John Mcgibbney" <le...@gmail.com>
wrote:
>>>Hi Rui,
>>>This should not work.
>>>The SqlStore module and support for it is now deprecated within Apache
Gora.
>>>If you would like to downgrade to use Nutch 2.1, then you can use older
>>>Gora artifacts but this is not recommended.
>>>Thanks
>>>Lewis
>>>
>>>
>>>On Sun, Jul 7, 2013 at 12:36 AM, Rui Gao <ga...@163.com> wrote:
>>>
>>>> Hello,
>>>>
>>>> I have set up eclipse environment according to the WIKI. Here's some
>>>> something I did before I run the inject job:
>>>> 1. I use SqlStore as storage class
>>>> 2. I started HSql database which contains the table 'webpage'.
>>>> 3. I added 1 URL in seed.txt.
>>>> Then I run the inject job. It seems the job is finished successfully.
But
>>>> I there's no change be made to my HSql database. Any thought about
this?
>>>> Here's the log:
>>>> InjectorJob: starting at 2013-07-07 15:28:42
>>>> InjectorJob: Injecting urlDir: urls/dev
>>>> InjectorJob: Using class org.apache.gora.sql.store.SqlStore as the Gora
>>>> storage class.
>>>> InjectorJob: total number of urls rejected by filters: 0
>>>> InjectorJob: total number of urls injected after normalization and
>>>> filtering: 1
>>>> Injector: finished at 2013-07-07 15:28:44, elapsed: 00:00:02
>>>>
>>>> Best Regards,
>>>> Rui
>>>>
>>>
>>>
>>>
>>>--
>>>*Lewis*
>

-- 
*Lewis*

Re:Re:Re: [2.2.1] What does inject job do?

Posted by Rui Gao <ga...@163.com>.

Hi Lewis,

I tried to downgrade gora-core to 0.2.1. then, I could run InjectorJob with both hsql and mysql. But the Crawler job still fail. here's the log:
2013-07-21 12:23:41,156 INFO  crawl.InjectorJob - InjectorJob: Using class org.apache.gora.sql.store.SqlStore as the Gora storage class.
2013-07-21 12:23:41,203 WARN  util.NativeCodeLoader - Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
2013-07-21 12:23:41,234 WARN  mapred.JobClient - No job jar file set.  User classes may not be found. See JobConf(Class) or JobConf#setJar(String).
2013-07-21 12:23:41,265 WARN  snappy.LoadSnappy - Snappy native library not loaded
2013-07-21 12:23:41,578 INFO  mapreduce.GoraRecordWriter - gora.buffer.write.limit = 10000
2013-07-21 12:23:41,718 WARN  plugin.PluginRepository - Plugins: directory not found: ./plugins
2013-07-21 12:23:41,765 INFO  regex.RegexURLNormalizer - can't find rules for scope 'inject', using default
2013-07-21 12:23:41,937 WARN  mapred.FileOutputCommitter - Output path is null in cleanup
2013-07-21 12:23:42,468 INFO  crawl.InjectorJob - InjectorJob: total number of urls rejected by filters: 0
2013-07-21 12:23:42,468 INFO  crawl.InjectorJob - InjectorJob: total number of urls injected after normalization and filtering: 1
2013-07-21 12:23:42,468 INFO  crawl.FetchScheduleFactory - Using FetchSchedule impl: org.apache.nutch.crawl.DefaultFetchSchedule
2013-07-21 12:23:42,468 INFO  crawl.AbstractFetchSchedule - defaultInterval=2592000
2013-07-21 12:23:42,468 INFO  crawl.AbstractFetchSchedule - maxInterval=7776000
2013-07-21 12:23:42,593 WARN  mapred.JobClient - No job jar file set.  User classes may not be found. See JobConf(Class) or JobConf#setJar(String).
2013-07-21 12:23:42,796 INFO  mapreduce.GoraRecordReader - gora.buffer.read.limit = 10000
2013-07-21 12:23:43,062 INFO  crawl.FetchScheduleFactory - Using FetchSchedule impl: org.apache.nutch.crawl.DefaultFetchSchedule
2013-07-21 12:23:43,062 INFO  crawl.AbstractFetchSchedule - defaultInterval=2592000
2013-07-21 12:23:43,062 INFO  crawl.AbstractFetchSchedule - maxInterval=7776000
2013-07-21 12:23:43,093 INFO  regex.RegexURLNormalizer - can't find rules for scope 'generate_host_count', using default
2013-07-21 12:23:43,234 INFO  mapreduce.GoraRecordWriter - gora.buffer.write.limit = 10000
2013-07-21 12:23:43,250 WARN  mapred.FileOutputCommitter - Output path is null in cleanup
2013-07-21 12:23:43,250 WARN  mapred.LocalJobRunner - job_local1378002997_0002
java.lang.NullPointerException
    at org.apache.avro.util.Utf8.<init>(Utf8.java:37)
    at org.apache.nutch.crawl.GeneratorReducer.setup(GeneratorReducer.java:100)
    at org.apache.hadoop.mapreduce.Reducer.run(Reducer.java:174)
    at org.apache.hadoop.mapred.ReduceTask.runNewReducer(ReduceTask.java:649)
    at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:418)
    at org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:398)

I don't know if this is the right direction I should continue with. But any way, hopefully my experience could help others.


Regards,
Rui






At 2013-07-20 23:07:41,"Rui Gao" <ga...@163.com> wrote:
>Hi Lewis,
>
>Thanks for your answer.
>So, what direction will Nutch go? Will it co-operate with relationship database or will it only work on non-relationship database like hbase?
>I remember when 2.2.1 has been released, I checked the release note, it says some bugs related with mysql has been fixed. That's why I try to integrate it with mysql or hsql. And also, in the wiki, there's a link talking about how to integrate nutch with mysql: http://nlp.solutions.asia/?p=362
>
>Do you have any suggestion?
>
>Thanks.
>
>Best Regards,
>Rui
>
>
>
>
>
>
>
>At 2013-07-11 03:53:12,"Lewis John Mcgibbney" <le...@gmail.com> wrote:
>>Hi Rui,
>>This should not work.
>>The SqlStore module and support for it is now deprecated within Apache Gora.
>>If you would like to downgrade to use Nutch 2.1, then you can use older
>>Gora artifacts but this is not recommended.
>>Thanks
>>Lewis
>>
>>
>>On Sun, Jul 7, 2013 at 12:36 AM, Rui Gao <ga...@163.com> wrote:
>>
>>> Hello,
>>>
>>> I have set up eclipse environment according to the WIKI. Here's some
>>> something I did before I run the inject job:
>>> 1. I use SqlStore as storage class
>>> 2. I started HSql database which contains the table 'webpage'.
>>> 3. I added 1 URL in seed.txt.
>>> Then I run the inject job. It seems the job is finished successfully. But
>>> I there's no change be made to my HSql database. Any thought about this?
>>> Here's the log:
>>> InjectorJob: starting at 2013-07-07 15:28:42
>>> InjectorJob: Injecting urlDir: urls/dev
>>> InjectorJob: Using class org.apache.gora.sql.store.SqlStore as the Gora
>>> storage class.
>>> InjectorJob: total number of urls rejected by filters: 0
>>> InjectorJob: total number of urls injected after normalization and
>>> filtering: 1
>>> Injector: finished at 2013-07-07 15:28:44, elapsed: 00:00:02
>>>
>>> Best Regards,
>>> Rui
>>>
>>
>>
>>
>>-- 
>>*Lewis*

Re: [2.2.1] What does inject job do?

Posted by Lewis John Mcgibbney <le...@gmail.com>.

Hi Rui,

On Saturday, July 20, 2013, Rui Gao <ga...@163.com> wrote:
> So, what direction will Nutch go? Will it co-operate with relationship
database or will it only work on non-relationship database like hbase?

This has nothing to do with Nutch. It has everything to do with Apache Gora
and we are moving the Gora framework away from relational models towards
NoSQL.

> I remember when 2.2.1 has been released, I checked the release note, it
says some bugs related with mysql has been fixed. That's why I try to
integrate it with mysql or hsql. And also, in the wiki, there's a link
talking about how to integrate nutch with mysql:
http://nlp.solutions.asia/?p=362
>

This may be the case, however there was also an upgrade of the Gora
dependency which deprecated the Sql DataStore. To understand the ordering
of issues as of when they are addressed you should refer to CHANGES.TXT in
your current version of Nutch.

thanks
Lewis

-- 
*Lewis*

Re:Re: [2.2.1] What does inject job do?

Posted by Rui Gao <ga...@163.com>.

Hi Lewis,

Thanks for your answer.
So, what direction will Nutch go? Will it co-operate with relationship database or will it only work on non-relationship database like hbase?
I remember when 2.2.1 has been released, I checked the release note, it says some bugs related with mysql has been fixed. That's why I try to integrate it with mysql or hsql. And also, in the wiki, there's a link talking about how to integrate nutch with mysql: http://nlp.solutions.asia/?p=362

Do you have any suggestion?

Thanks.

Best Regards,
Rui







At 2013-07-11 03:53:12,"Lewis John Mcgibbney" <le...@gmail.com> wrote:
>Hi Rui,
>This should not work.
>The SqlStore module and support for it is now deprecated within Apache Gora.
>If you would like to downgrade to use Nutch 2.1, then you can use older
>Gora artifacts but this is not recommended.
>Thanks
>Lewis
>
>
>On Sun, Jul 7, 2013 at 12:36 AM, Rui Gao <ga...@163.com> wrote:
>
>> Hello,
>>
>> I have set up eclipse environment according to the WIKI. Here's some
>> something I did before I run the inject job:
>> 1. I use SqlStore as storage class
>> 2. I started HSql database which contains the table 'webpage'.
>> 3. I added 1 URL in seed.txt.
>> Then I run the inject job. It seems the job is finished successfully. But
>> I there's no change be made to my HSql database. Any thought about this?
>> Here's the log:
>> InjectorJob: starting at 2013-07-07 15:28:42
>> InjectorJob: Injecting urlDir: urls/dev
>> InjectorJob: Using class org.apache.gora.sql.store.SqlStore as the Gora
>> storage class.
>> InjectorJob: total number of urls rejected by filters: 0
>> InjectorJob: total number of urls injected after normalization and
>> filtering: 1
>> Injector: finished at 2013-07-07 15:28:44, elapsed: 00:00:02
>>
>> Best Regards,
>> Rui
>>
>
>
>
>-- 
>*Lewis*

Re: [2.2.1] What does inject job do?

Posted by Lewis John Mcgibbney <le...@gmail.com>.

Hi Rui,
This should not work.
The SqlStore module and support for it is now deprecated within Apache Gora.
If you would like to downgrade to use Nutch 2.1, then you can use older
Gora artifacts but this is not recommended.
Thanks
Lewis


On Sun, Jul 7, 2013 at 12:36 AM, Rui Gao <ga...@163.com> wrote:

> Hello,
>
> I have set up eclipse environment according to the WIKI. Here's some
> something I did before I run the inject job:
> 1. I use SqlStore as storage class
> 2. I started HSql database which contains the table 'webpage'.
> 3. I added 1 URL in seed.txt.
> Then I run the inject job. It seems the job is finished successfully. But
> I there's no change be made to my HSql database. Any thought about this?
> Here's the log:
> InjectorJob: starting at 2013-07-07 15:28:42
> InjectorJob: Injecting urlDir: urls/dev
> InjectorJob: Using class org.apache.gora.sql.store.SqlStore as the Gora
> storage class.
> InjectorJob: total number of urls rejected by filters: 0
> InjectorJob: total number of urls injected after normalization and
> filtering: 1
> Injector: finished at 2013-07-07 15:28:44, elapsed: 00:00:02
>
> Best Regards,
> Rui
>



-- 
*Lewis*