You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by Rui Gao <ga...@163.com> on 2013/07/07 09:36:51 UTC
[2.2.1] What does inject job do?
Hello,
I have set up eclipse environment according to the WIKI. Here's some something I did before I run the inject job:
1. I use SqlStore as storage class
2. I started HSql database which contains the table 'webpage'.
3. I added 1 URL in seed.txt.
Then I run the inject job. It seems the job is finished successfully. But I there's no change be made to my HSql database. Any thought about this? Here's the log:
InjectorJob: starting at 2013-07-07 15:28:42
InjectorJob: Injecting urlDir: urls/dev
InjectorJob: Using class org.apache.gora.sql.store.SqlStore as the Gora storage class.
InjectorJob: total number of urls rejected by filters: 0
InjectorJob: total number of urls injected after normalization and filtering: 1
Injector: finished at 2013-07-07 15:28:44, elapsed: 00:00:02
Best Regards,
Rui
Re:Re: Re: [2.2.1] What does inject job do?
Posted by Rui Gao <ga...@163.com>.
I will put my own plugin in the second plugin folder. This works fine in Nutch-2.1.
At 2013-07-22 04:08:57,"Lewis John Mcgibbney" <le...@gmail.com> wrote:
>On Sat, Jul 20, 2013 at 10:58 PM, Rui Gao <ga...@163.com> wrote:
>
>> I checked the DB, the URL is already in DB.
>> The plugin property is configured like this:
>> <property>
>> <name>plugin.folders</name>
>> <value>./src/plugin,./plugins</value>
>>
>
>Any reason at all that you have two directories listed? Do you have two
>plugin directories with compiled classes in there?
Re: Re: [2.2.1] What does inject job do?
Posted by Lewis John Mcgibbney <le...@gmail.com>.
On Sat, Jul 20, 2013 at 10:58 PM, Rui Gao <ga...@163.com> wrote:
> I checked the DB, the URL is already in DB.
> The plugin property is configured like this:
> <property>
> <name>plugin.folders</name>
> <value>./src/plugin,./plugins</value>
>
Any reason at all that you have two directories listed? Do you have two
plugin directories with compiled classes in there?
Re:Re:Re: [2.2.1] What does inject job do?
Posted by Rui Gao <ga...@163.com>.
Some error found when using hbase 0.90.x. Here's the log:
2013-07-21 14:51:29,500 WARN mapred.FileOutputCommitter - Output path is null in cleanup
2013-07-21 14:51:29,500 WARN mapred.LocalJobRunner - job_local196483647_0002
java.lang.NullPointerException
at org.apache.avro.util.Utf8.<init>(Utf8.java:37)
at org.apache.nutch.crawl.GeneratorReducer.setup(GeneratorReducer.java:100)
at org.apache.hadoop.mapreduce.Reducer.run(Reducer.java:174)
at org.apache.hadoop.mapred.ReduceTask.runNewReducer(ReduceTask.java:649)
at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:418)
at org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:398)
At 2013-07-21 13:58:33,"Rui Gao" <ga...@163.com> wrote:
>I checked the DB, the URL is already in DB.
>The plugin property is configured like this:
><property>
> <name>plugin.folders</name>
> <value>./src/plugin,./plugins</value>
> <description>Directories where nutch plugins are located. Each
> element may be a relative or absolute path. If absolute, it is used
> as is. If relative, it is searched for on the classpath.</description>
></property>
>
>I guess the plugin property is configured properly. Because when I change it to other value, it complains plugins could not be found.
>
>
>
>
>
>
>
>At 2013-07-21 13:48:33,"Lewis John Mcgibbney" <le...@gmail.com> wrote:
>>Yes the warns which you've shown now are fine this is the old mapred API.
>>Its OK.
>>It's now stating that you've got 1 URL injected. Can you check the db?
>>either check contents or dump/read them with readdb tool?
>>Please remember that somewhere in the tutorial you reference the absolute
>>patch to plugins folder needs to be changed. This is your problem here.
>>InjectorJob doesn't require plugins to work... however when your indexing
>>plugins are called you are in trouble. You need to sort this out.
>>
>>On Saturday, July 20, 2013, Rui Gao <ga...@163.com> wrote:
>>> I am following this article http://wiki.apache.org/nutch/RunNutchInEclipse.
>>My environment is windows XP + cygwin + eclipse.
>>> I thinks the top several WARN logs are not the blocker. (The
>>plugin.folders contains an additional folder, after I remove it job still
>>fails.) We can compare it with the logs from InjectorJob which runs
>>successfully:
>>>
>>> 2013-07-21 12:45:01,968 INFO crawl.InjectorJob - InjectorJob: starting
>>at 2013-07-21 12:45:01
>>> 2013-07-21 12:45:01,968 INFO crawl.InjectorJob - InjectorJob: Injecting
>>urlDir: urls/dev
>>> 2013-07-21 12:45:02,921 INFO crawl.InjectorJob - InjectorJob: Using
>>class org.apache.gora.sql.store.SqlStore as the Gora storage class.
>>> 2013-07-21 12:45:02,968 WARN util.NativeCodeLoader - Unable to load
>>native-hadoop library for your platform... using builtin-java classes where
>>applicable
>>> 2013-07-21 12:45:02,984 WARN mapred.JobClient - No job jar file set.
>> User classes may not be found. See JobConf(Class) or
>>JobConf#setJar(String).
>>> 2013-07-21 12:45:03,015 WARN snappy.LoadSnappy - Snappy native library
>>not loaded
>>> 2013-07-21 12:45:03,328 INFO mapreduce.GoraRecordWriter -
>>gora.buffer.write.limit = 10000
>>> 2013-07-21 12:45:03,437 WARN plugin.PluginRepository - Plugins:
>>directory not found: ./plugins
>>> 2013-07-21 12:45:03,484 INFO regex.RegexURLNormalizer - can't find rules
>>for scope 'inject', using default
>>> 2013-07-21 12:45:03,625 WARN mapred.FileOutputCommitter - Output path is
>>null in cleanup
>>> 2013-07-21 12:45:04,218 INFO crawl.InjectorJob - InjectorJob: total
>>number of urls rejected by filters: 0
>>> 2013-07-21 12:45:04,218 INFO crawl.InjectorJob - InjectorJob: total
>>number of urls injected after normalization and filtering: 1
>>> 2013-07-21 12:45:04,218 INFO crawl.InjectorJob - Injector: finished at
>>2013-07-21 12:45:04, elapsed: 00:00:02
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>> At 2013-07-21 12:36:29,"Lewis John Mcgibbney" <le...@gmail.com>
>>wrote:
>>>>Please read the exception trace. You are running on Hadoop? You need to
>>>>ensure that your plugins.directory points to the right path. There is also
>>>>a mention of a missing job file. Please ensure that your nutch job file is
>>>>on the Hadoop jobtracker classpath.
>>>>hth
>>>>
>>>>On Saturday, July 20, 2013, Rui Gao <ga...@163.com> wrote:
>>>>> Hi Lewis,
>>>>>
>>>>> I tried to downgrade gora-core to 0.2.1. then, I could run InjectorJob
>>>>with both hsql and mysql. But the Crawler job still fail. here's the log:
>>>>> 2013-07-21 12:23:41,156 INFO crawl.InjectorJob - InjectorJob: Using
>>>>class org.apache.gora.sql.store.SqlStore as the Gora storage class.
>>>>> 2013-07-21 12:23:41,203 WARN util.NativeCodeLoader - Unable to load
>>>>native-hadoop library for your platform... using builtin-java classes
>>where
>>>>applicable
>>>>> 2013-07-21 12:23:41,234 WARN mapred.JobClient - No job jar file set.
>>>> User classes may not be found. See JobConf(Class) or
>>>>JobConf#setJar(String).
>>>>> 2013-07-21 12:23:41,265 WARN snappy.LoadSnappy - Snappy native library
>>>>not loaded
>>>>> 2013-07-21 12:23:41,578 INFO mapreduce.GoraRecordWriter -
>>>>gora.buffer.write.limit = 10000
>>>>> 2013-07-21 12:23:41,718 WARN plugin.PluginRepository - Plugins:
>>>>directory not found: ./plugins
>>>>> 2013-07-21 12:23:41,765 INFO regex.RegexURLNormalizer - can't find
>>rules
>>>>for scope 'inject', using default
>>>>> 2013-07-21 12:23:41,937 WARN mapred.FileOutputCommitter - Output path
>>is
>>>>null in cleanup
>>>>> 2013-07-21 12:23:42,468 INFO crawl.InjectorJob - InjectorJob: total
>>>>number of urls rejected by filters: 0
>>>>> 2013-07-21 12:23:42,468 INFO crawl.InjectorJob - InjectorJob: total
>>>>number of urls injected after normalization and filtering: 1
>>>>> 2013-07-21 12:23:42,468 INFO crawl.FetchScheduleFactory - Using
>>>>FetchSchedule impl: org.apache.nutch.crawl.DefaultFetchSchedule
>>>>> 2013-07-21 12:23:42,468 INFO crawl.AbstractFetchSchedule -
>>>>defaultInterval=2592000
>>>>> 2013-07-21 12:23:42,468 INFO crawl.AbstractFetchSchedule -
>>>>maxInterval=7776000
>>>>> 2013-07-21 12:23:42,593 WARN mapred.JobClient - No job jar file set.
>>>> User classes may not be found. See JobConf(Class) or
>>>>JobConf#setJar(String).
>>>>> 2013-07-21 12:23:42,796 INFO mapreduce.GoraRecordReader -
>>>>gora.buffer.read.limit = 10000
>>>>> 2013-07-21 12:23:43,062 INFO crawl.FetchScheduleFactory - Using
>>>>FetchSchedule impl: org.apache.nutch.crawl.DefaultFetchSchedule
>>>>> 2013-07-21 12:23:43,062 INFO crawl.AbstractFetchSchedule -
>>>>defaultInterval=2592000
>>>>> 2013-07-21 12:23:43,062 INFO crawl.AbstractFetchSchedule -
>>>>maxInterval=7776000
>>>>> 2013-07-21 12:23:43,093 INFO regex.RegexURLNormalizer - can't find
>>rules
>>>>for scope 'generate_host_count', using default
>>>>> 2013-07-21 12:23:43,234 INFO mapreduce.GoraRecordWriter -
>>>>gora.buffer.write.limit = 10000
>>>>> 2013-07-21 12:23:43,250 WARN mapred.FileOutputCommitter - Output path
>>is
>>>>null in cleanup
>>>>> 2013-07-21 12:23:43,250 WARN mapred.LocalJobRunner -
>>>>job_local1378002997_0002
>>>>> java.lang.NullPointerException
>>>>> at org.apache.avro.util.Utf8.<init>(Utf8.java:37)
>>>>> at
>>>>org.apache.nutch.crawl.GeneratorReducer.setup(GeneratorReducer.java:100)
>>>>> at org.apache.hadoop.mapreduce.Reducer.run(Reducer.java:174)
>>>>> at
>>>>org.apache.hadoop.mapred.ReduceTask.runNewReducer(ReduceTask.java:649)
>>>>> at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:418)
>>>>> at
>>>>org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:398)
>>>>>
>>>>> I don't know if this is the right direction I should continue with. But
>>>>any way, hopefully my experience could help others.
>>>>>
>>>>>
>>>>> Regards,
>>>>> Rui
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> At 2013-07-20 23:07:41,"Rui Gao" <>*Lewis*
>>>
>>
>>--
>>*Lewis*
Re:Re: [2.2.1] What does inject job do?
Posted by Rui Gao <ga...@163.com>.
I checked the DB, the URL is already in DB.
The plugin property is configured like this:
<property>
<name>plugin.folders</name>
<value>./src/plugin,./plugins</value>
<description>Directories where nutch plugins are located. Each
element may be a relative or absolute path. If absolute, it is used
as is. If relative, it is searched for on the classpath.</description>
</property>
I guess the plugin property is configured properly. Because when I change it to other value, it complains plugins could not be found.
At 2013-07-21 13:48:33,"Lewis John Mcgibbney" <le...@gmail.com> wrote:
>Yes the warns which you've shown now are fine this is the old mapred API.
>Its OK.
>It's now stating that you've got 1 URL injected. Can you check the db?
>either check contents or dump/read them with readdb tool?
>Please remember that somewhere in the tutorial you reference the absolute
>patch to plugins folder needs to be changed. This is your problem here.
>InjectorJob doesn't require plugins to work... however when your indexing
>plugins are called you are in trouble. You need to sort this out.
>
>On Saturday, July 20, 2013, Rui Gao <ga...@163.com> wrote:
>> I am following this article http://wiki.apache.org/nutch/RunNutchInEclipse.
>My environment is windows XP + cygwin + eclipse.
>> I thinks the top several WARN logs are not the blocker. (The
>plugin.folders contains an additional folder, after I remove it job still
>fails.) We can compare it with the logs from InjectorJob which runs
>successfully:
>>
>> 2013-07-21 12:45:01,968 INFO crawl.InjectorJob - InjectorJob: starting
>at 2013-07-21 12:45:01
>> 2013-07-21 12:45:01,968 INFO crawl.InjectorJob - InjectorJob: Injecting
>urlDir: urls/dev
>> 2013-07-21 12:45:02,921 INFO crawl.InjectorJob - InjectorJob: Using
>class org.apache.gora.sql.store.SqlStore as the Gora storage class.
>> 2013-07-21 12:45:02,968 WARN util.NativeCodeLoader - Unable to load
>native-hadoop library for your platform... using builtin-java classes where
>applicable
>> 2013-07-21 12:45:02,984 WARN mapred.JobClient - No job jar file set.
> User classes may not be found. See JobConf(Class) or
>JobConf#setJar(String).
>> 2013-07-21 12:45:03,015 WARN snappy.LoadSnappy - Snappy native library
>not loaded
>> 2013-07-21 12:45:03,328 INFO mapreduce.GoraRecordWriter -
>gora.buffer.write.limit = 10000
>> 2013-07-21 12:45:03,437 WARN plugin.PluginRepository - Plugins:
>directory not found: ./plugins
>> 2013-07-21 12:45:03,484 INFO regex.RegexURLNormalizer - can't find rules
>for scope 'inject', using default
>> 2013-07-21 12:45:03,625 WARN mapred.FileOutputCommitter - Output path is
>null in cleanup
>> 2013-07-21 12:45:04,218 INFO crawl.InjectorJob - InjectorJob: total
>number of urls rejected by filters: 0
>> 2013-07-21 12:45:04,218 INFO crawl.InjectorJob - InjectorJob: total
>number of urls injected after normalization and filtering: 1
>> 2013-07-21 12:45:04,218 INFO crawl.InjectorJob - Injector: finished at
>2013-07-21 12:45:04, elapsed: 00:00:02
>>
>>
>>
>>
>>
>>
>>
>>
>>
>> At 2013-07-21 12:36:29,"Lewis John Mcgibbney" <le...@gmail.com>
>wrote:
>>>Please read the exception trace. You are running on Hadoop? You need to
>>>ensure that your plugins.directory points to the right path. There is also
>>>a mention of a missing job file. Please ensure that your nutch job file is
>>>on the Hadoop jobtracker classpath.
>>>hth
>>>
>>>On Saturday, July 20, 2013, Rui Gao <ga...@163.com> wrote:
>>>> Hi Lewis,
>>>>
>>>> I tried to downgrade gora-core to 0.2.1. then, I could run InjectorJob
>>>with both hsql and mysql. But the Crawler job still fail. here's the log:
>>>> 2013-07-21 12:23:41,156 INFO crawl.InjectorJob - InjectorJob: Using
>>>class org.apache.gora.sql.store.SqlStore as the Gora storage class.
>>>> 2013-07-21 12:23:41,203 WARN util.NativeCodeLoader - Unable to load
>>>native-hadoop library for your platform... using builtin-java classes
>where
>>>applicable
>>>> 2013-07-21 12:23:41,234 WARN mapred.JobClient - No job jar file set.
>>> User classes may not be found. See JobConf(Class) or
>>>JobConf#setJar(String).
>>>> 2013-07-21 12:23:41,265 WARN snappy.LoadSnappy - Snappy native library
>>>not loaded
>>>> 2013-07-21 12:23:41,578 INFO mapreduce.GoraRecordWriter -
>>>gora.buffer.write.limit = 10000
>>>> 2013-07-21 12:23:41,718 WARN plugin.PluginRepository - Plugins:
>>>directory not found: ./plugins
>>>> 2013-07-21 12:23:41,765 INFO regex.RegexURLNormalizer - can't find
>rules
>>>for scope 'inject', using default
>>>> 2013-07-21 12:23:41,937 WARN mapred.FileOutputCommitter - Output path
>is
>>>null in cleanup
>>>> 2013-07-21 12:23:42,468 INFO crawl.InjectorJob - InjectorJob: total
>>>number of urls rejected by filters: 0
>>>> 2013-07-21 12:23:42,468 INFO crawl.InjectorJob - InjectorJob: total
>>>number of urls injected after normalization and filtering: 1
>>>> 2013-07-21 12:23:42,468 INFO crawl.FetchScheduleFactory - Using
>>>FetchSchedule impl: org.apache.nutch.crawl.DefaultFetchSchedule
>>>> 2013-07-21 12:23:42,468 INFO crawl.AbstractFetchSchedule -
>>>defaultInterval=2592000
>>>> 2013-07-21 12:23:42,468 INFO crawl.AbstractFetchSchedule -
>>>maxInterval=7776000
>>>> 2013-07-21 12:23:42,593 WARN mapred.JobClient - No job jar file set.
>>> User classes may not be found. See JobConf(Class) or
>>>JobConf#setJar(String).
>>>> 2013-07-21 12:23:42,796 INFO mapreduce.GoraRecordReader -
>>>gora.buffer.read.limit = 10000
>>>> 2013-07-21 12:23:43,062 INFO crawl.FetchScheduleFactory - Using
>>>FetchSchedule impl: org.apache.nutch.crawl.DefaultFetchSchedule
>>>> 2013-07-21 12:23:43,062 INFO crawl.AbstractFetchSchedule -
>>>defaultInterval=2592000
>>>> 2013-07-21 12:23:43,062 INFO crawl.AbstractFetchSchedule -
>>>maxInterval=7776000
>>>> 2013-07-21 12:23:43,093 INFO regex.RegexURLNormalizer - can't find
>rules
>>>for scope 'generate_host_count', using default
>>>> 2013-07-21 12:23:43,234 INFO mapreduce.GoraRecordWriter -
>>>gora.buffer.write.limit = 10000
>>>> 2013-07-21 12:23:43,250 WARN mapred.FileOutputCommitter - Output path
>is
>>>null in cleanup
>>>> 2013-07-21 12:23:43,250 WARN mapred.LocalJobRunner -
>>>job_local1378002997_0002
>>>> java.lang.NullPointerException
>>>> at org.apache.avro.util.Utf8.<init>(Utf8.java:37)
>>>> at
>>>org.apache.nutch.crawl.GeneratorReducer.setup(GeneratorReducer.java:100)
>>>> at org.apache.hadoop.mapreduce.Reducer.run(Reducer.java:174)
>>>> at
>>>org.apache.hadoop.mapred.ReduceTask.runNewReducer(ReduceTask.java:649)
>>>> at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:418)
>>>> at
>>>org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:398)
>>>>
>>>> I don't know if this is the right direction I should continue with. But
>>>any way, hopefully my experience could help others.
>>>>
>>>>
>>>> Regards,
>>>> Rui
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>> At 2013-07-20 23:07:41,"Rui Gao" <>*Lewis*
>>
>
>--
>*Lewis*
Re: [2.2.1] What does inject job do?
Posted by Lewis John Mcgibbney <le...@gmail.com>.
Yes the warns which you've shown now are fine this is the old mapred API.
Its OK.
It's now stating that you've got 1 URL injected. Can you check the db?
either check contents or dump/read them with readdb tool?
Please remember that somewhere in the tutorial you reference the absolute
patch to plugins folder needs to be changed. This is your problem here.
InjectorJob doesn't require plugins to work... however when your indexing
plugins are called you are in trouble. You need to sort this out.
On Saturday, July 20, 2013, Rui Gao <ga...@163.com> wrote:
> I am following this article http://wiki.apache.org/nutch/RunNutchInEclipse.
My environment is windows XP + cygwin + eclipse.
> I thinks the top several WARN logs are not the blocker. (The
plugin.folders contains an additional folder, after I remove it job still
fails.) We can compare it with the logs from InjectorJob which runs
successfully:
>
> 2013-07-21 12:45:01,968 INFO crawl.InjectorJob - InjectorJob: starting
at 2013-07-21 12:45:01
> 2013-07-21 12:45:01,968 INFO crawl.InjectorJob - InjectorJob: Injecting
urlDir: urls/dev
> 2013-07-21 12:45:02,921 INFO crawl.InjectorJob - InjectorJob: Using
class org.apache.gora.sql.store.SqlStore as the Gora storage class.
> 2013-07-21 12:45:02,968 WARN util.NativeCodeLoader - Unable to load
native-hadoop library for your platform... using builtin-java classes where
applicable
> 2013-07-21 12:45:02,984 WARN mapred.JobClient - No job jar file set.
User classes may not be found. See JobConf(Class) or
JobConf#setJar(String).
> 2013-07-21 12:45:03,015 WARN snappy.LoadSnappy - Snappy native library
not loaded
> 2013-07-21 12:45:03,328 INFO mapreduce.GoraRecordWriter -
gora.buffer.write.limit = 10000
> 2013-07-21 12:45:03,437 WARN plugin.PluginRepository - Plugins:
directory not found: ./plugins
> 2013-07-21 12:45:03,484 INFO regex.RegexURLNormalizer - can't find rules
for scope 'inject', using default
> 2013-07-21 12:45:03,625 WARN mapred.FileOutputCommitter - Output path is
null in cleanup
> 2013-07-21 12:45:04,218 INFO crawl.InjectorJob - InjectorJob: total
number of urls rejected by filters: 0
> 2013-07-21 12:45:04,218 INFO crawl.InjectorJob - InjectorJob: total
number of urls injected after normalization and filtering: 1
> 2013-07-21 12:45:04,218 INFO crawl.InjectorJob - Injector: finished at
2013-07-21 12:45:04, elapsed: 00:00:02
>
>
>
>
>
>
>
>
>
> At 2013-07-21 12:36:29,"Lewis John Mcgibbney" <le...@gmail.com>
wrote:
>>Please read the exception trace. You are running on Hadoop? You need to
>>ensure that your plugins.directory points to the right path. There is also
>>a mention of a missing job file. Please ensure that your nutch job file is
>>on the Hadoop jobtracker classpath.
>>hth
>>
>>On Saturday, July 20, 2013, Rui Gao <ga...@163.com> wrote:
>>> Hi Lewis,
>>>
>>> I tried to downgrade gora-core to 0.2.1. then, I could run InjectorJob
>>with both hsql and mysql. But the Crawler job still fail. here's the log:
>>> 2013-07-21 12:23:41,156 INFO crawl.InjectorJob - InjectorJob: Using
>>class org.apache.gora.sql.store.SqlStore as the Gora storage class.
>>> 2013-07-21 12:23:41,203 WARN util.NativeCodeLoader - Unable to load
>>native-hadoop library for your platform... using builtin-java classes
where
>>applicable
>>> 2013-07-21 12:23:41,234 WARN mapred.JobClient - No job jar file set.
>> User classes may not be found. See JobConf(Class) or
>>JobConf#setJar(String).
>>> 2013-07-21 12:23:41,265 WARN snappy.LoadSnappy - Snappy native library
>>not loaded
>>> 2013-07-21 12:23:41,578 INFO mapreduce.GoraRecordWriter -
>>gora.buffer.write.limit = 10000
>>> 2013-07-21 12:23:41,718 WARN plugin.PluginRepository - Plugins:
>>directory not found: ./plugins
>>> 2013-07-21 12:23:41,765 INFO regex.RegexURLNormalizer - can't find
rules
>>for scope 'inject', using default
>>> 2013-07-21 12:23:41,937 WARN mapred.FileOutputCommitter - Output path
is
>>null in cleanup
>>> 2013-07-21 12:23:42,468 INFO crawl.InjectorJob - InjectorJob: total
>>number of urls rejected by filters: 0
>>> 2013-07-21 12:23:42,468 INFO crawl.InjectorJob - InjectorJob: total
>>number of urls injected after normalization and filtering: 1
>>> 2013-07-21 12:23:42,468 INFO crawl.FetchScheduleFactory - Using
>>FetchSchedule impl: org.apache.nutch.crawl.DefaultFetchSchedule
>>> 2013-07-21 12:23:42,468 INFO crawl.AbstractFetchSchedule -
>>defaultInterval=2592000
>>> 2013-07-21 12:23:42,468 INFO crawl.AbstractFetchSchedule -
>>maxInterval=7776000
>>> 2013-07-21 12:23:42,593 WARN mapred.JobClient - No job jar file set.
>> User classes may not be found. See JobConf(Class) or
>>JobConf#setJar(String).
>>> 2013-07-21 12:23:42,796 INFO mapreduce.GoraRecordReader -
>>gora.buffer.read.limit = 10000
>>> 2013-07-21 12:23:43,062 INFO crawl.FetchScheduleFactory - Using
>>FetchSchedule impl: org.apache.nutch.crawl.DefaultFetchSchedule
>>> 2013-07-21 12:23:43,062 INFO crawl.AbstractFetchSchedule -
>>defaultInterval=2592000
>>> 2013-07-21 12:23:43,062 INFO crawl.AbstractFetchSchedule -
>>maxInterval=7776000
>>> 2013-07-21 12:23:43,093 INFO regex.RegexURLNormalizer - can't find
rules
>>for scope 'generate_host_count', using default
>>> 2013-07-21 12:23:43,234 INFO mapreduce.GoraRecordWriter -
>>gora.buffer.write.limit = 10000
>>> 2013-07-21 12:23:43,250 WARN mapred.FileOutputCommitter - Output path
is
>>null in cleanup
>>> 2013-07-21 12:23:43,250 WARN mapred.LocalJobRunner -
>>job_local1378002997_0002
>>> java.lang.NullPointerException
>>> at org.apache.avro.util.Utf8.<init>(Utf8.java:37)
>>> at
>>org.apache.nutch.crawl.GeneratorReducer.setup(GeneratorReducer.java:100)
>>> at org.apache.hadoop.mapreduce.Reducer.run(Reducer.java:174)
>>> at
>>org.apache.hadoop.mapred.ReduceTask.runNewReducer(ReduceTask.java:649)
>>> at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:418)
>>> at
>>org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:398)
>>>
>>> I don't know if this is the right direction I should continue with. But
>>any way, hopefully my experience could help others.
>>>
>>>
>>> Regards,
>>> Rui
>>>
>>>
>>>
>>>
>>>
>>>
>>> At 2013-07-20 23:07:41,"Rui Gao" <>*Lewis*
>
--
*Lewis*
Re:Re: [2.2.1] What does inject job do?
Posted by Rui Gao <ga...@163.com>.
I am following this article http://wiki.apache.org/nutch/RunNutchInEclipse. My environment is windows XP + cygwin + eclipse.
I thinks the top several WARN logs are not the blocker. (The plugin.folders contains an additional folder, after I remove it job still fails.) We can compare it with the logs from InjectorJob which runs successfully:
2013-07-21 12:45:01,968 INFO crawl.InjectorJob - InjectorJob: starting at 2013-07-21 12:45:01
2013-07-21 12:45:01,968 INFO crawl.InjectorJob - InjectorJob: Injecting urlDir: urls/dev
2013-07-21 12:45:02,921 INFO crawl.InjectorJob - InjectorJob: Using class org.apache.gora.sql.store.SqlStore as the Gora storage class.
2013-07-21 12:45:02,968 WARN util.NativeCodeLoader - Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
2013-07-21 12:45:02,984 WARN mapred.JobClient - No job jar file set. User classes may not be found. See JobConf(Class) or JobConf#setJar(String).
2013-07-21 12:45:03,015 WARN snappy.LoadSnappy - Snappy native library not loaded
2013-07-21 12:45:03,328 INFO mapreduce.GoraRecordWriter - gora.buffer.write.limit = 10000
2013-07-21 12:45:03,437 WARN plugin.PluginRepository - Plugins: directory not found: ./plugins
2013-07-21 12:45:03,484 INFO regex.RegexURLNormalizer - can't find rules for scope 'inject', using default
2013-07-21 12:45:03,625 WARN mapred.FileOutputCommitter - Output path is null in cleanup
2013-07-21 12:45:04,218 INFO crawl.InjectorJob - InjectorJob: total number of urls rejected by filters: 0
2013-07-21 12:45:04,218 INFO crawl.InjectorJob - InjectorJob: total number of urls injected after normalization and filtering: 1
2013-07-21 12:45:04,218 INFO crawl.InjectorJob - Injector: finished at 2013-07-21 12:45:04, elapsed: 00:00:02
At 2013-07-21 12:36:29,"Lewis John Mcgibbney" <le...@gmail.com> wrote:
>Please read the exception trace. You are running on Hadoop? You need to
>ensure that your plugins.directory points to the right path. There is also
>a mention of a missing job file. Please ensure that your nutch job file is
>on the Hadoop jobtracker classpath.
>hth
>
>On Saturday, July 20, 2013, Rui Gao <ga...@163.com> wrote:
>> Hi Lewis,
>>
>> I tried to downgrade gora-core to 0.2.1. then, I could run InjectorJob
>with both hsql and mysql. But the Crawler job still fail. here's the log:
>> 2013-07-21 12:23:41,156 INFO crawl.InjectorJob - InjectorJob: Using
>class org.apache.gora.sql.store.SqlStore as the Gora storage class.
>> 2013-07-21 12:23:41,203 WARN util.NativeCodeLoader - Unable to load
>native-hadoop library for your platform... using builtin-java classes where
>applicable
>> 2013-07-21 12:23:41,234 WARN mapred.JobClient - No job jar file set.
> User classes may not be found. See JobConf(Class) or
>JobConf#setJar(String).
>> 2013-07-21 12:23:41,265 WARN snappy.LoadSnappy - Snappy native library
>not loaded
>> 2013-07-21 12:23:41,578 INFO mapreduce.GoraRecordWriter -
>gora.buffer.write.limit = 10000
>> 2013-07-21 12:23:41,718 WARN plugin.PluginRepository - Plugins:
>directory not found: ./plugins
>> 2013-07-21 12:23:41,765 INFO regex.RegexURLNormalizer - can't find rules
>for scope 'inject', using default
>> 2013-07-21 12:23:41,937 WARN mapred.FileOutputCommitter - Output path is
>null in cleanup
>> 2013-07-21 12:23:42,468 INFO crawl.InjectorJob - InjectorJob: total
>number of urls rejected by filters: 0
>> 2013-07-21 12:23:42,468 INFO crawl.InjectorJob - InjectorJob: total
>number of urls injected after normalization and filtering: 1
>> 2013-07-21 12:23:42,468 INFO crawl.FetchScheduleFactory - Using
>FetchSchedule impl: org.apache.nutch.crawl.DefaultFetchSchedule
>> 2013-07-21 12:23:42,468 INFO crawl.AbstractFetchSchedule -
>defaultInterval=2592000
>> 2013-07-21 12:23:42,468 INFO crawl.AbstractFetchSchedule -
>maxInterval=7776000
>> 2013-07-21 12:23:42,593 WARN mapred.JobClient - No job jar file set.
> User classes may not be found. See JobConf(Class) or
>JobConf#setJar(String).
>> 2013-07-21 12:23:42,796 INFO mapreduce.GoraRecordReader -
>gora.buffer.read.limit = 10000
>> 2013-07-21 12:23:43,062 INFO crawl.FetchScheduleFactory - Using
>FetchSchedule impl: org.apache.nutch.crawl.DefaultFetchSchedule
>> 2013-07-21 12:23:43,062 INFO crawl.AbstractFetchSchedule -
>defaultInterval=2592000
>> 2013-07-21 12:23:43,062 INFO crawl.AbstractFetchSchedule -
>maxInterval=7776000
>> 2013-07-21 12:23:43,093 INFO regex.RegexURLNormalizer - can't find rules
>for scope 'generate_host_count', using default
>> 2013-07-21 12:23:43,234 INFO mapreduce.GoraRecordWriter -
>gora.buffer.write.limit = 10000
>> 2013-07-21 12:23:43,250 WARN mapred.FileOutputCommitter - Output path is
>null in cleanup
>> 2013-07-21 12:23:43,250 WARN mapred.LocalJobRunner -
>job_local1378002997_0002
>> java.lang.NullPointerException
>> at org.apache.avro.util.Utf8.<init>(Utf8.java:37)
>> at
>org.apache.nutch.crawl.GeneratorReducer.setup(GeneratorReducer.java:100)
>> at org.apache.hadoop.mapreduce.Reducer.run(Reducer.java:174)
>> at
>org.apache.hadoop.mapred.ReduceTask.runNewReducer(ReduceTask.java:649)
>> at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:418)
>> at
>org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:398)
>>
>> I don't know if this is the right direction I should continue with. But
>any way, hopefully my experience could help others.
>>
>>
>> Regards,
>> Rui
>>
>>
>>
>>
>>
>>
>> At 2013-07-20 23:07:41,"Rui Gao" <ga...@163.com> wrote:
>>>Hi Lewis,
>>>
>>>Thanks for your answer.
>>>So, what direction will Nutch go? Will it co-operate with relationship
>database or will it only work on non-relationship database like hbase?
>>>I remember when 2.2.1 has been released, I checked the release note, it
>says some bugs related with mysql has been fixed. That's why I try to
>integrate it with mysql or hsql. And also, in the wiki, there's a link
>talking about how to integrate nutch with mysql:
>http://nlp.solutions.asia/?p=362
>>>
>>>Do you have any suggestion?
>>>
>>>Thanks.
>>>
>>>Best Regards,
>>>Rui
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>At 2013-07-11 03:53:12,"Lewis John Mcgibbney" <le...@gmail.com>
>wrote:
>>>>Hi Rui,
>>>>This should not work.
>>>>The SqlStore module and support for it is now deprecated within Apache
>Gora.
>>>>If you would like to downgrade to use Nutch 2.1, then you can use older
>>>>Gora artifacts but this is not recommended.
>>>>Thanks
>>>>Lewis
>>>>
>>>>
>>>>On Sun, Jul 7, 2013 at 12:36 AM, Rui Gao <ga...@163.com> wrote:
>>>>
>>>>> Hello,
>>>>>
>>>>> I have set up eclipse environment according to the WIKI. Here's some
>>>>> something I did before I run the inject job:
>>>>> 1. I use SqlStore as storage class
>>>>> 2. I started HSql database which contains the table 'webpage'.
>>>>> 3. I added 1 URL in seed.txt.
>>>>> Then I run the inject job. It seems the job is finished successfully.
>But
>>>>> I there's no change be made to my HSql database. Any thought about
>this?
>>>>> Here's the log:
>>>>> InjectorJob: starting at 2013-07-07 15:28:42
>>>>> InjectorJob: Injecting urlDir: urls/dev
>>>>> InjectorJob: Using class org.apache.gora.sql.store.SqlStore as the Gora
>>>>> storage class.
>>>>> InjectorJob: total number of urls rejected by filters: 0
>>>>> InjectorJob: total number of urls injected after normalization and
>>>>> filtering: 1
>>>>> Injector: finished at 2013-07-07 15:28:44, elapsed: 00:00:02
>>>>>
>>>>> Best Regards,
>>>>> Rui
>>>>>
>>>>
>>>>
>>>>
>>>>--
>>>>*Lewis*
>>
>
>--
>*Lewis*
Re: [2.2.1] What does inject job do?
Posted by Lewis John Mcgibbney <le...@gmail.com>.
Please read the exception trace. You are running on Hadoop? You need to
ensure that your plugins.directory points to the right path. There is also
a mention of a missing job file. Please ensure that your nutch job file is
on the Hadoop jobtracker classpath.
hth
On Saturday, July 20, 2013, Rui Gao <ga...@163.com> wrote:
> Hi Lewis,
>
> I tried to downgrade gora-core to 0.2.1. then, I could run InjectorJob
with both hsql and mysql. But the Crawler job still fail. here's the log:
> 2013-07-21 12:23:41,156 INFO crawl.InjectorJob - InjectorJob: Using
class org.apache.gora.sql.store.SqlStore as the Gora storage class.
> 2013-07-21 12:23:41,203 WARN util.NativeCodeLoader - Unable to load
native-hadoop library for your platform... using builtin-java classes where
applicable
> 2013-07-21 12:23:41,234 WARN mapred.JobClient - No job jar file set.
User classes may not be found. See JobConf(Class) or
JobConf#setJar(String).
> 2013-07-21 12:23:41,265 WARN snappy.LoadSnappy - Snappy native library
not loaded
> 2013-07-21 12:23:41,578 INFO mapreduce.GoraRecordWriter -
gora.buffer.write.limit = 10000
> 2013-07-21 12:23:41,718 WARN plugin.PluginRepository - Plugins:
directory not found: ./plugins
> 2013-07-21 12:23:41,765 INFO regex.RegexURLNormalizer - can't find rules
for scope 'inject', using default
> 2013-07-21 12:23:41,937 WARN mapred.FileOutputCommitter - Output path is
null in cleanup
> 2013-07-21 12:23:42,468 INFO crawl.InjectorJob - InjectorJob: total
number of urls rejected by filters: 0
> 2013-07-21 12:23:42,468 INFO crawl.InjectorJob - InjectorJob: total
number of urls injected after normalization and filtering: 1
> 2013-07-21 12:23:42,468 INFO crawl.FetchScheduleFactory - Using
FetchSchedule impl: org.apache.nutch.crawl.DefaultFetchSchedule
> 2013-07-21 12:23:42,468 INFO crawl.AbstractFetchSchedule -
defaultInterval=2592000
> 2013-07-21 12:23:42,468 INFO crawl.AbstractFetchSchedule -
maxInterval=7776000
> 2013-07-21 12:23:42,593 WARN mapred.JobClient - No job jar file set.
User classes may not be found. See JobConf(Class) or
JobConf#setJar(String).
> 2013-07-21 12:23:42,796 INFO mapreduce.GoraRecordReader -
gora.buffer.read.limit = 10000
> 2013-07-21 12:23:43,062 INFO crawl.FetchScheduleFactory - Using
FetchSchedule impl: org.apache.nutch.crawl.DefaultFetchSchedule
> 2013-07-21 12:23:43,062 INFO crawl.AbstractFetchSchedule -
defaultInterval=2592000
> 2013-07-21 12:23:43,062 INFO crawl.AbstractFetchSchedule -
maxInterval=7776000
> 2013-07-21 12:23:43,093 INFO regex.RegexURLNormalizer - can't find rules
for scope 'generate_host_count', using default
> 2013-07-21 12:23:43,234 INFO mapreduce.GoraRecordWriter -
gora.buffer.write.limit = 10000
> 2013-07-21 12:23:43,250 WARN mapred.FileOutputCommitter - Output path is
null in cleanup
> 2013-07-21 12:23:43,250 WARN mapred.LocalJobRunner -
job_local1378002997_0002
> java.lang.NullPointerException
> at org.apache.avro.util.Utf8.<init>(Utf8.java:37)
> at
org.apache.nutch.crawl.GeneratorReducer.setup(GeneratorReducer.java:100)
> at org.apache.hadoop.mapreduce.Reducer.run(Reducer.java:174)
> at
org.apache.hadoop.mapred.ReduceTask.runNewReducer(ReduceTask.java:649)
> at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:418)
> at
org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:398)
>
> I don't know if this is the right direction I should continue with. But
any way, hopefully my experience could help others.
>
>
> Regards,
> Rui
>
>
>
>
>
>
> At 2013-07-20 23:07:41,"Rui Gao" <ga...@163.com> wrote:
>>Hi Lewis,
>>
>>Thanks for your answer.
>>So, what direction will Nutch go? Will it co-operate with relationship
database or will it only work on non-relationship database like hbase?
>>I remember when 2.2.1 has been released, I checked the release note, it
says some bugs related with mysql has been fixed. That's why I try to
integrate it with mysql or hsql. And also, in the wiki, there's a link
talking about how to integrate nutch with mysql:
http://nlp.solutions.asia/?p=362
>>
>>Do you have any suggestion?
>>
>>Thanks.
>>
>>Best Regards,
>>Rui
>>
>>
>>
>>
>>
>>
>>
>>At 2013-07-11 03:53:12,"Lewis John Mcgibbney" <le...@gmail.com>
wrote:
>>>Hi Rui,
>>>This should not work.
>>>The SqlStore module and support for it is now deprecated within Apache
Gora.
>>>If you would like to downgrade to use Nutch 2.1, then you can use older
>>>Gora artifacts but this is not recommended.
>>>Thanks
>>>Lewis
>>>
>>>
>>>On Sun, Jul 7, 2013 at 12:36 AM, Rui Gao <ga...@163.com> wrote:
>>>
>>>> Hello,
>>>>
>>>> I have set up eclipse environment according to the WIKI. Here's some
>>>> something I did before I run the inject job:
>>>> 1. I use SqlStore as storage class
>>>> 2. I started HSql database which contains the table 'webpage'.
>>>> 3. I added 1 URL in seed.txt.
>>>> Then I run the inject job. It seems the job is finished successfully.
But
>>>> I there's no change be made to my HSql database. Any thought about
this?
>>>> Here's the log:
>>>> InjectorJob: starting at 2013-07-07 15:28:42
>>>> InjectorJob: Injecting urlDir: urls/dev
>>>> InjectorJob: Using class org.apache.gora.sql.store.SqlStore as the Gora
>>>> storage class.
>>>> InjectorJob: total number of urls rejected by filters: 0
>>>> InjectorJob: total number of urls injected after normalization and
>>>> filtering: 1
>>>> Injector: finished at 2013-07-07 15:28:44, elapsed: 00:00:02
>>>>
>>>> Best Regards,
>>>> Rui
>>>>
>>>
>>>
>>>
>>>--
>>>*Lewis*
>
--
*Lewis*
Re: [2.2.1] What does inject job do?
Posted by Lewis John Mcgibbney <le...@gmail.com>.
Please read the exception trace. You are running on Hadoop? You need to
ensure that your plugins.directory points to the right path. There is also
a mention of a missing job file. Please ensure that your nutch job file is
on the Hadoop jobtracker classpath.
hth
On Saturday, July 20, 2013, Rui Gao <ga...@163.com> wrote:
> Hi Lewis,
>
> I tried to downgrade gora-core to 0.2.1. then, I could run InjectorJob
with both hsql and mysql. But the Crawler job still fail. here's the log:
> 2013-07-21 12:23:41,156 INFO crawl.InjectorJob - InjectorJob: Using
class org.apache.gora.sql.store.SqlStore as the Gora storage class.
> 2013-07-21 12:23:41,203 WARN util.NativeCodeLoader - Unable to load
native-hadoop library for your platform... using builtin-java classes where
applicable
> 2013-07-21 12:23:41,234 WARN mapred.JobClient - No job jar file set.
User classes may not be found. See JobConf(Class) or
JobConf#setJar(String).
> 2013-07-21 12:23:41,265 WARN snappy.LoadSnappy - Snappy native library
not loaded
> 2013-07-21 12:23:41,578 INFO mapreduce.GoraRecordWriter -
gora.buffer.write.limit = 10000
> 2013-07-21 12:23:41,718 WARN plugin.PluginRepository - Plugins:
directory not found: ./plugins
> 2013-07-21 12:23:41,765 INFO regex.RegexURLNormalizer - can't find rules
for scope 'inject', using default
> 2013-07-21 12:23:41,937 WARN mapred.FileOutputCommitter - Output path is
null in cleanup
> 2013-07-21 12:23:42,468 INFO crawl.InjectorJob - InjectorJob: total
number of urls rejected by filters: 0
> 2013-07-21 12:23:42,468 INFO crawl.InjectorJob - InjectorJob: total
number of urls injected after normalization and filtering: 1
> 2013-07-21 12:23:42,468 INFO crawl.FetchScheduleFactory - Using
FetchSchedule impl: org.apache.nutch.crawl.DefaultFetchSchedule
> 2013-07-21 12:23:42,468 INFO crawl.AbstractFetchSchedule -
defaultInterval=2592000
> 2013-07-21 12:23:42,468 INFO crawl.AbstractFetchSchedule -
maxInterval=7776000
> 2013-07-21 12:23:42,593 WARN mapred.JobClient - No job jar file set.
User classes may not be found. See JobConf(Class) or
JobConf#setJar(String).
> 2013-07-21 12:23:42,796 INFO mapreduce.GoraRecordReader -
gora.buffer.read.limit = 10000
> 2013-07-21 12:23:43,062 INFO crawl.FetchScheduleFactory - Using
FetchSchedule impl: org.apache.nutch.crawl.DefaultFetchSchedule
> 2013-07-21 12:23:43,062 INFO crawl.AbstractFetchSchedule -
defaultInterval=2592000
> 2013-07-21 12:23:43,062 INFO crawl.AbstractFetchSchedule -
maxInterval=7776000
> 2013-07-21 12:23:43,093 INFO regex.RegexURLNormalizer - can't find rules
for scope 'generate_host_count', using default
> 2013-07-21 12:23:43,234 INFO mapreduce.GoraRecordWriter -
gora.buffer.write.limit = 10000
> 2013-07-21 12:23:43,250 WARN mapred.FileOutputCommitter - Output path is
null in cleanup
> 2013-07-21 12:23:43,250 WARN mapred.LocalJobRunner -
job_local1378002997_0002
> java.lang.NullPointerException
> at org.apache.avro.util.Utf8.<init>(Utf8.java:37)
> at
org.apache.nutch.crawl.GeneratorReducer.setup(GeneratorReducer.java:100)
> at org.apache.hadoop.mapreduce.Reducer.run(Reducer.java:174)
> at
org.apache.hadoop.mapred.ReduceTask.runNewReducer(ReduceTask.java:649)
> at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:418)
> at
org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:398)
>
> I don't know if this is the right direction I should continue with. But
any way, hopefully my experience could help others.
>
>
> Regards,
> Rui
>
>
>
>
>
>
> At 2013-07-20 23:07:41,"Rui Gao" <ga...@163.com> wrote:
>>Hi Lewis,
>>
>>Thanks for your answer.
>>So, what direction will Nutch go? Will it co-operate with relationship
database or will it only work on non-relationship database like hbase?
>>I remember when 2.2.1 has been released, I checked the release note, it
says some bugs related with mysql has been fixed. That's why I try to
integrate it with mysql or hsql. And also, in the wiki, there's a link
talking about how to integrate nutch with mysql:
http://nlp.solutions.asia/?p=362
>>
>>Do you have any suggestion?
>>
>>Thanks.
>>
>>Best Regards,
>>Rui
>>
>>
>>
>>
>>
>>
>>
>>At 2013-07-11 03:53:12,"Lewis John Mcgibbney" <le...@gmail.com>
wrote:
>>>Hi Rui,
>>>This should not work.
>>>The SqlStore module and support for it is now deprecated within Apache
Gora.
>>>If you would like to downgrade to use Nutch 2.1, then you can use older
>>>Gora artifacts but this is not recommended.
>>>Thanks
>>>Lewis
>>>
>>>
>>>On Sun, Jul 7, 2013 at 12:36 AM, Rui Gao <ga...@163.com> wrote:
>>>
>>>> Hello,
>>>>
>>>> I have set up eclipse environment according to the WIKI. Here's some
>>>> something I did before I run the inject job:
>>>> 1. I use SqlStore as storage class
>>>> 2. I started HSql database which contains the table 'webpage'.
>>>> 3. I added 1 URL in seed.txt.
>>>> Then I run the inject job. It seems the job is finished successfully.
But
>>>> I there's no change be made to my HSql database. Any thought about
this?
>>>> Here's the log:
>>>> InjectorJob: starting at 2013-07-07 15:28:42
>>>> InjectorJob: Injecting urlDir: urls/dev
>>>> InjectorJob: Using class org.apache.gora.sql.store.SqlStore as the Gora
>>>> storage class.
>>>> InjectorJob: total number of urls rejected by filters: 0
>>>> InjectorJob: total number of urls injected after normalization and
>>>> filtering: 1
>>>> Injector: finished at 2013-07-07 15:28:44, elapsed: 00:00:02
>>>>
>>>> Best Regards,
>>>> Rui
>>>>
>>>
>>>
>>>
>>>--
>>>*Lewis*
>
--
*Lewis*
Re:Re:Re: [2.2.1] What does inject job do?
Posted by Rui Gao <ga...@163.com>.
Hi Lewis,
I tried to downgrade gora-core to 0.2.1. then, I could run InjectorJob with both hsql and mysql. But the Crawler job still fail. here's the log:
2013-07-21 12:23:41,156 INFO crawl.InjectorJob - InjectorJob: Using class org.apache.gora.sql.store.SqlStore as the Gora storage class.
2013-07-21 12:23:41,203 WARN util.NativeCodeLoader - Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
2013-07-21 12:23:41,234 WARN mapred.JobClient - No job jar file set. User classes may not be found. See JobConf(Class) or JobConf#setJar(String).
2013-07-21 12:23:41,265 WARN snappy.LoadSnappy - Snappy native library not loaded
2013-07-21 12:23:41,578 INFO mapreduce.GoraRecordWriter - gora.buffer.write.limit = 10000
2013-07-21 12:23:41,718 WARN plugin.PluginRepository - Plugins: directory not found: ./plugins
2013-07-21 12:23:41,765 INFO regex.RegexURLNormalizer - can't find rules for scope 'inject', using default
2013-07-21 12:23:41,937 WARN mapred.FileOutputCommitter - Output path is null in cleanup
2013-07-21 12:23:42,468 INFO crawl.InjectorJob - InjectorJob: total number of urls rejected by filters: 0
2013-07-21 12:23:42,468 INFO crawl.InjectorJob - InjectorJob: total number of urls injected after normalization and filtering: 1
2013-07-21 12:23:42,468 INFO crawl.FetchScheduleFactory - Using FetchSchedule impl: org.apache.nutch.crawl.DefaultFetchSchedule
2013-07-21 12:23:42,468 INFO crawl.AbstractFetchSchedule - defaultInterval=2592000
2013-07-21 12:23:42,468 INFO crawl.AbstractFetchSchedule - maxInterval=7776000
2013-07-21 12:23:42,593 WARN mapred.JobClient - No job jar file set. User classes may not be found. See JobConf(Class) or JobConf#setJar(String).
2013-07-21 12:23:42,796 INFO mapreduce.GoraRecordReader - gora.buffer.read.limit = 10000
2013-07-21 12:23:43,062 INFO crawl.FetchScheduleFactory - Using FetchSchedule impl: org.apache.nutch.crawl.DefaultFetchSchedule
2013-07-21 12:23:43,062 INFO crawl.AbstractFetchSchedule - defaultInterval=2592000
2013-07-21 12:23:43,062 INFO crawl.AbstractFetchSchedule - maxInterval=7776000
2013-07-21 12:23:43,093 INFO regex.RegexURLNormalizer - can't find rules for scope 'generate_host_count', using default
2013-07-21 12:23:43,234 INFO mapreduce.GoraRecordWriter - gora.buffer.write.limit = 10000
2013-07-21 12:23:43,250 WARN mapred.FileOutputCommitter - Output path is null in cleanup
2013-07-21 12:23:43,250 WARN mapred.LocalJobRunner - job_local1378002997_0002
java.lang.NullPointerException
at org.apache.avro.util.Utf8.<init>(Utf8.java:37)
at org.apache.nutch.crawl.GeneratorReducer.setup(GeneratorReducer.java:100)
at org.apache.hadoop.mapreduce.Reducer.run(Reducer.java:174)
at org.apache.hadoop.mapred.ReduceTask.runNewReducer(ReduceTask.java:649)
at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:418)
at org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:398)
I don't know if this is the right direction I should continue with. But any way, hopefully my experience could help others.
Regards,
Rui
At 2013-07-20 23:07:41,"Rui Gao" <ga...@163.com> wrote:
>Hi Lewis,
>
>Thanks for your answer.
>So, what direction will Nutch go? Will it co-operate with relationship database or will it only work on non-relationship database like hbase?
>I remember when 2.2.1 has been released, I checked the release note, it says some bugs related with mysql has been fixed. That's why I try to integrate it with mysql or hsql. And also, in the wiki, there's a link talking about how to integrate nutch with mysql: http://nlp.solutions.asia/?p=362
>
>Do you have any suggestion?
>
>Thanks.
>
>Best Regards,
>Rui
>
>
>
>
>
>
>
>At 2013-07-11 03:53:12,"Lewis John Mcgibbney" <le...@gmail.com> wrote:
>>Hi Rui,
>>This should not work.
>>The SqlStore module and support for it is now deprecated within Apache Gora.
>>If you would like to downgrade to use Nutch 2.1, then you can use older
>>Gora artifacts but this is not recommended.
>>Thanks
>>Lewis
>>
>>
>>On Sun, Jul 7, 2013 at 12:36 AM, Rui Gao <ga...@163.com> wrote:
>>
>>> Hello,
>>>
>>> I have set up eclipse environment according to the WIKI. Here's some
>>> something I did before I run the inject job:
>>> 1. I use SqlStore as storage class
>>> 2. I started HSql database which contains the table 'webpage'.
>>> 3. I added 1 URL in seed.txt.
>>> Then I run the inject job. It seems the job is finished successfully. But
>>> I there's no change be made to my HSql database. Any thought about this?
>>> Here's the log:
>>> InjectorJob: starting at 2013-07-07 15:28:42
>>> InjectorJob: Injecting urlDir: urls/dev
>>> InjectorJob: Using class org.apache.gora.sql.store.SqlStore as the Gora
>>> storage class.
>>> InjectorJob: total number of urls rejected by filters: 0
>>> InjectorJob: total number of urls injected after normalization and
>>> filtering: 1
>>> Injector: finished at 2013-07-07 15:28:44, elapsed: 00:00:02
>>>
>>> Best Regards,
>>> Rui
>>>
>>
>>
>>
>>--
>>*Lewis*
Re: [2.2.1] What does inject job do?
Posted by Lewis John Mcgibbney <le...@gmail.com>.
Hi Rui,
On Saturday, July 20, 2013, Rui Gao <ga...@163.com> wrote:
> So, what direction will Nutch go? Will it co-operate with relationship
database or will it only work on non-relationship database like hbase?
This has nothing to do with Nutch. It has everything to do with Apache Gora
and we are moving the Gora framework away from relational models towards
NoSQL.
> I remember when 2.2.1 has been released, I checked the release note, it
says some bugs related with mysql has been fixed. That's why I try to
integrate it with mysql or hsql. And also, in the wiki, there's a link
talking about how to integrate nutch with mysql:
http://nlp.solutions.asia/?p=362
>
This may be the case, however there was also an upgrade of the Gora
dependency which deprecated the Sql DataStore. To understand the ordering
of issues as of when they are addressed you should refer to CHANGES.TXT in
your current version of Nutch.
thanks
Lewis
--
*Lewis*
Re:Re: [2.2.1] What does inject job do?
Posted by Rui Gao <ga...@163.com>.
Hi Lewis,
Thanks for your answer.
So, what direction will Nutch go? Will it co-operate with relationship database or will it only work on non-relationship database like hbase?
I remember when 2.2.1 has been released, I checked the release note, it says some bugs related with mysql has been fixed. That's why I try to integrate it with mysql or hsql. And also, in the wiki, there's a link talking about how to integrate nutch with mysql: http://nlp.solutions.asia/?p=362
Do you have any suggestion?
Thanks.
Best Regards,
Rui
At 2013-07-11 03:53:12,"Lewis John Mcgibbney" <le...@gmail.com> wrote:
>Hi Rui,
>This should not work.
>The SqlStore module and support for it is now deprecated within Apache Gora.
>If you would like to downgrade to use Nutch 2.1, then you can use older
>Gora artifacts but this is not recommended.
>Thanks
>Lewis
>
>
>On Sun, Jul 7, 2013 at 12:36 AM, Rui Gao <ga...@163.com> wrote:
>
>> Hello,
>>
>> I have set up eclipse environment according to the WIKI. Here's some
>> something I did before I run the inject job:
>> 1. I use SqlStore as storage class
>> 2. I started HSql database which contains the table 'webpage'.
>> 3. I added 1 URL in seed.txt.
>> Then I run the inject job. It seems the job is finished successfully. But
>> I there's no change be made to my HSql database. Any thought about this?
>> Here's the log:
>> InjectorJob: starting at 2013-07-07 15:28:42
>> InjectorJob: Injecting urlDir: urls/dev
>> InjectorJob: Using class org.apache.gora.sql.store.SqlStore as the Gora
>> storage class.
>> InjectorJob: total number of urls rejected by filters: 0
>> InjectorJob: total number of urls injected after normalization and
>> filtering: 1
>> Injector: finished at 2013-07-07 15:28:44, elapsed: 00:00:02
>>
>> Best Regards,
>> Rui
>>
>
>
>
>--
>*Lewis*
Re: [2.2.1] What does inject job do?
Posted by Lewis John Mcgibbney <le...@gmail.com>.
Hi Rui,
This should not work.
The SqlStore module and support for it is now deprecated within Apache Gora.
If you would like to downgrade to use Nutch 2.1, then you can use older
Gora artifacts but this is not recommended.
Thanks
Lewis
On Sun, Jul 7, 2013 at 12:36 AM, Rui Gao <ga...@163.com> wrote:
> Hello,
>
> I have set up eclipse environment according to the WIKI. Here's some
> something I did before I run the inject job:
> 1. I use SqlStore as storage class
> 2. I started HSql database which contains the table 'webpage'.
> 3. I added 1 URL in seed.txt.
> Then I run the inject job. It seems the job is finished successfully. But
> I there's no change be made to my HSql database. Any thought about this?
> Here's the log:
> InjectorJob: starting at 2013-07-07 15:28:42
> InjectorJob: Injecting urlDir: urls/dev
> InjectorJob: Using class org.apache.gora.sql.store.SqlStore as the Gora
> storage class.
> InjectorJob: total number of urls rejected by filters: 0
> InjectorJob: total number of urls injected after normalization and
> filtering: 1
> Injector: finished at 2013-07-07 15:28:44, elapsed: 00:00:02
>
> Best Regards,
> Rui
>
--
*Lewis*