You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by Weilei Zhang <zh...@gmail.com> on 2013/01/30 20:52:44 UTC

GeneratorJob and InjectorJob questions in Nutch 2.x

Hi
I am trying to use Nutch 2.x and have one question regarding Generator
and Injector:
Basically, I only have link as root to crawl and I see (by
instrumenting the code) that this one link was written to Context in
the last step of InjectorJob and that is the only link written to
Context from GeneratorJob. However, I saw multiple links sent to map
function  in the first steps of GeneratorJob ( I instrumented setup
function). Those links seem to include all URLs referenced from the
original link. My question is where does fetch/parse happen? From the
Crawler code, it is straightforward to me that Injector is immediately
followed by Generator; I tried to scrub the code down to do the job
but failed.

I ran crawl in the following way:
>/nutch  crawl urlsDir

There is only one link under a file in urlsDir.
>cat urlsDir/*
http://www.bmw.com

The following is excerpt from the Generator map function
instrumentation output. Those are reversedURL.
al.com.bmw.www:http/
al.com.bmw.www:http/al/en
am.bmw.www:http/
am.bmw.www:http/am/en
ao.co.bmw:http/
ao.co.bmw:http/ao/pt
ar.com.bmw.www:http/
ar.com.bmw.www:http/ar/es/
at.bmw.www:http/
at.bmw.www:http/at/de/general/configurations_center/configure.html
at.bmw.www:http/de/index.html
at.bmw.www:http/de/topics/services-angebote/connecteddrivedienste/connecteddrive-antrag/ueberblick.html
au.com.bmw.www:http/
au.com.bmw.www:http/com/en/newvehicles/1series/5door/2011/showroom/compare.html
au.com.bmw.www:http/com/en/newvehicles/1series/5door/2011/showroom/configurator.html
au.com.bmw.www:http/com/en/newvehicles/1series/5door/2011/showroom/driveawayprice.html
au.com.bmw.www:http/com/en/newvehicles/1series/5door/2011/showroom/financecalculator.html
au.com.bmw.www:http/com/en/newvehicles/1series/5door/2011/showroom/highlights/
au.com.bmw.www:http/com/en/newvehicles/1series/5door/2011/showroom/introduction.html
au.com.bmw.www:http/com/en/newvehicles/1series/5door/2011/showroom/requestebrochure.html
au.com.bmw.www:http/com/en/newvehicles/1series/5door/2011/showroom/requesttestdrive.html
au.com.bmw.www:http/com/en/newvehicles/1series/convertible/2011/showroom/compare.html
au.com.bmw.www:http/com/en/newvehicles/1series/convertible/2011/showroom/configurator.html
au.com.bmw.www:http/com/en/newvehicles/1series/convertible/2011/showroom/driveawayprice.html
au.com.bmw.www:http/com/en/newvehicles/1series/convertible/2011/showroom/financecalculator.html
au.com.bmw.www:http/com/en/newvehicles/1series/convertible/2011/showroom/highlights/


Thanks for any hints!
-- 
Best Regards
-Weilei

Re: GeneratorJob and InjectorJob questions in Nutch 2.x

Posted by kiran chitturi <ch...@gmail.com>.
Yes.

I have noticed sometimes when i want a new crawl and there are already
records present in the database, the crawl does not go as expected.

I generally drop the table (hbase) and run the crawl again.

Also, please use crawl script instead of nutch script to start crawls [0]


[0] -
https://issues.apache.org/jira/browse/NUTCH-1087?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13554862#comment-13554862



On Wed, Jan 30, 2013 at 3:03 PM, Weilei Zhang <zh...@gmail.com> wrote:

> It seems that I understand this problem now: this comes from the prior
> fetch(es).
> I need to find some way to reset the database if I want to execute a
> fresh crawl, right?
> Sorry if this is too basic a question. This is only my 4th day into
> Nutch/Hadoop/Hbase though I have been a Java programmer for a while.
> Thanks
> -Weilei
>
>
> On Wed, Jan 30, 2013 at 11:52 AM, Weilei Zhang <zh...@gmail.com> wrote:
> > Hi
> > I am trying to use Nutch 2.x and have one question regarding Generator
> > and Injector:
> > Basically, I only have link as root to crawl and I see (by
> > instrumenting the code) that this one link was written to Context in
> > the last step of InjectorJob and that is the only link written to
> > Context from GeneratorJob. However, I saw multiple links sent to map
> > function  in the first steps of GeneratorJob ( I instrumented setup
> > function). Those links seem to include all URLs referenced from the
> > original link. My question is where does fetch/parse happen? From the
> > Crawler code, it is straightforward to me that Injector is immediately
> > followed by Generator; I tried to scrub the code down to do the job
> > but failed.
> >
> > I ran crawl in the following way:
> >>/nutch  crawl urlsDir
> >
> > There is only one link under a file in urlsDir.
> >>cat urlsDir/*
> > http://www.bmw.com
> >
> > The following is excerpt from the Generator map function
> > instrumentation output. Those are reversedURL.
> > al.com.bmw.www:http/
> > al.com.bmw.www:http/al/en
> > am.bmw.www:http/
> > am.bmw.www:http/am/en
> > ao.co.bmw:http/
> > ao.co.bmw:http/ao/pt
> > ar.com.bmw.www:http/
> > ar.com.bmw.www:http/ar/es/
> > at.bmw.www:http/
> > at.bmw.www:http/at/de/general/configurations_center/configure.html
> > at.bmw.www:http/de/index.html
> >
> at.bmw.www:http/de/topics/services-angebote/connecteddrivedienste/connecteddrive-antrag/ueberblick.html
> > au.com.bmw.www:http/
> >
> au.com.bmw.www:http/com/en/newvehicles/1series/5door/2011/showroom/compare.html
> >
> au.com.bmw.www:http/com/en/newvehicles/1series/5door/2011/showroom/configurator.html
> >
> au.com.bmw.www:http/com/en/newvehicles/1series/5door/2011/showroom/driveawayprice.html
> >
> au.com.bmw.www:http/com/en/newvehicles/1series/5door/2011/showroom/financecalculator.html
> >
> au.com.bmw.www:http/com/en/newvehicles/1series/5door/2011/showroom/highlights/
> >
> au.com.bmw.www:http/com/en/newvehicles/1series/5door/2011/showroom/introduction.html
> >
> au.com.bmw.www:http/com/en/newvehicles/1series/5door/2011/showroom/requestebrochure.html
> >
> au.com.bmw.www:http/com/en/newvehicles/1series/5door/2011/showroom/requesttestdrive.html
> >
> au.com.bmw.www:http/com/en/newvehicles/1series/convertible/2011/showroom/compare.html
> >
> au.com.bmw.www:http/com/en/newvehicles/1series/convertible/2011/showroom/configurator.html
> >
> au.com.bmw.www:http/com/en/newvehicles/1series/convertible/2011/showroom/driveawayprice.html
> >
> au.com.bmw.www:http/com/en/newvehicles/1series/convertible/2011/showroom/financecalculator.html
> >
> au.com.bmw.www:http/com/en/newvehicles/1series/convertible/2011/showroom/highlights/
> >
> >
> > Thanks for any hints!
> > --
> > Best Regards
> > -Weilei
>
>
>
> --
> Best Regards
> -Weilei
>



-- 
Kiran Chitturi

Re: GeneratorJob and InjectorJob questions in Nutch 2.x

Posted by Weilei Zhang <zh...@gmail.com>.
It seems that I understand this problem now: this comes from the prior
fetch(es).
I need to find some way to reset the database if I want to execute a
fresh crawl, right?
Sorry if this is too basic a question. This is only my 4th day into
Nutch/Hadoop/Hbase though I have been a Java programmer for a while.
Thanks
-Weilei


On Wed, Jan 30, 2013 at 11:52 AM, Weilei Zhang <zh...@gmail.com> wrote:
> Hi
> I am trying to use Nutch 2.x and have one question regarding Generator
> and Injector:
> Basically, I only have link as root to crawl and I see (by
> instrumenting the code) that this one link was written to Context in
> the last step of InjectorJob and that is the only link written to
> Context from GeneratorJob. However, I saw multiple links sent to map
> function  in the first steps of GeneratorJob ( I instrumented setup
> function). Those links seem to include all URLs referenced from the
> original link. My question is where does fetch/parse happen? From the
> Crawler code, it is straightforward to me that Injector is immediately
> followed by Generator; I tried to scrub the code down to do the job
> but failed.
>
> I ran crawl in the following way:
>>/nutch  crawl urlsDir
>
> There is only one link under a file in urlsDir.
>>cat urlsDir/*
> http://www.bmw.com
>
> The following is excerpt from the Generator map function
> instrumentation output. Those are reversedURL.
> al.com.bmw.www:http/
> al.com.bmw.www:http/al/en
> am.bmw.www:http/
> am.bmw.www:http/am/en
> ao.co.bmw:http/
> ao.co.bmw:http/ao/pt
> ar.com.bmw.www:http/
> ar.com.bmw.www:http/ar/es/
> at.bmw.www:http/
> at.bmw.www:http/at/de/general/configurations_center/configure.html
> at.bmw.www:http/de/index.html
> at.bmw.www:http/de/topics/services-angebote/connecteddrivedienste/connecteddrive-antrag/ueberblick.html
> au.com.bmw.www:http/
> au.com.bmw.www:http/com/en/newvehicles/1series/5door/2011/showroom/compare.html
> au.com.bmw.www:http/com/en/newvehicles/1series/5door/2011/showroom/configurator.html
> au.com.bmw.www:http/com/en/newvehicles/1series/5door/2011/showroom/driveawayprice.html
> au.com.bmw.www:http/com/en/newvehicles/1series/5door/2011/showroom/financecalculator.html
> au.com.bmw.www:http/com/en/newvehicles/1series/5door/2011/showroom/highlights/
> au.com.bmw.www:http/com/en/newvehicles/1series/5door/2011/showroom/introduction.html
> au.com.bmw.www:http/com/en/newvehicles/1series/5door/2011/showroom/requestebrochure.html
> au.com.bmw.www:http/com/en/newvehicles/1series/5door/2011/showroom/requesttestdrive.html
> au.com.bmw.www:http/com/en/newvehicles/1series/convertible/2011/showroom/compare.html
> au.com.bmw.www:http/com/en/newvehicles/1series/convertible/2011/showroom/configurator.html
> au.com.bmw.www:http/com/en/newvehicles/1series/convertible/2011/showroom/driveawayprice.html
> au.com.bmw.www:http/com/en/newvehicles/1series/convertible/2011/showroom/financecalculator.html
> au.com.bmw.www:http/com/en/newvehicles/1series/convertible/2011/showroom/highlights/
>
>
> Thanks for any hints!
> --
> Best Regards
> -Weilei



-- 
Best Regards
-Weilei

Re: GeneratorJob and InjectorJob questions in Nutch 2.x

Posted by Weilei Zhang <zh...@gmail.com>.
Indeed, Kiran. Sorry that my prior email and yours crossed with each other.
Thanks for the help!



On Wed, Jan 30, 2013 at 12:02 PM, kiran chitturi
<ch...@gmail.com> wrote:
> The steps occur in this order
>
> 1) Inject
> 2) Generate
> 3) Fetcher
> 4) Parse
> 5) dbUpdate
>
> I would suggest you to clean the database and start again. Please let me
> know your results.
>
>
>
> On Wed, Jan 30, 2013 at 2:52 PM, Weilei Zhang <zh...@gmail.com> wrote:
>
>> Hi
>> I am trying to use Nutch 2.x and have one question regarding Generator
>> and Injector:
>> Basically, I only have link as root to crawl and I see (by
>> instrumenting the code) that this one link was written to Context in
>> the last step of InjectorJob and that is the only link written to
>> Context from GeneratorJob. However, I saw multiple links sent to map
>> function  in the first steps of GeneratorJob ( I instrumented setup
>> function). Those links seem to include all URLs referenced from the
>> original link. My question is where does fetch/parse happen? From the
>> Crawler code, it is straightforward to me that Injector is immediately
>> followed by Generator; I tried to scrub the code down to do the job
>> but failed.
>>
>> I ran crawl in the following way:
>> >/nutch  crawl urlsDir
>>
>> There is only one link under a file in urlsDir.
>> >cat urlsDir/*
>> http://www.bmw.com
>>
>> The following is excerpt from the Generator map function
>> instrumentation output. Those are reversedURL.
>> al.com.bmw.www:http/
>> al.com.bmw.www:http/al/en
>> am.bmw.www:http/
>> am.bmw.www:http/am/en
>> ao.co.bmw:http/
>> ao.co.bmw:http/ao/pt
>> ar.com.bmw.www:http/
>> ar.com.bmw.www:http/ar/es/
>> at.bmw.www:http/
>> at.bmw.www:http/at/de/general/configurations_center/configure.html
>> at.bmw.www:http/de/index.html
>>
>> at.bmw.www:http/de/topics/services-angebote/connecteddrivedienste/connecteddrive-antrag/ueberblick.html
>> au.com.bmw.www:http/
>>
>> au.com.bmw.www:http/com/en/newvehicles/1series/5door/2011/showroom/compare.html
>>
>> au.com.bmw.www:http/com/en/newvehicles/1series/5door/2011/showroom/configurator.html
>>
>> au.com.bmw.www:http/com/en/newvehicles/1series/5door/2011/showroom/driveawayprice.html
>>
>> au.com.bmw.www:http/com/en/newvehicles/1series/5door/2011/showroom/financecalculator.html
>>
>> au.com.bmw.www:http/com/en/newvehicles/1series/5door/2011/showroom/highlights/
>>
>> au.com.bmw.www:http/com/en/newvehicles/1series/5door/2011/showroom/introduction.html
>>
>> au.com.bmw.www:http/com/en/newvehicles/1series/5door/2011/showroom/requestebrochure.html
>>
>> au.com.bmw.www:http/com/en/newvehicles/1series/5door/2011/showroom/requesttestdrive.html
>>
>> au.com.bmw.www:http/com/en/newvehicles/1series/convertible/2011/showroom/compare.html
>>
>> au.com.bmw.www:http/com/en/newvehicles/1series/convertible/2011/showroom/configurator.html
>>
>> au.com.bmw.www:http/com/en/newvehicles/1series/convertible/2011/showroom/driveawayprice.html
>>
>> au.com.bmw.www:http/com/en/newvehicles/1series/convertible/2011/showroom/financecalculator.html
>>
>> au.com.bmw.www:http/com/en/newvehicles/1series/convertible/2011/showroom/highlights/
>>
>>
>> Thanks for any hints!
>> --
>> Best Regards
>> -Weilei
>>
>
>
>
> --
> Kiran Chitturi



-- 
Best Regards
-Weilei

Re: GeneratorJob and InjectorJob questions in Nutch 2.x

Posted by kiran chitturi <ch...@gmail.com>.
The steps occur in this order

1) Inject
2) Generate
3) Fetcher
4) Parse
5) dbUpdate

I would suggest you to clean the database and start again. Please let me
know your results.



On Wed, Jan 30, 2013 at 2:52 PM, Weilei Zhang <zh...@gmail.com> wrote:

> Hi
> I am trying to use Nutch 2.x and have one question regarding Generator
> and Injector:
> Basically, I only have link as root to crawl and I see (by
> instrumenting the code) that this one link was written to Context in
> the last step of InjectorJob and that is the only link written to
> Context from GeneratorJob. However, I saw multiple links sent to map
> function  in the first steps of GeneratorJob ( I instrumented setup
> function). Those links seem to include all URLs referenced from the
> original link. My question is where does fetch/parse happen? From the
> Crawler code, it is straightforward to me that Injector is immediately
> followed by Generator; I tried to scrub the code down to do the job
> but failed.
>
> I ran crawl in the following way:
> >/nutch  crawl urlsDir
>
> There is only one link under a file in urlsDir.
> >cat urlsDir/*
> http://www.bmw.com
>
> The following is excerpt from the Generator map function
> instrumentation output. Those are reversedURL.
> al.com.bmw.www:http/
> al.com.bmw.www:http/al/en
> am.bmw.www:http/
> am.bmw.www:http/am/en
> ao.co.bmw:http/
> ao.co.bmw:http/ao/pt
> ar.com.bmw.www:http/
> ar.com.bmw.www:http/ar/es/
> at.bmw.www:http/
> at.bmw.www:http/at/de/general/configurations_center/configure.html
> at.bmw.www:http/de/index.html
>
> at.bmw.www:http/de/topics/services-angebote/connecteddrivedienste/connecteddrive-antrag/ueberblick.html
> au.com.bmw.www:http/
>
> au.com.bmw.www:http/com/en/newvehicles/1series/5door/2011/showroom/compare.html
>
> au.com.bmw.www:http/com/en/newvehicles/1series/5door/2011/showroom/configurator.html
>
> au.com.bmw.www:http/com/en/newvehicles/1series/5door/2011/showroom/driveawayprice.html
>
> au.com.bmw.www:http/com/en/newvehicles/1series/5door/2011/showroom/financecalculator.html
>
> au.com.bmw.www:http/com/en/newvehicles/1series/5door/2011/showroom/highlights/
>
> au.com.bmw.www:http/com/en/newvehicles/1series/5door/2011/showroom/introduction.html
>
> au.com.bmw.www:http/com/en/newvehicles/1series/5door/2011/showroom/requestebrochure.html
>
> au.com.bmw.www:http/com/en/newvehicles/1series/5door/2011/showroom/requesttestdrive.html
>
> au.com.bmw.www:http/com/en/newvehicles/1series/convertible/2011/showroom/compare.html
>
> au.com.bmw.www:http/com/en/newvehicles/1series/convertible/2011/showroom/configurator.html
>
> au.com.bmw.www:http/com/en/newvehicles/1series/convertible/2011/showroom/driveawayprice.html
>
> au.com.bmw.www:http/com/en/newvehicles/1series/convertible/2011/showroom/financecalculator.html
>
> au.com.bmw.www:http/com/en/newvehicles/1series/convertible/2011/showroom/highlights/
>
>
> Thanks for any hints!
> --
> Best Regards
> -Weilei
>



-- 
Kiran Chitturi