You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@nutch.apache.org by Binoy d <bi...@gmail.com> on 2013/04/01 05:25:33 UTC

Nutch2.x Null Pointer Exception in IndexerJob.Java for a fresh crawl with One Seed.

Hi,

I have Nutch 2.x set up with Mysql and am seeing a peculiar null pointer
exception with a crawl with sample seeds from DMOZ. I decided to do fresh
crawl with only  one url as seed and empty webpage table.
I am running *org.apache.nutch.crawl.Crawler* from eclipse  with args *urls
-dir /home/binoy/lab/dmoz/apache-url -solr http://localhost:8983/solr/
-depth 1  -topN 1*

the apache-url seed file has only one entry ("http://nutch.apache.org/")


I see the following nullpointer exception : Logs :
http://pastebin.com/CaqJpPkn

With a little debugging from eclipse I see

        conf.set(GeneratorJob.BATCH_ID, batchId);

in IndexerJob.java createIndexJob method being the root cause.

wrapping it in *if(batchId != null)  *seems to solve the issue.

I wanted to know if this is  a valid patch. It seems from grep-ing no on
else is reading GeneratorJob.BATCH_ID except indexerJob.

I am always seeing batchId passed as null for createIndexJob for clean
crawls (empty table), which scenario causes it to be not null? and what is
the significance generator job batchId for indexing job.

It seems a trivial issue and hence I didnot create a jira. I have attached
the small patch and would be glad if some one can take a look.

Regards,
Binoy

Re: Nutch2.x Null Pointer Exception in IndexerJob.Java for a fresh crawl with One Seed.

Posted by Binoy d <bi...@gmail.com>.

Hi,
I am able to reproduce the issue from within Eclipse(not using the
scripts)  with revision 1455209. Any revision later seems to break my
workspace and i am not able to successully run any crawl using the scripts
or the eclipse run as options.

It seems the  head revision   for 2.x branch  (1462079) is not stable, has
any one been able to figure out the issue ?

Regards,
Binoy


On Sun, Mar 31, 2013 at 10:34 PM, Binoy d <bi...@gmail.com> wrote:

> Hi Kiran,
>
> I was running the org.apache.nutch.crawl.Crawler class from within eclipse
> (Run as configuration option) with usual arguments arguments urls -dir
> /home/binoy/lab/dmoz/apache-url -solr http://localhost:8983/solr/  -depth
> 1  -topN 1
> Thanks for tip on remote debugging. It seems the latest 2.x revision is
> broken as i just did Update to Head and i am seeing a completely different
> exception. Let me revert the workspace and look at it again, though i was
> able to consistently reproduce the issue before i did svn update.
>
> Regards,
> Binoy
>
>
>
> On Sun, Mar 31, 2013 at 8:48 PM, kiran chitturi <chitturikiran15@gmail.com
> > wrote:
>
>> Hi Binoy,
>>
>> Thanks for the reporting on the issue and debugging ?
>>
>> Did you try using individual commands or crawl script instead of the
>> crawl command  ?
>>
>> You can try running Nutch remotely [1]. This will help you in running
>> commands from shell and debug using Eclipse.
>>
>> [1]
>> http://wiki.apache.org/nutch/RunNutchInEclipse#Remote_Debugging_in_Eclipse
>>
>>
>> On Sun, Mar 31, 2013 at 11:25 PM, Binoy d <bi...@gmail.com> wrote:
>>
>>> Hi,
>>>
>>> I have Nutch 2.x set up with Mysql and am seeing a peculiar null pointer
>>> exception with a crawl with sample seeds from DMOZ. I decided to do fresh
>>> crawl with only  one url as seed and empty webpage table.
>>> I am running *org.apache.nutch.crawl.Crawler* from eclipse  with args *urls
>>> -dir /home/binoy/lab/dmoz/apache-url -solr http://localhost:8983/solr/
>>> -depth 1  -topN 1*
>>>
>>> the apache-url seed file has only one entry ("http://nutch.apache.org/")
>>>
>>>
>>> I see the following nullpointer exception : Logs :
>>> http://pastebin.com/CaqJpPkn
>>>
>>> With a little debugging from eclipse I see
>>>
>>>         conf.set(GeneratorJob.BATCH_ID, batchId);
>>>
>>> in IndexerJob.java createIndexJob method being the root cause.
>>>
>>> wrapping it in *if(batchId != null)  *seems to solve the issue.
>>>
>>> I wanted to know if this is  a valid patch. It seems from grep-ing no on
>>> else is reading GeneratorJob.BATCH_ID except indexerJob.
>>>
>>> I am always seeing batchId passed as null for createIndexJob for clean
>>> crawls (empty table), which scenario causes it to be not null? and what is
>>> the significance generator job batchId for indexing job.
>>>
>>> It seems a trivial issue and hence I didnot create a jira. I have
>>> attached the small patch and would be glad if some one can take a look.
>>>
>>> Regards,
>>> Binoy
>>>
>>>
>>>
>>
>>
>> --
>> Kiran Chitturi
>>
>> <http://www.linkedin.com/in/kiranchitturi>
>>
>>
>>
>

Re: Nutch2.x Null Pointer Exception in IndexerJob.Java for a fresh crawl with One Seed.

Posted by Binoy d <bi...@gmail.com>.

Hi Kiran,

I was running the org.apache.nutch.crawl.Crawler class from within eclipse
(Run as configuration option) with usual arguments arguments urls -dir
/home/binoy/lab/dmoz/apache-url -solr http://localhost:8983/solr/  -depth
1  -topN 1
Thanks for tip on remote debugging. It seems the latest 2.x revision is
broken as i just did Update to Head and i am seeing a completely different
exception. Let me revert the workspace and look at it again, though i was
able to consistently reproduce the issue before i did svn update.

Regards,
Binoy



On Sun, Mar 31, 2013 at 8:48 PM, kiran chitturi
<ch...@gmail.com>wrote:

> Hi Binoy,
>
> Thanks for the reporting on the issue and debugging ?
>
> Did you try using individual commands or crawl script instead of the crawl
> command  ?
>
> You can try running Nutch remotely [1]. This will help you in running
> commands from shell and debug using Eclipse.
>
> [1]
> http://wiki.apache.org/nutch/RunNutchInEclipse#Remote_Debugging_in_Eclipse
>
>
> On Sun, Mar 31, 2013 at 11:25 PM, Binoy d <bi...@gmail.com> wrote:
>
>> Hi,
>>
>> I have Nutch 2.x set up with Mysql and am seeing a peculiar null pointer
>> exception with a crawl with sample seeds from DMOZ. I decided to do fresh
>> crawl with only  one url as seed and empty webpage table.
>> I am running *org.apache.nutch.crawl.Crawler* from eclipse  with args *urls
>> -dir /home/binoy/lab/dmoz/apache-url -solr http://localhost:8983/solr/
>> -depth 1  -topN 1*
>>
>> the apache-url seed file has only one entry ("http://nutch.apache.org/")
>>
>>
>> I see the following nullpointer exception : Logs :
>> http://pastebin.com/CaqJpPkn
>>
>> With a little debugging from eclipse I see
>>
>>         conf.set(GeneratorJob.BATCH_ID, batchId);
>>
>> in IndexerJob.java createIndexJob method being the root cause.
>>
>> wrapping it in *if(batchId != null)  *seems to solve the issue.
>>
>> I wanted to know if this is  a valid patch. It seems from grep-ing no on
>> else is reading GeneratorJob.BATCH_ID except indexerJob.
>>
>> I am always seeing batchId passed as null for createIndexJob for clean
>> crawls (empty table), which scenario causes it to be not null? and what is
>> the significance generator job batchId for indexing job.
>>
>> It seems a trivial issue and hence I didnot create a jira. I have
>> attached the small patch and would be glad if some one can take a look.
>>
>> Regards,
>> Binoy
>>
>>
>>
>
>
> --
> Kiran Chitturi
>
> <http://www.linkedin.com/in/kiranchitturi>
>
>
>

Re: Nutch2.x Null Pointer Exception in IndexerJob.Java for a fresh crawl with One Seed.

Posted by kiran chitturi <ch...@gmail.com>.

Hi Binoy,

Thanks for the reporting on the issue and debugging ?

Did you try using individual commands or crawl script instead of the crawl
command  ?

You can try running Nutch remotely [1]. This will help you in running
commands from shell and debug using Eclipse.

[1]
http://wiki.apache.org/nutch/RunNutchInEclipse#Remote_Debugging_in_Eclipse


On Sun, Mar 31, 2013 at 11:25 PM, Binoy d <bi...@gmail.com> wrote:

> Hi,
>
> I have Nutch 2.x set up with Mysql and am seeing a peculiar null pointer
> exception with a crawl with sample seeds from DMOZ. I decided to do fresh
> crawl with only  one url as seed and empty webpage table.
> I am running *org.apache.nutch.crawl.Crawler* from eclipse  with args *urls
> -dir /home/binoy/lab/dmoz/apache-url -solr http://localhost:8983/solr/
> -depth 1  -topN 1*
>
> the apache-url seed file has only one entry ("http://nutch.apache.org/")
>
>
> I see the following nullpointer exception : Logs :
> http://pastebin.com/CaqJpPkn
>
> With a little debugging from eclipse I see
>
>         conf.set(GeneratorJob.BATCH_ID, batchId);
>
> in IndexerJob.java createIndexJob method being the root cause.
>
> wrapping it in *if(batchId != null)  *seems to solve the issue.
>
> I wanted to know if this is  a valid patch. It seems from grep-ing no on
> else is reading GeneratorJob.BATCH_ID except indexerJob.
>
> I am always seeing batchId passed as null for createIndexJob for clean
> crawls (empty table), which scenario causes it to be not null? and what is
> the significance generator job batchId for indexing job.
>
> It seems a trivial issue and hence I didnot create a jira. I have attached
> the small patch and would be glad if some one can take a look.
>
> Regards,
> Binoy
>
>
>


-- 
Kiran Chitturi

<http://www.linkedin.com/in/kiranchitturi>