You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by 高睿 <ga...@163.com> on 2012/12/14 12:49:12 UTC

Nutch 2.1 crash with solr

Hi,

When I specify solr in command line, There will be an exception thrown.
Command line: urls -solr http://localhost:8080/solr/ -depth 1 -topN 3
I tried to add '-batch 3' parameter into command line, but it doesn't help. I looked into the code, and found the parameter is ignored somewhere.
So, how do I fix this? Thanks.

Skipping http://www.iguuu.com/thread-944-1-1.html; different batch id (null)
Skipping http://www.iguuu.com/thread-987-1-1.html; different batch id (null)
Exception in thread "main" java.lang.NullPointerException
    at java.util.Hashtable.put(Unknown Source)
    at java.util.Properties.setProperty(Unknown Source)
    at org.apache.hadoop.conf.Configuration.set(Configuration.java:438)
    at org.apache.nutch.indexer.IndexerJob.createIndexJob(IndexerJob.java:128)
    at org.apache.nutch.indexer.solr.SolrIndexerJob.run(SolrIndexerJob.java:44)
    at org.apache.nutch.crawl.Crawler.runTool(Crawler.java:68)
    at org.apache.nutch.crawl.Crawler.run(Crawler.java:192)
    at org.apache.nutch.crawl.Crawler.run(Crawler.java:250)
    at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
    at org.apache.nutch.crawl.Crawler.main(Crawler.java:257)

Regards,
Rui

Re:Nutch 2.1 crash with solr

Posted by 高睿 <ga...@163.com>.
Hi,

Here's more detail:
In Crawler.java
    for (int i = 0; i < args.length; i++) {
      if ("-threads".equals(args[i])) {
        threads = Integer.parseInt(args[i+1]);
        i++;
      } else if ("-depth".equals(args[i])) {
        depth = Integer.parseInt(args[i+1]);
        i++;
      } else if ("-topN".equals(args[i])) {
          topN = Integer.parseInt(args[i+1]);
          i++;
      } else if ("-solr".equals(args[i])) {
        solrUrl = StringUtils.lowerCase(args[i + 1]);
        i++;
      } else if ("-numTasks".equals(args[i])) {
        numTasks = Integer.parseInt(args[i+1]);
        i++;
      } else if ("-continue".equals(args[i])) {
        // skip
      } else if (args[i] != null) {
        seedDir = args[i];
      }
    }
    Map<String,Object> argMap = ToolUtil.toArgMap(
        Nutch.ARG_THREADS, threads,
        Nutch.ARG_DEPTH, depth,
        Nutch.ARG_TOPN, topN,
        Nutch.ARG_SOLR, solrUrl,
        Nutch.ARG_SEEDDIR, seedDir,
        Nutch.ARG_NUMTASKS, numTasks);
    run(argMap);

So, argMap doesn't contain 'batch' argument. But in SolrIndexJob.java, it try to get such argument value. Obviously, it's null.

  @Override
  public Map<String,Object> run(Map<String,Object> args) throws Exception {
    String solrUrl = (String)args.get(Nutch.ARG_SOLR);
    String batchId = (String)args.get(Nutch.ARG_BATCH);
    NutchIndexWriterFactory.addClassToConf(getConf(), SolrWriter.class);
    getConf().set(SolrConstants.SERVER_URL, solrUrl);

    currentJob = createIndexJob(getConf(), "solr-index", batchId);

Then, in IndexJob.java, there is a NullPointerException thrown:

  protected Job createIndexJob(Configuration conf, String jobName, String batchId)
  throws IOException, ClassNotFoundException {
    conf.set(GeneratorJob.BATCH_ID, batchId);
    Job job = new NutchJob(conf, jobName);



At 2012-12-14 19:49:12,"高睿" <ga...@163.com> wrote:

Hi,

When I specify solr in command line, There will be an exception thrown.
Command line: urls -solr http://localhost:8080/solr/ -depth 1 -topN 3
I tried to add '-batch 3' parameter into command line, but it doesn't help. I looked into the code, and found the parameter is ignored somewhere.
So, how do I fix this? Thanks.

Skipping http://www.iguuu.com/thread-944-1-1.html; different batch id (null)
Skipping http://www.iguuu.com/thread-987-1-1.html; different batch id (null)
Exception in thread "main" java.lang.NullPointerException
    at java.util.Hashtable.put(Unknown Source)
    at java.util.Properties.setProperty(Unknown Source)
    at org.apache.hadoop.conf.Configuration.set(Configuration.java:438)
    at org.apache.nutch.indexer.IndexerJob.createIndexJob(IndexerJob.java:128)
    at org.apache.nutch.indexer.solr.SolrIndexerJob.run(SolrIndexerJob.java:44)
    at org.apache.nutch.crawl.Crawler.runTool(Crawler.java:68)
    at org.apache.nutch.crawl.Crawler.run(Crawler.java:192)
    at org.apache.nutch.crawl.Crawler.run(Crawler.java:250)
    at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
    at org.apache.nutch.crawl.Crawler.main(Crawler.java:257)

Regards,
Rui




Re: Re: Nutch 2.1 crash with solr

Posted by Lewis John Mcgibbney <le...@gmail.com>.
Hi Rui,

The equivalent of 'batchid' in 1.x would be segment. The batchId represents
an identifier for a data structure containing (initially) generated URLs
which are good for fetching.

hth

Lewis

On Fri, Dec 28, 2012 at 6:43 AM, 高睿 <ga...@163.com> wrote:

> Hi,
>
> I would like to do that. But I still don't understand the concept of
> 'batch id'. Besides, is it the right direction to capture 'batch' argument
> in command line?
>
> Thanks.
>
>
>
>
>
>
>
>
>
> At 2012-12-19 22:07:23,"Lewis John Mcgibbney" <le...@gmail.com>
> wrote:
> >Hi,
> >
> >Currently the batchID is originally set by the GeneratorJob#run() method
> >@line 169 [0], you will see that this can also be overridden by the
> >generate.batch.id property in nutch-site.xml
> >
> >Currently if you look at line 117 in the crawl script [1] you will see
> that
> >there is a TODO to capture the batchID programmatically.
> >
> >1) I would advise you to use this crawl script
> >2) If you are able to create an issue on Jira, then submit a patch for the
> >issue it would be excellent.
> >
> >Lewis
> >
> >[0]
> >
> http://svn.apache.org/viewvc/nutch/branches/2.x/src/java/org/apache/nutch/crawl/GeneratorJob.java?view=markup
> >[1]
> >http://svn.apache.org/viewvc/nutch/branches/2.x/src/bin/crawl?view=markup
> >
> >On Fri, Dec 14, 2012 at 11:49 AM, 高睿 <ga...@163.com> wrote:
> >
> >> Hi,
> >>
> >> When I specify solr in command line, There will be an exception thrown.
> >> Command line: urls -solr http://localhost:8080/solr/ -depth 1 -topN 3
> >> I tried to add '-batch 3' parameter into command line, but it doesn't
> >> help. I looked into the code, and found the parameter is ignored
> somewhere.
> >> So, how do I fix this? Thanks.
> >>
> >> Skipping http://www.iguuu.com/thread-944-1-1.html; different batch id
> >> (null)
> >> Skipping http://www.iguuu.com/thread-987-1-1.html; different batch id
> >> (null)
> >> Exception in thread "main" java.lang.NullPointerException
> >>     at java.util.Hashtable.put(Unknown Source)
> >>     at java.util.Properties.setProperty(Unknown Source)
> >>     at org.apache.hadoop.conf.Configuration.set(Configuration.java:438)
> >>     at
> >> org.apache.nutch.indexer.IndexerJob.createIndexJob(IndexerJob.java:128)
> >>     at
> >> org.apache.nutch.indexer.solr.SolrIndexerJob.run(SolrIndexerJob.java:44)
> >>     at org.apache.nutch.crawl.Crawler.runTool(Crawler.java:68)
> >>     at org.apache.nutch.crawl.Crawler.run(Crawler.java:192)
> >>     at org.apache.nutch.crawl.Crawler.run(Crawler.java:250)
> >>     at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
> >>     at org.apache.nutch.crawl.Crawler.main(Crawler.java:257)
> >>
> >> Regards,
> >> Rui
> >>
> >
> >
> >
> >--
> >*Lewis*
>



-- 
*Lewis*

Re:Re: Nutch 2.1 crash with solr

Posted by 高睿 <ga...@163.com>.
Hi,

I would like to do that. But I still don't understand the concept of 'batch id'. Besides, is it the right direction to capture 'batch' argument in command line?

Thanks.









At 2012-12-19 22:07:23,"Lewis John Mcgibbney" <le...@gmail.com> wrote:
>Hi,
>
>Currently the batchID is originally set by the GeneratorJob#run() method
>@line 169 [0], you will see that this can also be overridden by the
>generate.batch.id property in nutch-site.xml
>
>Currently if you look at line 117 in the crawl script [1] you will see that
>there is a TODO to capture the batchID programmatically.
>
>1) I would advise you to use this crawl script
>2) If you are able to create an issue on Jira, then submit a patch for the
>issue it would be excellent.
>
>Lewis
>
>[0]
>http://svn.apache.org/viewvc/nutch/branches/2.x/src/java/org/apache/nutch/crawl/GeneratorJob.java?view=markup
>[1]
>http://svn.apache.org/viewvc/nutch/branches/2.x/src/bin/crawl?view=markup
>
>On Fri, Dec 14, 2012 at 11:49 AM, 高睿 <ga...@163.com> wrote:
>
>> Hi,
>>
>> When I specify solr in command line, There will be an exception thrown.
>> Command line: urls -solr http://localhost:8080/solr/ -depth 1 -topN 3
>> I tried to add '-batch 3' parameter into command line, but it doesn't
>> help. I looked into the code, and found the parameter is ignored somewhere.
>> So, how do I fix this? Thanks.
>>
>> Skipping http://www.iguuu.com/thread-944-1-1.html; different batch id
>> (null)
>> Skipping http://www.iguuu.com/thread-987-1-1.html; different batch id
>> (null)
>> Exception in thread "main" java.lang.NullPointerException
>>     at java.util.Hashtable.put(Unknown Source)
>>     at java.util.Properties.setProperty(Unknown Source)
>>     at org.apache.hadoop.conf.Configuration.set(Configuration.java:438)
>>     at
>> org.apache.nutch.indexer.IndexerJob.createIndexJob(IndexerJob.java:128)
>>     at
>> org.apache.nutch.indexer.solr.SolrIndexerJob.run(SolrIndexerJob.java:44)
>>     at org.apache.nutch.crawl.Crawler.runTool(Crawler.java:68)
>>     at org.apache.nutch.crawl.Crawler.run(Crawler.java:192)
>>     at org.apache.nutch.crawl.Crawler.run(Crawler.java:250)
>>     at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
>>     at org.apache.nutch.crawl.Crawler.main(Crawler.java:257)
>>
>> Regards,
>> Rui
>>
>
>
>
>-- 
>*Lewis*

Re: Nutch 2.1 crash with solr

Posted by Lewis John Mcgibbney <le...@gmail.com>.
Hi,

Currently the batchID is originally set by the GeneratorJob#run() method
@line 169 [0], you will see that this can also be overridden by the
generate.batch.id property in nutch-site.xml

Currently if you look at line 117 in the crawl script [1] you will see that
there is a TODO to capture the batchID programmatically.

1) I would advise you to use this crawl script
2) If you are able to create an issue on Jira, then submit a patch for the
issue it would be excellent.

Lewis

[0]
http://svn.apache.org/viewvc/nutch/branches/2.x/src/java/org/apache/nutch/crawl/GeneratorJob.java?view=markup
[1]
http://svn.apache.org/viewvc/nutch/branches/2.x/src/bin/crawl?view=markup

On Fri, Dec 14, 2012 at 11:49 AM, 高睿 <ga...@163.com> wrote:

> Hi,
>
> When I specify solr in command line, There will be an exception thrown.
> Command line: urls -solr http://localhost:8080/solr/ -depth 1 -topN 3
> I tried to add '-batch 3' parameter into command line, but it doesn't
> help. I looked into the code, and found the parameter is ignored somewhere.
> So, how do I fix this? Thanks.
>
> Skipping http://www.iguuu.com/thread-944-1-1.html; different batch id
> (null)
> Skipping http://www.iguuu.com/thread-987-1-1.html; different batch id
> (null)
> Exception in thread "main" java.lang.NullPointerException
>     at java.util.Hashtable.put(Unknown Source)
>     at java.util.Properties.setProperty(Unknown Source)
>     at org.apache.hadoop.conf.Configuration.set(Configuration.java:438)
>     at
> org.apache.nutch.indexer.IndexerJob.createIndexJob(IndexerJob.java:128)
>     at
> org.apache.nutch.indexer.solr.SolrIndexerJob.run(SolrIndexerJob.java:44)
>     at org.apache.nutch.crawl.Crawler.runTool(Crawler.java:68)
>     at org.apache.nutch.crawl.Crawler.run(Crawler.java:192)
>     at org.apache.nutch.crawl.Crawler.run(Crawler.java:250)
>     at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
>     at org.apache.nutch.crawl.Crawler.main(Crawler.java:257)
>
> Regards,
> Rui
>



-- 
*Lewis*