You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@nutch.apache.org by Sami Siren <ss...@gmail.com> on 2009/02/28 09:26:10 UTC

planning for nutch-1.0-rc1

I am planning to build the first rc for nutch 1.0 at Tue 3.3.2009 
morning (EET). There are still some issues marked as fix for 1.0 in 
Jira. Neither of the two remaining _bugs_ seems too important to me, 
actually I only count the issues assigned to developers as real 
candidates to be included in 1.0:

NUTCH-578 (kubes)
NUTCH-477 (ab)
NUTCH-669 (siren)

I am also volunteering to push all open issues to 1.1 before starting 
the RC build on Tuesday. Any objections on the proposed procedure or timing?

--
 Sami Siren

Re: planning for nutch-1.0-rc1

Posted by Bartosz Gadzimski <ba...@o2.pl>.

Hello,

It's on 2 linux boxes one with centos and one with ubuntu. Both properly 
running "old" bin/nutch crawl.
Problem is that it doesn't give exception on command line or in eclipse 
just writes to logs so it's hard to debug.

One is running nutch trunk from 07 march, and one from todays rc1

Any hints? Maybe some logs properties or sth?

In hadoop.log it looks exactly the same:

2009-03-09 12:12:09,452 INFO  plugin.PluginRepository -         Nutch 
Scoring (org.apache.nutch.scoring.ScoringFilter)
2009-03-09 12:12:09,452 INFO  plugin.PluginRepository -         Ontology 
Model Loader (org.apache.nutch.ontology.Ontology)
2009-03-09 12:12:09,560 INFO  field.FieldIndexer - IFD [Thread-11]: 
setInfoStream 
deletionPolicy=org.apache.lucene.index.KeepOnlyLastCommitDeletionPolicy@6210fb
2009-03-09 12:12:09,560 INFO  field.FieldIndexer - IW 0 [Thread-11]: 
setInfoStream: 
dir=org.apache.lucene.store.FSDirectory@/tmp/hadoop-agniesia441/mapred/local/index/_-174719952 
autoCommit=true
mergePolicy=org.apache.lucene.index.LogByteSizeMergePolicy@48edb5 
mergeScheduler=org.apache.lucene.index.ConcurrentMergeScheduler@1ee2c2c 
ramBufferSizeMB=16.0 maxBufferedDocs=50 maxBuffereDeleteTerms=-1 
maxFieldLength=10000 index=
2009-03-09 12:12:09,585 WARN  mapred.LocalJobRunner - job_local_0001
java.lang.NullPointerException
        at 
org.apache.nutch.indexer.field.FieldIndexer$OutputFormat$1.write(FieldIndexer.java:139)
        at 
org.apache.nutch.indexer.field.FieldIndexer$OutputFormat$1.write(FieldIndexer.java:1)
        at 
org.apache.hadoop.mapred.ReduceTask$3.collect(ReduceTask.java:410)
        at 
org.apache.nutch.indexer.field.FieldIndexer.reduce(FieldIndexer.java:239)
        at 
org.apache.nutch.indexer.field.FieldIndexer.reduce(FieldIndexer.java:1)
        at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:436)
        at 
org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:170)
2009-03-09 12:12:10,021 FATAL field.FieldIndexer - FieldIndexer: 
java.io.IOException: Job failed!
        at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1232)
        at 
org.apache.nutch.indexer.field.FieldIndexer.index(FieldIndexer.java:267)
        at 
org.apache.nutch.indexer.field.FieldIndexer.run(FieldIndexer.java:312)
        at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
        at 
org.apache.nutch.indexer.field.FieldIndexer.main(FieldIndexer.java:275)


Thanks,
Bartosz


Dennis Kubes pisze:
> Sorry about the docs being sparse on this.  I will write more about 
> the process as time permits.  Don't know about the problem below.  
> What platform are you running on, windows, linux?
>
> Dennis
>
> Bartosz Gadzimski wrote:
>> Hello,
>>
>> Thanks Dennis for updateing wiki it helped a lot.
>>
>> You gave example with indexing but you didn't said a bit about it. 
>> Can you write some more? :)
>>
>> Anyways I have problems at the last step (nutch from 07 march):
>>
>> bin/nutch org.apache.nutch.indexer.field.FieldIndexer
>>
>> It simply stops somewhere
>>
>> 2009-03-07 16:09:04,432 INFO  field.FieldIndexer - FieldIndexer: 
>> starting
>> 2009-03-07 16:09:04,436 INFO  field.FieldIndexer - FieldIndexer: 
>> adding fields db: crawl/fields/basicfields
>> 2009-03-07 16:09:04,498 INFO  field.FieldIndexer - FieldIndexer: 
>> adding fields db: crawl/fields/anchorfields
>> 2009-03-07 16:09:05,636 INFO  plugin.PluginRepository - Plugins: 
>> looking in: /usr/local/nutch/plugins
>> 2009-03-07 16:09:06,437 INFO  plugin.PluginRepository - Plugin 
>> Auto-activation mode: [true]
>> 2009-03-07 16:09:06,437 INFO  plugin.PluginRepository - Registered 
>> Plugins:
>> 2009-03-07 16:09:06,437 INFO  plugin.PluginRepository -         the 
>> nutch core extension points (nutch-extensionpoints)
>> 2009-03-07 16:09:06,437 INFO  plugin.PluginRepository -         Basic 
>> Query Filter (query-basic)
>> .... plugins....
>>
>> 2009-03-07 16:09:07,769 INFO  field.FieldIndexer - IFD [Thread-11]: 
>> setInfoStream 
>> deletionPolicy=org.apache.lucene.index.KeepOnlyLastCommitDeletionPolicy@1b4a74b 
>>
>> 2009-03-07 16:09:07,769 INFO  field.FieldIndexer - IW 0 [Thread-11]: 
>> setInfoStream: 
>> dir=org.apache.lucene.store.FSDirectory@/tmp/hadoop-root/mapred/local/index/_-884655313 
>> autoCommit=true 
>> mergePolicy=org.apache.lucene.index.LogByteSizeMergePolicy@15356d5 
>> mergeScheduler=org.apache.lucene.index.ConcurrentMergeScheduler@69d02b 
>> ramBufferSizeMB=16.0 maxBufferedDocs=50 maxBuffereDeleteTerms=-1 
>> maxFieldLength=10000 index=
>> 2009-03-07 16:09:07,781 WARN  mapred.LocalJobRunner - job_local_0001
>> java.lang.NullPointerException
>>        at 
>> org.apache.nutch.indexer.field.FieldIndexer$OutputFormat$1.write(FieldIndexer.java:139) 
>>
>>        at 
>> org.apache.nutch.indexer.field.FieldIndexer$OutputFormat$1.write(FieldIndexer.java:131) 
>>
>>        at 
>> org.apache.hadoop.mapred.ReduceTask$3.collect(ReduceTask.java:410)
>>        at 
>> org.apache.nutch.indexer.field.FieldIndexer.reduce(FieldIndexer.java:239) 
>>
>>        at 
>> org.apache.nutch.indexer.field.FieldIndexer.reduce(FieldIndexer.java:69)
>>        at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:436)
>>        at 
>> org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:170)
>> 2009-03-07 16:09:08,197 FATAL field.FieldIndexer - FieldIndexer: 
>> java.io.IOException: Job failed!
>>        at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1232)
>>        at 
>> org.apache.nutch.indexer.field.FieldIndexer.index(FieldIndexer.java:267)
>>        at 
>> org.apache.nutch.indexer.field.FieldIndexer.run(FieldIndexer.java:312)
>>        at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
>>        at 
>> org.apache.nutch.indexer.field.FieldIndexer.main(FieldIndexer.java:275)
>>
>>
>>
>>
>> In crawl/indexes is only _temporary folder.
>>
>> I will try to debug this but have problems with running nutch in eclipse
>>
>> Thanks,
>> Bartosz
>>
>>
>>
>> Dennis Kubes pisze:
>>> I don't know if I would make this primary yet.  I need to check what 
>>> is causing this as it worked fine for me, in fact we currently have 
>>> it in production.  Also we would need to update the shell scripts to 
>>> integrate this more tightly.
>>>
>>> Dennis
>>>
>>> Bartosz Gadzimski wrote:
>>>> Sami Siren pisze:
>>>>> Andrzej Bialecki wrote:
>>>>>> Sami Siren wrote:
>>>>>>> I am planning to build the first rc for nutch 1.0 at Tue 
>>>>>>> 3.3.2009 morning (EET). There are still some issues marked as 
>>>>>>> fix for 1.0 in Jira. Neither of the two remaining _bugs_ seems 
>>>>>>> too important to me, actually I only count the issues assigned 
>>>>>>> to developers as real candidates to be included in 1.0:
>>>>>>>
>>>>>>> NUTCH-578 (kubes)
>>>>>>> NUTCH-477 (ab)
>>>>>>> NUTCH-669 (siren)
>>>>>>
>>>>>> There's one Critical issue reported, related to NekoHTML 
>>>>>> (NUTCH-700). I'm not sure what are the feature differences 
>>>>>> (pertinent to Nutch) between 0.9.4 and 1.9.11 - perhaps 
>>>>>> downgrading is the safest course of action.
>>>>> I will take care of that.
>>>>>>
>>>>>>
>>>>>>> I am also volunteering to push all open issues to 1.1 before 
>>>>>>> starting the RC build on Tuesday. Any objections on the proposed 
>>>>>>> procedure or timing?
>>>>>>
>>>>>> Sounds good.
>>>>> great!
>>>>>
>>>>> -- 
>>>>> Sami Siren
>>>>>
>>>>>
>>>>>
>>>> What about new scoring and new indexing? Will it be integrated as a 
>>>> primary scoring algorithm? I have problem with it on LinkRank:
>>>>
>>>> 2009-03-02 20:43:45,708 INFO  webgraph.LinkRank - Starting link 
>>>> counter job
>>>> 2009-03-02 20:43:47,838 INFO  webgraph.LinkRank - Finished link 
>>>> counter job
>>>> 2009-03-02 20:43:47,839 INFO  webgraph.LinkRank - Reading numlinks 
>>>> temp file
>>>> 2009-03-02 20:43:47,840 INFO  webgraph.LinkRank - Deleting numlinks 
>>>> temp file
>>>> 2009-03-02 20:43:47,842 FATAL webgraph.LinkRank - LinkAnalysis: 
>>>> java.lang.NullPointerException
>>>>        at 
>>>> org.apache.nutch.scoring.webgraph.LinkRank.runCounter(LinkRank.java:113) 
>>>>
>>>>        at 
>>>> org.apache.nutch.scoring.webgraph.LinkRank.analyze(LinkRank.java:582)
>>>>        at 
>>>> org.apache.nutch.scoring.webgraph.LinkRank.run(LinkRank.java:657)
>>>>        at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
>>>>        at 
>>>> org.apache.nutch.scoring.webgraph.LinkRank.main(LinkRank.java:627)
>>>>
>>>> Another question what about indexing framework mentioned here:
>>>> http://www.mail-archive.com/nutch-user@lucene.apache.org/msg11764.html
>>>>
>>>>
>>>> Have all those new scoring and indexing would be real step forward.
>>>>
>>>> Thanks,
>>>> Bartosz
>>>>
>>>
>>
>

Re: planning for nutch-1.0-rc1

Posted by Dennis Kubes <ku...@apache.org>.

The output of FieldIndexer similar to the current indexer is an indexes 
folder.  This output can be named whatever you want but to be searched 
by NutchBean, it expects a base folder containing folders named either 
index and segments or indexes and segments.

The index folder would be a single index versus segment indexes.  It is 
a little confusing here as segment indexes does NOT refer to the nutch 
segments content but to output under the indexes folder such as 
part-xxxxx folders each with a lucene index, one part or index per 
output of the reduce task.

You can use the bin/nutch merge command with and output similar to 
/yourfolder/crawl/index to merge (possibly multiple) segment indexes 
into a single index.  Nutch though should be able to handle both segment 
and single indexes, it just handles them based on how they are named.

Further than that lucene can create a compound index file from a single 
index where everything is in a single file.  I don't think Nutch 
currently supports that, in terms of creation, it should in terms of 
querying.

Dennis

Bartosz Gadzimski wrote:
> Hello Dennis,
> 
> We'v been trying your new framework and indexer and everything looks 
> better now. But we can't understand what should be output of last 
> command (FieldIndexer).
> 
> We have:
> user@kubuntu:~/nutch-1.0$ ls crawl/indexes/part-00000/
> index.done         segments_1         segments.gen
> .index.done.crc    .segments_1.crc    .segments.gen.crc
> 
> How to generate "normal" index from those indexes?
> 
> Thanks in advance,
> Bartosz
> 
> 
> Dennis Kubes pisze:
>> Sorry about the docs being sparse on this.  I will write more about 
>> the process as time permits.  Don't know about the problem below.  
>> What platform are you running on, windows, linux?
>>
>> Dennis
>>
>> Bartosz Gadzimski wrote:
>>> Hello,
>>>
>>> Thanks Dennis for updateing wiki it helped a lot.
>>>
>>> You gave example with indexing but you didn't said a bit about it. 
>>> Can you write some more? :)
>>>
>>> Anyways I have problems at the last step (nutch from 07 march):
>>>
>>> bin/nutch org.apache.nutch.indexer.field.FieldIndexer
>>>
>>> It simply stops somewhere
>>>
>>> 2009-03-07 16:09:04,432 INFO  field.FieldIndexer - FieldIndexer: 
>>> starting
>>> 2009-03-07 16:09:04,436 INFO  field.FieldIndexer - FieldIndexer: 
>>> adding fields db: crawl/fields/basicfields
>>> 2009-03-07 16:09:04,498 INFO  field.FieldIndexer - FieldIndexer: 
>>> adding fields db: crawl/fields/anchorfields
>>> 2009-03-07 16:09:05,636 INFO  plugin.PluginRepository - Plugins: 
>>> looking in: /usr/local/nutch/plugins
>>> 2009-03-07 16:09:06,437 INFO  plugin.PluginRepository - Plugin 
>>> Auto-activation mode: [true]
>>> 2009-03-07 16:09:06,437 INFO  plugin.PluginRepository - Registered 
>>> Plugins:
>>> 2009-03-07 16:09:06,437 INFO  plugin.PluginRepository -         the 
>>> nutch core extension points (nutch-extensionpoints)
>>> 2009-03-07 16:09:06,437 INFO  plugin.PluginRepository -         Basic 
>>> Query Filter (query-basic)
>>> .... plugins....
>>>
>>> 2009-03-07 16:09:07,769 INFO  field.FieldIndexer - IFD [Thread-11]: 
>>> setInfoStream 
>>> deletionPolicy=org.apache.lucene.index.KeepOnlyLastCommitDeletionPolicy@1b4a74b 
>>>
>>> 2009-03-07 16:09:07,769 INFO  field.FieldIndexer - IW 0 [Thread-11]: 
>>> setInfoStream: 
>>> dir=org.apache.lucene.store.FSDirectory@/tmp/hadoop-root/mapred/local/index/_-884655313 
>>> autoCommit=true 
>>> mergePolicy=org.apache.lucene.index.LogByteSizeMergePolicy@15356d5 
>>> mergeScheduler=org.apache.lucene.index.ConcurrentMergeScheduler@69d02b 
>>> ramBufferSizeMB=16.0 maxBufferedDocs=50 maxBuffereDeleteTerms=-1 
>>> maxFieldLength=10000 index=
>>> 2009-03-07 16:09:07,781 WARN  mapred.LocalJobRunner - job_local_0001
>>> java.lang.NullPointerException
>>>        at 
>>> org.apache.nutch.indexer.field.FieldIndexer$OutputFormat$1.write(FieldIndexer.java:139) 
>>>
>>>        at 
>>> org.apache.nutch.indexer.field.FieldIndexer$OutputFormat$1.write(FieldIndexer.java:131) 
>>>
>>>        at 
>>> org.apache.hadoop.mapred.ReduceTask$3.collect(ReduceTask.java:410)
>>>        at 
>>> org.apache.nutch.indexer.field.FieldIndexer.reduce(FieldIndexer.java:239) 
>>>
>>>        at 
>>> org.apache.nutch.indexer.field.FieldIndexer.reduce(FieldIndexer.java:69)
>>>        at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:436)
>>>        at 
>>> org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:170)
>>> 2009-03-07 16:09:08,197 FATAL field.FieldIndexer - FieldIndexer: 
>>> java.io.IOException: Job failed!
>>>        at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1232)
>>>        at 
>>> org.apache.nutch.indexer.field.FieldIndexer.index(FieldIndexer.java:267)
>>>        at 
>>> org.apache.nutch.indexer.field.FieldIndexer.run(FieldIndexer.java:312)
>>>        at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
>>>        at 
>>> org.apache.nutch.indexer.field.FieldIndexer.main(FieldIndexer.java:275)
>>>
>>>
>>>
>>>
>>> In crawl/indexes is only _temporary folder.
>>>
>>> I will try to debug this but have problems with running nutch in eclipse
>>>
>>> Thanks,
>>> Bartosz
>>>
>>>
>>>
>>> Dennis Kubes pisze:
>>>> I don't know if I would make this primary yet.  I need to check what 
>>>> is causing this as it worked fine for me, in fact we currently have 
>>>> it in production.  Also we would need to update the shell scripts to 
>>>> integrate this more tightly.
>>>>
>>>> Dennis
>>>>
>>>> Bartosz Gadzimski wrote:
>>>>> Sami Siren pisze:
>>>>>> Andrzej Bialecki wrote:
>>>>>>> Sami Siren wrote:
>>>>>>>> I am planning to build the first rc for nutch 1.0 at Tue 
>>>>>>>> 3.3.2009 morning (EET). There are still some issues marked as 
>>>>>>>> fix for 1.0 in Jira. Neither of the two remaining _bugs_ seems 
>>>>>>>> too important to me, actually I only count the issues assigned 
>>>>>>>> to developers as real candidates to be included in 1.0:
>>>>>>>>
>>>>>>>> NUTCH-578 (kubes)
>>>>>>>> NUTCH-477 (ab)
>>>>>>>> NUTCH-669 (siren)
>>>>>>>
>>>>>>> There's one Critical issue reported, related to NekoHTML 
>>>>>>> (NUTCH-700). I'm not sure what are the feature differences 
>>>>>>> (pertinent to Nutch) between 0.9.4 and 1.9.11 - perhaps 
>>>>>>> downgrading is the safest course of action.
>>>>>> I will take care of that.
>>>>>>>
>>>>>>>
>>>>>>>> I am also volunteering to push all open issues to 1.1 before 
>>>>>>>> starting the RC build on Tuesday. Any objections on the proposed 
>>>>>>>> procedure or timing?
>>>>>>>
>>>>>>> Sounds good.
>>>>>> great!
>>>>>>
>>>>>> -- 
>>>>>> Sami Siren
>>>>>>
>>>>>>
>>>>>>
>>>>> What about new scoring and new indexing? Will it be integrated as a 
>>>>> primary scoring algorithm? I have problem with it on LinkRank:
>>>>>
>>>>> 2009-03-02 20:43:45,708 INFO  webgraph.LinkRank - Starting link 
>>>>> counter job
>>>>> 2009-03-02 20:43:47,838 INFO  webgraph.LinkRank - Finished link 
>>>>> counter job
>>>>> 2009-03-02 20:43:47,839 INFO  webgraph.LinkRank - Reading numlinks 
>>>>> temp file
>>>>> 2009-03-02 20:43:47,840 INFO  webgraph.LinkRank - Deleting numlinks 
>>>>> temp file
>>>>> 2009-03-02 20:43:47,842 FATAL webgraph.LinkRank - LinkAnalysis: 
>>>>> java.lang.NullPointerException
>>>>>        at 
>>>>> org.apache.nutch.scoring.webgraph.LinkRank.runCounter(LinkRank.java:113) 
>>>>>
>>>>>        at 
>>>>> org.apache.nutch.scoring.webgraph.LinkRank.analyze(LinkRank.java:582)
>>>>>        at 
>>>>> org.apache.nutch.scoring.webgraph.LinkRank.run(LinkRank.java:657)
>>>>>        at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
>>>>>        at 
>>>>> org.apache.nutch.scoring.webgraph.LinkRank.main(LinkRank.java:627)
>>>>>
>>>>> Another question what about indexing framework mentioned here:
>>>>> http://www.mail-archive.com/nutch-user@lucene.apache.org/msg11764.html
>>>>>
>>>>>
>>>>> Have all those new scoring and indexing would be real step forward.
>>>>>
>>>>> Thanks,
>>>>> Bartosz
>>>>>
>>>>
>>>
>>
>

Re: planning for nutch-1.0-rc1

Posted by Bartosz Gadzimski <ba...@o2.pl>.

Hello Dennis,

We'v been trying your new framework and indexer and everything looks 
better now. But we can't understand what should be output of last 
command (FieldIndexer).

We have:
user@kubuntu:~/nutch-1.0$ ls crawl/indexes/part-00000/
index.done         segments_1         segments.gen
.index.done.crc    .segments_1.crc    .segments.gen.crc

How to generate "normal" index from those indexes?

Thanks in advance,
Bartosz


Dennis Kubes pisze:
> Sorry about the docs being sparse on this.  I will write more about 
> the process as time permits.  Don't know about the problem below.  
> What platform are you running on, windows, linux?
>
> Dennis
>
> Bartosz Gadzimski wrote:
>> Hello,
>>
>> Thanks Dennis for updateing wiki it helped a lot.
>>
>> You gave example with indexing but you didn't said a bit about it. 
>> Can you write some more? :)
>>
>> Anyways I have problems at the last step (nutch from 07 march):
>>
>> bin/nutch org.apache.nutch.indexer.field.FieldIndexer
>>
>> It simply stops somewhere
>>
>> 2009-03-07 16:09:04,432 INFO  field.FieldIndexer - FieldIndexer: 
>> starting
>> 2009-03-07 16:09:04,436 INFO  field.FieldIndexer - FieldIndexer: 
>> adding fields db: crawl/fields/basicfields
>> 2009-03-07 16:09:04,498 INFO  field.FieldIndexer - FieldIndexer: 
>> adding fields db: crawl/fields/anchorfields
>> 2009-03-07 16:09:05,636 INFO  plugin.PluginRepository - Plugins: 
>> looking in: /usr/local/nutch/plugins
>> 2009-03-07 16:09:06,437 INFO  plugin.PluginRepository - Plugin 
>> Auto-activation mode: [true]
>> 2009-03-07 16:09:06,437 INFO  plugin.PluginRepository - Registered 
>> Plugins:
>> 2009-03-07 16:09:06,437 INFO  plugin.PluginRepository -         the 
>> nutch core extension points (nutch-extensionpoints)
>> 2009-03-07 16:09:06,437 INFO  plugin.PluginRepository -         Basic 
>> Query Filter (query-basic)
>> .... plugins....
>>
>> 2009-03-07 16:09:07,769 INFO  field.FieldIndexer - IFD [Thread-11]: 
>> setInfoStream 
>> deletionPolicy=org.apache.lucene.index.KeepOnlyLastCommitDeletionPolicy@1b4a74b 
>>
>> 2009-03-07 16:09:07,769 INFO  field.FieldIndexer - IW 0 [Thread-11]: 
>> setInfoStream: 
>> dir=org.apache.lucene.store.FSDirectory@/tmp/hadoop-root/mapred/local/index/_-884655313 
>> autoCommit=true 
>> mergePolicy=org.apache.lucene.index.LogByteSizeMergePolicy@15356d5 
>> mergeScheduler=org.apache.lucene.index.ConcurrentMergeScheduler@69d02b 
>> ramBufferSizeMB=16.0 maxBufferedDocs=50 maxBuffereDeleteTerms=-1 
>> maxFieldLength=10000 index=
>> 2009-03-07 16:09:07,781 WARN  mapred.LocalJobRunner - job_local_0001
>> java.lang.NullPointerException
>>        at 
>> org.apache.nutch.indexer.field.FieldIndexer$OutputFormat$1.write(FieldIndexer.java:139) 
>>
>>        at 
>> org.apache.nutch.indexer.field.FieldIndexer$OutputFormat$1.write(FieldIndexer.java:131) 
>>
>>        at 
>> org.apache.hadoop.mapred.ReduceTask$3.collect(ReduceTask.java:410)
>>        at 
>> org.apache.nutch.indexer.field.FieldIndexer.reduce(FieldIndexer.java:239) 
>>
>>        at 
>> org.apache.nutch.indexer.field.FieldIndexer.reduce(FieldIndexer.java:69)
>>        at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:436)
>>        at 
>> org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:170)
>> 2009-03-07 16:09:08,197 FATAL field.FieldIndexer - FieldIndexer: 
>> java.io.IOException: Job failed!
>>        at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1232)
>>        at 
>> org.apache.nutch.indexer.field.FieldIndexer.index(FieldIndexer.java:267)
>>        at 
>> org.apache.nutch.indexer.field.FieldIndexer.run(FieldIndexer.java:312)
>>        at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
>>        at 
>> org.apache.nutch.indexer.field.FieldIndexer.main(FieldIndexer.java:275)
>>
>>
>>
>>
>> In crawl/indexes is only _temporary folder.
>>
>> I will try to debug this but have problems with running nutch in eclipse
>>
>> Thanks,
>> Bartosz
>>
>>
>>
>> Dennis Kubes pisze:
>>> I don't know if I would make this primary yet.  I need to check what 
>>> is causing this as it worked fine for me, in fact we currently have 
>>> it in production.  Also we would need to update the shell scripts to 
>>> integrate this more tightly.
>>>
>>> Dennis
>>>
>>> Bartosz Gadzimski wrote:
>>>> Sami Siren pisze:
>>>>> Andrzej Bialecki wrote:
>>>>>> Sami Siren wrote:
>>>>>>> I am planning to build the first rc for nutch 1.0 at Tue 
>>>>>>> 3.3.2009 morning (EET). There are still some issues marked as 
>>>>>>> fix for 1.0 in Jira. Neither of the two remaining _bugs_ seems 
>>>>>>> too important to me, actually I only count the issues assigned 
>>>>>>> to developers as real candidates to be included in 1.0:
>>>>>>>
>>>>>>> NUTCH-578 (kubes)
>>>>>>> NUTCH-477 (ab)
>>>>>>> NUTCH-669 (siren)
>>>>>>
>>>>>> There's one Critical issue reported, related to NekoHTML 
>>>>>> (NUTCH-700). I'm not sure what are the feature differences 
>>>>>> (pertinent to Nutch) between 0.9.4 and 1.9.11 - perhaps 
>>>>>> downgrading is the safest course of action.
>>>>> I will take care of that.
>>>>>>
>>>>>>
>>>>>>> I am also volunteering to push all open issues to 1.1 before 
>>>>>>> starting the RC build on Tuesday. Any objections on the proposed 
>>>>>>> procedure or timing?
>>>>>>
>>>>>> Sounds good.
>>>>> great!
>>>>>
>>>>> -- 
>>>>> Sami Siren
>>>>>
>>>>>
>>>>>
>>>> What about new scoring and new indexing? Will it be integrated as a 
>>>> primary scoring algorithm? I have problem with it on LinkRank:
>>>>
>>>> 2009-03-02 20:43:45,708 INFO  webgraph.LinkRank - Starting link 
>>>> counter job
>>>> 2009-03-02 20:43:47,838 INFO  webgraph.LinkRank - Finished link 
>>>> counter job
>>>> 2009-03-02 20:43:47,839 INFO  webgraph.LinkRank - Reading numlinks 
>>>> temp file
>>>> 2009-03-02 20:43:47,840 INFO  webgraph.LinkRank - Deleting numlinks 
>>>> temp file
>>>> 2009-03-02 20:43:47,842 FATAL webgraph.LinkRank - LinkAnalysis: 
>>>> java.lang.NullPointerException
>>>>        at 
>>>> org.apache.nutch.scoring.webgraph.LinkRank.runCounter(LinkRank.java:113) 
>>>>
>>>>        at 
>>>> org.apache.nutch.scoring.webgraph.LinkRank.analyze(LinkRank.java:582)
>>>>        at 
>>>> org.apache.nutch.scoring.webgraph.LinkRank.run(LinkRank.java:657)
>>>>        at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
>>>>        at 
>>>> org.apache.nutch.scoring.webgraph.LinkRank.main(LinkRank.java:627)
>>>>
>>>> Another question what about indexing framework mentioned here:
>>>> http://www.mail-archive.com/nutch-user@lucene.apache.org/msg11764.html
>>>>
>>>>
>>>> Have all those new scoring and indexing would be real step forward.
>>>>
>>>> Thanks,
>>>> Bartosz
>>>>
>>>
>>
>

Re: planning for nutch-1.0-rc1

Posted by Dennis Kubes <ku...@apache.org>.

Sorry about the docs being sparse on this.  I will write more about the 
process as time permits.  Don't know about the problem below.  What 
platform are you running on, windows, linux?

Dennis

Bartosz Gadzimski wrote:
> Hello,
> 
> Thanks Dennis for updateing wiki it helped a lot.
> 
> You gave example with indexing but you didn't said a bit about it. Can 
> you write some more? :)
> 
> Anyways I have problems at the last step (nutch from 07 march):
> 
> bin/nutch org.apache.nutch.indexer.field.FieldIndexer
> 
> It simply stops somewhere
> 
> 2009-03-07 16:09:04,432 INFO  field.FieldIndexer - FieldIndexer: starting
> 2009-03-07 16:09:04,436 INFO  field.FieldIndexer - FieldIndexer: adding 
> fields db: crawl/fields/basicfields
> 2009-03-07 16:09:04,498 INFO  field.FieldIndexer - FieldIndexer: adding 
> fields db: crawl/fields/anchorfields
> 2009-03-07 16:09:05,636 INFO  plugin.PluginRepository - Plugins: looking 
> in: /usr/local/nutch/plugins
> 2009-03-07 16:09:06,437 INFO  plugin.PluginRepository - Plugin 
> Auto-activation mode: [true]
> 2009-03-07 16:09:06,437 INFO  plugin.PluginRepository - Registered Plugins:
> 2009-03-07 16:09:06,437 INFO  plugin.PluginRepository -         the 
> nutch core extension points (nutch-extensionpoints)
> 2009-03-07 16:09:06,437 INFO  plugin.PluginRepository -         Basic 
> Query Filter (query-basic)
> .... plugins....
> 
> 2009-03-07 16:09:07,769 INFO  field.FieldIndexer - IFD [Thread-11]: 
> setInfoStream 
> deletionPolicy=org.apache.lucene.index.KeepOnlyLastCommitDeletionPolicy@1b4a74b 
> 
> 2009-03-07 16:09:07,769 INFO  field.FieldIndexer - IW 0 [Thread-11]: 
> setInfoStream: 
> dir=org.apache.lucene.store.FSDirectory@/tmp/hadoop-root/mapred/local/index/_-884655313 
> autoCommit=true 
> mergePolicy=org.apache.lucene.index.LogByteSizeMergePolicy@15356d5 
> mergeScheduler=org.apache.lucene.index.ConcurrentMergeScheduler@69d02b 
> ramBufferSizeMB=16.0 maxBufferedDocs=50 maxBuffereDeleteTerms=-1 
> maxFieldLength=10000 index=
> 2009-03-07 16:09:07,781 WARN  mapred.LocalJobRunner - job_local_0001
> java.lang.NullPointerException
>        at 
> org.apache.nutch.indexer.field.FieldIndexer$OutputFormat$1.write(FieldIndexer.java:139) 
> 
>        at 
> org.apache.nutch.indexer.field.FieldIndexer$OutputFormat$1.write(FieldIndexer.java:131) 
> 
>        at 
> org.apache.hadoop.mapred.ReduceTask$3.collect(ReduceTask.java:410)
>        at 
> org.apache.nutch.indexer.field.FieldIndexer.reduce(FieldIndexer.java:239)
>        at 
> org.apache.nutch.indexer.field.FieldIndexer.reduce(FieldIndexer.java:69)
>        at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:436)
>        at 
> org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:170)
> 2009-03-07 16:09:08,197 FATAL field.FieldIndexer - FieldIndexer: 
> java.io.IOException: Job failed!
>        at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1232)
>        at 
> org.apache.nutch.indexer.field.FieldIndexer.index(FieldIndexer.java:267)
>        at 
> org.apache.nutch.indexer.field.FieldIndexer.run(FieldIndexer.java:312)
>        at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
>        at 
> org.apache.nutch.indexer.field.FieldIndexer.main(FieldIndexer.java:275)
> 
> 
> 
> 
> In crawl/indexes is only _temporary folder.
> 
> I will try to debug this but have problems with running nutch in eclipse
> 
> Thanks,
> Bartosz
> 
> 
> 
> Dennis Kubes pisze:
>> I don't know if I would make this primary yet.  I need to check what 
>> is causing this as it worked fine for me, in fact we currently have it 
>> in production.  Also we would need to update the shell scripts to 
>> integrate this more tightly.
>>
>> Dennis
>>
>> Bartosz Gadzimski wrote:
>>> Sami Siren pisze:
>>>> Andrzej Bialecki wrote:
>>>>> Sami Siren wrote:
>>>>>> I am planning to build the first rc for nutch 1.0 at Tue 3.3.2009 
>>>>>> morning (EET). There are still some issues marked as fix for 1.0 
>>>>>> in Jira. Neither of the two remaining _bugs_ seems too important 
>>>>>> to me, actually I only count the issues assigned to developers as 
>>>>>> real candidates to be included in 1.0:
>>>>>>
>>>>>> NUTCH-578 (kubes)
>>>>>> NUTCH-477 (ab)
>>>>>> NUTCH-669 (siren)
>>>>>
>>>>> There's one Critical issue reported, related to NekoHTML 
>>>>> (NUTCH-700). I'm not sure what are the feature differences 
>>>>> (pertinent to Nutch) between 0.9.4 and 1.9.11 - perhaps downgrading 
>>>>> is the safest course of action.
>>>> I will take care of that.
>>>>>
>>>>>
>>>>>> I am also volunteering to push all open issues to 1.1 before 
>>>>>> starting the RC build on Tuesday. Any objections on the proposed 
>>>>>> procedure or timing?
>>>>>
>>>>> Sounds good.
>>>> great!
>>>>
>>>> -- 
>>>> Sami Siren
>>>>
>>>>
>>>>
>>> What about new scoring and new indexing? Will it be integrated as a 
>>> primary scoring algorithm? I have problem with it on LinkRank:
>>>
>>> 2009-03-02 20:43:45,708 INFO  webgraph.LinkRank - Starting link 
>>> counter job
>>> 2009-03-02 20:43:47,838 INFO  webgraph.LinkRank - Finished link 
>>> counter job
>>> 2009-03-02 20:43:47,839 INFO  webgraph.LinkRank - Reading numlinks 
>>> temp file
>>> 2009-03-02 20:43:47,840 INFO  webgraph.LinkRank - Deleting numlinks 
>>> temp file
>>> 2009-03-02 20:43:47,842 FATAL webgraph.LinkRank - LinkAnalysis: 
>>> java.lang.NullPointerException
>>>        at 
>>> org.apache.nutch.scoring.webgraph.LinkRank.runCounter(LinkRank.java:113)
>>>        at 
>>> org.apache.nutch.scoring.webgraph.LinkRank.analyze(LinkRank.java:582)
>>>        at 
>>> org.apache.nutch.scoring.webgraph.LinkRank.run(LinkRank.java:657)
>>>        at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
>>>        at 
>>> org.apache.nutch.scoring.webgraph.LinkRank.main(LinkRank.java:627)
>>>
>>> Another question what about indexing framework mentioned here:
>>> http://www.mail-archive.com/nutch-user@lucene.apache.org/msg11764.html
>>>
>>>
>>> Have all those new scoring and indexing would be real step forward.
>>>
>>> Thanks,
>>> Bartosz
>>>
>>
>

Re: planning for nutch-1.0-rc1

Posted by Bartosz Gadzimski <ba...@o2.pl>.

Hello,

Thanks Dennis for updateing wiki it helped a lot.

You gave example with indexing but you didn't said a bit about it. Can 
you write some more? :)

Anyways I have problems at the last step (nutch from 07 march):

bin/nutch org.apache.nutch.indexer.field.FieldIndexer

It simply stops somewhere

2009-03-07 16:09:04,432 INFO  field.FieldIndexer - FieldIndexer: starting
2009-03-07 16:09:04,436 INFO  field.FieldIndexer - FieldIndexer: adding fields db: crawl/fields/basicfields
2009-03-07 16:09:04,498 INFO  field.FieldIndexer - FieldIndexer: adding fields db: crawl/fields/anchorfields
2009-03-07 16:09:05,636 INFO  plugin.PluginRepository - Plugins: looking in: /usr/local/nutch/plugins
2009-03-07 16:09:06,437 INFO  plugin.PluginRepository - Plugin Auto-activation mode: [true]
2009-03-07 16:09:06,437 INFO  plugin.PluginRepository - Registered Plugins:
2009-03-07 16:09:06,437 INFO  plugin.PluginRepository -         the nutch core extension points (nutch-extensionpoints)
2009-03-07 16:09:06,437 INFO  plugin.PluginRepository -         Basic Query Filter (query-basic)
.... plugins....

2009-03-07 16:09:07,769 INFO  field.FieldIndexer - IFD [Thread-11]: setInfoStream deletionPolicy=org.apache.lucene.index.KeepOnlyLastCommitDeletionPolicy@1b4a74b
2009-03-07 16:09:07,769 INFO  field.FieldIndexer - IW 0 [Thread-11]: setInfoStream: dir=org.apache.lucene.store.FSDirectory@/tmp/hadoop-root/mapred/local/index/_-884655313 autoCommit=true mergePolicy=org.apache.lucene.index.LogByteSizeMergePolicy@15356d5 mergeScheduler=org.apache.lucene.index.ConcurrentMergeScheduler@69d02b ramBufferSizeMB=16.0 maxBufferedDocs=50 maxBuffereDeleteTerms=-1 maxFieldLength=10000 index=
2009-03-07 16:09:07,781 WARN  mapred.LocalJobRunner - job_local_0001
java.lang.NullPointerException
        at org.apache.nutch.indexer.field.FieldIndexer$OutputFormat$1.write(FieldIndexer.java:139)
        at org.apache.nutch.indexer.field.FieldIndexer$OutputFormat$1.write(FieldIndexer.java:131)
        at org.apache.hadoop.mapred.ReduceTask$3.collect(ReduceTask.java:410)
        at org.apache.nutch.indexer.field.FieldIndexer.reduce(FieldIndexer.java:239)
        at org.apache.nutch.indexer.field.FieldIndexer.reduce(FieldIndexer.java:69)
        at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:436)
        at org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:170)
2009-03-07 16:09:08,197 FATAL field.FieldIndexer - FieldIndexer: java.io.IOException: Job failed!
        at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1232)
        at org.apache.nutch.indexer.field.FieldIndexer.index(FieldIndexer.java:267)
        at org.apache.nutch.indexer.field.FieldIndexer.run(FieldIndexer.java:312)
        at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
        at org.apache.nutch.indexer.field.FieldIndexer.main(FieldIndexer.java:275)




In crawl/indexes is only _temporary folder.

I will try to debug this but have problems with running nutch in eclipse

Thanks,
Bartosz



Dennis Kubes pisze:
> I don't know if I would make this primary yet.  I need to check what 
> is causing this as it worked fine for me, in fact we currently have it 
> in production.  Also we would need to update the shell scripts to 
> integrate this more tightly.
>
> Dennis
>
> Bartosz Gadzimski wrote:
>> Sami Siren pisze:
>>> Andrzej Bialecki wrote:
>>>> Sami Siren wrote:
>>>>> I am planning to build the first rc for nutch 1.0 at Tue 3.3.2009 
>>>>> morning (EET). There are still some issues marked as fix for 1.0 
>>>>> in Jira. Neither of the two remaining _bugs_ seems too important 
>>>>> to me, actually I only count the issues assigned to developers as 
>>>>> real candidates to be included in 1.0:
>>>>>
>>>>> NUTCH-578 (kubes)
>>>>> NUTCH-477 (ab)
>>>>> NUTCH-669 (siren)
>>>>
>>>> There's one Critical issue reported, related to NekoHTML 
>>>> (NUTCH-700). I'm not sure what are the feature differences 
>>>> (pertinent to Nutch) between 0.9.4 and 1.9.11 - perhaps downgrading 
>>>> is the safest course of action.
>>> I will take care of that.
>>>>
>>>>
>>>>> I am also volunteering to push all open issues to 1.1 before 
>>>>> starting the RC build on Tuesday. Any objections on the proposed 
>>>>> procedure or timing?
>>>>
>>>> Sounds good.
>>> great!
>>>
>>> -- 
>>> Sami Siren
>>>
>>>
>>>
>> What about new scoring and new indexing? Will it be integrated as a 
>> primary scoring algorithm? I have problem with it on LinkRank:
>>
>> 2009-03-02 20:43:45,708 INFO  webgraph.LinkRank - Starting link 
>> counter job
>> 2009-03-02 20:43:47,838 INFO  webgraph.LinkRank - Finished link 
>> counter job
>> 2009-03-02 20:43:47,839 INFO  webgraph.LinkRank - Reading numlinks 
>> temp file
>> 2009-03-02 20:43:47,840 INFO  webgraph.LinkRank - Deleting numlinks 
>> temp file
>> 2009-03-02 20:43:47,842 FATAL webgraph.LinkRank - LinkAnalysis: 
>> java.lang.NullPointerException
>>        at 
>> org.apache.nutch.scoring.webgraph.LinkRank.runCounter(LinkRank.java:113)
>>        at 
>> org.apache.nutch.scoring.webgraph.LinkRank.analyze(LinkRank.java:582)
>>        at 
>> org.apache.nutch.scoring.webgraph.LinkRank.run(LinkRank.java:657)
>>        at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
>>        at 
>> org.apache.nutch.scoring.webgraph.LinkRank.main(LinkRank.java:627)
>>
>> Another question what about indexing framework mentioned here:
>> http://www.mail-archive.com/nutch-user@lucene.apache.org/msg11764.html
>>
>>
>> Have all those new scoring and indexing would be real step forward.
>>
>> Thanks,
>> Bartosz
>>
>

Re: planning for nutch-1.0-rc1

Posted by Andrzej Bialecki <ab...@getopt.org>.

Dennis Kubes wrote:
> I don't know if I would make this primary yet.

Not before 1.0 ... ;) After that, we need to discuss what to do with the 
new and the current framework.

-- 
Best regards,
Andrzej Bialecki     <><
  ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com

Re: planning for nutch-1.0-rc1

Posted by Dennis Kubes <ku...@apache.org>.

I don't know if I would make this primary yet.  I need to check what is 
causing this as it worked fine for me, in fact we currently have it in 
production.  Also we would need to update the shell scripts to integrate 
this more tightly.

Dennis

Bartosz Gadzimski wrote:
> Sami Siren pisze:
>> Andrzej Bialecki wrote:
>>> Sami Siren wrote:
>>>> I am planning to build the first rc for nutch 1.0 at Tue 3.3.2009 
>>>> morning (EET). There are still some issues marked as fix for 1.0 in 
>>>> Jira. Neither of the two remaining _bugs_ seems too important to me, 
>>>> actually I only count the issues assigned to developers as real 
>>>> candidates to be included in 1.0:
>>>>
>>>> NUTCH-578 (kubes)
>>>> NUTCH-477 (ab)
>>>> NUTCH-669 (siren)
>>>
>>> There's one Critical issue reported, related to NekoHTML (NUTCH-700). 
>>> I'm not sure what are the feature differences (pertinent to Nutch) 
>>> between 0.9.4 and 1.9.11 - perhaps downgrading is the safest course 
>>> of action.
>> I will take care of that.
>>>
>>>
>>>> I am also volunteering to push all open issues to 1.1 before 
>>>> starting the RC build on Tuesday. Any objections on the proposed 
>>>> procedure or timing?
>>>
>>> Sounds good.
>> great!
>>
>> -- 
>> Sami Siren
>>
>>
>>
> What about new scoring and new indexing? Will it be integrated as a 
> primary scoring algorithm? I have problem with it on LinkRank:
> 
> 2009-03-02 20:43:45,708 INFO  webgraph.LinkRank - Starting link counter job
> 2009-03-02 20:43:47,838 INFO  webgraph.LinkRank - Finished link counter job
> 2009-03-02 20:43:47,839 INFO  webgraph.LinkRank - Reading numlinks temp 
> file
> 2009-03-02 20:43:47,840 INFO  webgraph.LinkRank - Deleting numlinks temp 
> file
> 2009-03-02 20:43:47,842 FATAL webgraph.LinkRank - LinkAnalysis: 
> java.lang.NullPointerException
>        at 
> org.apache.nutch.scoring.webgraph.LinkRank.runCounter(LinkRank.java:113)
>        at 
> org.apache.nutch.scoring.webgraph.LinkRank.analyze(LinkRank.java:582)
>        at org.apache.nutch.scoring.webgraph.LinkRank.run(LinkRank.java:657)
>        at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
>        at 
> org.apache.nutch.scoring.webgraph.LinkRank.main(LinkRank.java:627)
> 
> Another question what about indexing framework mentioned here:
> http://www.mail-archive.com/nutch-user@lucene.apache.org/msg11764.html
> 
> 
> Have all those new scoring and indexing would be real step forward.
> 
> Thanks,
> Bartosz
>

Re: planning for nutch-1.0-rc1

Posted by Bartosz Gadzimski <ba...@o2.pl>.

Sami Siren pisze:
> Andrzej Bialecki wrote:
>> Sami Siren wrote:
>>> I am planning to build the first rc for nutch 1.0 at Tue 3.3.2009 
>>> morning (EET). There are still some issues marked as fix for 1.0 in 
>>> Jira. Neither of the two remaining _bugs_ seems too important to me, 
>>> actually I only count the issues assigned to developers as real 
>>> candidates to be included in 1.0:
>>>
>>> NUTCH-578 (kubes)
>>> NUTCH-477 (ab)
>>> NUTCH-669 (siren)
>>
>> There's one Critical issue reported, related to NekoHTML (NUTCH-700). 
>> I'm not sure what are the feature differences (pertinent to Nutch) 
>> between 0.9.4 and 1.9.11 - perhaps downgrading is the safest course 
>> of action.
> I will take care of that.
>>
>>
>>> I am also volunteering to push all open issues to 1.1 before 
>>> starting the RC build on Tuesday. Any objections on the proposed 
>>> procedure or timing?
>>
>> Sounds good.
> great!
>
> -- 
> Sami Siren
>
>
>
What about new scoring and new indexing? Will it be integrated as a 
primary scoring algorithm? I have problem with it on LinkRank:

2009-03-02 20:43:45,708 INFO  webgraph.LinkRank - Starting link counter job
2009-03-02 20:43:47,838 INFO  webgraph.LinkRank - Finished link counter job
2009-03-02 20:43:47,839 INFO  webgraph.LinkRank - Reading numlinks temp file
2009-03-02 20:43:47,840 INFO  webgraph.LinkRank - Deleting numlinks temp 
file
2009-03-02 20:43:47,842 FATAL webgraph.LinkRank - LinkAnalysis: 
java.lang.NullPointerException
        at 
org.apache.nutch.scoring.webgraph.LinkRank.runCounter(LinkRank.java:113)
        at 
org.apache.nutch.scoring.webgraph.LinkRank.analyze(LinkRank.java:582)
        at org.apache.nutch.scoring.webgraph.LinkRank.run(LinkRank.java:657)
        at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
        at 
org.apache.nutch.scoring.webgraph.LinkRank.main(LinkRank.java:627)

Another question what about indexing framework mentioned here:
http://www.mail-archive.com/nutch-user@lucene.apache.org/msg11764.html


Have all those new scoring and indexing would be real step forward.

Thanks,
Bartosz

Re: planning for nutch-1.0-rc1

Posted by Sami Siren <ss...@gmail.com>.

Andrzej Bialecki wrote:
> Sami Siren wrote:
>> I am planning to build the first rc for nutch 1.0 at Tue 3.3.2009 
>> morning (EET). There are still some issues marked as fix for 1.0 in 
>> Jira. Neither of the two remaining _bugs_ seems too important to me, 
>> actually I only count the issues assigned to developers as real 
>> candidates to be included in 1.0:
>>
>> NUTCH-578 (kubes)
>> NUTCH-477 (ab)
>> NUTCH-669 (siren)
>
> There's one Critical issue reported, related to NekoHTML (NUTCH-700). 
> I'm not sure what are the feature differences (pertinent to Nutch) 
> between 0.9.4 and 1.9.11 - perhaps downgrading is the safest course of 
> action.
I will take care of that.
>
>
>> I am also volunteering to push all open issues to 1.1 before starting 
>> the RC build on Tuesday. Any objections on the proposed procedure or 
>> timing?
>
> Sounds good.
great!

--
 Sami Siren

Re: planning for nutch-1.0-rc1

Posted by Andrzej Bialecki <ab...@getopt.org>.

Sami Siren wrote:
> I am planning to build the first rc for nutch 1.0 at Tue 3.3.2009 
> morning (EET). There are still some issues marked as fix for 1.0 in 
> Jira. Neither of the two remaining _bugs_ seems too important to me, 
> actually I only count the issues assigned to developers as real 
> candidates to be included in 1.0:
> 
> NUTCH-578 (kubes)
> NUTCH-477 (ab)
> NUTCH-669 (siren)

There's one Critical issue reported, related to NekoHTML (NUTCH-700). 
I'm not sure what are the feature differences (pertinent to Nutch) 
between 0.9.4 and 1.9.11 - perhaps downgrading is the safest course of 
action.


> I am also volunteering to push all open issues to 1.1 before starting 
> the RC build on Tuesday. Any objections on the proposed procedure or 
> timing?

Sounds good.



-- 
Best regards,
Andrzej Bialecki     <><
  ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com

Re: planning for nutch-1.0-rc1

Posted by Andrzej Bialecki <ab...@getopt.org>.

Sami Siren wrote:


> NUTCH-477 (ab)

I decided to postpone this - the patch brings a lot of complexity, and 
it seems that it would be useful to few users.


-- 
Best regards,
Andrzej Bialecki     <><
  ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com

Re: planning for nutch-1.0-rc1

Posted by Sami Siren <ss...@gmail.com>.

I am sure all of you noticed that the release planned to be cut during 
this week was delayed because of a new discovery right before the 
deadline (NUTCH-711). That has now been fixed so it's time to move on. I 
am now going to build the first RC during the weekend.

--
 Sami Siren

Sami Siren wrote:
> I am planning to build the first rc for nutch 1.0 at Tue 3.3.2009 
> morning (EET). There are still some issues marked as fix for 1.0 in 
> Jira. Neither of the two remaining _bugs_ seems too important to me, 
> actually I only count the issues assigned to developers as real 
> candidates to be included in 1.0:
>
> NUTCH-578 (kubes)
> NUTCH-477 (ab)
> NUTCH-669 (siren)
>
> I am also volunteering to push all open issues to 1.1 before starting 
> the RC build on Tuesday. Any objections on the proposed procedure or 
> timing?
>
> -- 
> Sami Siren
>

Re: planning for nutch-1.0-rc1

Posted by Dennis Kubes <ku...@apache.org>.

NUTCH-578 was a while back but as I remember it worked fine.  No 
objections to either including or pushing it.

Dennis

Sami Siren wrote:
> I am planning to build the first rc for nutch 1.0 at Tue 3.3.2009 
> morning (EET). There are still some issues marked as fix for 1.0 in 
> Jira. Neither of the two remaining _bugs_ seems too important to me, 
> actually I only count the issues assigned to developers as real 
> candidates to be included in 1.0:
> 
> NUTCH-578 (kubes)
> NUTCH-477 (ab)
> NUTCH-669 (siren)
> 
> I am also volunteering to push all open issues to 1.1 before starting 
> the RC build on Tuesday. Any objections on the proposed procedure or 
> timing?
> 
> -- 
> Sami Siren
>