You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by EM <em...@cpuedge.com> on 2005/08/30 08:50:16 UTC
Analyser error
What does it mean if the bin/nutch analyze db 7 fails with:
050830 024914 Target pages from init(): 27419
050830 024914 Processing pagesByURL: Sorted 27419 instructions in 0.172
seconds.
050830 024914 Processing pagesByURL: Sorted 159412.79069767444
instructions/second
Finished at Tue Aug 30 02:49:14 EDT 2005
Exception in thread "main" java.io.IOException: already exists:
db\webdb.new\pagesByURL
at org.apache.nutch.io.MapFile$Writer.<init>(MapFile.java:86)
at
org.apache.nutch.db.WebDBWriter$CloseProcessor.closeDown(WebDBWriter.java:54
9)
at org.apache.nutch.db.WebDBWriter.close(WebDBWriter.java:1544)
at
org.apache.nutch.tools.DistributedAnalysisTool.completeRound(DistributedAnal
ysisTool.java:562)
at
org.apache.nutch.tools.LinkAnalysisTool.iterate(LinkAnalysisTool.java:60)
at
org.apache.nutch.tools.LinkAnalysisTool.main(LinkAnalysisTool.java:81)
RE: Analyser error
Posted by EM <em...@cpuedge.com>.
I did it with 0.6 (late versions) without the analyzer part and it went all
fine.
It's no big deal, I'll figure something out. Just wanted to let you know.
-----Original Message-----
From: Piotr Kosiorowski [mailto:pkosiorowski@gmail.com]
Sent: Wednesday, August 31, 2005 2:56 PM
To: nutch-user@lucene.apache.org
Subject: Re: Analyser error
I was never doing it this way - creating webdb content based on segments
only. So I do not know if it works - I do not have time at the moment to
test it myslef - sorry.
Regards
Piotr
EM wrote:
> The problem is still there, maybe I'm doing something wrong?
>
> 1. 'rm -r db'
> 2. 'mkdir db'
> 3. ' bin/nutch admin db -create'
> 4. I'll then updatedb db from a fetched segment, this should fill it up
with
> links?
> 5. 'bin/nutch analylze db 7'
> And it fails here with three 'tmp<something>' directories and webdb.new
>
>
>
> -----Original Message-----
> From: Piotr Kosiorowski [mailto:pkosiorowski@gmail.com]
> Sent: Tuesday, August 30, 2005 3:07 PM
> To: nutch-user@lucene.apache.org
> Subject: Re: Analyser error
>
> It looks like you have temporary results from previous run (probably
> killed or terminated not successfully). It shoudl be safe to remove
> db\webdb.new directory and start again.
> regars
> Piotr
> EM wrote:
>
>>What does it mean if the bin/nutch analyze db 7 fails with:
>>
>>
>>050830 024914 Target pages from init(): 27419
>>050830 024914 Processing pagesByURL: Sorted 27419 instructions in 0.172
>>seconds.
>>050830 024914 Processing pagesByURL: Sorted 159412.79069767444
>>instructions/second
>>Finished at Tue Aug 30 02:49:14 EDT 2005
>>Exception in thread "main" java.io.IOException: already exists:
>>db\webdb.new\pagesByURL
>> at org.apache.nutch.io.MapFile$Writer.<init>(MapFile.java:86)
>> at
>>
>
>
org.apache.nutch.db.WebDBWriter$CloseProcessor.closeDown(WebDBWriter.java:54
>
>>9)
>> at org.apache.nutch.db.WebDBWriter.close(WebDBWriter.java:1544)
>> at
>>
>
>
org.apache.nutch.tools.DistributedAnalysisTool.completeRound(DistributedAnal
>
>>ysisTool.java:562)
>> at
>>org.apache.nutch.tools.LinkAnalysisTool.iterate(LinkAnalysisTool.java:60)
>> at
>>org.apache.nutch.tools.LinkAnalysisTool.main(LinkAnalysisTool.java:81)
>>
>>
>
>
>
>
>
Re: Analyser error
Posted by Piotr Kosiorowski <pk...@gmail.com>.
I was never doing it this way - creating webdb content based on segments
only. So I do not know if it works - I do not have time at the moment to
test it myslef - sorry.
Regards
Piotr
EM wrote:
> The problem is still there, maybe I'm doing something wrong?
>
> 1. 'rm -r db'
> 2. 'mkdir db'
> 3. ' bin/nutch admin db -create'
> 4. I'll then updatedb db from a fetched segment, this should fill it up with
> links?
> 5. 'bin/nutch analylze db 7'
> And it fails here with three 'tmp<something>' directories and webdb.new
>
>
>
> -----Original Message-----
> From: Piotr Kosiorowski [mailto:pkosiorowski@gmail.com]
> Sent: Tuesday, August 30, 2005 3:07 PM
> To: nutch-user@lucene.apache.org
> Subject: Re: Analyser error
>
> It looks like you have temporary results from previous run (probably
> killed or terminated not successfully). It shoudl be safe to remove
> db\webdb.new directory and start again.
> regars
> Piotr
> EM wrote:
>
>>What does it mean if the bin/nutch analyze db 7 fails with:
>>
>>
>>050830 024914 Target pages from init(): 27419
>>050830 024914 Processing pagesByURL: Sorted 27419 instructions in 0.172
>>seconds.
>>050830 024914 Processing pagesByURL: Sorted 159412.79069767444
>>instructions/second
>>Finished at Tue Aug 30 02:49:14 EDT 2005
>>Exception in thread "main" java.io.IOException: already exists:
>>db\webdb.new\pagesByURL
>> at org.apache.nutch.io.MapFile$Writer.<init>(MapFile.java:86)
>> at
>>
>
> org.apache.nutch.db.WebDBWriter$CloseProcessor.closeDown(WebDBWriter.java:54
>
>>9)
>> at org.apache.nutch.db.WebDBWriter.close(WebDBWriter.java:1544)
>> at
>>
>
> org.apache.nutch.tools.DistributedAnalysisTool.completeRound(DistributedAnal
>
>>ysisTool.java:562)
>> at
>>org.apache.nutch.tools.LinkAnalysisTool.iterate(LinkAnalysisTool.java:60)
>> at
>>org.apache.nutch.tools.LinkAnalysisTool.main(LinkAnalysisTool.java:81)
>>
>>
>
>
>
>
>
dedup between segments
Posted by Michael Ji <fj...@yahoo.com>.
Hi,
Is there a way that we could delete duplicated
documents for two segments?
I see there is DeleteDuplicates.java could do dedup
for a single segment based on doc's content MD5 and
URL.
But, if I have two segments fetched in two time period
and not sure if there are documents duplicated. Should
I do dedup?
Is that done in IndexMerger.java? But I didn't see any
code logic to dedup, even within
org.apache.lucene.index.IndexWriter.
Does that mean document duplication is OK for multiple
segments?
thanks,
Michael Ji,
__________________________________
Yahoo! Mail - PC Magazine Editors' Choice 2005
http://mail.yahoo.com
RE: Analyser error
Posted by EM <em...@cpuedge.com>.
The problem is still there, maybe I'm doing something wrong?
1. 'rm -r db'
2. 'mkdir db'
3. ' bin/nutch admin db -create'
4. I'll then updatedb db from a fetched segment, this should fill it up with
links?
5. 'bin/nutch analylze db 7'
And it fails here with three 'tmp<something>' directories and webdb.new
-----Original Message-----
From: Piotr Kosiorowski [mailto:pkosiorowski@gmail.com]
Sent: Tuesday, August 30, 2005 3:07 PM
To: nutch-user@lucene.apache.org
Subject: Re: Analyser error
It looks like you have temporary results from previous run (probably
killed or terminated not successfully). It shoudl be safe to remove
db\webdb.new directory and start again.
regars
Piotr
EM wrote:
> What does it mean if the bin/nutch analyze db 7 fails with:
>
>
> 050830 024914 Target pages from init(): 27419
> 050830 024914 Processing pagesByURL: Sorted 27419 instructions in 0.172
> seconds.
> 050830 024914 Processing pagesByURL: Sorted 159412.79069767444
> instructions/second
> Finished at Tue Aug 30 02:49:14 EDT 2005
> Exception in thread "main" java.io.IOException: already exists:
> db\webdb.new\pagesByURL
> at org.apache.nutch.io.MapFile$Writer.<init>(MapFile.java:86)
> at
>
org.apache.nutch.db.WebDBWriter$CloseProcessor.closeDown(WebDBWriter.java:54
> 9)
> at org.apache.nutch.db.WebDBWriter.close(WebDBWriter.java:1544)
> at
>
org.apache.nutch.tools.DistributedAnalysisTool.completeRound(DistributedAnal
> ysisTool.java:562)
> at
> org.apache.nutch.tools.LinkAnalysisTool.iterate(LinkAnalysisTool.java:60)
> at
> org.apache.nutch.tools.LinkAnalysisTool.main(LinkAnalysisTool.java:81)
>
>
Re: Analyser error
Posted by Piotr Kosiorowski <pk...@gmail.com>.
It looks like you have temporary results from previous run (probably
killed or terminated not successfully). It shoudl be safe to remove
db\webdb.new directory and start again.
regars
Piotr
EM wrote:
> What does it mean if the bin/nutch analyze db 7 fails with:
>
>
> 050830 024914 Target pages from init(): 27419
> 050830 024914 Processing pagesByURL: Sorted 27419 instructions in 0.172
> seconds.
> 050830 024914 Processing pagesByURL: Sorted 159412.79069767444
> instructions/second
> Finished at Tue Aug 30 02:49:14 EDT 2005
> Exception in thread "main" java.io.IOException: already exists:
> db\webdb.new\pagesByURL
> at org.apache.nutch.io.MapFile$Writer.<init>(MapFile.java:86)
> at
> org.apache.nutch.db.WebDBWriter$CloseProcessor.closeDown(WebDBWriter.java:54
> 9)
> at org.apache.nutch.db.WebDBWriter.close(WebDBWriter.java:1544)
> at
> org.apache.nutch.tools.DistributedAnalysisTool.completeRound(DistributedAnal
> ysisTool.java:562)
> at
> org.apache.nutch.tools.LinkAnalysisTool.iterate(LinkAnalysisTool.java:60)
> at
> org.apache.nutch.tools.LinkAnalysisTool.main(LinkAnalysisTool.java:81)
>
>