You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@nutch.apache.org by EM <em...@cpuedge.com> on 2005/08/30 08:50:16 UTC

Analyser error

What does it mean if the bin/nutch analyze db 7 fails with:


050830 024914 Target pages from init(): 27419
050830 024914 Processing pagesByURL: Sorted 27419 instructions in 0.172
seconds.
050830 024914 Processing pagesByURL: Sorted 159412.79069767444
instructions/second
Finished at Tue Aug 30 02:49:14 EDT 2005
Exception in thread "main" java.io.IOException: already exists:
db\webdb.new\pagesByURL
        at org.apache.nutch.io.MapFile$Writer.<init>(MapFile.java:86)
        at
org.apache.nutch.db.WebDBWriter$CloseProcessor.closeDown(WebDBWriter.java:54
9)
        at org.apache.nutch.db.WebDBWriter.close(WebDBWriter.java:1544)
        at
org.apache.nutch.tools.DistributedAnalysisTool.completeRound(DistributedAnal
ysisTool.java:562)
        at
org.apache.nutch.tools.LinkAnalysisTool.iterate(LinkAnalysisTool.java:60)
        at
org.apache.nutch.tools.LinkAnalysisTool.main(LinkAnalysisTool.java:81)

RE: Analyser error

Posted by EM <em...@cpuedge.com>.

I did it with 0.6 (late versions) without the analyzer part and it went all
fine.

It's no big deal, I'll figure something out. Just wanted to let you know.

-----Original Message-----
From: Piotr Kosiorowski [mailto:pkosiorowski@gmail.com] 
Sent: Wednesday, August 31, 2005 2:56 PM
To: nutch-user@lucene.apache.org
Subject: Re: Analyser error

I was never doing it this way - creating webdb content based on segments 
only. So I do not know if it works - I do not have time at the moment to 
test it myslef - sorry.
Regards
Piotr

EM wrote:
> The problem is still there, maybe I'm doing something wrong?
> 
> 1. 'rm -r db' 
> 2. 'mkdir db'
> 3. ' bin/nutch admin db -create'
> 4. I'll then updatedb db from a fetched segment, this should fill it up
with
> links?
> 5. 'bin/nutch analylze db 7'
> And it fails here with three 'tmp<something>' directories and webdb.new 
> 
> 
> 
> -----Original Message-----
> From: Piotr Kosiorowski [mailto:pkosiorowski@gmail.com] 
> Sent: Tuesday, August 30, 2005 3:07 PM
> To: nutch-user@lucene.apache.org
> Subject: Re: Analyser error
> 
> It looks like you have temporary results from previous run (probably 
> killed or terminated not successfully). It shoudl be safe to remove 
> db\webdb.new directory and start again.
> regars
> Piotr
> EM wrote:
> 
>>What does it mean if the bin/nutch analyze db 7 fails with:
>>
>>
>>050830 024914 Target pages from init(): 27419
>>050830 024914 Processing pagesByURL: Sorted 27419 instructions in 0.172
>>seconds.
>>050830 024914 Processing pagesByURL: Sorted 159412.79069767444
>>instructions/second
>>Finished at Tue Aug 30 02:49:14 EDT 2005
>>Exception in thread "main" java.io.IOException: already exists:
>>db\webdb.new\pagesByURL
>>        at org.apache.nutch.io.MapFile$Writer.<init>(MapFile.java:86)
>>        at
>>
> 
>
org.apache.nutch.db.WebDBWriter$CloseProcessor.closeDown(WebDBWriter.java:54
> 
>>9)
>>        at org.apache.nutch.db.WebDBWriter.close(WebDBWriter.java:1544)
>>        at
>>
> 
>
org.apache.nutch.tools.DistributedAnalysisTool.completeRound(DistributedAnal
> 
>>ysisTool.java:562)
>>        at
>>org.apache.nutch.tools.LinkAnalysisTool.iterate(LinkAnalysisTool.java:60)
>>        at
>>org.apache.nutch.tools.LinkAnalysisTool.main(LinkAnalysisTool.java:81)
>>
>>
> 
> 
> 
> 
>

Re: Analyser error

Posted by Piotr Kosiorowski <pk...@gmail.com>.

I was never doing it this way - creating webdb content based on segments 
only. So I do not know if it works - I do not have time at the moment to 
test it myslef - sorry.
Regards
Piotr

EM wrote:
> The problem is still there, maybe I'm doing something wrong?
> 
> 1. 'rm -r db' 
> 2. 'mkdir db'
> 3. ' bin/nutch admin db -create'
> 4. I'll then updatedb db from a fetched segment, this should fill it up with
> links?
> 5. 'bin/nutch analylze db 7'
> And it fails here with three 'tmp<something>' directories and webdb.new 
> 
> 
> 
> -----Original Message-----
> From: Piotr Kosiorowski [mailto:pkosiorowski@gmail.com] 
> Sent: Tuesday, August 30, 2005 3:07 PM
> To: nutch-user@lucene.apache.org
> Subject: Re: Analyser error
> 
> It looks like you have temporary results from previous run (probably 
> killed or terminated not successfully). It shoudl be safe to remove 
> db\webdb.new directory and start again.
> regars
> Piotr
> EM wrote:
> 
>>What does it mean if the bin/nutch analyze db 7 fails with:
>>
>>
>>050830 024914 Target pages from init(): 27419
>>050830 024914 Processing pagesByURL: Sorted 27419 instructions in 0.172
>>seconds.
>>050830 024914 Processing pagesByURL: Sorted 159412.79069767444
>>instructions/second
>>Finished at Tue Aug 30 02:49:14 EDT 2005
>>Exception in thread "main" java.io.IOException: already exists:
>>db\webdb.new\pagesByURL
>>        at org.apache.nutch.io.MapFile$Writer.<init>(MapFile.java:86)
>>        at
>>
> 
> org.apache.nutch.db.WebDBWriter$CloseProcessor.closeDown(WebDBWriter.java:54
> 
>>9)
>>        at org.apache.nutch.db.WebDBWriter.close(WebDBWriter.java:1544)
>>        at
>>
> 
> org.apache.nutch.tools.DistributedAnalysisTool.completeRound(DistributedAnal
> 
>>ysisTool.java:562)
>>        at
>>org.apache.nutch.tools.LinkAnalysisTool.iterate(LinkAnalysisTool.java:60)
>>        at
>>org.apache.nutch.tools.LinkAnalysisTool.main(LinkAnalysisTool.java:81)
>>
>>
> 
> 
> 
> 
>

dedup between segments

Posted by Michael Ji <fj...@yahoo.com>.

Hi,

Is there a way that we could delete duplicated
documents for two segments?

I see there is DeleteDuplicates.java could do dedup
for a single segment based on doc's content MD5 and
URL. 

But, if I have two segments fetched in two time period
and not sure if there are documents duplicated. Should
I do dedup?

Is that done in IndexMerger.java? But I didn't see any
code logic to dedup, even within
org.apache.lucene.index.IndexWriter.

Does that mean document duplication is OK for multiple
segments?

thanks,

Michael Ji,


	
		
__________________________________ 
Yahoo! Mail - PC Magazine Editors' Choice 2005 
http://mail.yahoo.com

RE: Analyser error

Posted by EM <em...@cpuedge.com>.

The problem is still there, maybe I'm doing something wrong?

1. 'rm -r db' 
2. 'mkdir db'
3. ' bin/nutch admin db -create'
4. I'll then updatedb db from a fetched segment, this should fill it up with
links?
5. 'bin/nutch analylze db 7'
And it fails here with three 'tmp<something>' directories and webdb.new 



-----Original Message-----
From: Piotr Kosiorowski [mailto:pkosiorowski@gmail.com] 
Sent: Tuesday, August 30, 2005 3:07 PM
To: nutch-user@lucene.apache.org
Subject: Re: Analyser error

It looks like you have temporary results from previous run (probably 
killed or terminated not successfully). It shoudl be safe to remove 
db\webdb.new directory and start again.
regars
Piotr
EM wrote:
> What does it mean if the bin/nutch analyze db 7 fails with:
> 
> 
> 050830 024914 Target pages from init(): 27419
> 050830 024914 Processing pagesByURL: Sorted 27419 instructions in 0.172
> seconds.
> 050830 024914 Processing pagesByURL: Sorted 159412.79069767444
> instructions/second
> Finished at Tue Aug 30 02:49:14 EDT 2005
> Exception in thread "main" java.io.IOException: already exists:
> db\webdb.new\pagesByURL
>         at org.apache.nutch.io.MapFile$Writer.<init>(MapFile.java:86)
>         at
>
org.apache.nutch.db.WebDBWriter$CloseProcessor.closeDown(WebDBWriter.java:54
> 9)
>         at org.apache.nutch.db.WebDBWriter.close(WebDBWriter.java:1544)
>         at
>
org.apache.nutch.tools.DistributedAnalysisTool.completeRound(DistributedAnal
> ysisTool.java:562)
>         at
> org.apache.nutch.tools.LinkAnalysisTool.iterate(LinkAnalysisTool.java:60)
>         at
> org.apache.nutch.tools.LinkAnalysisTool.main(LinkAnalysisTool.java:81)
> 
>

Re: Analyser error

Posted by Piotr Kosiorowski <pk...@gmail.com>.

It looks like you have temporary results from previous run (probably 
killed or terminated not successfully). It shoudl be safe to remove 
db\webdb.new directory and start again.
regars
Piotr
EM wrote:
> What does it mean if the bin/nutch analyze db 7 fails with:
> 
> 
> 050830 024914 Target pages from init(): 27419
> 050830 024914 Processing pagesByURL: Sorted 27419 instructions in 0.172
> seconds.
> 050830 024914 Processing pagesByURL: Sorted 159412.79069767444
> instructions/second
> Finished at Tue Aug 30 02:49:14 EDT 2005
> Exception in thread "main" java.io.IOException: already exists:
> db\webdb.new\pagesByURL
>         at org.apache.nutch.io.MapFile$Writer.<init>(MapFile.java:86)
>         at
> org.apache.nutch.db.WebDBWriter$CloseProcessor.closeDown(WebDBWriter.java:54
> 9)
>         at org.apache.nutch.db.WebDBWriter.close(WebDBWriter.java:1544)
>         at
> org.apache.nutch.tools.DistributedAnalysisTool.completeRound(DistributedAnal
> ysisTool.java:562)
>         at
> org.apache.nutch.tools.LinkAnalysisTool.iterate(LinkAnalysisTool.java:60)
>         at
> org.apache.nutch.tools.LinkAnalysisTool.main(LinkAnalysisTool.java:81)
> 
>