You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by "Peters, Vijaya" <Vi...@sra.com> on 2009/12/04 14:18:04 UTC

How to force recrawl of everything

I am using Nutch 1.0.  I want to perform a 'clean' crawl.  

 

I see the force option in this patch:  NUTCH-601v1.0.patch
<https://issues.apache.org/jira/secure/attachment/12375717/NUTCH-601v1.0
.patch> 

Do I have to make those code changes, or does Nutch 1.0 have another way
to do this?

 

Also, everytime I do another crawl, I see the same file being fetched
over and over again. Is it appending the same url over and over to the
fetch list?

 

Thanks,

- Vijaya

 

 

Vijaya Peters
SRA International, Inc.
4350 Fair Lakes Court North
Room 4004
Fairfax, VA  22033
Tel:  703-502-1184

www.sra.com <http://www.sra.com/> 
Named to FORTUNE's "100 Best Companies to Work For" list for 10
consecutive years

P Please consider the environment before printing this e-mail

This electronic message transmission contains information from SRA
International, Inc. which may be confidential, privileged or
proprietary.  The information is intended for the use of the individual
or entity named above.  If you are not the intended recipient, be aware
that any disclosure, copying, distribution, or use of the contents of
this information is strictly prohibited.  If you have received this
electronic information in error, please notify us immediately by
telephone at 866-584-2143.

 


RE: How to force recrawl of everything

Posted by "Peters, Vijaya" <Vi...@sra.com>.
Running:
bin/nutch readdb crawldb -url <url> I got the following exception.
Also, how do I force a recrawl in Nutch 1.0?


Exception in thread "main" java.lang.ArithmeticException: / by zero
        at
org.apache.hadoop.mapred.lib.HashPartitioner.getPartition(HashPartiti
oner.java:32)
        at
org.apache.hadoop.mapred.MapFileOutputFormat.getEntry(MapFileOutputFo
rmat.java:104)
        at
org.apache.nutch.crawl.CrawlDbReader.get(CrawlDbReader.java:380)
        at
org.apache.nutch.crawl.CrawlDbReader.readUrl(CrawlDbReader.java:386)
        at
org.apache.nutch.crawl.CrawlDbReader.main(CrawlDbReader.java:511)

Vijaya Peters
SRA International, Inc.
4350 Fair Lakes Court North
Room 4004
Fairfax, VA  22033
Tel:  703-502-1184

www.sra.com
Named to FORTUNE's "100 Best Companies to Work For" list for 10
consecutive years
P Please consider the environment before printing this e-mail
This electronic message transmission contains information from SRA
International, Inc. which may be confidential, privileged or
proprietary.  The information is intended for the use of the individual
or entity named above.  If you are not the intended recipient, be aware
that any disclosure, copying, distribution, or use of the contents of
this information is strictly prohibited.  If you have received this
electronic information in error, please notify us immediately by
telephone at 866-584-2143.
-----Original Message-----
From: reinhard schwab [mailto:reinhard.schwab@aon.at] 
Sent: Friday, December 04, 2009 8:32 AM
To: nutch-user@lucene.apache.org
Subject: Re: How to force recrawl of everything

Peters, Vijaya schrieb:
> I am using Nutch 1.0.  I want to perform a 'clean' crawl.  
>
>  
>
> I see the force option in this patch:  NUTCH-601v1.0.patch
>
<https://issues.apache.org/jira/secure/attachment/12375717/NUTCH-601v1.0
> .patch> 
>
> Do I have to make those code changes, or does Nutch 1.0 have another
way
> to do this?
>
>  
>
> Also, everytime I do another crawl, I see the same file being fetched
> over and over again. Is it appending the same url over and over to the
>   
which file?
you can check the crawl date of this file with

reinhard@thord:>bin/nutch readdb  <crawldb>   -url <url>


> fetch list?
>
>  
>
> Thanks,
>
> - Vijaya
>
>  
>
>  
>
> Vijaya Peters
> SRA International, Inc.
> 4350 Fair Lakes Court North
> Room 4004
> Fairfax, VA  22033
> Tel:  703-502-1184
>
> www.sra.com <http://www.sra.com/> 
> Named to FORTUNE's "100 Best Companies to Work For" list for 10
> consecutive years
>
> P Please consider the environment before printing this e-mail
>
> This electronic message transmission contains information from SRA
> International, Inc. which may be confidential, privileged or
> proprietary.  The information is intended for the use of the
individual
> or entity named above.  If you are not the intended recipient, be
aware
> that any disclosure, copying, distribution, or use of the contents of
> this information is strictly prohibited.  If you have received this
> electronic information in error, please notify us immediately by
> telephone at 866-584-2143.
>
>  
>
>
>   


Re: How to force recrawl of everything

Posted by reinhard schwab <re...@aon.at>.
Peters, Vijaya schrieb:
> I am using Nutch 1.0.  I want to perform a 'clean' crawl.  
>
>  
>
> I see the force option in this patch:  NUTCH-601v1.0.patch
> <https://issues.apache.org/jira/secure/attachment/12375717/NUTCH-601v1.0
> .patch> 
>
> Do I have to make those code changes, or does Nutch 1.0 have another way
> to do this?
>
>  
>
> Also, everytime I do another crawl, I see the same file being fetched
> over and over again. Is it appending the same url over and over to the
>   
which file?
you can check the crawl date of this file with

reinhard@thord:>bin/nutch readdb  <crawldb>   -url <url>


> fetch list?
>
>  
>
> Thanks,
>
> - Vijaya
>
>  
>
>  
>
> Vijaya Peters
> SRA International, Inc.
> 4350 Fair Lakes Court North
> Room 4004
> Fairfax, VA  22033
> Tel:  703-502-1184
>
> www.sra.com <http://www.sra.com/> 
> Named to FORTUNE's "100 Best Companies to Work For" list for 10
> consecutive years
>
> P Please consider the environment before printing this e-mail
>
> This electronic message transmission contains information from SRA
> International, Inc. which may be confidential, privileged or
> proprietary.  The information is intended for the use of the individual
> or entity named above.  If you are not the intended recipient, be aware
> that any disclosure, copying, distribution, or use of the contents of
> this information is strictly prohibited.  If you have received this
> electronic information in error, please notify us immediately by
> telephone at 866-584-2143.
>
>  
>
>
>