You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@nutch.apache.org by raviksingh <ra...@gmail.com> on 2013/05/04 20:55:51 UTC

Nutch Crawls Again and again

Hi, 
    I have written a java program that call "crawl" command. This fetches
and updates the results in MySQL. However, if called again the same urls are
fetched again and again. Which certainly slows the process. Status for many
urls is now "2". They still get fetched every time. What can be the problem.
Please help.

Regards
Ravi Singh



--
View this message in context: http://lucene.472066.n3.nabble.com/Nutch-Crawls-Again-and-again-tp4060834.html
Sent from the Nutch - User mailing list archive at Nabble.com.

Re: Nutch Crawls Again and again

Posted by Tejas Patil <te...@gmail.com>.

Just saw the code to confirm that. Protocol Status =  "2" corresponds
to FAILED. Nutch will attempt to fetch them in subsequent round with a hope
that I can fetch it. After a limit 'db.fetch.retry.max', it will mark that
url as DB_GONE and wont reattempt it further.

On Sat, May 4, 2013 at 12:04 PM, Tejas Patil <te...@gmail.com>wrote:

> My guess is that those urls were not fetched successfully and so its been
> retried in every round of crawl.
>
>
> On Sat, May 4, 2013 at 11:55 AM, raviksingh <ra...@gmail.com>wrote:
>
>> Hi,
>>     I have written a java program that call "crawl" command. This fetches
>> and updates the results in MySQL. However, if called again the same urls
>> are
>> fetched again and again. Which certainly slows the process. Status for
>> many
>> urls is now "2". They still get fetched every time. What can be the
>> problem.
>> Please help.
>>
>> Regards
>> Ravi Singh
>>
>>
>>
>> --
>> View this message in context:
>> http://lucene.472066.n3.nabble.com/Nutch-Crawls-Again-and-again-tp4060834.html
>> Sent from the Nutch - User mailing list archive at Nabble.com.
>>
>
>

Re: Nutch Crawls Again and again

Posted by Tejas Patil <te...@gmail.com>.

I should not have brought that protocol status thing in here.
ProtocolStatusCode = 2 means FAILED (see [0])
CrawlStatus = 2 means FETCHED (see [1])

Get the webdb dump and share it, your problem will get more clear.

[0] :
http://svn.apache.org/repos/asf/nutch/branches/2.x/src/java/org/apache/nutch/protocol/ProtocolStatusCodes.java

[1] :
http://svn.apache.org/repos/asf/nutch/branches/2.x/src/java/org/apache/nutch/crawl/CrawlStatus.java

On Sat, May 4, 2013 at 12:20 PM, raviksingh <ra...@gmail.com>wrote:

> Hi,
>    This link http://nlp.solutions.asia/?p=232 says that "2" means
> "fetched".
> Is this wrong?
>
>
>
> --
> View this message in context:
> http://lucene.472066.n3.nabble.com/Nutch-Crawls-Again-and-again-tp4060834p4060842.html
> Sent from the Nutch - User mailing list archive at Nabble.com.
>

Re: Nutch Crawls Again and again

Posted by raviksingh <ra...@gmail.com>.

Hi, 
   This link http://nlp.solutions.asia/?p=232 says that "2" means "fetched".
Is this wrong?



--
View this message in context: http://lucene.472066.n3.nabble.com/Nutch-Crawls-Again-and-again-tp4060834p4060842.html
Sent from the Nutch - User mailing list archive at Nabble.com.

Re: Nutch Crawls Again and again

Posted by Tejas Patil <te...@gmail.com>.

Get a crawldb dump [0] and see the status of the url.

[0] : http://wiki.apache.org/nutch/bin/nutch_readdb


On Sat, May 4, 2013 at 12:17 PM, raviksingh <ra...@gmail.com>wrote:

> Hi,
>
> This is log.
> http://pastebin.com/iYNQq5gi
>
> It does not show any error.
>
>
>
> --
> View this message in context:
> http://lucene.472066.n3.nabble.com/Nutch-Crawls-Again-and-again-tp4060834p4060841.html
> Sent from the Nutch - User mailing list archive at Nabble.com.
>

Re: Nutch Crawls Again and again

Posted by raviksingh <ra...@gmail.com>.

Hi,

This is log. 
http://pastebin.com/iYNQq5gi

It does not show any error. 



--
View this message in context: http://lucene.472066.n3.nabble.com/Nutch-Crawls-Again-and-again-tp4060834p4060841.html
Sent from the Nutch - User mailing list archive at Nabble.com.

Re: Nutch Crawls Again and again

Posted by Tejas Patil <te...@gmail.com>.

My guess is that those urls were not fetched successfully and so its been
retried in every round of crawl.


On Sat, May 4, 2013 at 11:55 AM, raviksingh <ra...@gmail.com>wrote:

> Hi,
>     I have written a java program that call "crawl" command. This fetches
> and updates the results in MySQL. However, if called again the same urls
> are
> fetched again and again. Which certainly slows the process. Status for many
> urls is now "2". They still get fetched every time. What can be the
> problem.
> Please help.
>
> Regards
> Ravi Singh
>
>
>
> --
> View this message in context:
> http://lucene.472066.n3.nabble.com/Nutch-Crawls-Again-and-again-tp4060834.html
> Sent from the Nutch - User mailing list archive at Nabble.com.
>