You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@nutch.apache.org by Kai_testing Middleton <ka...@yahoo.com> on 2007/07/13 20:12:31 UTC

Recrawling and Merging

Anuradha brought this up on nutch-dev and I also have a lot of questions regarding recrawling and merging.  Unfortunately, many of these questions are not even clearly formulated yet.  

I have been working on a new blog. I only have two posts on there so far but this one:
http://nutch.wordpress.com/2007/07/13/recrawling-and-merging/
is about recrawling and merging.

There are a couple things I'm trying to accomplish:

How do I control the crawling--how can I set up crawl jobs so that I know how long they will take, so that I can throttle or stop them if necessary (not supported I think).  Do I want to have lots of disparate crawls and merge them?

One clear question to ask is: how do I manage web crawls--are the few examples in the FAQ all there is or do we have more fine grained control?

Another thing that I'm having significant cognitive dissonance on is NUTCH-230.  Do I understand that recrawls have some kind of penalty in terms of scoring--older pages get a higher score?  There is a whole conversation about "cash value" and "inflation" here: 
http://www.mail-archive.com/nutch-developers@lists.sourceforge.net/msg12695.html

Please advise.



----- Forwarded Message ----
From: anuradha (JIRA) <ji...@apache.org>
To: nutch-dev@lucene.apache.org
Sent: Thursday, July 12, 2007 4:40:04 AM
Subject: [jira] Created: (NUTCH-511) Recrawling

Recrawling 
-----------

                 Key: NUTCH-511
                 URL: https://issues.apache.org/jira/browse/NUTCH-511
             Project: Nutch
          Issue Type: Wish
    Affects Versions: 0.9.0
            Reporter: anuradha


Hi,

First I have crawled one website.
I added one page to crawled site. After that I have recrawled the same website.
I have copied the recrawling the code  from "http://wiki.apache.org/nutch/IntranetRecrawl#head-e58e25a0b9530bb6fcdfb282fd27a207fc0aff03";

But I didn't get the results from the newly added page.

I am using nutch 0.9.0 and jvm/java-1.5.0-sun

Please guide me how to recrawl the site.
Thanks in advance,

Regards,
Anuradha

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.








       
____________________________________________________________________________________Ready for the edge of your seat? 
Check out tonight's top picks on Yahoo! TV. 
http://tv.yahoo.com/

Re: Recrawling and Merging

Posted by John Reidy <jo...@reidy.com>.

Kai_testing Middleton wrote:

>Anuradha brought this up on nutch-dev and I also have a lot of questions regarding recrawling and merging.  Unfortunately, many of these questions are not even clearly formulated yet.  
>
>I have been working on a new blog. I only have two posts on there so far but this one:
>http://nutch.wordpress.com/2007/07/13/recrawling-and-merging/
>is about recrawling and merging.
>
>There are a couple things I'm trying to accomplish:
>
>How do I control the crawling--how can I set up crawl jobs so that I know how long they will take, so that I can throttle or stop them if necessary (not supported I think).  Do I want to have lots of disparate crawls and merge them?
>
>One clear question to ask is: how do I manage web crawls--are the few examples in the FAQ all there is or do we have more fine grained control?
>
>Another thing that I'm having significant cognitive dissonance on is NUTCH-230.  Do I understand that recrawls have some kind of penalty in terms of scoring--older pages get a higher score?  There is a whole conversation about "cash value" and "inflation" here: 
>http://www.mail-archive.com/nutch-developers@lists.sourceforge.net/msg12695.html
>
>Please advise.
>
>
>
>----- Forwarded Message ----
>From: anuradha (JIRA) <ji...@apache.org>
>To: nutch-dev@lucene.apache.org
>Sent: Thursday, July 12, 2007 4:40:04 AM
>Subject: [jira] Created: (NUTCH-511) Recrawling
>
>Recrawling 
>-----------
>
>                 Key: NUTCH-511
>                 URL: https://issues.apache.org/jira/browse/NUTCH-511
>             Project: Nutch
>          Issue Type: Wish
>    Affects Versions: 0.9.0
>            Reporter: anuradha
>
>
>Hi,
>
>First I have crawled one website.
>I added one page to crawled site. After that I have recrawled the same website.
>I have copied the recrawling the code  from "http://wiki.apache.org/nutch/IntranetRecrawl#head-e58e25a0b9530bb6fcdfb282fd27a207fc0aff03";
>
>But I didn't get the results from the newly added page.
>
>I am using nutch 0.9.0 and jvm/java-1.5.0-sun
>
>Please guide me how to recrawl the site.
>Thanks in advance,
>
>Regards,
>Anuradha
>
>  
>
Thanks this looks like a great resource, I have posted on the blog and 
will add to it when I have information.
As I have posted to this list earlier, I am interested in the 
incremental merge scenario, where I add new urls and want to merge them 
with the main index on a hourly/daily basis. I believe that v0.71 did 
what I wanted and will look to see how v0.9 can do the same.
Regards
John Reidy.