You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by Kai_testing Middleton <ka...@yahoo.com> on 2007/07/13 20:12:31 UTC
Recrawling and Merging
Anuradha brought this up on nutch-dev and I also have a lot of questions regarding recrawling and merging. Unfortunately, many of these questions are not even clearly formulated yet.
I have been working on a new blog. I only have two posts on there so far but this one:
http://nutch.wordpress.com/2007/07/13/recrawling-and-merging/
is about recrawling and merging.
There are a couple things I'm trying to accomplish:
How do I control the crawling--how can I set up crawl jobs so that I know how long they will take, so that I can throttle or stop them if necessary (not supported I think). Do I want to have lots of disparate crawls and merge them?
One clear question to ask is: how do I manage web crawls--are the few examples in the FAQ all there is or do we have more fine grained control?
Another thing that I'm having significant cognitive dissonance on is NUTCH-230. Do I understand that recrawls have some kind of penalty in terms of scoring--older pages get a higher score? There is a whole conversation about "cash value" and "inflation" here:
http://www.mail-archive.com/nutch-developers@lists.sourceforge.net/msg12695.html
Please advise.
----- Forwarded Message ----
From: anuradha (JIRA) <ji...@apache.org>
To: nutch-dev@lucene.apache.org
Sent: Thursday, July 12, 2007 4:40:04 AM
Subject: [jira] Created: (NUTCH-511) Recrawling
Recrawling
-----------
Key: NUTCH-511
URL: https://issues.apache.org/jira/browse/NUTCH-511
Project: Nutch
Issue Type: Wish
Affects Versions: 0.9.0
Reporter: anuradha
Hi,
First I have crawled one website.
I added one page to crawled site. After that I have recrawled the same website.
I have copied the recrawling the code from "http://wiki.apache.org/nutch/IntranetRecrawl#head-e58e25a0b9530bb6fcdfb282fd27a207fc0aff03";
But I didn't get the results from the newly added page.
I am using nutch 0.9.0 and jvm/java-1.5.0-sun
Please guide me how to recrawl the site.
Thanks in advance,
Regards,
Anuradha
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.
____________________________________________________________________________________Ready for the edge of your seat?
Check out tonight's top picks on Yahoo! TV.
http://tv.yahoo.com/
Re: Recrawling and Merging
Posted by John Reidy <jo...@reidy.com>.
Kai_testing Middleton wrote:
>Anuradha brought this up on nutch-dev and I also have a lot of questions regarding recrawling and merging. Unfortunately, many of these questions are not even clearly formulated yet.
>
>I have been working on a new blog. I only have two posts on there so far but this one:
>http://nutch.wordpress.com/2007/07/13/recrawling-and-merging/
>is about recrawling and merging.
>
>There are a couple things I'm trying to accomplish:
>
>How do I control the crawling--how can I set up crawl jobs so that I know how long they will take, so that I can throttle or stop them if necessary (not supported I think). Do I want to have lots of disparate crawls and merge them?
>
>One clear question to ask is: how do I manage web crawls--are the few examples in the FAQ all there is or do we have more fine grained control?
>
>Another thing that I'm having significant cognitive dissonance on is NUTCH-230. Do I understand that recrawls have some kind of penalty in terms of scoring--older pages get a higher score? There is a whole conversation about "cash value" and "inflation" here:
>http://www.mail-archive.com/nutch-developers@lists.sourceforge.net/msg12695.html
>
>Please advise.
>
>
>
>----- Forwarded Message ----
>From: anuradha (JIRA) <ji...@apache.org>
>To: nutch-dev@lucene.apache.org
>Sent: Thursday, July 12, 2007 4:40:04 AM
>Subject: [jira] Created: (NUTCH-511) Recrawling
>
>Recrawling
>-----------
>
> Key: NUTCH-511
> URL: https://issues.apache.org/jira/browse/NUTCH-511
> Project: Nutch
> Issue Type: Wish
> Affects Versions: 0.9.0
> Reporter: anuradha
>
>
>Hi,
>
>First I have crawled one website.
>I added one page to crawled site. After that I have recrawled the same website.
>I have copied the recrawling the code from "http://wiki.apache.org/nutch/IntranetRecrawl#head-e58e25a0b9530bb6fcdfb282fd27a207fc0aff03";
>
>But I didn't get the results from the newly added page.
>
>I am using nutch 0.9.0 and jvm/java-1.5.0-sun
>
>Please guide me how to recrawl the site.
>Thanks in advance,
>
>Regards,
>Anuradha
>
>
>
Thanks this looks like a great resource, I have posted on the blog and
will add to it when I have information.
As I have posted to this list earlier, I am interested in the
incremental merge scenario, where I add new urls and want to merge them
with the main index on a hourly/daily basis. I believe that v0.71 did
what I wanted and will look to see how v0.9 can do the same.
Regards
John Reidy.