You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by Alexandre <al...@gmail.com> on 2012/09/17 16:06:31 UTC

Absolute depth for recrawling

Hey all!

Since a few days we are currently playing a bit arround with Nutch. Today we
have encountered the following issue.

Our very simple test "URL structure" looks like this:
index.html  ->  1.1.html  ->  1.1.1.html  ->  1.1.1.1.html  -> 
1.1.1.1.1.html

We start a crawl on the index.html (index.html is the only page in the seed
list) with a depth of 3. In this case the first three pages (index.html,
1.1html and 1.1.1.html) are crawled and indexed which is absolutley fine.
Now we start a second crawl (recrawl) with the same depth and crawl db and
in this case all of the pages (including 1.1.1.1.html and 1.1.1.1.1.html)
are crawled. Nutch seems to take the indexed pages from the first crawl
(like 1.1.1.html) also as a starting point for crawling.

In our case we'd like to force Nutch to always just crawl stuff within a
depth of 3 from the real seed page, which is index.html in this case. Is
there any possible way to do this?

We have already tried to use the '-noAdditions' option to 'updatedb' like
mentioned in the wiki (http://wiki.apache.org/nutch/IntranetRecrawl), but
this results in the fact that only the first URL (index.html) is crawled.
In addition we are afraid that new URLs (for example if we add now 1.2.html
as a link to the index.html) are also not crawled.

Thanks a lot in advance!





--
View this message in context: http://lucene.472066.n3.nabble.com/Absolute-depth-for-recrawling-tp4008320.html
Sent from the Nutch - User mailing list archive at Nabble.com.

Re: Absolute depth for recrawling

Posted by Alexandre <al...@gmail.com>.
Sorry I meant a new subject in this forum and not a Jira ticket.
See: 
http://lucene.472066.n3.nabble.com/Recrawling-and-segment-cleanup-td4008865.html





--
View this message in context: http://lucene.472066.n3.nabble.com/Absolute-depth-for-recrawling-tp4008320p4008866.html
Sent from the Nutch - User mailing list archive at Nabble.com.

Re: Absolute depth for recrawling

Posted by Julien Nioche <li...@gmail.com>.
Ask on the user list before opening a new JIRA, it is not necessarily a bug
Thanks!

On 19 September 2012 11:55, Alexandre <al...@gmail.com> wrote:

> Salut Julien,
>
> Thanks for your reply.
> This Plugin: https://issues.apache.org/jira/browse/NUTCH-1331 is exactly
> what I needed.
>
> I've tested it and it's working very well.
>
> But i still have some issue or misunderstanding with the generated segments
> and recrawling.
> I will create a new subject for that.
>
>
>
> Alex.
>
>
>
> --
> View this message in context:
> http://lucene.472066.n3.nabble.com/Absolute-depth-for-recrawling-tp4008320p4008860.html
> Sent from the Nutch - User mailing list archive at Nabble.com.
>



-- 
*
*Open Source Solutions for Text Engineering

http://digitalpebble.blogspot.com/
http://www.digitalpebble.com
http://twitter.com/digitalpebble

Re: Absolute depth for recrawling

Posted by Alexandre <al...@gmail.com>.
Salut Julien,

Thanks for your reply. 
This Plugin: https://issues.apache.org/jira/browse/NUTCH-1331 is exactly
what I needed.

I've tested it and it's working very well. 

But i still have some issue or misunderstanding with the generated segments
and recrawling.
I will create a new subject for that.



Alex.



--
View this message in context: http://lucene.472066.n3.nabble.com/Absolute-depth-for-recrawling-tp4008320p4008860.html
Sent from the Nutch - User mailing list archive at Nabble.com.

Re: Absolute depth for recrawling

Posted by Julien Nioche <li...@gmail.com>.
Salut Alexandre,

The use of the term 'depth' the crawl tool is very misleading. What it
means is # rounds of generate/fetch/parse/update and has nothing to do with
the actual logical depth from a start seed.

You can limit the depth of a crawl using the patch from
https://issues.apache.org/jira/browse/NUTCH-1331.

BTW I'd use the new script in the SVN trunk instead of the all in all crawl
command as it gives more control and a better understanding of what happens

HTH

Julien

On 17 September 2012 15:06, Alexandre <al...@gmail.com> wrote:

> Hey all!
>
> Since a few days we are currently playing a bit arround with Nutch. Today
> we
> have encountered the following issue.
>
> Our very simple test "URL structure" looks like this:
> index.html  ->  1.1.html  ->  1.1.1.html  ->  1.1.1.1.html  ->
> 1.1.1.1.1.html
>
> We start a crawl on the index.html (index.html is the only page in the seed
> list) with a depth of 3. In this case the first three pages (index.html,
> 1.1html and 1.1.1.html) are crawled and indexed which is absolutley fine.
> Now we start a second crawl (recrawl) with the same depth and crawl db and
> in this case all of the pages (including 1.1.1.1.html and 1.1.1.1.1.html)
> are crawled. Nutch seems to take the indexed pages from the first crawl
> (like 1.1.1.html) also as a starting point for crawling.
>
> In our case we'd like to force Nutch to always just crawl stuff within a
> depth of 3 from the real seed page, which is index.html in this case. Is
> there any possible way to do this?
>
> We have already tried to use the '-noAdditions' option to 'updatedb' like
> mentioned in the wiki (http://wiki.apache.org/nutch/IntranetRecrawl), but
> this results in the fact that only the first URL (index.html) is crawled.
> In addition we are afraid that new URLs (for example if we add now 1.2.html
> as a link to the index.html) are also not crawled.
>
> Thanks a lot in advance!
>
>
>
>
>
> --
> View this message in context:
> http://lucene.472066.n3.nabble.com/Absolute-depth-for-recrawling-tp4008320.html
> Sent from the Nutch - User mailing list archive at Nabble.com.
>



-- 
*
*Open Source Solutions for Text Engineering

http://digitalpebble.blogspot.com/
http://www.digitalpebble.com
http://twitter.com/digitalpebble