You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by atawfik <co...@gmail.com> on 2014/08/10 00:32:10 UTC

how to get the depth of url in nutch

I am trying to crawl and index Urls based on the their depth levels. In my
scenario, I am interested in two content types: html and images. For images,
I need to index any imaged based Url regardless of its depth. However, for
html content, I only need to index them if they are provided via my seed
list (depth 1).

I am thinking of writing a custom indexFilter plugin that returns an empty
document if the parsed content meets the condition above.

However, I do not know how to get the depth of a Url. So, I looked into the
scoring-depth plugin and it seems I can get the depth using :

String depthString = parseData.getMeta(DEPTH_KEY);

Can I do that or there is a better way?

Thanks in advance




--
View this message in context: http://lucene.472066.n3.nabble.com/how-to-get-the-depth-of-url-in-nutch-tp4152122.html
Sent from the Nutch - User mailing list archive at Nabble.com.

Re: how to get the depth of url in nutch

Posted by atawfik <co...@gmail.com>.
Thanks Sebastian,

Using scoring-depth is indeed the way to go. I figured out this after
enabling it in nutch configuration. Once I enabled it, I was able to get the
depth in the indexFilter.

Regards
Ameer



--
View this message in context: http://lucene.472066.n3.nabble.com/how-to-get-the-depth-of-url-in-nutch-tp4152122p4152346.html
Sent from the Nutch - User mailing list archive at Nabble.com.

Re: how to get the depth of url in nutch

Posted by Sebastian Nagel <wa...@googlemail.com>.
Hi,

> I am thinking of writing a custom indexFilter plugin that returns an empty
> document if the parsed content meets the condition above.

If null is returned, a document is skipped from indexing.

> However, I do not know how to get the depth of a Url. So, I looked into the
> scoring-depth plugin and it seems I can get the depth using :
>
> String depthString = parseData.getMeta(DEPTH_KEY);
>
> Can I do that or there is a better way?

It would be the simplest solution, to let scoring-depth track
the depth but keep scoring.depth.max so that the crawl isn't stopped.

Sebastian

On 08/10/2014 12:32 AM, atawfik wrote:
> 
> I am trying to crawl and index Urls based on the their depth levels. In my
> scenario, I am interested in two content types: html and images. For images,
> I need to index any imaged based Url regardless of its depth. However, for
> html content, I only need to index them if they are provided via my seed
> list (depth 1).
> 
> I am thinking of writing a custom indexFilter plugin that returns an empty
> document if the parsed content meets the condition above.
> 
> However, I do not know how to get the depth of a Url. So, I looked into the
> scoring-depth plugin and it seems I can get the depth using :
> 
> String depthString = parseData.getMeta(DEPTH_KEY);
> 
> Can I do that or there is a better way?
> 
> Thanks in advance
> 
> 
> 
> 
> --
> View this message in context: http://lucene.472066.n3.nabble.com/how-to-get-the-depth-of-url-in-nutch-tp4152122.html
> Sent from the Nutch - User mailing list archive at Nabble.com.
>