You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by atawfik <co...@gmail.com> on 2014/08/10 00:32:10 UTC
how to get the depth of url in nutch
I am trying to crawl and index Urls based on the their depth levels. In my
scenario, I am interested in two content types: html and images. For images,
I need to index any imaged based Url regardless of its depth. However, for
html content, I only need to index them if they are provided via my seed
list (depth 1).
I am thinking of writing a custom indexFilter plugin that returns an empty
document if the parsed content meets the condition above.
However, I do not know how to get the depth of a Url. So, I looked into the
scoring-depth plugin and it seems I can get the depth using :
String depthString = parseData.getMeta(DEPTH_KEY);
Can I do that or there is a better way?
Thanks in advance
--
View this message in context: http://lucene.472066.n3.nabble.com/how-to-get-the-depth-of-url-in-nutch-tp4152122.html
Sent from the Nutch - User mailing list archive at Nabble.com.
Re: how to get the depth of url in nutch
Posted by atawfik <co...@gmail.com>.
Thanks Sebastian,
Using scoring-depth is indeed the way to go. I figured out this after
enabling it in nutch configuration. Once I enabled it, I was able to get the
depth in the indexFilter.
Regards
Ameer
--
View this message in context: http://lucene.472066.n3.nabble.com/how-to-get-the-depth-of-url-in-nutch-tp4152122p4152346.html
Sent from the Nutch - User mailing list archive at Nabble.com.
Re: how to get the depth of url in nutch
Posted by Sebastian Nagel <wa...@googlemail.com>.
Hi,
> I am thinking of writing a custom indexFilter plugin that returns an empty
> document if the parsed content meets the condition above.
If null is returned, a document is skipped from indexing.
> However, I do not know how to get the depth of a Url. So, I looked into the
> scoring-depth plugin and it seems I can get the depth using :
>
> String depthString = parseData.getMeta(DEPTH_KEY);
>
> Can I do that or there is a better way?
It would be the simplest solution, to let scoring-depth track
the depth but keep scoring.depth.max so that the crawl isn't stopped.
Sebastian
On 08/10/2014 12:32 AM, atawfik wrote:
>
> I am trying to crawl and index Urls based on the their depth levels. In my
> scenario, I am interested in two content types: html and images. For images,
> I need to index any imaged based Url regardless of its depth. However, for
> html content, I only need to index them if they are provided via my seed
> list (depth 1).
>
> I am thinking of writing a custom indexFilter plugin that returns an empty
> document if the parsed content meets the condition above.
>
> However, I do not know how to get the depth of a Url. So, I looked into the
> scoring-depth plugin and it seems I can get the depth using :
>
> String depthString = parseData.getMeta(DEPTH_KEY);
>
> Can I do that or there is a better way?
>
> Thanks in advance
>
>
>
>
> --
> View this message in context: http://lucene.472066.n3.nabble.com/how-to-get-the-depth-of-url-in-nutch-tp4152122.html
> Sent from the Nutch - User mailing list archive at Nabble.com.
>