You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by MilleBii <mi...@gmail.com> on 2009/12/04 23:18:23 UTC

How to drop page content at fetch stages ?

Hi guys,

I'm looking if I can optimize the size occupied on disk by my segments.
I have implemented a topical-scoring plugin... this means I know at that
steps if I should keep that page content or not.
Is there a way to drop some pages content after parsing it, but of course
keep the links because I want to follow the graph ?

PS: Prune is no option to me because it only cleans up the indexes, not the
segments and my indexer does that clean-up very well.

-- 
-MilleBii-

Re: How to drop page content at fetch stages ?

Posted by MilleBii <mi...@gmail.com>.
Thx, a bit too complex for me right now. I don't -yet- fully understand this
map/reduce technique.
But I'll keep the idea for a future development.

2009/12/4 Dennis Kubes <ku...@apache.org>

> Sorry, segments, not indexes.
>
>
> Dennis Kubes wrote:
>
>> You would need to write a custom MapReduce job to run through the indexes
>> and only keeps the ones identified by your plugin.  Be sure to update the
>> CrawlDb with the extracted urls before you drop the content from the
>> segments.
>>
>> Dennis
>>
>> MilleBii wrote:
>>
>>> Hi guys,
>>>
>>> I'm looking if I can optimize the size occupied on disk by my segments.
>>> I have implemented a topical-scoring plugin... this means I know at that
>>> steps if I should keep that page content or not.
>>> Is there a way to drop some pages content after parsing it, but of course
>>> keep the links because I want to follow the graph ?
>>>
>>> PS: Prune is no option to me because it only cleans up the indexes, not
>>> the
>>> segments and my indexer does that clean-up very well.
>>>
>>>


-- 
-MilleBii-

Re: How to drop page content at fetch stages ?

Posted by Dennis Kubes <ku...@apache.org>.
Sorry, segments, not indexes.

Dennis Kubes wrote:
> You would need to write a custom MapReduce job to run through the 
> indexes and only keeps the ones identified by your plugin.  Be sure to 
> update the CrawlDb with the extracted urls before you drop the content 
> from the segments.
> 
> Dennis
> 
> MilleBii wrote:
>> Hi guys,
>>
>> I'm looking if I can optimize the size occupied on disk by my segments.
>> I have implemented a topical-scoring plugin... this means I know at that
>> steps if I should keep that page content or not.
>> Is there a way to drop some pages content after parsing it, but of course
>> keep the links because I want to follow the graph ?
>>
>> PS: Prune is no option to me because it only cleans up the indexes, 
>> not the
>> segments and my indexer does that clean-up very well.
>>

Re: How to drop page content at fetch stages ?

Posted by Dennis Kubes <ku...@apache.org>.
You would need to write a custom MapReduce job to run through the 
indexes and only keeps the ones identified by your plugin.  Be sure to 
update the CrawlDb with the extracted urls before you drop the content 
from the segments.

Dennis

MilleBii wrote:
> Hi guys,
> 
> I'm looking if I can optimize the size occupied on disk by my segments.
> I have implemented a topical-scoring plugin... this means I know at that
> steps if I should keep that page content or not.
> Is there a way to drop some pages content after parsing it, but of course
> keep the links because I want to follow the graph ?
> 
> PS: Prune is no option to me because it only cleans up the indexes, not the
> segments and my indexer does that clean-up very well.
>