You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by "Insurance Squared Inc." <gc...@insurancesquared.com> on 2006/02/08 19:32:34 UTC

[Fwd: Re: deleting old segments]

My understanding was that every 30 days sites are recrawled.&nbsp; So if
site #1 was indexed 30 days ago, it would be recrawled and re-indexed
in a new segment with today's crawl.&nbsp; That leaves you with site #1 in
two segments - the current one and the 30 day old segment.&nbsp; Nutch then
only uses the current segment with the most recent crawl of site #1 and
ignores the 30 day old segment, leaving it safe to delete using that
script.

That's my understanding of how it works, I stand to be corrected by the
experts.

Raghavendra Prabhu wrote:

> Hi
>  
> Thanks for the infor
>  
> My thing is that every here and there my list of sites change
>  
> So first time i index one and two site
>  
> Next time i index two and three site
>  
> So there wil be new data (so if i use ur script,both one and two will 
> be deleted)
>  
> But i still want to have one preserved
>  
> So thing is i wud like to remove some segments based upon url of the 
> page alone
>  
> The information which you gave was also useful .But i want to do the above
>  
>  
> Rgds
> Prabhu
> On 2/8/06, *Insurance Squared Inc.* <gcooke@insurancesquared.com 
> <ma...@insurancesquared.com>> wrote:
>
>     Hi Prabhu,
>
>     Below is the script we use for deleting old segments.
>
>     Regards,
>     Glenn
>
>
>     #!/bin/sh
>     # Remove old dirs from segments dir
>     # PERIOD is threshold for old dirs
>     #
>     # Created by Keren Yu Jan 31, 2006
>
>     NUTCH_DIR=/home/glenn/nutch
>     PERIOD=30
>
>     # put dirs which are older than PERIOD into dates.tmp
>     ls -d $NUTCH_DIR/segments/2* > $NUTCH_DIR/dirs.tmp
>     threshold_date=`date -d "$PERIOD days ago" +%Y%m%d`
>     count=`cat $NUTCH_DIR/dirs.tmp | wc -l`
>     if [ $count -gt 0 ];
>     then
>     for dir in `cat $NUTCH_DIR/dirs.tmp`
>     do
>     # get date from dir
>        mydate=`echo $dir | sed "s/.*\/\([0-9]\{8\}\).*/\1/"`
>        if [ $threshold_date -ge $mydate ];
>        then
>          echo $mydate >> $NUTCH_DIR/dates.tmp
>        fi
>     done
>     fi
>
>     # remove dirs which are older than PERIOD
>     ls -d $NUTCH_DIR/segments/2* | grep -f $NUTCH_DIR/dates.tmp >
>     $NUTCH_DIR/dirs.tmp
>     count=`cat $NUTCH_DIR/dirs.tmp | wc -l`
>     if [ $count -gt 0 ];
>     then
>     for dir in `cat $NUTCH_DIR/dirs.tmp`
>     do
>        rm -fr $dir
>     done
>     fi
>
>     rm -f $NUTCH_DIR/dates.tmp
>
>
>     Raghavendra Prabhu wrote:
>
>     >Hi
>     >
>     >Should we manually delete the old segments in nutch.
>     >
>     >For example i have indexed a site on a particular day
>     >
>     >and one week after that i index the updated content
>     >
>     >Is there a way i can delete the redundant old url contents in the
>     old
>     >segments
>     >
>     >How can we do this?
>     >
>     >Rgds
>     >Prabhu
>     >
>     >
>     >
>
>

Re: [Fwd: Re: deleting old segments]

Posted by Raghavendra Prabhu <rr...@gmail.com>.
Hi

What you mentioned is correct ? But i do it in a different way

Exactly that the part which i want to avoid

I run crawl which does it everytime
In the first crawl , i have indexed one and two(create new webdb)

copy segments to searcher directory

In the second crawl,two and three(so recreate webdb)
 copy these segments to the searcher directory

Now two is redundant

I want to remove two in the old segments(keeping one intact)

Rgds
Prabhu

On 2/9/06, Insurance Squared Inc. <gc...@insurancesquared.com> wrote:
>
>
> My understanding was that every 30 days sites are recrawled.&nbsp; So if
> site #1 was indexed 30 days ago, it would be recrawled and re-indexed
> in a new segment with today's crawl.&nbsp; That leaves you with site #1 in
> two segments - the current one and the 30 day old segment.&nbsp; Nutch
> then
> only uses the current segment with the most recent crawl of site #1 and
> ignores the 30 day old segment, leaving it safe to delete using that
> script.
>
> That's my understanding of how it works, I stand to be corrected by the
> experts.
>
> Raghavendra Prabhu wrote:
>
> > Hi
> >
> > Thanks for the infor
> >
> > My thing is that every here and there my list of sites change
> >
> > So first time i index one and two site
> >
> > Next time i index two and three site
> >
> > So there wil be new data (so if i use ur script,both one and two will
> > be deleted)
> >
> > But i still want to have one preserved
> >
> > So thing is i wud like to remove some segments based upon url of the
> > page alone
> >
> > The information which you gave was also useful .But i want to do the
> above
> >
> >
> > Rgds
> > Prabhu
> > On 2/8/06, *Insurance Squared Inc.* <gcooke@insurancesquared.com
> > <ma...@insurancesquared.com>> wrote:
> >
> >     Hi Prabhu,
> >
> >     Below is the script we use for deleting old segments.
> >
> >     Regards,
> >     Glenn
> >
> >
> >     #!/bin/sh
> >     # Remove old dirs from segments dir
> >     # PERIOD is threshold for old dirs
> >     #
> >     # Created by Keren Yu Jan 31, 2006
> >
> >     NUTCH_DIR=/home/glenn/nutch
> >     PERIOD=30
> >
> >     # put dirs which are older than PERIOD into dates.tmp
> >     ls -d $NUTCH_DIR/segments/2* > $NUTCH_DIR/dirs.tmp
> >     threshold_date=`date -d "$PERIOD days ago" +%Y%m%d`
> >     count=`cat $NUTCH_DIR/dirs.tmp | wc -l`
> >     if [ $count -gt 0 ];
> >     then
> >     for dir in `cat $NUTCH_DIR/dirs.tmp`
> >     do
> >     # get date from dir
> >        mydate=`echo $dir | sed "s/.*\/\([0-9]\{8\}\).*/\1/"`
> >        if [ $threshold_date -ge $mydate ];
> >        then
> >          echo $mydate >> $NUTCH_DIR/dates.tmp
> >        fi
> >     done
> >     fi
> >
> >     # remove dirs which are older than PERIOD
> >     ls -d $NUTCH_DIR/segments/2* | grep -f $NUTCH_DIR/dates.tmp >
> >     $NUTCH_DIR/dirs.tmp
> >     count=`cat $NUTCH_DIR/dirs.tmp | wc -l`
> >     if [ $count -gt 0 ];
> >     then
> >     for dir in `cat $NUTCH_DIR/dirs.tmp`
> >     do
> >        rm -fr $dir
> >     done
> >     fi
> >
> >     rm -f $NUTCH_DIR/dates.tmp
> >
> >
> >     Raghavendra Prabhu wrote:
> >
> >     >Hi
> >     >
> >     >Should we manually delete the old segments in nutch.
> >     >
> >     >For example i have indexed a site on a particular day
> >     >
> >     >and one week after that i index the updated content
> >     >
> >     >Is there a way i can delete the redundant old url contents in the
> >     old
> >     >segments
> >     >
> >     >How can we do this?
> >     >
> >     >Rgds
> >     >Prabhu
> >     >
> >     >
> >     >
> >
> >
>