You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by Raghavendra Prabhu <rr...@gmail.com> on 2006/02/08 14:39:24 UTC

deleting old segments

Hi

Should we manually delete the old segments in nutch.

For example i have indexed a site on a particular day

and one week after that i index the updated content

Is there a way i can delete the redundant old url contents in the old
segments

How can we do this?

Rgds
Prabhu

Re: deleting old segments

Posted by Raghavendra Prabhu <rr...@gmail.com>.
Hi

Thanks for the infor

My thing is that every here and there my list of sites change

So first time i index one and two site

Next time i index two and three site

So there wil be new data (so if i use ur script,both one and two will be
deleted)

But i still want to have one preserved

So thing is i wud like to remove some segments based upon url of the page
alone

The information which you gave was also useful .But i want to do the above


Rgds
Prabhu
On 2/8/06, Insurance Squared Inc. <gc...@insurancesquared.com> wrote:
>
> Hi Prabhu,
>
> Below is the script we use for deleting old segments.
>
> Regards,
> Glenn
>
>
> #!/bin/sh
> # Remove old dirs from segments dir
> # PERIOD is threshold for old dirs
> #
> # Created by Keren Yu Jan 31, 2006
>
> NUTCH_DIR=/home/glenn/nutch
> PERIOD=30
>
> # put dirs which are older than PERIOD into dates.tmp
> ls -d $NUTCH_DIR/segments/2* > $NUTCH_DIR/dirs.tmp
> threshold_date=`date -d "$PERIOD days ago" +%Y%m%d`
> count=`cat $NUTCH_DIR/dirs.tmp | wc -l`
> if [ $count -gt 0 ];
> then
> for dir in `cat $NUTCH_DIR/dirs.tmp`
> do
> # get date from dir
>    mydate=`echo $dir | sed "s/.*\/\([0-9]\{8\}\).*/\1/"`
>    if [ $threshold_date -ge $mydate ];
>    then
>      echo $mydate >> $NUTCH_DIR/dates.tmp
>    fi
> done
> fi
>
> # remove dirs which are older than PERIOD
> ls -d $NUTCH_DIR/segments/2* | grep -f $NUTCH_DIR/dates.tmp >
> $NUTCH_DIR/dirs.tmp
> count=`cat $NUTCH_DIR/dirs.tmp | wc -l`
> if [ $count -gt 0 ];
> then
> for dir in `cat $NUTCH_DIR/dirs.tmp`
> do
>    rm -fr $dir
> done
> fi
>
> rm -f $NUTCH_DIR/dates.tmp
>
>
> Raghavendra Prabhu wrote:
>
> >Hi
> >
> >Should we manually delete the old segments in nutch.
> >
> >For example i have indexed a site on a particular day
> >
> >and one week after that i index the updated content
> >
> >Is there a way i can delete the redundant old url contents in the old
> >segments
> >
> >How can we do this?
> >
> >Rgds
> >Prabhu
> >
> >
> >
>

Re: deleting old segments

Posted by "Insurance Squared Inc." <gc...@insurancesquared.com>.
Hi Prabhu,

Below is the script we use for deleting old segments.

Regards,
Glenn


#!/bin/sh
# Remove old dirs from segments dir
# PERIOD is threshold for old dirs
#
# Created by Keren Yu Jan 31, 2006

NUTCH_DIR=/home/glenn/nutch
PERIOD=30

# put dirs which are older than PERIOD into dates.tmp
ls -d $NUTCH_DIR/segments/2* > $NUTCH_DIR/dirs.tmp
threshold_date=`date -d "$PERIOD days ago" +%Y%m%d`
count=`cat $NUTCH_DIR/dirs.tmp | wc -l`
if [ $count -gt 0 ];
then
  for dir in `cat $NUTCH_DIR/dirs.tmp`
  do
# get date from dir
    mydate=`echo $dir | sed "s/.*\/\([0-9]\{8\}\).*/\1/"`
    if [ $threshold_date -ge $mydate ];
    then
      echo $mydate >> $NUTCH_DIR/dates.tmp
    fi
  done
fi

# remove dirs which are older than PERIOD
ls -d $NUTCH_DIR/segments/2* | grep -f $NUTCH_DIR/dates.tmp > 
$NUTCH_DIR/dirs.tmp
count=`cat $NUTCH_DIR/dirs.tmp | wc -l`
if [ $count -gt 0 ];
then
  for dir in `cat $NUTCH_DIR/dirs.tmp`
  do
    rm -fr $dir
  done
fi

rm -f $NUTCH_DIR/dates.tmp


Raghavendra Prabhu wrote:

>Hi
>
>Should we manually delete the old segments in nutch.
>
>For example i have indexed a site on a particular day
>
>and one week after that i index the updated content
>
>Is there a way i can delete the redundant old url contents in the old
>segments
>
>How can we do this?
>
>Rgds
>Prabhu
>
>  
>