You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by Raghavendra Prabhu <rr...@gmail.com> on 2006/02/08 14:39:24 UTC
deleting old segments
Hi
Should we manually delete the old segments in nutch.
For example i have indexed a site on a particular day
and one week after that i index the updated content
Is there a way i can delete the redundant old url contents in the old
segments
How can we do this?
Rgds
Prabhu
Re: deleting old segments
Posted by Raghavendra Prabhu <rr...@gmail.com>.
Hi
Thanks for the infor
My thing is that every here and there my list of sites change
So first time i index one and two site
Next time i index two and three site
So there wil be new data (so if i use ur script,both one and two will be
deleted)
But i still want to have one preserved
So thing is i wud like to remove some segments based upon url of the page
alone
The information which you gave was also useful .But i want to do the above
Rgds
Prabhu
On 2/8/06, Insurance Squared Inc. <gc...@insurancesquared.com> wrote:
>
> Hi Prabhu,
>
> Below is the script we use for deleting old segments.
>
> Regards,
> Glenn
>
>
> #!/bin/sh
> # Remove old dirs from segments dir
> # PERIOD is threshold for old dirs
> #
> # Created by Keren Yu Jan 31, 2006
>
> NUTCH_DIR=/home/glenn/nutch
> PERIOD=30
>
> # put dirs which are older than PERIOD into dates.tmp
> ls -d $NUTCH_DIR/segments/2* > $NUTCH_DIR/dirs.tmp
> threshold_date=`date -d "$PERIOD days ago" +%Y%m%d`
> count=`cat $NUTCH_DIR/dirs.tmp | wc -l`
> if [ $count -gt 0 ];
> then
> for dir in `cat $NUTCH_DIR/dirs.tmp`
> do
> # get date from dir
> mydate=`echo $dir | sed "s/.*\/\([0-9]\{8\}\).*/\1/"`
> if [ $threshold_date -ge $mydate ];
> then
> echo $mydate >> $NUTCH_DIR/dates.tmp
> fi
> done
> fi
>
> # remove dirs which are older than PERIOD
> ls -d $NUTCH_DIR/segments/2* | grep -f $NUTCH_DIR/dates.tmp >
> $NUTCH_DIR/dirs.tmp
> count=`cat $NUTCH_DIR/dirs.tmp | wc -l`
> if [ $count -gt 0 ];
> then
> for dir in `cat $NUTCH_DIR/dirs.tmp`
> do
> rm -fr $dir
> done
> fi
>
> rm -f $NUTCH_DIR/dates.tmp
>
>
> Raghavendra Prabhu wrote:
>
> >Hi
> >
> >Should we manually delete the old segments in nutch.
> >
> >For example i have indexed a site on a particular day
> >
> >and one week after that i index the updated content
> >
> >Is there a way i can delete the redundant old url contents in the old
> >segments
> >
> >How can we do this?
> >
> >Rgds
> >Prabhu
> >
> >
> >
>
Re: deleting old segments
Posted by "Insurance Squared Inc." <gc...@insurancesquared.com>.
Hi Prabhu,
Below is the script we use for deleting old segments.
Regards,
Glenn
#!/bin/sh
# Remove old dirs from segments dir
# PERIOD is threshold for old dirs
#
# Created by Keren Yu Jan 31, 2006
NUTCH_DIR=/home/glenn/nutch
PERIOD=30
# put dirs which are older than PERIOD into dates.tmp
ls -d $NUTCH_DIR/segments/2* > $NUTCH_DIR/dirs.tmp
threshold_date=`date -d "$PERIOD days ago" +%Y%m%d`
count=`cat $NUTCH_DIR/dirs.tmp | wc -l`
if [ $count -gt 0 ];
then
for dir in `cat $NUTCH_DIR/dirs.tmp`
do
# get date from dir
mydate=`echo $dir | sed "s/.*\/\([0-9]\{8\}\).*/\1/"`
if [ $threshold_date -ge $mydate ];
then
echo $mydate >> $NUTCH_DIR/dates.tmp
fi
done
fi
# remove dirs which are older than PERIOD
ls -d $NUTCH_DIR/segments/2* | grep -f $NUTCH_DIR/dates.tmp >
$NUTCH_DIR/dirs.tmp
count=`cat $NUTCH_DIR/dirs.tmp | wc -l`
if [ $count -gt 0 ];
then
for dir in `cat $NUTCH_DIR/dirs.tmp`
do
rm -fr $dir
done
fi
rm -f $NUTCH_DIR/dates.tmp
Raghavendra Prabhu wrote:
>Hi
>
>Should we manually delete the old segments in nutch.
>
>For example i have indexed a site on a particular day
>
>and one week after that i index the updated content
>
>Is there a way i can delete the redundant old url contents in the old
>segments
>
>How can we do this?
>
>Rgds
>Prabhu
>
>
>