You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by Scott Owens <sc...@gmail.com> on 2006/02/08 14:56:51 UTC

Re: How to add only new urls to DB

Hi All,

I wanted to check in to see if anyone has found an answer for this
issue.  I am injecting new URLs on a daily basis, and only need to
fetch/index those new one's, but obviously need to maintain a complete
webdb.

One thing I was thinking was to use a temporary webdb for the initial
injection, then updating (updatedb) my primary webdb after the fetch
or indexing.

# prepare dirs and inject urls
       rm -rf $db/*
       $nutch admin -local $db -create
       $nutch inject -local $db -urlfile $urlFile

	echo -e "\nGenerating next segment to fetch"
	$nutch generate -local $db $segmentdir $fetchLimit
	s=`ls -d $segmentdir/* | tail -1`
	echo -e "\nFetching next segment"
	$nutch fetch $s
	echo -e "\nUpdating web database"
	$nutch updatedb $dbmain $s
	echo -e "\nAnalyzing links"
	$nutch analyze $dbmain 5

OR after the segment is indexed -- as the above method wouldn't allow
a depth greather than 1?

# prepare dirs and inject urls
       rm -rf $db/*
       $nutch admin -local $db -create
       $nutch inject -local $db -urlfile $urlFile

for i in `seq $depth`
do
	echo -e "\nGenerating next segment to fetch"
	$nutch generate -local $db $segmentdir $fetchLimit
	s=`ls -d $segmentdir/* | tail -1`
	echo -e "\nFetching next segment"
	$nutch fetch $s
	echo -e "\nUpdating web database"
	$nutch updatedb $db $s
	echo -e "\nAnalyzing links"
	$nutch analyze $db 5
done

echo -e "\nFetch done"
echo "Indexing segments"

for s in `ls -1d $segmentdir/*`
do
	$nutch index $s
done

	echo -e "\nUpdating web database"
	$nutch updatedb $dbmain $s


OR maybe I have no idea what I'm talking about : ) - I'm not a
developer, just trying to figure things out.

If anyone has experience with this and some advice I'm all ears.  thanks!

Scott

On 11/10/05, Dean Elwood <de...@gmail.com> wrote:
> Hi Lawrence,
>
> I'm stuck in the same position. I haven't yet examined the "merge" function,
> which might shed some light on it.
>
> Have you managed to discover anything so far?
>
> >>You can use the regular expression bases url filter. Than only urls that
> >>match the pattern will be added to a fetch list.<<
>
> Hi Stefan. Getting the new URL's to crawl is the easy part ;-)
>
> The trick, and the question, is how you add that to an existing database,
> and then re-index, without doing a full re-crawl?
>
> Thanks,
>
> Dean
>
> ----- Original Message -----
> From: "Lawrence Pitcher" <lp...@redomains.com>
> To: <nu...@lucene.apache.org>
> Sent: Thursday, November 10, 2005 5:05 PM
> Subject: How to add only new urls to DB
>
>
> Hi,
>
> Thanks to all for the best search solution available.
>
> I have installed the software, indexed 15,000 websites and tested the search
> and it works great!
>
> If I want to add only two more websites, so I made a "newurls.txt" file,
> then injected it to WebDB "bin/nutch inject db/ -urlfile newurls.txt", then
> generated a new segment "bin/nutch generate db/ segments/",  I then checked
> for the new sement name in the directory "/segments'
>
> Took that new segment name and placed it in the fetch command "bin/nutch
> fetch segments/20051110103316/"
>
> However it appears to re-fetch all 15,000 webpages along with the
> newurls.txt webpages.
>
> Can I not just index only the new and then Update the DB.
>
> Sorry for such a lame question but I have just started.
>
> Many thanks to all.
> Lawrence
>
>

Re: How to add only new urls to DB

Posted by Enrico Triolo <en...@gmail.com>.
Ok, this seems correct behaviour... Let's face the problem from
another perspective: can I remove urls that are in DB_UNFETCHED status
before injecting the new url?

Enrico

On 2/13/06, Gal Nitzan <gn...@usa.net> wrote:
> no since generate looks in web db (crawldb) for the links which their
> status is db_unfetched and it doesn't know that it was injected...
>
>
> On Mon, 2006-02-13 at 16:52 +0100, Enrico Triolo wrote:
> > > ...
> > > In general if you inject a set of urls to a webdb and create new
> > > segment the segment should only contains the new urls and pages that
> > > are older than 30 days and fetched anyway.
> >
> > Actually it seems to me that generated segments contain also urls that
> > are in DB_UNFETCHED status from the latest fetching job.
> >
> > I mean, if I inject an url and set a fetching depth of 1, at the end
> > of the process the webdb will contain 1 url in DB_FETCHED status and n
> > urls in DB_UNFETCHED (where n is the number of outgoing links of the
> > injected url).
> > If I then inject another url and generate a new segment, it will
> > contain the url itself and the n urls from previous iteration...
> > Is there a way to instruct nutch to only fetch the injected url?
> >
> > Thanks,
> > Enrico
> >
> > > Am 08.02.2006 um 14:56 schrieb Scott Owens:
> > >
> > > > Hi All,
> > > >
> > > > I wanted to check in to see if anyone has found an answer for this
> > > > issue.  I am injecting new URLs on a daily basis, and only need to
> > > > fetch/index those new one's, but obviously need to maintain a complete
> > > > webdb.
> > > >
> > > > One thing I was thinking was to use a temporary webdb for the initial
> > > > injection, then updating (updatedb) my primary webdb after the fetch
> > > > or indexing.
> > > >
> > > > # prepare dirs and inject urls
> > > >        rm -rf $db/*
> > > >        $nutch admin -local $db -create
> > > >        $nutch inject -local $db -urlfile $urlFile
> > > >
> > > >       echo -e "\nGenerating next segment to fetch"
> > > >       $nutch generate -local $db $segmentdir $fetchLimit
> > > >       s=`ls -d $segmentdir/* | tail -1`
> > > >       echo -e "\nFetching next segment"
> > > >       $nutch fetch $s
> > > >       echo -e "\nUpdating web database"
> > > >       $nutch updatedb $dbmain $s
> > > >       echo -e "\nAnalyzing links"
> > > >       $nutch analyze $dbmain 5
> > > >
> > > > OR after the segment is indexed -- as the above method wouldn't allow
> > > > a depth greather than 1?
> > > >
> > > > # prepare dirs and inject urls
> > > >        rm -rf $db/*
> > > >        $nutch admin -local $db -create
> > > >        $nutch inject -local $db -urlfile $urlFile
> > > >
> > > > for i in `seq $depth`
> > > > do
> > > >       echo -e "\nGenerating next segment to fetch"
> > > >       $nutch generate -local $db $segmentdir $fetchLimit
> > > >       s=`ls -d $segmentdir/* | tail -1`
> > > >       echo -e "\nFetching next segment"
> > > >       $nutch fetch $s
> > > >       echo -e "\nUpdating web database"
> > > >       $nutch updatedb $db $s
> > > >       echo -e "\nAnalyzing links"
> > > >       $nutch analyze $db 5
> > > > done
> > > >
> > > > echo -e "\nFetch done"
> > > > echo "Indexing segments"
> > > >
> > > > for s in `ls -1d $segmentdir/*`
> > > > do
> > > >       $nutch index $s
> > > > done
> > > >
> > > >       echo -e "\nUpdating web database"
> > > >       $nutch updatedb $dbmain $s
> > > >
> > > >
> > > > OR maybe I have no idea what I'm talking about : ) - I'm not a
> > > > developer, just trying to figure things out.
> > > >
> > > > If anyone has experience with this and some advice I'm all ears.
> > > > thanks!
> > > >
> > > > Scott
> > > >
> > > > On 11/10/05, Dean Elwood <de...@gmail.com> wrote:
> > > >> Hi Lawrence,
> > > >>
> > > >> I'm stuck in the same position. I haven't yet examined the "merge"
> > > >> function,
> > > >> which might shed some light on it.
> > > >>
> > > >> Have you managed to discover anything so far?
> > > >>
> > > >>>> You can use the regular expression bases url filter. Than only
> > > >>>> urls that
> > > >>>> match the pattern will be added to a fetch list.<<
> > > >>
> > > >> Hi Stefan. Getting the new URL's to crawl is the easy part ;-)
> > > >>
> > > >> The trick, and the question, is how you add that to an existing
> > > >> database,
> > > >> and then re-index, without doing a full re-crawl?
> > > >>
> > > >> Thanks,
> > > >>
> > > >> Dean
> > > >>
> > > >> ----- Original Message -----
> > > >> From: "Lawrence Pitcher" <lp...@redomains.com>
> > > >> To: <nu...@lucene.apache.org>
> > > >> Sent: Thursday, November 10, 2005 5:05 PM
> > > >> Subject: How to add only new urls to DB
> > > >>
> > > >>
> > > >> Hi,
> > > >>
> > > >> Thanks to all for the best search solution available.
> > > >>
> > > >> I have installed the software, indexed 15,000 websites and tested
> > > >> the search
> > > >> and it works great!
> > > >>
> > > >> If I want to add only two more websites, so I made a "newurls.txt"
> > > >> file,
> > > >> then injected it to WebDB "bin/nutch inject db/ -urlfile
> > > >> newurls.txt", then
> > > >> generated a new segment "bin/nutch generate db/ segments/",  I
> > > >> then checked
> > > >> for the new sement name in the directory "/segments'
> > > >>
> > > >> Took that new segment name and placed it in the fetch command "bin/
> > > >> nutch
> > > >> fetch segments/20051110103316/"
> > > >>
> > > >> However it appears to re-fetch all 15,000 webpages along with the
> > > >> newurls.txt webpages.
> > > >>
> > > >> Can I not just index only the new and then Update the DB.
> > > >>
> > > >> Sorry for such a lame question but I have just started.
> > > >>
> > > >> Many thanks to all.
> > > >> Lawrence
> > > >>
> > > >>
> > > >
> > >
> > > ---------------------------------------------------------------
> > > company:        http://www.media-style.com
> > > forum:        http://www.text-mining.org
> > > blog:            http://www.find23.net
> > >
> > >
> > >
> > >
> >
>
>
>

Re: How to add only new urls to DB

Posted by Gal Nitzan <gn...@usa.net>.
no since generate looks in web db (crawldb) for the links which their
status is db_unfetched and it doesn't know that it was injected...


On Mon, 2006-02-13 at 16:52 +0100, Enrico Triolo wrote:
> > ...
> > In general if you inject a set of urls to a webdb and create new
> > segment the segment should only contains the new urls and pages that
> > are older than 30 days and fetched anyway.
> 
> Actually it seems to me that generated segments contain also urls that
> are in DB_UNFETCHED status from the latest fetching job.
> 
> I mean, if I inject an url and set a fetching depth of 1, at the end
> of the process the webdb will contain 1 url in DB_FETCHED status and n
> urls in DB_UNFETCHED (where n is the number of outgoing links of the
> injected url).
> If I then inject another url and generate a new segment, it will
> contain the url itself and the n urls from previous iteration...
> Is there a way to instruct nutch to only fetch the injected url?
> 
> Thanks,
> Enrico
> 
> > Am 08.02.2006 um 14:56 schrieb Scott Owens:
> >
> > > Hi All,
> > >
> > > I wanted to check in to see if anyone has found an answer for this
> > > issue.  I am injecting new URLs on a daily basis, and only need to
> > > fetch/index those new one's, but obviously need to maintain a complete
> > > webdb.
> > >
> > > One thing I was thinking was to use a temporary webdb for the initial
> > > injection, then updating (updatedb) my primary webdb after the fetch
> > > or indexing.
> > >
> > > # prepare dirs and inject urls
> > >        rm -rf $db/*
> > >        $nutch admin -local $db -create
> > >        $nutch inject -local $db -urlfile $urlFile
> > >
> > >       echo -e "\nGenerating next segment to fetch"
> > >       $nutch generate -local $db $segmentdir $fetchLimit
> > >       s=`ls -d $segmentdir/* | tail -1`
> > >       echo -e "\nFetching next segment"
> > >       $nutch fetch $s
> > >       echo -e "\nUpdating web database"
> > >       $nutch updatedb $dbmain $s
> > >       echo -e "\nAnalyzing links"
> > >       $nutch analyze $dbmain 5
> > >
> > > OR after the segment is indexed -- as the above method wouldn't allow
> > > a depth greather than 1?
> > >
> > > # prepare dirs and inject urls
> > >        rm -rf $db/*
> > >        $nutch admin -local $db -create
> > >        $nutch inject -local $db -urlfile $urlFile
> > >
> > > for i in `seq $depth`
> > > do
> > >       echo -e "\nGenerating next segment to fetch"
> > >       $nutch generate -local $db $segmentdir $fetchLimit
> > >       s=`ls -d $segmentdir/* | tail -1`
> > >       echo -e "\nFetching next segment"
> > >       $nutch fetch $s
> > >       echo -e "\nUpdating web database"
> > >       $nutch updatedb $db $s
> > >       echo -e "\nAnalyzing links"
> > >       $nutch analyze $db 5
> > > done
> > >
> > > echo -e "\nFetch done"
> > > echo "Indexing segments"
> > >
> > > for s in `ls -1d $segmentdir/*`
> > > do
> > >       $nutch index $s
> > > done
> > >
> > >       echo -e "\nUpdating web database"
> > >       $nutch updatedb $dbmain $s
> > >
> > >
> > > OR maybe I have no idea what I'm talking about : ) - I'm not a
> > > developer, just trying to figure things out.
> > >
> > > If anyone has experience with this and some advice I'm all ears.
> > > thanks!
> > >
> > > Scott
> > >
> > > On 11/10/05, Dean Elwood <de...@gmail.com> wrote:
> > >> Hi Lawrence,
> > >>
> > >> I'm stuck in the same position. I haven't yet examined the "merge"
> > >> function,
> > >> which might shed some light on it.
> > >>
> > >> Have you managed to discover anything so far?
> > >>
> > >>>> You can use the regular expression bases url filter. Than only
> > >>>> urls that
> > >>>> match the pattern will be added to a fetch list.<<
> > >>
> > >> Hi Stefan. Getting the new URL's to crawl is the easy part ;-)
> > >>
> > >> The trick, and the question, is how you add that to an existing
> > >> database,
> > >> and then re-index, without doing a full re-crawl?
> > >>
> > >> Thanks,
> > >>
> > >> Dean
> > >>
> > >> ----- Original Message -----
> > >> From: "Lawrence Pitcher" <lp...@redomains.com>
> > >> To: <nu...@lucene.apache.org>
> > >> Sent: Thursday, November 10, 2005 5:05 PM
> > >> Subject: How to add only new urls to DB
> > >>
> > >>
> > >> Hi,
> > >>
> > >> Thanks to all for the best search solution available.
> > >>
> > >> I have installed the software, indexed 15,000 websites and tested
> > >> the search
> > >> and it works great!
> > >>
> > >> If I want to add only two more websites, so I made a "newurls.txt"
> > >> file,
> > >> then injected it to WebDB "bin/nutch inject db/ -urlfile
> > >> newurls.txt", then
> > >> generated a new segment "bin/nutch generate db/ segments/",  I
> > >> then checked
> > >> for the new sement name in the directory "/segments'
> > >>
> > >> Took that new segment name and placed it in the fetch command "bin/
> > >> nutch
> > >> fetch segments/20051110103316/"
> > >>
> > >> However it appears to re-fetch all 15,000 webpages along with the
> > >> newurls.txt webpages.
> > >>
> > >> Can I not just index only the new and then Update the DB.
> > >>
> > >> Sorry for such a lame question but I have just started.
> > >>
> > >> Many thanks to all.
> > >> Lawrence
> > >>
> > >>
> > >
> >
> > ---------------------------------------------------------------
> > company:        http://www.media-style.com
> > forum:        http://www.text-mining.org
> > blog:            http://www.find23.net
> >
> >
> >
> >
> 



Re: How to add only new urls to DB

Posted by Enrico Triolo <en...@gmail.com>.
> ...
> In general if you inject a set of urls to a webdb and create new
> segment the segment should only contains the new urls and pages that
> are older than 30 days and fetched anyway.

Actually it seems to me that generated segments contain also urls that
are in DB_UNFETCHED status from the latest fetching job.

I mean, if I inject an url and set a fetching depth of 1, at the end
of the process the webdb will contain 1 url in DB_FETCHED status and n
urls in DB_UNFETCHED (where n is the number of outgoing links of the
injected url).
If I then inject another url and generate a new segment, it will
contain the url itself and the n urls from previous iteration...
Is there a way to instruct nutch to only fetch the injected url?

Thanks,
Enrico

> Am 08.02.2006 um 14:56 schrieb Scott Owens:
>
> > Hi All,
> >
> > I wanted to check in to see if anyone has found an answer for this
> > issue.  I am injecting new URLs on a daily basis, and only need to
> > fetch/index those new one's, but obviously need to maintain a complete
> > webdb.
> >
> > One thing I was thinking was to use a temporary webdb for the initial
> > injection, then updating (updatedb) my primary webdb after the fetch
> > or indexing.
> >
> > # prepare dirs and inject urls
> >        rm -rf $db/*
> >        $nutch admin -local $db -create
> >        $nutch inject -local $db -urlfile $urlFile
> >
> >       echo -e "\nGenerating next segment to fetch"
> >       $nutch generate -local $db $segmentdir $fetchLimit
> >       s=`ls -d $segmentdir/* | tail -1`
> >       echo -e "\nFetching next segment"
> >       $nutch fetch $s
> >       echo -e "\nUpdating web database"
> >       $nutch updatedb $dbmain $s
> >       echo -e "\nAnalyzing links"
> >       $nutch analyze $dbmain 5
> >
> > OR after the segment is indexed -- as the above method wouldn't allow
> > a depth greather than 1?
> >
> > # prepare dirs and inject urls
> >        rm -rf $db/*
> >        $nutch admin -local $db -create
> >        $nutch inject -local $db -urlfile $urlFile
> >
> > for i in `seq $depth`
> > do
> >       echo -e "\nGenerating next segment to fetch"
> >       $nutch generate -local $db $segmentdir $fetchLimit
> >       s=`ls -d $segmentdir/* | tail -1`
> >       echo -e "\nFetching next segment"
> >       $nutch fetch $s
> >       echo -e "\nUpdating web database"
> >       $nutch updatedb $db $s
> >       echo -e "\nAnalyzing links"
> >       $nutch analyze $db 5
> > done
> >
> > echo -e "\nFetch done"
> > echo "Indexing segments"
> >
> > for s in `ls -1d $segmentdir/*`
> > do
> >       $nutch index $s
> > done
> >
> >       echo -e "\nUpdating web database"
> >       $nutch updatedb $dbmain $s
> >
> >
> > OR maybe I have no idea what I'm talking about : ) - I'm not a
> > developer, just trying to figure things out.
> >
> > If anyone has experience with this and some advice I'm all ears.
> > thanks!
> >
> > Scott
> >
> > On 11/10/05, Dean Elwood <de...@gmail.com> wrote:
> >> Hi Lawrence,
> >>
> >> I'm stuck in the same position. I haven't yet examined the "merge"
> >> function,
> >> which might shed some light on it.
> >>
> >> Have you managed to discover anything so far?
> >>
> >>>> You can use the regular expression bases url filter. Than only
> >>>> urls that
> >>>> match the pattern will be added to a fetch list.<<
> >>
> >> Hi Stefan. Getting the new URL's to crawl is the easy part ;-)
> >>
> >> The trick, and the question, is how you add that to an existing
> >> database,
> >> and then re-index, without doing a full re-crawl?
> >>
> >> Thanks,
> >>
> >> Dean
> >>
> >> ----- Original Message -----
> >> From: "Lawrence Pitcher" <lp...@redomains.com>
> >> To: <nu...@lucene.apache.org>
> >> Sent: Thursday, November 10, 2005 5:05 PM
> >> Subject: How to add only new urls to DB
> >>
> >>
> >> Hi,
> >>
> >> Thanks to all for the best search solution available.
> >>
> >> I have installed the software, indexed 15,000 websites and tested
> >> the search
> >> and it works great!
> >>
> >> If I want to add only two more websites, so I made a "newurls.txt"
> >> file,
> >> then injected it to WebDB "bin/nutch inject db/ -urlfile
> >> newurls.txt", then
> >> generated a new segment "bin/nutch generate db/ segments/",  I
> >> then checked
> >> for the new sement name in the directory "/segments'
> >>
> >> Took that new segment name and placed it in the fetch command "bin/
> >> nutch
> >> fetch segments/20051110103316/"
> >>
> >> However it appears to re-fetch all 15,000 webpages along with the
> >> newurls.txt webpages.
> >>
> >> Can I not just index only the new and then Update the DB.
> >>
> >> Sorry for such a lame question but I have just started.
> >>
> >> Many thanks to all.
> >> Lawrence
> >>
> >>
> >
>
> ---------------------------------------------------------------
> company:        http://www.media-style.com
> forum:        http://www.text-mining.org
> blog:            http://www.find23.net
>
>
>
>

Re: How to add only new urls to DB

Posted by Stefan Groschupf <sg...@media-style.com>.
Hi Scott,
yes this makes a sense.
I would also create a temp web db create the segment, crawl the segment.
If you don't want to add the pages down of the new urls than just  
index the segment and add this segment to the other searchable  
segments, do not update the db.

In general if you inject a set of urls to a webdb and create new  
segment the segment should only contains the new urls and pages that  
are older than 30 days and fetched anyway.
Greetings,
Stefan


Am 08.02.2006 um 14:56 schrieb Scott Owens:

> Hi All,
>
> I wanted to check in to see if anyone has found an answer for this
> issue.  I am injecting new URLs on a daily basis, and only need to
> fetch/index those new one's, but obviously need to maintain a complete
> webdb.
>
> One thing I was thinking was to use a temporary webdb for the initial
> injection, then updating (updatedb) my primary webdb after the fetch
> or indexing.
>
> # prepare dirs and inject urls
>        rm -rf $db/*
>        $nutch admin -local $db -create
>        $nutch inject -local $db -urlfile $urlFile
>
> 	echo -e "\nGenerating next segment to fetch"
> 	$nutch generate -local $db $segmentdir $fetchLimit
> 	s=`ls -d $segmentdir/* | tail -1`
> 	echo -e "\nFetching next segment"
> 	$nutch fetch $s
> 	echo -e "\nUpdating web database"
> 	$nutch updatedb $dbmain $s
> 	echo -e "\nAnalyzing links"
> 	$nutch analyze $dbmain 5
>
> OR after the segment is indexed -- as the above method wouldn't allow
> a depth greather than 1?
>
> # prepare dirs and inject urls
>        rm -rf $db/*
>        $nutch admin -local $db -create
>        $nutch inject -local $db -urlfile $urlFile
>
> for i in `seq $depth`
> do
> 	echo -e "\nGenerating next segment to fetch"
> 	$nutch generate -local $db $segmentdir $fetchLimit
> 	s=`ls -d $segmentdir/* | tail -1`
> 	echo -e "\nFetching next segment"
> 	$nutch fetch $s
> 	echo -e "\nUpdating web database"
> 	$nutch updatedb $db $s
> 	echo -e "\nAnalyzing links"
> 	$nutch analyze $db 5
> done
>
> echo -e "\nFetch done"
> echo "Indexing segments"
>
> for s in `ls -1d $segmentdir/*`
> do
> 	$nutch index $s
> done
>
> 	echo -e "\nUpdating web database"
> 	$nutch updatedb $dbmain $s
>
>
> OR maybe I have no idea what I'm talking about : ) - I'm not a
> developer, just trying to figure things out.
>
> If anyone has experience with this and some advice I'm all ears.   
> thanks!
>
> Scott
>
> On 11/10/05, Dean Elwood <de...@gmail.com> wrote:
>> Hi Lawrence,
>>
>> I'm stuck in the same position. I haven't yet examined the "merge"  
>> function,
>> which might shed some light on it.
>>
>> Have you managed to discover anything so far?
>>
>>>> You can use the regular expression bases url filter. Than only  
>>>> urls that
>>>> match the pattern will be added to a fetch list.<<
>>
>> Hi Stefan. Getting the new URL's to crawl is the easy part ;-)
>>
>> The trick, and the question, is how you add that to an existing  
>> database,
>> and then re-index, without doing a full re-crawl?
>>
>> Thanks,
>>
>> Dean
>>
>> ----- Original Message -----
>> From: "Lawrence Pitcher" <lp...@redomains.com>
>> To: <nu...@lucene.apache.org>
>> Sent: Thursday, November 10, 2005 5:05 PM
>> Subject: How to add only new urls to DB
>>
>>
>> Hi,
>>
>> Thanks to all for the best search solution available.
>>
>> I have installed the software, indexed 15,000 websites and tested  
>> the search
>> and it works great!
>>
>> If I want to add only two more websites, so I made a "newurls.txt"  
>> file,
>> then injected it to WebDB "bin/nutch inject db/ -urlfile  
>> newurls.txt", then
>> generated a new segment "bin/nutch generate db/ segments/",  I  
>> then checked
>> for the new sement name in the directory "/segments'
>>
>> Took that new segment name and placed it in the fetch command "bin/ 
>> nutch
>> fetch segments/20051110103316/"
>>
>> However it appears to re-fetch all 15,000 webpages along with the
>> newurls.txt webpages.
>>
>> Can I not just index only the new and then Update the DB.
>>
>> Sorry for such a lame question but I have just started.
>>
>> Many thanks to all.
>> Lawrence
>>
>>
>

---------------------------------------------------------------
company:        http://www.media-style.com
forum:        http://www.text-mining.org
blog:            http://www.find23.net