You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by Hasan Diwan <ha...@gmail.com> on 2006/02/28 00:16:58 UTC

nutch-extensionpoints 0.71

When I tried to deploy nutch in intranet crawl mode, it built fine, but when
I tried to run the command:

$NUTCH_HOME/bin/nutch crawl $HOME/SearchTest/urls -dir
$HOME/SearchTest/crawl -depth 2

bin/nutch returns the following log. For sake of completeness, it is
duplicated in its entirity below:

060227 150621 parsing file:/home/hdiwan/nutch-0.7.1/conf/nutch-default.xml
060227 150621 parsing file:/home/hdiwan/nutch-0.7.1/conf/crawl-tool.xml
060227 150621 parsing file:/home/hdiwan/nutch-0.7.1/conf/nutch-site.xml
060227 150621 No FS indicated, using default:local
060227 150621 crawl started in: /home/hdiwan/nutch/crawl20060227150607
060227 150621 rootUrlFile = /home/hdiwan/SpectraSearch/urls
060227 150621 threads = 10
060227 150621 depth = 2
060227 150621 Created webdb at
LocalFS,/home/hdiwan/nutch/crawl20060227150607/db
060227 150621 Starting URL processing
060227 150621 Plugins: looking in: /home/hdiwan/nutch-0.7.1/build/plugins
060227 150621 parsing: /home/hdiwan/nutch-0.7.1
/build/plugins/nutch-extensionpoints/plugin.xml
060227 150621 not including: /home/hdiwan/nutch-0.7.1
/build/plugins/protocol-ftp
060227 150621 not including: /home/hdiwan/nutch-0.7.1
/build/plugins/protocol-http
060227 150621 parsing: /home/hdiwan/nutch-0.7.1
/build/plugins/protocol-httpclient/plugin.xml
060227 150621 impl: point=org.apache.nutch.protocol.Protocol class=
org.apache.nutch.protocol.httpclient.Http
060227 150621 impl: point=org.apache.nutch.protocol.Protocol class=
org.apache.nutch.protocol.httpclient.Http
060227 150621 parsing: /home/hdiwan/nutch-0.7.1
/build/plugins/parse-html/plugin.xml
060227 150621 impl: point=org.apache.nutch.parse.Parser class=
org.apache.nutch.parse.html.HtmlParser
060227 150621 not including: /home/hdiwan/nutch-0.7.1/build/plugins/parse-js
060227 150621 parsing: /home/hdiwan/nutch-0.7.1
/build/plugins/parse-text/plugin.xml
060227 150621 impl: point=org.apache.nutch.parse.Parser class=
org.apache.nutch.parse.text.TextParser
060227 150621 not including: /home/hdiwan/nutch-0.7.1
/build/plugins/parse-pdf
060227 150621 not including: /home/hdiwan/nutch-0.7.1
/build/plugins/parse-rss
060227 150621 not including: /home/hdiwan/nutch-0.7.1
/build/plugins/parse-msword
060227 150621 not including: /home/hdiwan/nutch-0.7.1
/build/plugins/parse-ext
060227 150621 not including: /home/hdiwan/nutch-0.7.1
/build/plugins/index-basic
060227 150621 parsing: /home/hdiwan/nutch-0.7.1
/build/plugins/index-more/plugin.xml
060227 150622 impl: point=org.apache.nutch.indexer.IndexingFilter class=
org.apache.nutch.indexer.more.MoreIndexingFilter
060227 150622 parsing: /home/hdiwan/nutch-0.7.1
/build/plugins/query-basic/plugin.xml
060227 150622 impl: point=org.apache.nutch.searcher.QueryFilter class=
org.apache.nutch.searcher.basic.BasicQueryFilter
060227 150622 parsing: /home/hdiwan/nutch-0.7.1
/build/plugins/query-more/plugin.xml
060227 150622 impl: point=org.apache.nutch.searcher.QueryFilter class=
org.apache.nutch.searcher.more.TypeQueryFilter
060227 150622 impl: point=org.apache.nutch.searcher.QueryFilter class=
org.apache.nutch.searcher.more.DateQueryFilter
060227 150622 parsing: /home/hdiwan/nutch-0.7.1
/build/plugins/query-site/plugin.xml
060227 150622 impl: point=org.apache.nutch.searcher.QueryFilter class=
org.apache.nutch.searcher.site.SiteQueryFilter
060227 150622 parsing: /home/hdiwan/nutch-0.7.1
/build/plugins/query-url/plugin.xml
060227 150622 impl: point=org.apache.nutch.searcher.QueryFilter class=
org.apache.nutch.searcher.url.URLQueryFilter
060227 150622 not including: /home/hdiwan/nutch-0.7.1
/build/plugins/urlfilter-regex
060227 150622 not including: /home/hdiwan/nutch-0.7.1
/build/plugins/urlfilter-prefix
060227 150622 not including: /home/hdiwan/nutch-0.7.1
/build/plugins/creativecommons
060227 150622 not including: /home/hdiwan/nutch-0.7.1
/build/plugins/language-identifier
060227 150622 not including: /home/hdiwan/nutch-0.7.1
/build/plugins/clustering-carrot2
060227 150622 not including: /home/hdiwan/nutch-0.7.1/build/plugins/ontology
060227 150622 Using URL normalizer: org.apache.nutch.net.BasicUrlNormalizer
060227 150622 Added 30 pages
060227 150622 Processing pagesByURL: Sorted 30 instructions in 0.0080seconds.
060227 150622 Processing pagesByURL: Sorted 3750.0 instructions/second
060227 150622 Processing pagesByURL: Merged to new DB containing 18 records
in 0.0050 seconds
060227 150622 Processing pagesByURL: Merged 3600.0 records/second
060227 150622 Processing pagesByMD5: Sorted 18 instructions in 0.0040seconds.
060227 150622 Processing pagesByMD5: Sorted 4500.0 instructions/second
060227 150622 Processing pagesByMD5: Merged to new DB containing 18 records
in 0.0010 seconds
060227 150622 Processing pagesByMD5: Merged 18000.0 records/second
060227 150622 Processing linksByMD5: Copied file (4096 bytes) in 0.0050secs.
060227 150622 Processing linksByURL: Copied file (4096 bytes) in 0.0030secs.
060227 150622 Processing
/home/hdiwan/nutch/crawl20060227150607/segments/20060227150622/fetchlist.unsorted:
Sorted 18 entries in 0.0030 seconds.
060227 150622 Processing
/home/hdiwan/nutch/crawl20060227150607/segments/20060227150622/fetchlist.unsorted:
Sorted 6000.0 entries/second
060227 150622 Overall processing: Sorted 18 entries in 0.0030 seconds.
060227 150622 Overall processing: Sorted 1.6666666666666666E-4entries/second
060227 150622 FetchListTool completed
060227 150622 logging at INFO
060227 150622 fetching
http://hasan.wits2020.net/~hdiwan/blog/2006/02/27/murder_in_samarkand.html
060227 150622 fetching
http://hasan.wits2020.net/~hdiwan/blog/2006/02/16/book_search_presentation.html
060227 150622 fetching
http://hasan.wits2020.net/~hdiwan/blog/2006/02/20/emailcasting_revisited.html
060227 150622 http.proxy.host = null
060227 150622 http.proxy.port = 8118
060227 150622 http.timeout = 10000
060227 150622 http.content.limit = -1
060227 150622 http.agent = Spectra/200602 (Spectra;
http://hasan.wits2020.net/typo/public; spectrasearch+agent@gmail.com)
060227 150622 http.auth.ntlm.username =
060227 150622 fetcher.server.delay = 1000
060227 150622 http.max.delays = 100
060227 150623 Configured Client
060227 150623 fetching
http://hasan.wits2020.net/~hdiwan/blog/2006/02/26/automating_photographic_workfl.html
060227 150623 fetching
http://hasan.wits2020.net/~hdiwan/blog/2006/02/25/pint_search.html
060227 150623 fetching
http://hasan.wits2020.net/~hdiwan/blog/2006/02/16/spectrasearch_privacy_statemen.html
060227 150623 fetching
http://hasan.wits2020.net/~hdiwan/blog/2006/02/27/directtv_videoondemand.html
060227 150623 fetching
http://hasan.wits2020.net/~hdiwan/blog/2006/02/15/atms_an
d_googlemaps_tad_buggy.html
060227 150623 fetching
http://hasan.wits2020.net/~hdiwan/blog/2006/02/19/capanni
na.html
060227 150623 fetching
http://hasan.wits2020.net/~hdiwan/blog/2006/02/15/opera_t
ries_to_converge_bittor.html
060227 150623 fetching
http://hasan.wits2020.net/~hdiwan/blog/2006/02/25/transam
erica.html
060227 150623 fetching
http://hasan.wits2020.net/~hdiwan/blog/2006/02/27/nobody_
likes_me.html
060227 150623 fetching
http://hasan.wits2020.net/~hdiwan/blog/2006/02/21/bloggin
g_system_critiques.html
060227 150623 fetching
http://hasan.wits2020.net/~hdiwan/blog/2006/02/25/sorry_h
aters.html
060227 150623 fetching
http://hasan.wits2020.net/~hdiwan/blog/2006/02/15/valenti
nes_overseas.html
060227 150623 fetching
http://hasan.wits2020.net/~hdiwan/blog/2006/02/18/spectra
search_update.html
060227 150624 Updating /home/hdiwan/nutch/crawl20060227150607/db
060227 150624 Updating for
/home/hdiwan/nutch/crawl20060227150607/segments/20060
227150622
060227 150624 Processing document 0
060227 150624 Finishing update
060227 150626 Update finished
060227 150626 Updating /home/hdiwan/nutch/crawl20060227150607/segments from
/hom
e/hdiwan/nutch/crawl20060227150607/db
060227 150626  reading
/home/hdiwan/nutch/crawl20060227150607/segments/200602271
50622
060227 150626  reading
/home/hdiwan/nutch/crawl20060227150607/segments/200602271
50624
060227 150626 Sorting pages by url...
060227 150626 Getting updated scores and anchors from db...
060227 150626 Sorting updates by segment...
060227 150626 Updating segments...
060227 150626  updating
/home/hdiwan/nutch/crawl20060227150607/segments/20060227
150622
060227 150626 Done updating /home/hdiwan/nutch/crawl20060227150607/segments
from
 /home/hdiwan/nutch/crawl20060227150607/db
060227 150626 indexing segment:
/home/hdiwan/nutch/crawl20060227150607/segments/
20060227150622
060227 150626 * Opening segment 20060227150622
060227 150626 * Indexing segment 20060227150622
060227 150626 * Optimizing index...
060227 150626 * Moving index to NFS if needed...
060227 150626 DONE indexing segment 20060227150622: total 18 records in
0.047 s
(Infinity rec/s).
060227 150626 done indexing
060227 150626 done indexing
060227 150626 Reading url hashes...
060227 150626 Sorting url hashes...
060227 150626 Deleting url duplicates...
060227 150626 Deleted 0 url duplicates.
060227 150626 Reading content hashes...
060227 150626 Sorting content hashes...
060227 150626 Deleting content duplicates...
060227 150626 Deleted 0 content duplicates.
060227 150626 Duplicate deletion complete locally.  Now returning to NFS...
060227 150626 DeleteDuplicates complete
060227 150626 Merging segment indexes...
060227 150626 crawl finished: /home/hdiwan/nutch/crawl20060227150607

Now, I'm sure there are duplicates in the url list, yet nutch doesn't delete
anything. I'm also going to be adding new pages pretty frequently. The crawl
parameter does not let you add new urls without removing the last crawl. So,
how would I go about doing this? Thanks for the help! Please CC replies to
my personal address. Thanks a bunch!
--
Cheers,
Hasan Diwan <ha...@gmail.com>

RE: nutch-extensionpoints 0.71

Posted by Richard Braman <rb...@bramantax.com>.
It's the same using cygwin :).  Please share your script if you can!
I think nutch will make more than one dbsegment when you fire the
generate command.  Maybe more the more urls you have.
 

-----Original Message-----
From: Hasan Diwan [mailto:hasan.diwan@gmail.com] 
Sent: Monday, February 27, 2006 7:29 PM
To: nutch-user@lucene.apache.org; rbraman@bramantax.com
Subject: Re: nutch-extensionpoints 0.71


Mr Brannan:

On 27/02/06, Richard Braman <rb...@bramantax.com> wrote:
>
> The latest segments would have a modified date of when you ran 
> generate dbsegments I don't know how to do it in script.

ls -t | head -n 1

The 't' switch sorts by mtime, at least on Linux, according to the
manpage.
--
Cheers,
Hasan Diwan <ha...@gmail.com>


Re: nutch-extensionpoints 0.71

Posted by Hasan Diwan <ha...@gmail.com>.
Mr Brannan:

On 27/02/06, Richard Braman <rb...@bramantax.com> wrote:
>
> The latest segments would have a modified date of when you ran generate
> dbsegments
> I don't know how to do it in script.

ls -t | head -n 1

The 't' switch sorts by mtime, at least on Linux, according to the manpage.
--
Cheers,
Hasan Diwan <ha...@gmail.com>

RE: nutch-extensionpoints 0.71

Posted by Richard Braman <rb...@bramantax.com>.
The latest segments would have a modified date of when you ran generate
dbsegments
I don't know how to do it in script.
Possibly the nutch generatecommand has a return?
That's a question for someone with more knowledge than I.  But I too
would like to know.
I don't know about:  Deleted 0 content duplicates.


-----Original Message-----
From: Hasan Diwan [mailto:hasan.diwan@gmail.com] 
Sent: Monday, February 27, 2006 6:45 PM
To: nutch-user@lucene.apache.org; rbraman@bramantax.com
Subject: Re: nutch-extensionpoints 0.71


Mr. Braman (or anyone  else):

On 27/02/06, Richard Braman <rb...@bramantax.com> wrote:
>
>
> bin/nutch fetch segments/<latest_segment>


How would I determine which is the latest segment?

I don't really know what your other question was.

I know there are duplicate URLs in urls.txt. Why would I be getting the
line below?

> 060227 150626 Deleted 0 content duplicates.
>

Thanks again for the kind assistance.

--
Cheers,
Hasan Diwan <ha...@gmail.com>


Re: nutch-extensionpoints 0.71

Posted by Hasan Diwan <ha...@gmail.com>.
Mr. Braman (or anyone  else):

On 27/02/06, Richard Braman <rb...@bramantax.com> wrote:
>
>
> bin/nutch fetch segments/<latest_segment>


How would I determine which is the latest segment?

I don't really know what your other question was.

I know there are duplicate URLs in urls.txt. Why would I be getting the line
below?

> 060227 150626 Deleted 0 content duplicates.
>

Thanks again for the kind assistance.

--
Cheers,
Hasan Diwan <ha...@gmail.com>

Nutch 0.8 -building WAR file

Posted by sudhendra seshachala <su...@yahoo.com>.
Hi there,
I got the nightly build and if I try to run "Ant war" I get the following error
BUILD FAILED
C:\kool\nutch-nightly\build.xml:94: The following error occurred while executing
 this line:
C:\kool\nutch-nightly\src\plugin\build.xml:9: The following error occurred while
 executing this line:
C:\kool\nutch-nightly\src\plugin\clustering-carrot2\build.xml:26: The following
error occurred while executing this line:
C:\kool\nutch-nightly\src\plugin\build-plugin.xml:97: srcdir "C:\kool\nutch-nigh
tly\src\plugin\nutch-extensionpoints\src\java" does not exist!

I guess am mising something. Can some one point me to exact direction, where I can get the missing things.


  Sudhi Seshachala
  http://sudhilogs.blogspot.com/
   


		
---------------------------------
Yahoo! Mail
Bring photos to life! New PhotoMail  makes sharing a breeze. 

RE: nutch-extensionpoints 0.71

Posted by Richard Braman <rb...@bramantax.com>.
>The crawl parameter does not let you add new urls without removing the
last crawl. So, how would I go about doing this? Thanks for the 
>help! P

To add new urls:
#adds new urls to db
bin/nutch inject db newurls.txt
#generates segments based on new urls
bin/nutch generate db segments
#fetched segments
bin/nutch fetch segments/<latest_segment>

I don't really know what your other question was.


-----Original Message-----
From: Hasan Diwan [mailto:hasan.diwan@gmail.com] 
Sent: Monday, February 27, 2006 6:17 PM
To: nutch-user@lucene.apache.org
Subject: nutch-extensionpoints 0.71


When I tried to deploy nutch in intranet crawl mode, it built fine, but
when I tried to run the command:

$NUTCH_HOME/bin/nutch crawl $HOME/SearchTest/urls -dir
$HOME/SearchTest/crawl -depth 2

bin/nutch returns the following log. For sake of completeness, it is
duplicated in its entirity below:

060227 150621 parsing
file:/home/hdiwan/nutch-0.7.1/conf/nutch-default.xml
060227 150621 parsing file:/home/hdiwan/nutch-0.7.1/conf/crawl-tool.xml
060227 150621 parsing file:/home/hdiwan/nutch-0.7.1/conf/nutch-site.xml
060227 150621 No FS indicated, using default:local
060227 150621 crawl started in: /home/hdiwan/nutch/crawl20060227150607
060227 150621 rootUrlFile = /home/hdiwan/SpectraSearch/urls 060227
150621 threads = 10 060227 150621 depth = 2 060227 150621 Created webdb
at LocalFS,/home/hdiwan/nutch/crawl20060227150607/db
060227 150621 Starting URL processing
060227 150621 Plugins: looking in:
/home/hdiwan/nutch-0.7.1/build/plugins
060227 150621 parsing: /home/hdiwan/nutch-0.7.1
/build/plugins/nutch-extensionpoints/plugin.xml
060227 150621 not including: /home/hdiwan/nutch-0.7.1
/build/plugins/protocol-ftp 060227 150621 not including:
/home/hdiwan/nutch-0.7.1 /build/plugins/protocol-http 060227 150621
parsing: /home/hdiwan/nutch-0.7.1
/build/plugins/protocol-httpclient/plugin.xml
060227 150621 impl: point=org.apache.nutch.protocol.Protocol class=
org.apache.nutch.protocol.httpclient.Http
060227 150621 impl: point=org.apache.nutch.protocol.Protocol class=
org.apache.nutch.protocol.httpclient.Http
060227 150621 parsing: /home/hdiwan/nutch-0.7.1
/build/plugins/parse-html/plugin.xml
060227 150621 impl: point=org.apache.nutch.parse.Parser class=
org.apache.nutch.parse.html.HtmlParser
060227 150621 not including:
/home/hdiwan/nutch-0.7.1/build/plugins/parse-js
060227 150621 parsing: /home/hdiwan/nutch-0.7.1
/build/plugins/parse-text/plugin.xml
060227 150621 impl: point=org.apache.nutch.parse.Parser class=
org.apache.nutch.parse.text.TextParser
060227 150621 not including: /home/hdiwan/nutch-0.7.1
/build/plugins/parse-pdf 060227 150621 not including:
/home/hdiwan/nutch-0.7.1 /build/plugins/parse-rss 060227 150621 not
including: /home/hdiwan/nutch-0.7.1 /build/plugins/parse-msword 060227
150621 not including: /home/hdiwan/nutch-0.7.1 /build/plugins/parse-ext
060227 150621 not including: /home/hdiwan/nutch-0.7.1
/build/plugins/index-basic 060227 150621 parsing:
/home/hdiwan/nutch-0.7.1 /build/plugins/index-more/plugin.xml
060227 150622 impl: point=org.apache.nutch.indexer.IndexingFilter class=
org.apache.nutch.indexer.more.MoreIndexingFilter
060227 150622 parsing: /home/hdiwan/nutch-0.7.1
/build/plugins/query-basic/plugin.xml
060227 150622 impl: point=org.apache.nutch.searcher.QueryFilter class=
org.apache.nutch.searcher.basic.BasicQueryFilter
060227 150622 parsing: /home/hdiwan/nutch-0.7.1
/build/plugins/query-more/plugin.xml
060227 150622 impl: point=org.apache.nutch.searcher.QueryFilter class=
org.apache.nutch.searcher.more.TypeQueryFilter
060227 150622 impl: point=org.apache.nutch.searcher.QueryFilter class=
org.apache.nutch.searcher.more.DateQueryFilter
060227 150622 parsing: /home/hdiwan/nutch-0.7.1
/build/plugins/query-site/plugin.xml
060227 150622 impl: point=org.apache.nutch.searcher.QueryFilter class=
org.apache.nutch.searcher.site.SiteQueryFilter
060227 150622 parsing: /home/hdiwan/nutch-0.7.1
/build/plugins/query-url/plugin.xml
060227 150622 impl: point=org.apache.nutch.searcher.QueryFilter class=
org.apache.nutch.searcher.url.URLQueryFilter
060227 150622 not including: /home/hdiwan/nutch-0.7.1
/build/plugins/urlfilter-regex 060227 150622 not including:
/home/hdiwan/nutch-0.7.1 /build/plugins/urlfilter-prefix 060227 150622
not including: /home/hdiwan/nutch-0.7.1 /build/plugins/creativecommons
060227 150622 not including: /home/hdiwan/nutch-0.7.1
/build/plugins/language-identifier
060227 150622 not including: /home/hdiwan/nutch-0.7.1
/build/plugins/clustering-carrot2 060227 150622 not including:
/home/hdiwan/nutch-0.7.1/build/plugins/ontology
060227 150622 Using URL normalizer:
org.apache.nutch.net.BasicUrlNormalizer
060227 150622 Added 30 pages
060227 150622 Processing pagesByURL: Sorted 30 instructions in
0.0080seconds. 060227 150622 Processing pagesByURL: Sorted 3750.0
instructions/second 060227 150622 Processing pagesByURL: Merged to new
DB containing 18 records in 0.0050 seconds 060227 150622 Processing
pagesByURL: Merged 3600.0 records/second 060227 150622 Processing
pagesByMD5: Sorted 18 instructions in 0.0040seconds. 060227 150622
Processing pagesByMD5: Sorted 4500.0 instructions/second 060227 150622
Processing pagesByMD5: Merged to new DB containing 18 records in 0.0010
seconds 060227 150622 Processing pagesByMD5: Merged 18000.0
records/second 060227 150622 Processing linksByMD5: Copied file (4096
bytes) in 0.0050secs. 060227 150622 Processing linksByURL: Copied file
(4096 bytes) in 0.0030secs. 060227 150622 Processing
/home/hdiwan/nutch/crawl20060227150607/segments/20060227150622/fetchlist
.unsorted:
Sorted 18 entries in 0.0030 seconds.
060227 150622 Processing
/home/hdiwan/nutch/crawl20060227150607/segments/20060227150622/fetchlist
.unsorted:
Sorted 6000.0 entries/second
060227 150622 Overall processing: Sorted 18 entries in 0.0030 seconds.
060227 150622 Overall processing: Sorted
1.6666666666666666E-4entries/second
060227 150622 FetchListTool completed
060227 150622 logging at INFO
060227 150622 fetching
http://hasan.wits2020.net/~hdiwan/blog/2006/02/27/murder_in_samarkand.ht
ml
060227 150622 fetching
http://hasan.wits2020.net/~hdiwan/blog/2006/02/16/book_search_presentati
on.html
060227 150622 fetching
http://hasan.wits2020.net/~hdiwan/blog/2006/02/20/emailcasting_revisited
.html
060227 150622 http.proxy.host = null
060227 150622 http.proxy.port = 8118
060227 150622 http.timeout = 10000
060227 150622 http.content.limit = -1
060227 150622 http.agent = Spectra/200602 (Spectra;
http://hasan.wits2020.net/typo/public; spectrasearch+agent@gmail.com)
060227 150622 http.auth.ntlm.username = 060227 150622
fetcher.server.delay = 1000 060227 150622 http.max.delays = 100 060227
150623 Configured Client 060227 150623 fetching
http://hasan.wits2020.net/~hdiwan/blog/2006/02/26/automating_photographi
c_workfl.html
060227 150623 fetching
http://hasan.wits2020.net/~hdiwan/blog/2006/02/25/pint_search.html
060227 150623 fetching
http://hasan.wits2020.net/~hdiwan/blog/2006/02/16/spectrasearch_privacy_
statemen.html
060227 150623 fetching
http://hasan.wits2020.net/~hdiwan/blog/2006/02/27/directtv_videoondemand
.html
060227 150623 fetching
http://hasan.wits2020.net/~hdiwan/blog/2006/02/15/atms_an
d_googlemaps_tad_buggy.html
060227 150623 fetching
http://hasan.wits2020.net/~hdiwan/blog/2006/02/19/capanni
na.html
060227 150623 fetching
http://hasan.wits2020.net/~hdiwan/blog/2006/02/15/opera_t
ries_to_converge_bittor.html
060227 150623 fetching
http://hasan.wits2020.net/~hdiwan/blog/2006/02/25/transam
erica.html
060227 150623 fetching
http://hasan.wits2020.net/~hdiwan/blog/2006/02/27/nobody_
likes_me.html
060227 150623 fetching
http://hasan.wits2020.net/~hdiwan/blog/2006/02/21/bloggin
g_system_critiques.html
060227 150623 fetching
http://hasan.wits2020.net/~hdiwan/blog/2006/02/25/sorry_h
aters.html
060227 150623 fetching
http://hasan.wits2020.net/~hdiwan/blog/2006/02/15/valenti
nes_overseas.html
060227 150623 fetching
http://hasan.wits2020.net/~hdiwan/blog/2006/02/18/spectra
search_update.html
060227 150624 Updating /home/hdiwan/nutch/crawl20060227150607/db
060227 150624 Updating for
/home/hdiwan/nutch/crawl20060227150607/segments/20060
227150622
060227 150624 Processing document 0
060227 150624 Finishing update
060227 150626 Update finished
060227 150626 Updating /home/hdiwan/nutch/crawl20060227150607/segments
from /hom e/hdiwan/nutch/crawl20060227150607/db
060227 150626  reading
/home/hdiwan/nutch/crawl20060227150607/segments/200602271
50622
060227 150626  reading
/home/hdiwan/nutch/crawl20060227150607/segments/200602271
50624
060227 150626 Sorting pages by url...
060227 150626 Getting updated scores and anchors from db... 060227
150626 Sorting updates by segment... 060227 150626 Updating segments...
060227 150626  updating
/home/hdiwan/nutch/crawl20060227150607/segments/20060227
150622
060227 150626 Done updating
/home/hdiwan/nutch/crawl20060227150607/segments
from
 /home/hdiwan/nutch/crawl20060227150607/db
060227 150626 indexing segment:
/home/hdiwan/nutch/crawl20060227150607/segments/
20060227150622
060227 150626 * Opening segment 20060227150622
060227 150626 * Indexing segment 20060227150622
060227 150626 * Optimizing index...
060227 150626 * Moving index to NFS if needed...
060227 150626 DONE indexing segment 20060227150622: total 18 records in
0.047 s (Infinity rec/s). 060227 150626 done indexing 060227 150626 done
indexing 060227 150626 Reading url hashes... 060227 150626 Sorting url
hashes... 060227 150626 Deleting url duplicates... 060227 150626 Deleted
0 url duplicates. 060227 150626 Reading content hashes... 060227 150626
Sorting content hashes... 060227 150626 Deleting content duplicates...
060227 150626 Deleted 0 content duplicates. 060227 150626 Duplicate
deletion complete locally.  Now returning to NFS... 060227 150626
DeleteDuplicates complete 060227 150626 Merging segment indexes...
060227 150626 crawl finished: /home/hdiwan/nutch/crawl20060227150607

Now, I'm sure there are duplicates in the url list, yet nutch doesn't
delete anything. I'm also going to be adding new pages pretty
frequently. The crawl parameter does not let you add new urls without
removing the last crawl. So, how would I go about doing this? Thanks for
the help! Please CC replies to my personal address. Thanks a bunch!
--
Cheers,
Hasan Diwan <ha...@gmail.com>