You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@nutch.apache.org by "Sami Siren (JIRA)" <ji...@apache.org> on 2006/10/24 18:16:03 UTC

[jira] Closed: (NUTCH-177) Default installation seems to produce working entity of nutch

     [ http://issues.apache.org/jira/browse/NUTCH-177?page=all ]

Sami Siren closed NUTCH-177.
----------------------------


> Default installation seems to produce working entity of nutch
> -------------------------------------------------------------
>
>                 Key: NUTCH-177
>                 URL: http://issues.apache.org/jira/browse/NUTCH-177
>             Project: Nutch
>          Issue Type: Bug
>    Affects Versions: 0.7.1
>         Environment: Linux SUSE 9.3
>            Reporter: Matthias Günter
>            Priority: Minor
>             Fix For: 0.8
>
>         Attachments: crawl-urlfilter.txt, urllist.txt
>
>
> I downloaded 0.7.1 and installed it.
> Then changed crawl-urlfilter.txt for apache.org
> Then I added an urllist.txt  and tried scanning.
> Apparently the URL has been ignored, even when it matched the rule in the crawl-url-filter.txt
> guenter@deimos:~/workspace/lucene/nutch-0.7.1/bin> sh ./nutch crawl ../../urllist.txt
> 060115 141534 parsing file:/home/guenter/workspace/lucene/nutch-0.7.1/conf/nutch-default.xml
> 060115 141534 parsing file:/home/guenter/workspace/lucene/nutch-0.7.1/conf/crawl-tool.xml
> 060115 141534 parsing file:/home/guenter/workspace/lucene/nutch-0.7.1/conf/nutch-site.xml
> 060115 141534 No FS indicated, using default:local
> 060115 141534 crawl started in: crawl-20060115141534
> 060115 141534 rootUrlFile = ../../urllist.txt
> 060115 141534 threads = 10
> 060115 141534 depth = 5
> 060115 141535 Created webdb at LocalFS,/home/guenter/workspace/lucene/nutch-0.7.1/bin/crawl-20060115141534/db
> 060115 141535 Starting URL processing
> 060115 141535 Plugins: looking in: /home/guenter/workspace/lucene/nutch-0.7.1/plugins
> 060115 141535 not including: /home/guenter/workspace/lucene/nutch-0.7.1/plugins/query-more
> 060115 141535 parsing: /home/guenter/workspace/lucene/nutch-0.7.1/plugins/query-site/plugin.xml
> 060115 141535 impl: point=org.apache.nutch.searcher.QueryFilter class=org.apache.nutch.searcher.site.SiteQueryFilter
> 060115 141535 parsing: /home/guenter/workspace/lucene/nutch-0.7.1/plugins/parse-html/plugin.xml
> 060115 141535 impl: point=org.apache.nutch.parse.Parser class=org.apache.nutch.parse.html.HtmlParser
> 060115 141535 parsing: /home/guenter/workspace/lucene/nutch-0.7.1/plugins/parse-text/plugin.xml
> 060115 141535 impl: point=org.apache.nutch.parse.Parser class=org.apache.nutch.parse.text.TextParser
> 060115 141535 not including: /home/guenter/workspace/lucene/nutch-0.7.1/plugins/parse-ext
> 060115 141535 not including: /home/guenter/workspace/lucene/nutch-0.7.1/plugins/parse-pdf
> 060115 141535 not including: /home/guenter/workspace/lucene/nutch-0.7.1/plugins/parse-rss
> 060115 141535 parsing: /home/guenter/workspace/lucene/nutch-0.7.1/plugins/query-basic/plugin.xml
> 060115 141535 impl: point=org.apache.nutch.searcher.QueryFilter class=org.apache.nutch.searcher.basic.BasicQueryFilter
> 060115 141535 not including: /home/guenter/workspace/lucene/nutch-0.7.1/plugins/index-more
> 060115 141535 not including: /home/guenter/workspace/lucene/nutch-0.7.1/plugins/parse-js
> 060115 141535 parsing: /home/guenter/workspace/lucene/nutch-0.7.1/plugins/urlfilter-regex/plugin.xml
> 060115 141535 impl: point=org.apache.nutch.net.URLFilter class=org.apache.nutch.net.RegexURLFilter
> 060115 141535 not including: /home/guenter/workspace/lucene/nutch-0.7.1/plugins/protocol-ftp
> 060115 141535 not including: /home/guenter/workspace/lucene/nutch-0.7.1/plugins/parse-msword
> 060115 141535 not including: /home/guenter/workspace/lucene/nutch-0.7.1/plugins/creativecommons
> 060115 141535 not including: /home/guenter/workspace/lucene/nutch-0.7.1/plugins/ontology
> 060115 141535 parsing: /home/guenter/workspace/lucene/nutch-0.7.1/plugins/nutch-extensionpoints/plugin.xml
> 060115 141535 not including: /home/guenter/workspace/lucene/nutch-0.7.1/plugins/protocol-file
> 060115 141535 parsing: /home/guenter/workspace/lucene/nutch-0.7.1/plugins/protocol-http/plugin.xml
> 060115 141535 impl: point=org.apache.nutch.protocol.Protocol class=org.apache.nutch.protocol.http.Http
> 060115 141535 not including: /home/guenter/workspace/lucene/nutch-0.7.1/plugins/clustering-carrot2
> 060115 141535 not including: /home/guenter/workspace/lucene/nutch-0.7.1/plugins/language-identifier
> 060115 141535 not including: /home/guenter/workspace/lucene/nutch-0.7.1/plugins/urlfilter-prefix
> 060115 141535 parsing: /home/guenter/workspace/lucene/nutch-0.7.1/plugins/query-url/plugin.xml
> 060115 141535 impl: point=org.apache.nutch.searcher.QueryFilter class=org.apache.nutch.searcher.url.URLQueryFilter
> 060115 141535 parsing: /home/guenter/workspace/lucene/nutch-0.7.1/plugins/index-basic/plugin.xml
> 060115 141535 impl: point=org.apache.nutch.indexer.IndexingFilter class=org.apache.nutch.indexer.basic.BasicIndexingFilter
> 060115 141535 not including: /home/guenter/workspace/lucene/nutch-0.7.1/plugins/protocol-httpclient
> 060115 141535 found resource crawl-urlfilter.txt at file:/home/guenter/workspace/lucene/nutch-0.7.1/conf/crawl-urlfilter.txt
> ..060115 141535 Added 0 pages
> 060115 141535 FetchListTool started
> 060115 141535 Overall processing: Sorted 0 entries in 0.0 seconds.
> 060115 141535 Overall processing: Sorted NaN entries/second
> 060115 141535 FetchListTool completed
> 060115 141536 logging at INFO
> 060115 141537 Updating /home/guenter/workspace/lucene/nutch-0.7.1/bin/crawl-20060115141534/db
> 060115 141537 Updating for /home/guenter/workspace/lucene/nutch-0.7.1/bin/crawl-20060115141534/segments/20060115141535
> 060115 141537 Finishing update
> 060115 141537 Update finished
> 060115 141537 FetchListTool started
> 060115 141537 Overall processing: Sorted 0 entries in 0.0 seconds.
> 060115 141537 Overall processing: Sorted NaN entries/second
> 060115 141537 FetchListTool completed
> 060115 141537 logging at INFO
> 060115 141538 Updating /home/guenter/workspace/lucene/nutch-0.7.1/bin/crawl-20060115141534/db
> 060115 141538 Updating for /home/guenter/workspace/lucene/nutch-0.7.1/bin/crawl-20060115141534/segments/20060115141537
> 060115 141538 Finishing update
> 060115 141538 Update finished
> 060115 141538 FetchListTool started
> 060115 141538 Overall processing: Sorted 0 entries in 0.0 seconds.
> 060115 141538 Overall processing: Sorted NaN entries/second
> 060115 141538 FetchListTool completed
> 060115 141538 logging at INFO
> 060115 141539 Updating /home/guenter/workspace/lucene/nutch-0.7.1/bin/crawl-20060115141534/db
> 060115 141539 Updating for /home/guenter/workspace/lucene/nutch-0.7.1/bin/crawl-20060115141534/segments/20060115141538
> 060115 141539 Finishing update
> 060115 141539 Update finished
> 060115 141539 FetchListTool started
> 060115 141540 Overall processing: Sorted 0 entries in 0.0 seconds.
> 060115 141540 Overall processing: Sorted NaN entries/second
> 060115 141540 FetchListTool completed
> 060115 141540 logging at INFO
> 060115 141541 Updating /home/guenter/workspace/lucene/nutch-0.7.1/bin/crawl-20060115141534/db
> 060115 141541 Updating for /home/guenter/workspace/lucene/nutch-0.7.1/bin/crawl-20060115141534/segments/20060115141539
> 060115 141541 Finishing update
> 060115 141541 Update finished
> 060115 141541 FetchListTool started
> 060115 141541 Overall processing: Sorted 0 entries in 0.0 seconds.
> 060115 141541 Overall processing: Sorted NaN entries/second
> 060115 141541 FetchListTool completed
> 060115 141541 logging at INFO
> 060115 141542 Updating /home/guenter/workspace/lucene/nutch-0.7.1/bin/crawl-20060115141534/db
> 060115 141542 Updating for /home/guenter/workspace/lucene/nutch-0.7.1/bin/crawl-20060115141534/segments/20060115141541
> 060115 141542 Finishing update
> 060115 141542 Update finished
> 060115 141542 Updating /home/guenter/workspace/lucene/nutch-0.7.1/bin/crawl-20060115141534/segments from /home/guenter/workspace/lucene/nutch-0.7.1/bin/crawl-20060115141534/db
> 060115 141542  reading /home/guenter/workspace/lucene/nutch-0.7.1/bin/crawl-20060115141534/segments/20060115141535
> 060115 141542  reading /home/guenter/workspace/lucene/nutch-0.7.1/bin/crawl-20060115141534/segments/20060115141537
> 060115 141542  reading /home/guenter/workspace/lucene/nutch-0.7.1/bin/crawl-20060115141534/segments/20060115141538
> 060115 141542  reading /home/guenter/workspace/lucene/nutch-0.7.1/bin/crawl-20060115141534/segments/20060115141539
> 060115 141542  reading /home/guenter/workspace/lucene/nutch-0.7.1/bin/crawl-20060115141534/segments/20060115141541
> 060115 141542 Sorting pages by url...
> 060115 141542 Getting updated scores and anchors from db...
> 060115 141542 Sorting updates by segment...
> 060115 141542 Updating segments...
> 060115 141542 Done updating /home/guenter/workspace/lucene/nutch-0.7.1/bin/crawl-20060115141534/segments from /home/guenter/workspace/lucene/nutch-0.7.1/bin/crawl-20060115141534/db
> 060115 141542 indexing segment: /home/guenter/workspace/lucene/nutch-0.7.1/bin/crawl-20060115141534/segments/20060115141535
> 060115 141542 * Opening segment 20060115141535
> 060115 141542 * Indexing segment 20060115141535
> 060115 141542 * Optimizing index...
> 060115 141542 * Moving index to NFS if needed...
> 060115 141542 DONE indexing segment 20060115141535: total 0 records in 0.035 s (NaN rec/s).
> 060115 141543 done indexing
> 060115 141543 indexing segment: /home/guenter/workspace/lucene/nutch-0.7.1/bin/crawl-20060115141534/segments/20060115141537
> 060115 141543 * Opening segment 20060115141537
> 060115 141543 * Indexing segment 20060115141537
> 060115 141543 * Optimizing index...
> 060115 141543 * Moving index to NFS if needed...
> 060115 141543 DONE indexing segment 20060115141537: total 0 records in 0.076 s (NaN rec/s).
> 060115 141543 done indexing
> 060115 141543 indexing segment: /home/guenter/workspace/lucene/nutch-0.7.1/bin/crawl-20060115141534/segments/20060115141538
> 060115 141543 * Opening segment 20060115141538
> 060115 141543 * Indexing segment 20060115141538
> 060115 141543 * Optimizing index...
> 060115 141543 * Moving index to NFS if needed...
> 060115 141543 DONE indexing segment 20060115141538: total 0 records in 0.012 s (NaN rec/s).
> 060115 141543 done indexing
> 060115 141543 indexing segment: /home/guenter/workspace/lucene/nutch-0.7.1/bin/crawl-20060115141534/segments/20060115141539
> 060115 141543 * Opening segment 20060115141539
> 060115 141543 * Indexing segment 20060115141539
> 060115 141543 * Optimizing index...
> 060115 141543 * Moving index to NFS if needed...
> 060115 141543 DONE indexing segment 20060115141539: total 0 records in 0.013 s (NaN rec/s).
> 060115 141543 done indexing
> 060115 141543 indexing segment: /home/guenter/workspace/lucene/nutch-0.7.1/bin/crawl-20060115141534/segments/20060115141541
> 060115 141543 * Opening segment 20060115141541
> 060115 141543 * Indexing segment 20060115141541
> 060115 141543 * Optimizing index...
> 060115 141543 * Moving index to NFS if needed...
> 060115 141543 DONE indexing segment 20060115141541: total 0 records in 0.02 s (NaN rec/s).
> 060115 141543 done indexing
> 060115 141543 Reading url hashes...
> 060115 141543 Sorting url hashes...
> 060115 141543 Deleting url duplicates...
> 060115 141543 Deleted 0 url duplicates.
> 060115 141543 Reading content hashes...
> 060115 141543 Sorting content hashes...
> 060115 141543 Deleting content duplicates...
> 060115 141543 Deleted 0 content duplicates.
> 060115 141543 Duplicate deletion complete locally.  Now returning to NFS...
> 060115 141543 DeleteDuplicates complete
> 060115 141543 Merging segment indexes...
> 060115 141543 crawl finished: crawl-20060115141534
> guenter@deimos:~/workspace/lucene/nutch-0.7.1/bin>   

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira