You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by Chih How Bong <ch...@gmail.com> on 2005/12/28 05:33:53 UTC

Crawler problem in 0.7 and 0.7.1

Hi all,
  I encountered problems when I run nutch 0.7 and 0.7.1 crawler.
  Although I have added a number of root url in a plain text file *urls *as
it the crawler seems unwillingly to fetch any of the urls. However, when In
fall back to the nutch 0.6, everything just works fine under it.
  Therefore, I wondering if this problem happen to all of you? Currently, I
am running nutch 0.7.1 with JDK1.5 update 6 on Ubuntu 5.10. Anywhere I came
across the same problem under my apple Mac too.
  Below are the content of the log of the crawler, it shows that the crawler
returrns 0 entry.
  Thanks in advance.


051227 212142 parsing file:/opt/nutch-0.7.1/conf/nutch-default.xml
051227 212143 parsing file:/opt/nutch- 0.7.1/conf/crawl-tool.xml
051227 212143 parsing file:/opt/nutch-0.7.1/conf/nutch-site.xml
051227 212143 No FS indicated, using default:local
051227 212143 crawl started in: crawl.test
051227 212143 rootUrlFile = urls
051227 212143 threads = 10
051227 212143 depth = 3
...

...

..051227 212143 *Added 0 pages*
051227 212143 FetchListTool started
051227 212144 *Overall processing: Sorted 0 entries in 0.0 seconds.
*051227 212144 Overall processing: Sorted NaN entries/second
051227 212144 FetchListTool completed
051227 212144 logging at INFO
051227 212145 Updating /opt/nutch-0.7.1/crawl.test/db
051227 212145 Updating for /opt/nutch-0.7.1
/crawl.test/segments/20051227212143
051227 212145 Finishing update
051227 212145 Update finished
051227 212145 FetchListTool started
*051227 212145 Overall processing: Sorted 0 entries in 0.0 seconds.*
051227 212145 Overall processing: Sorted NaN entries/second
051227 212145 FetchListTool completed
051227 212145 logging at INFO
051227 212146 Updating /opt/nutch-0.7.1/crawl.test/db
051227 212146 Updating for /opt/nutch-0.7.1
/crawl.test/segments/20051227212145
051227 212146 Finishing update
051227 212146 Update finished
051227 212146 FetchListTool started
051227 212146 Overall processing: Sorted 0 entries in 0.0 seconds.
051227 212146 Overall processing: Sorted NaN entries/second
051227 212146 FetchListTool completed
051227 212146 logging at INFO
051227 212147 Updating /opt/nutch- 0.7.1/crawl.test/db
051227 212147 Updating for /opt/nutch-0.7.1
/crawl.test/segments/20051227212146
051227 212147 Finishing update
051227 212147 Update finished
051227 212147 Updating /opt/nutch-0.7.1/crawl.test/segments from /opt/nutch-
0.7.1/crawl.test/db
051227 212147  reading /opt/nutch-0.7.1/crawl.test/segments/20051227212143
051227 212148  reading /opt/nutch-0.7.1/crawl.test/segments/20051227212145
051227 212148  reading /opt/nutch-0.7.1/crawl.test/segments/20051227212146
051227 212148 Sorting pages by url...
051227 212148 Getting updated scores and anchors from db...
051227 212148 Sorting updates by segment...
051227 212148 Updating segments...
051227 212148 Done updating /opt/nutch-0.7.1/crawl.test/segments from
/opt/nutch-0.7.1/crawl.test/db
051227 212148 indexing segment: /opt/nutch- 0.7.1
/crawl.test/segments/20051227212143
051227 212148 * Opening segment 20051227212143
051227 212148 * Indexing segment 20051227212143
051227 212148 * Optimizing index...
051227 212148 * Moving index to NFS if needed...
051227 212148 DONE indexing segment 20051227212143: total 0 records in
0.026s (NaN rec/s).
051227 212148 done indexing
051227 212148 indexing segment: /opt/nutch-0.7.1
/crawl.test/segments/20051227212145
051227 212148 * Opening segment 20051227212145
051227 212148 * Indexing segment 20051227212145
051227 212148 * Optimizing index...
051227 212148 * Moving index to NFS if needed...
051227 212148 DONE indexing segment 20051227212145: total 0 records in
0.075s (NaN rec/s).
051227 212148 done indexing
051227 212148 indexing segment: /opt/nutch-0.7.1
/crawl.test/segments/20051227212146
051227 212148 * Opening segment 20051227212146
051227 212148 * Indexing segment 20051227212146
051227 212148 * Optimizing index...
051227 212148 * Moving index to NFS if needed...
*051227 212148 DONE indexing segment 20051227212146: total 0 records in
0.011 s (NaN rec/s).
*051227 212148 done indexing
051227 212148 Reading url hashes...
051227 212148 Sorting url hashes...
051227 212148 Deleting url duplicates...
051227 212148 Deleted 0 url duplicates.
051227 212148 Reading content hashes...
051227 212148 Sorting content hashes...
051227 212148 Deleting content duplicates...
051227 212148 Deleted 0 content duplicates.
051227 212148 Duplicate deletion complete locally.   Now returning to NFS...
051227 212148 DeleteDuplicates complete
051227 212148 Merging segment indexes...
051227 212148 crawl finished: crawl.test

Rgds
Chih-How Bong

Re: nutch-0.8-dev

Posted by "R.Mayoran" <ma...@team-lab.com>.
Thank you very much for your quick response.


----- Original Message ----- 
From: "Jérôme Charron" <je...@gmail.com>
To: <nu...@lucene.apache.org>
Sent: Tuesday, January 10, 2006 7:14 PM
Subject: Re: nutch-0.8-dev


> Where can I download the latest version of nutch-0.8-dev?

You can download the nightly builds from
http://cvs.apache.org/dist/lucene/nutch/nightly/
Or checkout the source code using svn
http://lucene.apache.org/nutch/version_control.html

Regards

Jérôme


--
http://motrech.free.fr/
http://www.frutch.org/



Re: nutch-0.8-dev

Posted by Jérôme Charron <je...@gmail.com>.
> Where can I download the latest version of nutch-0.8-dev?

You can download the nightly builds from
http://cvs.apache.org/dist/lucene/nutch/nightly/
Or checkout the source code using svn
http://lucene.apache.org/nutch/version_control.html

Regards

Jérôme


--
http://motrech.free.fr/
http://www.frutch.org/

nutch-0.8-dev

Posted by "R.Mayoran" <ma...@team-lab.com>.
Hello,

Where can I download the latest version of nutch-0.8-dev?

Thank you.

Mayoran


Re: Crawler problem in 0.7 and 0.7.1

Posted by "Pushpesh Kr. Rajwanshi" <pu...@gmail.com>.
Hi there,

Can u check ur crawl filter.txt file? I guess there is slight handling
problem in code.

+^http://([a-z0-9]*\.)*google.com

works
but

+^http://([a-z0-9]*\.)*google.com/

doesnt work

U see the leading slash messes and wont allow to inject urls. So try
removing "/" at the end in crawlurl filter.txt file and then it should work

HTH
Pushpesh


On 12/28/05, Chih How Bong <ch...@gmail.com> wrote:
>
> Hi all,
> I encountered problems when I run nutch 0.7 and 0.7.1 crawler.
> Although I have added a number of root url in a plain text file *urls *as
> it the crawler seems unwillingly to fetch any of the urls. However, when
> In
> fall back to the nutch 0.6, everything just works fine under it.
> Therefore, I wondering if this problem happen to all of you? Currently, I
> am running nutch 0.7.1 with JDK1.5 update 6 on Ubuntu 5.10. Anywhere I
> came
> across the same problem under my apple Mac too.
> Below are the content of the log of the crawler, it shows that the crawler
> returrns 0 entry.
> Thanks in advance.
>
>
> 051227 212142 parsing file:/opt/nutch-0.7.1/conf/nutch-default.xml
> 051227 212143 parsing file:/opt/nutch- 0.7.1/conf/crawl-tool.xml
> 051227 212143 parsing file:/opt/nutch-0.7.1/conf/nutch-site.xml
> 051227 212143 No FS indicated, using default:local
> 051227 212143 crawl started in: crawl.test
> 051227 212143 rootUrlFile = urls
> 051227 212143 threads = 10
> 051227 212143 depth = 3
> ...
>
> ...
>
> ..051227 212143 *Added 0 pages*
> 051227 212143 FetchListTool started
> 051227 212144 *Overall processing: Sorted 0 entries in 0.0 seconds.
> *051227 212144 Overall processing: Sorted NaN entries/second
> 051227 212144 FetchListTool completed
> 051227 212144 logging at INFO
> 051227 212145 Updating /opt/nutch-0.7.1/crawl.test/db
> 051227 212145 Updating for /opt/nutch-0.7.1
> /crawl.test/segments/20051227212143
> 051227 212145 Finishing update
> 051227 212145 Update finished
> 051227 212145 FetchListTool started
> *051227 212145 Overall processing: Sorted 0 entries in 0.0 seconds.*
> 051227 212145 Overall processing: Sorted NaN entries/second
> 051227 212145 FetchListTool completed
> 051227 212145 logging at INFO
> 051227 212146 Updating /opt/nutch-0.7.1/crawl.test/db
> 051227 212146 Updating for /opt/nutch-0.7.1
> /crawl.test/segments/20051227212145
> 051227 212146 Finishing update
> 051227 212146 Update finished
> 051227 212146 FetchListTool started
> 051227 212146 Overall processing: Sorted 0 entries in 0.0 seconds.
> 051227 212146 Overall processing: Sorted NaN entries/second
> 051227 212146 FetchListTool completed
> 051227 212146 logging at INFO
> 051227 212147 Updating /opt/nutch- 0.7.1/crawl.test/db
> 051227 212147 Updating for /opt/nutch-0.7.1
> /crawl.test/segments/20051227212146
> 051227 212147 Finishing update
> 051227 212147 Update finished
> 051227 212147 Updating /opt/nutch-0.7.1/crawl.test/segments from
> /opt/nutch-
> 0.7.1/crawl.test/db
> 051227 212147  reading /opt/nutch-0.7.1/crawl.test/segments/20051227212143
> 051227 212148  reading /opt/nutch-0.7.1/crawl.test/segments/20051227212145
> 051227 212148  reading /opt/nutch-0.7.1/crawl.test/segments/20051227212146
> 051227 212148 Sorting pages by url...
> 051227 212148 Getting updated scores and anchors from db...
> 051227 212148 Sorting updates by segment...
> 051227 212148 Updating segments...
> 051227 212148 Done updating /opt/nutch-0.7.1/crawl.test/segments from
> /opt/nutch-0.7.1/crawl.test/db
> 051227 212148 indexing segment: /opt/nutch- 0.7.1
> /crawl.test/segments/20051227212143
> 051227 212148 * Opening segment 20051227212143
> 051227 212148 * Indexing segment 20051227212143
> 051227 212148 * Optimizing index...
> 051227 212148 * Moving index to NFS if needed...
> 051227 212148 DONE indexing segment 20051227212143: total 0 records in
> 0.026s (NaN rec/s).
> 051227 212148 done indexing
> 051227 212148 indexing segment: /opt/nutch-0.7.1
> /crawl.test/segments/20051227212145
> 051227 212148 * Opening segment 20051227212145
> 051227 212148 * Indexing segment 20051227212145
> 051227 212148 * Optimizing index...
> 051227 212148 * Moving index to NFS if needed...
> 051227 212148 DONE indexing segment 20051227212145: total 0 records in
> 0.075s (NaN rec/s).
> 051227 212148 done indexing
> 051227 212148 indexing segment: /opt/nutch-0.7.1
> /crawl.test/segments/20051227212146
> 051227 212148 * Opening segment 20051227212146
> 051227 212148 * Indexing segment 20051227212146
> 051227 212148 * Optimizing index...
> 051227 212148 * Moving index to NFS if needed...
> *051227 212148 DONE indexing segment 20051227212146: total 0 records in
> 0.011 s (NaN rec/s).
> *051227 212148 done indexing
> 051227 212148 Reading url hashes...
> 051227 212148 Sorting url hashes...
> 051227 212148 Deleting url duplicates...
> 051227 212148 Deleted 0 url duplicates.
> 051227 212148 Reading content hashes...
> 051227 212148 Sorting content hashes...
> 051227 212148 Deleting content duplicates...
> 051227 212148 Deleted 0 content duplicates.
> 051227 212148 Duplicate deletion complete locally.   Now returning to
> NFS...
> 051227 212148 DeleteDuplicates complete
> 051227 212148 Merging segment indexes...
> 051227 212148 crawl finished: crawl.test
>
> Rgds
> Chih-How Bong
>
>