You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@nutch.apache.org by "AJ Chen (JIRA)" <ji...@apache.org> on 2006/11/08 01:28:51 UTC
[jira] Created: (NUTCH-398) map-reduce very slow when crawling on
single server
map-reduce very slow when crawling on single server
---------------------------------------------------
Key: NUTCH-398
URL: http://issues.apache.org/jira/browse/NUTCH-398
Project: Nutch
Issue Type: Bug
Components: fetcher
Affects Versions: 0.8.1
Environment: linux and windows
Reporter: AJ Chen
This seems a bug and so I create a ticket here. I'm using nutch 0.9-dev to crawl web on one linux server. With default hadoop
configuration (local file system, no distributed crawling), the Generator and Fetcher spend unproportional amount of time on map-reduce opearations. For example:
2006-11-01 20:32:44,074 INFO crawl.Generator - Generator: segment:
crawl/segments/20061101203244
... (doing map and reduce for 2 hours )
2006-11-01 22:28:11,102 INFO fetcher.Fetcher - Fetcher: segment:
crawl/segments/20061101203244
... (fetching 12 hours )
2006-11-02 11:15:10,590 INFO mapred.LocalJobRunner - 175383 pages, 16583
errors, 3.8 pages/s, 687 kb/s,
2006-11-02 11:17:24,039 INFO mapred.LocalJobRunner - reduce > sort
... (but doing reduce>sort and reduce>duce for 8 hours )
2006-11-02 19:13:38,882 INFO crawl.CrawlDb - CrawlDb update: segment:
crawl/segments/20061101203244
Since it's crawling on a single machine, such slow map-reduce opearation is not expected.
--
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] Commented: (NUTCH-398) map-reduce very slow when crawling on
single server
Posted by "Uros Gruber (JIRA)" <ji...@apache.org>.
[ http://issues.apache.org/jira/browse/NUTCH-398?page=comments#action_12448053 ]
Uros Gruber commented on NUTCH-398:
-----------------------------------
Did anyone try to use single machine but not with local mode but with nutch acting like one node? Maybe this is workaround till bug is fixed.
I need to recrawl about 800k urls and I'll report my timing.
> map-reduce very slow when crawling on single server
> ---------------------------------------------------
>
> Key: NUTCH-398
> URL: http://issues.apache.org/jira/browse/NUTCH-398
> Project: Nutch
> Issue Type: Bug
> Components: fetcher
> Affects Versions: 0.8.1
> Environment: linux and windows
> Reporter: AJ Chen
>
> This seems a bug and so I create a ticket here. I'm using nutch 0.9-dev to crawl web on one linux server. With default hadoop
> configuration (local file system, no distributed crawling), the Generator and Fetcher spend unproportional amount of time on map-reduce opearations. For example:
> 2006-11-01 20:32:44,074 INFO crawl.Generator - Generator: segment:
> crawl/segments/20061101203244
> ... (doing map and reduce for 2 hours )
> 2006-11-01 22:28:11,102 INFO fetcher.Fetcher - Fetcher: segment:
> crawl/segments/20061101203244
> ... (fetching 12 hours )
> 2006-11-02 11:15:10,590 INFO mapred.LocalJobRunner - 175383 pages, 16583
> errors, 3.8 pages/s, 687 kb/s,
> 2006-11-02 11:17:24,039 INFO mapred.LocalJobRunner - reduce > sort
> ... (but doing reduce>sort and reduce>duce for 8 hours )
> 2006-11-02 19:13:38,882 INFO crawl.CrawlDb - CrawlDb update: segment:
> crawl/segments/20061101203244
> Since it's crawling on a single machine, such slow map-reduce opearation is not expected.
--
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] Commented: (NUTCH-398) map-reduce very slow when crawling on
single server
Posted by "Sami Siren (JIRA)" <ji...@apache.org>.
[ http://issues.apache.org/jira/browse/NUTCH-398?page=comments#action_12448949 ]
Sami Siren commented on NUTCH-398:
----------------------------------
>Did anyone try to use single machine but not with local mode but with nutch acting like one node? Maybe this is workaround till bug is fixed.
>I need to recrawl about 800k urls and I'll report my timing.
Could you also try the patch on NUTCH-395 and reports if it helps for you?
> map-reduce very slow when crawling on single server
> ---------------------------------------------------
>
> Key: NUTCH-398
> URL: http://issues.apache.org/jira/browse/NUTCH-398
> Project: Nutch
> Issue Type: Bug
> Components: fetcher
> Affects Versions: 0.8.1
> Environment: linux and windows
> Reporter: AJ Chen
>
> This seems a bug and so I create a ticket here. I'm using nutch 0.9-dev to crawl web on one linux server. With default hadoop
> configuration (local file system, no distributed crawling), the Generator and Fetcher spend unproportional amount of time on map-reduce opearations. For example:
> 2006-11-01 20:32:44,074 INFO crawl.Generator - Generator: segment:
> crawl/segments/20061101203244
> ... (doing map and reduce for 2 hours )
> 2006-11-01 22:28:11,102 INFO fetcher.Fetcher - Fetcher: segment:
> crawl/segments/20061101203244
> ... (fetching 12 hours )
> 2006-11-02 11:15:10,590 INFO mapred.LocalJobRunner - 175383 pages, 16583
> errors, 3.8 pages/s, 687 kb/s,
> 2006-11-02 11:17:24,039 INFO mapred.LocalJobRunner - reduce > sort
> ... (but doing reduce>sort and reduce>duce for 8 hours )
> 2006-11-02 19:13:38,882 INFO crawl.CrawlDb - CrawlDb update: segment:
> crawl/segments/20061101203244
> Since it's crawling on a single machine, such slow map-reduce opearation is not expected.
--
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] Closed: (NUTCH-398) map-reduce very slow when crawling on
single server
Posted by "Andrzej Bialecki (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/NUTCH-398?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Andrzej Bialecki closed NUTCH-398.
-----------------------------------
Resolution: Cannot Reproduce
Fix Version/s: 1.0.0
Assignee: Andrzej Bialecki
> map-reduce very slow when crawling on single server
> ---------------------------------------------------
>
> Key: NUTCH-398
> URL: https://issues.apache.org/jira/browse/NUTCH-398
> Project: Nutch
> Issue Type: Bug
> Components: fetcher
> Affects Versions: 0.8.1
> Environment: linux and windows
> Reporter: AJ Chen
> Assignee: Andrzej Bialecki
> Fix For: 1.0.0
>
>
> This seems a bug and so I create a ticket here. I'm using nutch 0.9-dev to crawl web on one linux server. With default hadoop
> configuration (local file system, no distributed crawling), the Generator and Fetcher spend unproportional amount of time on map-reduce opearations. For example:
> 2006-11-01 20:32:44,074 INFO crawl.Generator - Generator: segment:
> crawl/segments/20061101203244
> ... (doing map and reduce for 2 hours )
> 2006-11-01 22:28:11,102 INFO fetcher.Fetcher - Fetcher: segment:
> crawl/segments/20061101203244
> ... (fetching 12 hours )
> 2006-11-02 11:15:10,590 INFO mapred.LocalJobRunner - 175383 pages, 16583
> errors, 3.8 pages/s, 687 kb/s,
> 2006-11-02 11:17:24,039 INFO mapred.LocalJobRunner - reduce > sort
> ... (but doing reduce>sort and reduce>duce for 8 hours )
> 2006-11-02 19:13:38,882 INFO crawl.CrawlDb - CrawlDb update: segment:
> crawl/segments/20061101203244
> Since it's crawling on a single machine, such slow map-reduce opearation is not expected.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.
[jira] Commented: (NUTCH-398) map-reduce very slow when crawling on
single server
Posted by "Andrzej Bialecki (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/NUTCH-398?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12659338#action_12659338 ]
Andrzej Bialecki commented on NUTCH-398:
-----------------------------------------
This is an old bug, and refers to an old Hadoop bug that is already fixed in newer versions.
> map-reduce very slow when crawling on single server
> ---------------------------------------------------
>
> Key: NUTCH-398
> URL: https://issues.apache.org/jira/browse/NUTCH-398
> Project: Nutch
> Issue Type: Bug
> Components: fetcher
> Affects Versions: 0.8.1
> Environment: linux and windows
> Reporter: AJ Chen
> Assignee: Andrzej Bialecki
> Fix For: 1.0.0
>
>
> This seems a bug and so I create a ticket here. I'm using nutch 0.9-dev to crawl web on one linux server. With default hadoop
> configuration (local file system, no distributed crawling), the Generator and Fetcher spend unproportional amount of time on map-reduce opearations. For example:
> 2006-11-01 20:32:44,074 INFO crawl.Generator - Generator: segment:
> crawl/segments/20061101203244
> ... (doing map and reduce for 2 hours )
> 2006-11-01 22:28:11,102 INFO fetcher.Fetcher - Fetcher: segment:
> crawl/segments/20061101203244
> ... (fetching 12 hours )
> 2006-11-02 11:15:10,590 INFO mapred.LocalJobRunner - 175383 pages, 16583
> errors, 3.8 pages/s, 687 kb/s,
> 2006-11-02 11:17:24,039 INFO mapred.LocalJobRunner - reduce > sort
> ... (but doing reduce>sort and reduce>duce for 8 hours )
> 2006-11-02 19:13:38,882 INFO crawl.CrawlDb - CrawlDb update: segment:
> crawl/segments/20061101203244
> Since it's crawling on a single machine, such slow map-reduce opearation is not expected.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.
[jira] Commented: (NUTCH-398) map-reduce very slow when crawling on
single server
Posted by "nutch.newbie (JIRA)" <ji...@apache.org>.
[ http://issues.apache.org/jira/browse/NUTCH-398?page=comments#action_12448033 ]
nutch.newbie commented on NUTCH-398:
------------------------------------
FYI
Its more of a Hadoop bug...
http://issues.apache.org/jira/browse/HADOOP-206
Seems like the bug is not highly prioritized.
> map-reduce very slow when crawling on single server
> ---------------------------------------------------
>
> Key: NUTCH-398
> URL: http://issues.apache.org/jira/browse/NUTCH-398
> Project: Nutch
> Issue Type: Bug
> Components: fetcher
> Affects Versions: 0.8.1
> Environment: linux and windows
> Reporter: AJ Chen
>
> This seems a bug and so I create a ticket here. I'm using nutch 0.9-dev to crawl web on one linux server. With default hadoop
> configuration (local file system, no distributed crawling), the Generator and Fetcher spend unproportional amount of time on map-reduce opearations. For example:
> 2006-11-01 20:32:44,074 INFO crawl.Generator - Generator: segment:
> crawl/segments/20061101203244
> ... (doing map and reduce for 2 hours )
> 2006-11-01 22:28:11,102 INFO fetcher.Fetcher - Fetcher: segment:
> crawl/segments/20061101203244
> ... (fetching 12 hours )
> 2006-11-02 11:15:10,590 INFO mapred.LocalJobRunner - 175383 pages, 16583
> errors, 3.8 pages/s, 687 kb/s,
> 2006-11-02 11:17:24,039 INFO mapred.LocalJobRunner - reduce > sort
> ... (but doing reduce>sort and reduce>duce for 8 hours )
> 2006-11-02 19:13:38,882 INFO crawl.CrawlDb - CrawlDb update: segment:
> crawl/segments/20061101203244
> Since it's crawling on a single machine, such slow map-reduce opearation is not expected.
--
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira