You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@nutch.apache.org by "Markus Jelsma (JIRA)" <ji...@apache.org> on 2011/07/06 16:13:16 UTC
[jira] [Created] (NUTCH-1033) Backport FetcherJob should run more
reduce tasks than default
Backport FetcherJob should run more reduce tasks than default
-------------------------------------------------------------
Key: NUTCH-1033
URL: https://issues.apache.org/jira/browse/NUTCH-1033
Project: Nutch
Issue Type: Improvement
Components: fetcher
Affects Versions: 1.3, 1.4
Reporter: Markus Jelsma
Fix For: 1.4
Attachments: NUTCH-1033-1.4-1.patch
Andrzej wrote:"FetcherJob now performs fetching in the reduce phase. This means that in a typical Hadoop setup there will be many fewer reduce tasks than map tasks, and consequently the max. total throughput of Fetcher will be proportionally reduced. I propose that FetcherJob should set the number of reduce tasks to the number of map tasks. This way the fetching will be more granular."
This issue covers the backport of NUTCH-884 to Nutch 1.4-dev.
--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (NUTCH-1033) Backport FetcherJob should run more
reduce tasks than default
Posted by "Julien Nioche (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/NUTCH-1033?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13060636#comment-13060636 ]
Julien Nioche commented on NUTCH-1033:
--------------------------------------
great. +1 to commit
> Backport FetcherJob should run more reduce tasks than default
> -------------------------------------------------------------
>
> Key: NUTCH-1033
> URL: https://issues.apache.org/jira/browse/NUTCH-1033
> Project: Nutch
> Issue Type: Improvement
> Components: fetcher
> Affects Versions: 1.3, 1.4
> Reporter: Markus Jelsma
> Fix For: 1.4
>
> Attachments: NUTCH-1033-1.4-1.patch, NUTCH-1033-1.4-2.patch
>
>
> Andrzej wrote:"FetcherJob now performs fetching in the reduce phase. This means that in a typical Hadoop setup there will be many fewer reduce tasks than map tasks, and consequently the max. total throughput of Fetcher will be proportionally reduced. I propose that FetcherJob should set the number of reduce tasks to the number of map tasks. This way the fetching will be more granular."
> This issue covers the backport of NUTCH-884 to Nutch 1.4-dev.
--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (NUTCH-1033) Backport FetcherJob should run more
reduce tasks than default
Posted by "Julien Nioche (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/NUTCH-1033?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13060640#comment-13060640 ]
Julien Nioche commented on NUTCH-1033:
--------------------------------------
Please ignore my previous comment about committing - was about another issue.
Am a bit perplexed by this one - it's only in 2.0 that the fetching is done as part of the reduce step. in 1.x it is still done in the map step; setting the number of tasks to use for the reducer could be done thanks to a parameter as per your patch but I can't think of a reason why we should not simply rely on the generic mapred.map.tasks for doing this, especially as the value specified would probably be the same across the various jobs involved in a crawl (generate, fetch, parse etc...)
> Backport FetcherJob should run more reduce tasks than default
> -------------------------------------------------------------
>
> Key: NUTCH-1033
> URL: https://issues.apache.org/jira/browse/NUTCH-1033
> Project: Nutch
> Issue Type: Improvement
> Components: fetcher
> Affects Versions: 1.3, 1.4
> Reporter: Markus Jelsma
> Fix For: 1.4
>
> Attachments: NUTCH-1033-1.4-1.patch, NUTCH-1033-1.4-2.patch
>
>
> Andrzej wrote:"FetcherJob now performs fetching in the reduce phase. This means that in a typical Hadoop setup there will be many fewer reduce tasks than map tasks, and consequently the max. total throughput of Fetcher will be proportionally reduced. I propose that FetcherJob should set the number of reduce tasks to the number of map tasks. This way the fetching will be more granular."
> This issue covers the backport of NUTCH-884 to Nutch 1.4-dev.
--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (NUTCH-1033) Backport FetcherJob should run more
reduce tasks than default
Posted by "Markus Jelsma (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/NUTCH-1033?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Markus Jelsma updated NUTCH-1033:
---------------------------------
Attachment: NUTCH-1033-1.4-2.patch
New patch also includes modified unit test, which passes!
> Backport FetcherJob should run more reduce tasks than default
> -------------------------------------------------------------
>
> Key: NUTCH-1033
> URL: https://issues.apache.org/jira/browse/NUTCH-1033
> Project: Nutch
> Issue Type: Improvement
> Components: fetcher
> Affects Versions: 1.3, 1.4
> Reporter: Markus Jelsma
> Fix For: 1.4
>
> Attachments: NUTCH-1033-1.4-1.patch, NUTCH-1033-1.4-2.patch
>
>
> Andrzej wrote:"FetcherJob now performs fetching in the reduce phase. This means that in a typical Hadoop setup there will be many fewer reduce tasks than map tasks, and consequently the max. total throughput of Fetcher will be proportionally reduced. I propose that FetcherJob should set the number of reduce tasks to the number of map tasks. This way the fetching will be more granular."
> This issue covers the backport of NUTCH-884 to Nutch 1.4-dev.
--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (NUTCH-1033) Backport FetcherJob should run more
reduce tasks than default
Posted by "Julien Nioche (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/NUTCH-1033?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Julien Nioche updated NUTCH-1033:
---------------------------------
Comment: was deleted
(was: great. +1 to commit)
> Backport FetcherJob should run more reduce tasks than default
> -------------------------------------------------------------
>
> Key: NUTCH-1033
> URL: https://issues.apache.org/jira/browse/NUTCH-1033
> Project: Nutch
> Issue Type: Improvement
> Components: fetcher
> Affects Versions: 1.3, 1.4
> Reporter: Markus Jelsma
> Fix For: 1.4
>
> Attachments: NUTCH-1033-1.4-1.patch, NUTCH-1033-1.4-2.patch
>
>
> Andrzej wrote:"FetcherJob now performs fetching in the reduce phase. This means that in a typical Hadoop setup there will be many fewer reduce tasks than map tasks, and consequently the max. total throughput of Fetcher will be proportionally reduced. I propose that FetcherJob should set the number of reduce tasks to the number of map tasks. This way the fetching will be more granular."
> This issue covers the backport of NUTCH-884 to Nutch 1.4-dev.
--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Closed] (NUTCH-1033) Backport FetcherJob should run more
reduce tasks than default
Posted by "Markus Jelsma (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/NUTCH-1033?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Markus Jelsma closed NUTCH-1033.
--------------------------------
Resolution: Invalid
Fix Version/s: (was: 1.4)
You are correct, ignore my stupidity. Andrzej mentioned backporting 844 in an older post. In my sillyness i ended up with 884! Laugh at me here =D
> Backport FetcherJob should run more reduce tasks than default
> -------------------------------------------------------------
>
> Key: NUTCH-1033
> URL: https://issues.apache.org/jira/browse/NUTCH-1033
> Project: Nutch
> Issue Type: Improvement
> Components: fetcher
> Affects Versions: 1.3, 1.4
> Reporter: Markus Jelsma
> Attachments: NUTCH-1033-1.4-1.patch, NUTCH-1033-1.4-2.patch
>
>
> Andrzej wrote:"FetcherJob now performs fetching in the reduce phase. This means that in a typical Hadoop setup there will be many fewer reduce tasks than map tasks, and consequently the max. total throughput of Fetcher will be proportionally reduced. I propose that FetcherJob should set the number of reduce tasks to the number of map tasks. This way the fetching will be more granular."
> This issue covers the backport of NUTCH-884 to Nutch 1.4-dev.
--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (NUTCH-1033) Backport FetcherJob should run more
reduce tasks than default
Posted by "Markus Jelsma (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/NUTCH-1033?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Markus Jelsma updated NUTCH-1033:
---------------------------------
Attachment: NUTCH-1033-1.4-1.patch
Patch for 1.4-dev.
> Backport FetcherJob should run more reduce tasks than default
> -------------------------------------------------------------
>
> Key: NUTCH-1033
> URL: https://issues.apache.org/jira/browse/NUTCH-1033
> Project: Nutch
> Issue Type: Improvement
> Components: fetcher
> Affects Versions: 1.3, 1.4
> Reporter: Markus Jelsma
> Fix For: 1.4
>
> Attachments: NUTCH-1033-1.4-1.patch
>
>
> Andrzej wrote:"FetcherJob now performs fetching in the reduce phase. This means that in a typical Hadoop setup there will be many fewer reduce tasks than map tasks, and consequently the max. total throughput of Fetcher will be proportionally reduced. I propose that FetcherJob should set the number of reduce tasks to the number of map tasks. This way the fetching will be more granular."
> This issue covers the backport of NUTCH-884 to Nutch 1.4-dev.
--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira