You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@nutch.apache.org by "Markus Jelsma (JIRA)" <ji...@apache.org> on 2011/07/06 16:13:16 UTC

[jira] [Created] (NUTCH-1033) Backport FetcherJob should run more reduce tasks than default

Backport FetcherJob should run more reduce tasks than default
-------------------------------------------------------------

                 Key: NUTCH-1033
                 URL: https://issues.apache.org/jira/browse/NUTCH-1033
             Project: Nutch
          Issue Type: Improvement
          Components: fetcher
    Affects Versions: 1.3, 1.4
            Reporter: Markus Jelsma
             Fix For: 1.4
         Attachments: NUTCH-1033-1.4-1.patch

Andrzej wrote:"FetcherJob now performs fetching in the reduce phase. This means that in a typical Hadoop setup there will be many fewer reduce tasks than map tasks, and consequently the max. total throughput of Fetcher will be proportionally reduced. I propose that FetcherJob should set the number of reduce tasks to the number of map tasks. This way the fetching will be more granular."

This issue covers the backport of NUTCH-884 to Nutch 1.4-dev.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (NUTCH-1033) Backport FetcherJob should run more reduce tasks than default

Posted by "Julien Nioche (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/NUTCH-1033?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13060636#comment-13060636 ] 

Julien Nioche commented on NUTCH-1033:
--------------------------------------

great. +1 to commit

> Backport FetcherJob should run more reduce tasks than default
> -------------------------------------------------------------
>
>                 Key: NUTCH-1033
>                 URL: https://issues.apache.org/jira/browse/NUTCH-1033
>             Project: Nutch
>          Issue Type: Improvement
>          Components: fetcher
>    Affects Versions: 1.3, 1.4
>            Reporter: Markus Jelsma
>             Fix For: 1.4
>
>         Attachments: NUTCH-1033-1.4-1.patch, NUTCH-1033-1.4-2.patch
>
>
> Andrzej wrote:"FetcherJob now performs fetching in the reduce phase. This means that in a typical Hadoop setup there will be many fewer reduce tasks than map tasks, and consequently the max. total throughput of Fetcher will be proportionally reduced. I propose that FetcherJob should set the number of reduce tasks to the number of map tasks. This way the fetching will be more granular."
> This issue covers the backport of NUTCH-884 to Nutch 1.4-dev.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (NUTCH-1033) Backport FetcherJob should run more reduce tasks than default

Posted by "Julien Nioche (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/NUTCH-1033?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13060640#comment-13060640 ] 

Julien Nioche commented on NUTCH-1033:
--------------------------------------

Please ignore my previous comment about committing - was about another issue.

Am a bit perplexed by this one - it's only in 2.0 that the fetching is done as part of the reduce step. in 1.x it is still done in the map step; setting the number of tasks to use for the reducer could be done thanks to a parameter as per your patch but I can't think of a reason why we should not simply rely on the generic mapred.map.tasks for doing this, especially as the value specified would probably be the same across the various jobs involved in a crawl (generate, fetch, parse etc...)

> Backport FetcherJob should run more reduce tasks than default
> -------------------------------------------------------------
>
>                 Key: NUTCH-1033
>                 URL: https://issues.apache.org/jira/browse/NUTCH-1033
>             Project: Nutch
>          Issue Type: Improvement
>          Components: fetcher
>    Affects Versions: 1.3, 1.4
>            Reporter: Markus Jelsma
>             Fix For: 1.4
>
>         Attachments: NUTCH-1033-1.4-1.patch, NUTCH-1033-1.4-2.patch
>
>
> Andrzej wrote:"FetcherJob now performs fetching in the reduce phase. This means that in a typical Hadoop setup there will be many fewer reduce tasks than map tasks, and consequently the max. total throughput of Fetcher will be proportionally reduced. I propose that FetcherJob should set the number of reduce tasks to the number of map tasks. This way the fetching will be more granular."
> This issue covers the backport of NUTCH-884 to Nutch 1.4-dev.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Updated] (NUTCH-1033) Backport FetcherJob should run more reduce tasks than default

Posted by "Markus Jelsma (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/NUTCH-1033?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Markus Jelsma updated NUTCH-1033:
---------------------------------

    Attachment: NUTCH-1033-1.4-2.patch

New patch also includes modified unit test, which passes!

> Backport FetcherJob should run more reduce tasks than default
> -------------------------------------------------------------
>
>                 Key: NUTCH-1033
>                 URL: https://issues.apache.org/jira/browse/NUTCH-1033
>             Project: Nutch
>          Issue Type: Improvement
>          Components: fetcher
>    Affects Versions: 1.3, 1.4
>            Reporter: Markus Jelsma
>             Fix For: 1.4
>
>         Attachments: NUTCH-1033-1.4-1.patch, NUTCH-1033-1.4-2.patch
>
>
> Andrzej wrote:"FetcherJob now performs fetching in the reduce phase. This means that in a typical Hadoop setup there will be many fewer reduce tasks than map tasks, and consequently the max. total throughput of Fetcher will be proportionally reduced. I propose that FetcherJob should set the number of reduce tasks to the number of map tasks. This way the fetching will be more granular."
> This issue covers the backport of NUTCH-884 to Nutch 1.4-dev.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Updated] (NUTCH-1033) Backport FetcherJob should run more reduce tasks than default

Posted by "Julien Nioche (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/NUTCH-1033?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Julien Nioche updated NUTCH-1033:
---------------------------------

    Comment: was deleted

(was: great. +1 to commit)

> Backport FetcherJob should run more reduce tasks than default
> -------------------------------------------------------------
>
>                 Key: NUTCH-1033
>                 URL: https://issues.apache.org/jira/browse/NUTCH-1033
>             Project: Nutch
>          Issue Type: Improvement
>          Components: fetcher
>    Affects Versions: 1.3, 1.4
>            Reporter: Markus Jelsma
>             Fix For: 1.4
>
>         Attachments: NUTCH-1033-1.4-1.patch, NUTCH-1033-1.4-2.patch
>
>
> Andrzej wrote:"FetcherJob now performs fetching in the reduce phase. This means that in a typical Hadoop setup there will be many fewer reduce tasks than map tasks, and consequently the max. total throughput of Fetcher will be proportionally reduced. I propose that FetcherJob should set the number of reduce tasks to the number of map tasks. This way the fetching will be more granular."
> This issue covers the backport of NUTCH-884 to Nutch 1.4-dev.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Closed] (NUTCH-1033) Backport FetcherJob should run more reduce tasks than default

Posted by "Markus Jelsma (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/NUTCH-1033?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Markus Jelsma closed NUTCH-1033.
--------------------------------

       Resolution: Invalid
    Fix Version/s:     (was: 1.4)

You are correct, ignore my stupidity. Andrzej mentioned backporting 844 in an older post. In my sillyness i ended up with 884! Laugh at me here =D

> Backport FetcherJob should run more reduce tasks than default
> -------------------------------------------------------------
>
>                 Key: NUTCH-1033
>                 URL: https://issues.apache.org/jira/browse/NUTCH-1033
>             Project: Nutch
>          Issue Type: Improvement
>          Components: fetcher
>    Affects Versions: 1.3, 1.4
>            Reporter: Markus Jelsma
>         Attachments: NUTCH-1033-1.4-1.patch, NUTCH-1033-1.4-2.patch
>
>
> Andrzej wrote:"FetcherJob now performs fetching in the reduce phase. This means that in a typical Hadoop setup there will be many fewer reduce tasks than map tasks, and consequently the max. total throughput of Fetcher will be proportionally reduced. I propose that FetcherJob should set the number of reduce tasks to the number of map tasks. This way the fetching will be more granular."
> This issue covers the backport of NUTCH-884 to Nutch 1.4-dev.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Updated] (NUTCH-1033) Backport FetcherJob should run more reduce tasks than default

Posted by "Markus Jelsma (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/NUTCH-1033?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Markus Jelsma updated NUTCH-1033:
---------------------------------

    Attachment: NUTCH-1033-1.4-1.patch

Patch for 1.4-dev.

> Backport FetcherJob should run more reduce tasks than default
> -------------------------------------------------------------
>
>                 Key: NUTCH-1033
>                 URL: https://issues.apache.org/jira/browse/NUTCH-1033
>             Project: Nutch
>          Issue Type: Improvement
>          Components: fetcher
>    Affects Versions: 1.3, 1.4
>            Reporter: Markus Jelsma
>             Fix For: 1.4
>
>         Attachments: NUTCH-1033-1.4-1.patch
>
>
> Andrzej wrote:"FetcherJob now performs fetching in the reduce phase. This means that in a typical Hadoop setup there will be many fewer reduce tasks than map tasks, and consequently the max. total throughput of Fetcher will be proportionally reduced. I propose that FetcherJob should set the number of reduce tasks to the number of map tasks. This way the fetching will be more granular."
> This issue covers the backport of NUTCH-884 to Nutch 1.4-dev.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira