You are viewing a plain text version of this content. The canonical link for it is here.
Posted to common-issues@hadoop.apache.org by "Yang Yang (JIRA)" <ji...@apache.org> on 2012/06/11 05:23:42 UTC

[jira] [Created] (HADOOP-8503) logic difference between old mapred.FileInputFormat and mapreduce.lib.input.FileInputFormat

Yang Yang created HADOOP-8503:
---------------------------------

             Summary: logic difference between old mapred.FileInputFormat and mapreduce.lib.input.FileInputFormat
                 Key: HADOOP-8503
                 URL: https://issues.apache.org/jira/browse/HADOOP-8503
             Project: Hadoop Common
          Issue Type: Bug
            Reporter: Yang Yang
            Priority: Minor




--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Updated] (HADOOP-8503) logic difference between old mapred.FileInputFormat and mapreduce.lib.input.FileInputFormat

Posted by "Yang Yang (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/HADOOP-8503?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Yang Yang updated HADOOP-8503:
------------------------------

    Attachment: 0001-HADOOP-8503-re-enable-mapreduces.job.maps.patch

here is a patch to use the new mapreduce.job.maps  config param

it's the same spirit as the old mapred.map.tasks


I have not tested this since the github source I pulled does not build . Harsh could you please try this?

Thanks
Yang
                
> logic difference between old mapred.FileInputFormat and mapreduce.lib.input.FileInputFormat
> -------------------------------------------------------------------------------------------
>
>                 Key: HADOOP-8503
>                 URL: https://issues.apache.org/jira/browse/HADOOP-8503
>             Project: Hadoop Common
>          Issue Type: Bug
>    Affects Versions: 0.20.0
>            Reporter: Yang Yang
>            Priority: Minor
>         Attachments: 0001-HADOOP-8503-re-enable-mapreduces.job.maps.patch
>
>
> in the old mapred.FileInputFormat.getSplits(JobConf, int) 
>         long splitSize = computeSplitSize(goalSize, minSize, blockSize);
> so we could control splitSize with the goalSize, which is controlled by mapred.map.tasks 
> in the new code, mapreduces.lib.input.FileInputFormat
>         long splitSize = computeSplitSize(blockSize, minSize, maxSize);
> i.e. we don't have goal size anymore, furthermore,
> the implementation of computeSplitSize() no longer makes sense:
>     return Math.max(minSize, Math.min(maxSize, blockSize));
> since we assume that maxSize is always bigger than minSize, the above line is equivalent to  just
> return Math.min(maxSize, blockSize), so minSize is useless 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (HADOOP-8503) logic difference between old mapred.FileInputFormat and mapreduce.lib.input.FileInputFormat

Posted by "Harsh J (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HADOOP-8503?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13292629#comment-13292629 ] 

Harsh J commented on HADOOP-8503:
---------------------------------

Hi Yang,

Please add more description on what logic difference you speak of here, and which versions of Apache Hadoop is affected by it.

Thanks! :)
                
> logic difference between old mapred.FileInputFormat and mapreduce.lib.input.FileInputFormat
> -------------------------------------------------------------------------------------------
>
>                 Key: HADOOP-8503
>                 URL: https://issues.apache.org/jira/browse/HADOOP-8503
>             Project: Hadoop Common
>          Issue Type: Bug
>            Reporter: Yang Yang
>            Priority: Minor
>


--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (HADOOP-8503) logic difference between old mapred.FileInputFormat and mapreduce.lib.input.FileInputFormat

Posted by "Yang Yang (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HADOOP-8503?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13292644#comment-13292644 ] 

Yang Yang commented on HADOOP-8503:
-----------------------------------

Harsh:

this is an issue in PIG, which uses the same config for multiple jobs in the same pig script. (one pig script normally translates to several MR jobs)


let's say, for the first PIG stage, I have a huge input file (10G). by default hadoop launches about 10G/128MB = 100 mappers.

if I have 400 mapper slots, I want to launch 400 mappers.  with the old InputFormat code, I could set min.split.size=25MB, with the new code, I could also set max.split.size=25MB, both would work fine.


but the next stage in pig script would take an input of 100GB, now, with 25MB split size,it's going to generate 4000 mappers, which is too much for my 400 slots.
in the old code, I could set "mapred.map.tasks=400" to control the upper limit of map tasks number, or the lower limit of split size (which takes into effect in the Math.max() in computeSplitSize()  ). so I can still maintain 400 mappers.  but the new code would lead to 4000 mappers, which don't make sense anymore.





                
> logic difference between old mapred.FileInputFormat and mapreduce.lib.input.FileInputFormat
> -------------------------------------------------------------------------------------------
>
>                 Key: HADOOP-8503
>                 URL: https://issues.apache.org/jira/browse/HADOOP-8503
>             Project: Hadoop Common
>          Issue Type: Bug
>    Affects Versions: 0.20.0
>            Reporter: Yang Yang
>            Priority: Minor
>
> in the old mapred.FileInputFormat.getSplits(JobConf, int) 
>         long splitSize = computeSplitSize(goalSize, minSize, blockSize);
> so we could control splitSize with the goalSize, which is controlled by mapred.map.tasks 
> in the new code, mapreduces.lib.input.FileInputFormat
>         long splitSize = computeSplitSize(blockSize, minSize, maxSize);
> i.e. we don't have goal size anymore, furthermore,
> the implementation of computeSplitSize() no longer makes sense:
>     return Math.max(minSize, Math.min(maxSize, blockSize));
> since we assume that maxSize is always bigger than minSize, the above line is equivalent to  just
> return Math.min(maxSize, blockSize), so minSize is useless 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (HADOOP-8503) logic difference between old mapred.FileInputFormat and mapreduce.lib.input.FileInputFormat

Posted by "Harsh J (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HADOOP-8503?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13292641#comment-13292641 ] 

Harsh J commented on HADOOP-8503:
---------------------------------

bq. i.e. we don't have goal size anymore

I believe this was intentional and to remove confusion around "specify number of maps to run" kind of needs, that doesn't sit well with files as input.

For the issue with the min/max size, do you mean to report that setting min-split-size has no impact in increasing number of input splits (i.e. mappers)? If possible, can you also attach in code form the bug you wish to report (like, a test case of whats to be expected vs. reality)?
                
> logic difference between old mapred.FileInputFormat and mapreduce.lib.input.FileInputFormat
> -------------------------------------------------------------------------------------------
>
>                 Key: HADOOP-8503
>                 URL: https://issues.apache.org/jira/browse/HADOOP-8503
>             Project: Hadoop Common
>          Issue Type: Bug
>    Affects Versions: 0.20.0
>            Reporter: Yang Yang
>            Priority: Minor
>
> in the old mapred.FileInputFormat.getSplits(JobConf, int) 
>         long splitSize = computeSplitSize(goalSize, minSize, blockSize);
> so we could control splitSize with the goalSize, which is controlled by mapred.map.tasks 
> in the new code, mapreduces.lib.input.FileInputFormat
>         long splitSize = computeSplitSize(blockSize, minSize, maxSize);
> i.e. we don't have goal size anymore, furthermore,
> the implementation of computeSplitSize() no longer makes sense:
>     return Math.max(minSize, Math.min(maxSize, blockSize));
> since we assume that maxSize is always bigger than minSize, the above line is equivalent to  just
> return Math.min(maxSize, blockSize), so minSize is useless 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Updated] (HADOOP-8503) logic difference between old mapred.FileInputFormat and mapreduce.lib.input.FileInputFormat

Posted by "Yang Yang (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/HADOOP-8503?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Yang Yang updated HADOOP-8503:
------------------------------

    Description: 
in the old mapred.FileInputFormat.getSplits(JobConf, int) 


        long splitSize = computeSplitSize(goalSize, minSize, blockSize);


so we could control splitSize with the goalSize, which is controlled by mapred.map.tasks 

in the new code, mapreduces.lib.input.FileInputFormat


        long splitSize = computeSplitSize(blockSize, minSize, maxSize);

i.e. we don't have goal size anymore, furthermore,
the implementation of computeSplitSize() no longer makes sense:

    return Math.max(minSize, Math.min(maxSize, blockSize));

since we assume that maxSize is always bigger than minSize, the above line is equivalent to  just
return Math.min(maxSize, blockSize), so minSize is useless 
    
> logic difference between old mapred.FileInputFormat and mapreduce.lib.input.FileInputFormat
> -------------------------------------------------------------------------------------------
>
>                 Key: HADOOP-8503
>                 URL: https://issues.apache.org/jira/browse/HADOOP-8503
>             Project: Hadoop Common
>          Issue Type: Bug
>            Reporter: Yang Yang
>            Priority: Minor
>
> in the old mapred.FileInputFormat.getSplits(JobConf, int) 
>         long splitSize = computeSplitSize(goalSize, minSize, blockSize);
> so we could control splitSize with the goalSize, which is controlled by mapred.map.tasks 
> in the new code, mapreduces.lib.input.FileInputFormat
>         long splitSize = computeSplitSize(blockSize, minSize, maxSize);
> i.e. we don't have goal size anymore, furthermore,
> the implementation of computeSplitSize() no longer makes sense:
>     return Math.max(minSize, Math.min(maxSize, blockSize));
> since we assume that maxSize is always bigger than minSize, the above line is equivalent to  just
> return Math.min(maxSize, blockSize), so minSize is useless 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Updated] (HADOOP-8503) logic difference between old mapred.FileInputFormat and mapreduce.lib.input.FileInputFormat

Posted by "Yang Yang (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/HADOOP-8503?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Yang Yang updated HADOOP-8503:
------------------------------

    Affects Version/s: 0.20.0
    
> logic difference between old mapred.FileInputFormat and mapreduce.lib.input.FileInputFormat
> -------------------------------------------------------------------------------------------
>
>                 Key: HADOOP-8503
>                 URL: https://issues.apache.org/jira/browse/HADOOP-8503
>             Project: Hadoop Common
>          Issue Type: Bug
>    Affects Versions: 0.20.0
>            Reporter: Yang Yang
>            Priority: Minor
>
> in the old mapred.FileInputFormat.getSplits(JobConf, int) 
>         long splitSize = computeSplitSize(goalSize, minSize, blockSize);
> so we could control splitSize with the goalSize, which is controlled by mapred.map.tasks 
> in the new code, mapreduces.lib.input.FileInputFormat
>         long splitSize = computeSplitSize(blockSize, minSize, maxSize);
> i.e. we don't have goal size anymore, furthermore,
> the implementation of computeSplitSize() no longer makes sense:
>     return Math.max(minSize, Math.min(maxSize, blockSize));
> since we assume that maxSize is always bigger than minSize, the above line is equivalent to  just
> return Math.min(maxSize, blockSize), so minSize is useless 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira