You are viewing a plain text version of this content. The canonical link for it is here.

Posted to common-dev@hadoop.apache.org by "Andrew McNabb (JIRA)" <ji...@apache.org> on 2007/01/30 03:25:49 UTC

[jira] Created: (HADOOP-960) Incorrect number of map tasks when there are multiple input files

Incorrect number of map tasks when there are multiple input files
-----------------------------------------------------------------

                 Key: HADOOP-960
                 URL: https://issues.apache.org/jira/browse/HADOOP-960
             Project: Hadoop
          Issue Type: Bug
    Affects Versions: 0.10.1
            Reporter: Andrew McNabb


This problem happens with hadoop-streaming and possibly elsewhere.  If there are 5 input files, it will create 130 map tasks, even if mapred.map.tasks=128.  The number of map tasks is incorrectly set to a multiple of the number of files.  (I wrote a much more complete bug report, but Jira lost it when it had an error, so I'm not in the mood to write it all again)

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HADOOP-960) Incorrect number of map tasks when there are multiple input files

Posted by "Doug Cutting (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HADOOP-960?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12468762 ] 

Doug Cutting commented on HADOOP-960:
-------------------------------------

> have the same number of records in each split

That's a very different policy.  The base implementation does not open files, only examines their lengths.  I think adding this as a "knob" would result in convoluted code.  This sounds like a different splitting algorithm altogether, not a modification of the existing one.  So I'd suggest implementing a different InputFormat that implements this.  If you feel others might find it useful, please contribute it.


> Incorrect number of map tasks when there are multiple input files
> -----------------------------------------------------------------
>
>                 Key: HADOOP-960
>                 URL: https://issues.apache.org/jira/browse/HADOOP-960
>             Project: Hadoop
>          Issue Type: Improvement
>          Components: documentation
>    Affects Versions: 0.10.1
>            Reporter: Andrew McNabb
>            Priority: Minor
>
> This problem happens with hadoop-streaming and possibly elsewhere.  If there are 5 input files, it will create 130 map tasks, even if mapred.map.tasks=128.  The number of map tasks is incorrectly set to a multiple of the number of files.  (I wrote a much more complete bug report, but Jira lost it when it had an error, so I'm not in the mood to write it all again)

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (HADOOP-960) Incorrect number of map tasks when there are multiple input files

Posted by "Andrew McNabb (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HADOOP-960?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Andrew McNabb updated HADOOP-960:
---------------------------------

    Issue Type: Wish  (was: Bug)

> Incorrect number of map tasks when there are multiple input files
> -----------------------------------------------------------------
>
>                 Key: HADOOP-960
>                 URL: https://issues.apache.org/jira/browse/HADOOP-960
>             Project: Hadoop
>          Issue Type: Wish
>    Affects Versions: 0.10.1
>            Reporter: Andrew McNabb
>
> This problem happens with hadoop-streaming and possibly elsewhere.  If there are 5 input files, it will create 130 map tasks, even if mapred.map.tasks=128.  The number of map tasks is incorrectly set to a multiple of the number of files.  (I wrote a much more complete bug report, but Jira lost it when it had an error, so I'm not in the mood to write it all again)

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HADOOP-960) Incorrect number of map tasks when there are multiple input files

Posted by "Owen O'Malley (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HADOOP-960?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12468718 ] 

Owen O'Malley commented on HADOOP-960:
--------------------------------------

Oops, sorry about that. To actually get 128 splits you would need to write your own InputFormat and implement getSplit yourself. That said, it is usually better to take the extra maps and get data locality on the map input.

> Incorrect number of map tasks when there are multiple input files
> -----------------------------------------------------------------
>
>                 Key: HADOOP-960
>                 URL: https://issues.apache.org/jira/browse/HADOOP-960
>             Project: Hadoop
>          Issue Type: Bug
>    Affects Versions: 0.10.1
>            Reporter: Andrew McNabb
>
> This problem happens with hadoop-streaming and possibly elsewhere.  If there are 5 input files, it will create 130 map tasks, even if mapred.map.tasks=128.  The number of map tasks is incorrectly set to a multiple of the number of files.  (I wrote a much more complete bug report, but Jira lost it when it had an error, so I'm not in the mood to write it all again)

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HADOOP-960) Incorrect number of map tasks when there are multiple input files

Posted by "Doug Cutting (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HADOOP-960?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12468747 ] 

Doug Cutting commented on HADOOP-960:
-------------------------------------

> it seems like the need to split input evenly would be pretty common

Can you tell more about why you think this is important and useful?  It's not obvious to me.

Also, your original complaint was about the *number* of splits not matching what you expect.  Now you're complaining about the *size* of the splits not being even.  Which is it you need?  Both?  Why?  If you pass one big file and one little file and ask for six splits, should it break each file into three or break the bigger file into four and the smaller in two?  How should file size be measured: number of records or number of bytes?  There are myriad possibilities.  The base class implements something that should work well in many cases by default, and it also has some knobs that make it somewhat flexible, but it's not well documented.

> Incorrect number of map tasks when there are multiple input files
> -----------------------------------------------------------------
>
>                 Key: HADOOP-960
>                 URL: https://issues.apache.org/jira/browse/HADOOP-960
>             Project: Hadoop
>          Issue Type: Improvement
>          Components: documentation
>    Affects Versions: 0.10.1
>            Reporter: Andrew McNabb
>            Priority: Minor
>
> This problem happens with hadoop-streaming and possibly elsewhere.  If there are 5 input files, it will create 130 map tasks, even if mapred.map.tasks=128.  The number of map tasks is incorrectly set to a multiple of the number of files.  (I wrote a much more complete bug report, but Jira lost it when it had an error, so I'm not in the mood to write it all again)

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HADOOP-960) Incorrect number of map tasks when there are multiple input files

Posted by "Doug Cutting (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HADOOP-960?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12468726 ] 

Doug Cutting commented on HADOOP-960:
-------------------------------------

> It's "usually" better to take the extra maps. Unfortunately, I've got one of those ones where it isn't better. :) 

Then subclass your InputFormat and override the getSplits() method.

> This is now a feature request instead of a bug.

The feature's there: you can precisely control splitting if you like.  E.g., Nutch does this when crawling.  Is that not sufficient for you?

> Incorrect number of map tasks when there are multiple input files
> -----------------------------------------------------------------
>
>                 Key: HADOOP-960
>                 URL: https://issues.apache.org/jira/browse/HADOOP-960
>             Project: Hadoop
>          Issue Type: Wish
>    Affects Versions: 0.10.1
>            Reporter: Andrew McNabb
>
> This problem happens with hadoop-streaming and possibly elsewhere.  If there are 5 input files, it will create 130 map tasks, even if mapred.map.tasks=128.  The number of map tasks is incorrectly set to a multiple of the number of files.  (I wrote a much more complete bug report, but Jira lost it when it had an error, so I'm not in the mood to write it all again)

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Reopened: (HADOOP-960) Incorrect number of map tasks when there are multiple input files

Posted by "Andrew McNabb (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HADOOP-960?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Andrew McNabb reopened HADOOP-960:
----------------------------------


This is now a feature request instead of a bug.

> Incorrect number of map tasks when there are multiple input files
> -----------------------------------------------------------------
>
>                 Key: HADOOP-960
>                 URL: https://issues.apache.org/jira/browse/HADOOP-960
>             Project: Hadoop
>          Issue Type: Bug
>    Affects Versions: 0.10.1
>            Reporter: Andrew McNabb
>
> This problem happens with hadoop-streaming and possibly elsewhere.  If there are 5 input files, it will create 130 map tasks, even if mapred.map.tasks=128.  The number of map tasks is incorrectly set to a multiple of the number of files.  (I wrote a much more complete bug report, but Jira lost it when it had an error, so I'm not in the mood to write it all again)

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HADOOP-960) Incorrect number of map tasks when there are multiple input files

Posted by "Andrew McNabb (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HADOOP-960?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12468744 ] 

Andrew McNabb commented on HADOOP-960:
--------------------------------------

>From an end-user perspective, it seems like the need to split input evenly would be pretty common.  Especially for a user like myself who uses hadoop-streaming, it would be nice to  set an "even splits" configuration option rather than subclassing InputFormat.  Since the rest of the custom code is non-Java, it is a little awkward to write a Java class for this.

Of course, all of this is based on my belief that this would be commonly desirable.  It seems like something that would be a nice standard feature of the Hadoop toolbox.  I hope the tone of my request is more clear now.  I really think that this would be a great thing to be available for everyone.

Thanks again for everything you do.

> Incorrect number of map tasks when there are multiple input files
> -----------------------------------------------------------------
>
>                 Key: HADOOP-960
>                 URL: https://issues.apache.org/jira/browse/HADOOP-960
>             Project: Hadoop
>          Issue Type: Improvement
>          Components: documentation
>    Affects Versions: 0.10.1
>            Reporter: Andrew McNabb
>            Priority: Minor
>
> This problem happens with hadoop-streaming and possibly elsewhere.  If there are 5 input files, it will create 130 map tasks, even if mapred.map.tasks=128.  The number of map tasks is incorrectly set to a multiple of the number of files.  (I wrote a much more complete bug report, but Jira lost it when it had an error, so I'm not in the mood to write it all again)

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HADOOP-960) Incorrect number of map tasks when there are multiple input files

Posted by "Andrew McNabb (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HADOOP-960?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12468721 ] 

Andrew McNabb commented on HADOOP-960:
--------------------------------------

It's "usually" better to take the extra maps. Unfortunately, I've got one of those ones where it isn't better. :)

I guess this is really a feature request for an InputFormat that puts together all of the files in a directory and splits them evenly.

> Incorrect number of map tasks when there are multiple input files
> -----------------------------------------------------------------
>
>                 Key: HADOOP-960
>                 URL: https://issues.apache.org/jira/browse/HADOOP-960
>             Project: Hadoop
>          Issue Type: Bug
>    Affects Versions: 0.10.1
>            Reporter: Andrew McNabb
>
> This problem happens with hadoop-streaming and possibly elsewhere.  If there are 5 input files, it will create 130 map tasks, even if mapred.map.tasks=128.  The number of map tasks is incorrectly set to a multiple of the number of files.  (I wrote a much more complete bug report, but Jira lost it when it had an error, so I'm not in the mood to write it all again)

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (HADOOP-960) Incorrect number of map tasks when there are multiple input files

Posted by "Doug Cutting (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HADOOP-960?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Doug Cutting updated HADOOP-960:
--------------------------------

    Component/s: documentation
       Priority: Minor  (was: Major)
     Issue Type: Improvement  (was: Wish)

I think this is really a documentation problem.  InputFormatBase should better document the splitting algorithm that it uses.  I don't think we want to keep adding more options to its algorithm, but we should describe more precisely its split policy.

> Incorrect number of map tasks when there are multiple input files
> -----------------------------------------------------------------
>
>                 Key: HADOOP-960
>                 URL: https://issues.apache.org/jira/browse/HADOOP-960
>             Project: Hadoop
>          Issue Type: Improvement
>          Components: documentation
>    Affects Versions: 0.10.1
>            Reporter: Andrew McNabb
>            Priority: Minor
>
> This problem happens with hadoop-streaming and possibly elsewhere.  If there are 5 input files, it will create 130 map tasks, even if mapred.map.tasks=128.  The number of map tasks is incorrectly set to a multiple of the number of files.  (I wrote a much more complete bug report, but Jira lost it when it had an error, so I'm not in the mood to write it all again)

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HADOOP-960) Incorrect number of map tasks when there are multiple input files

Posted by "Andrew McNabb (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HADOOP-960?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12468751 ] 

Andrew McNabb commented on HADOOP-960:
--------------------------------------

That's a great question.  I actually care about both the number and the size.  I think I switched to talking about size because if you make the size even, the number issue will get fixed automatically.

When I say "make the size of splits even," I mean "have the same number of records in each split."  The reason is that there is a relatively small number of records but they each take a long time to run.  Without a specific attempt at making the splits even, load balancing suffers.  I think that you'd run into this issue with most MapReduce programs that aren't text processors.

I currently have jobs with 1,000 records which take 2 minutes each to map.  If there are 1,000 map tasks, there is too much overhead from distributing the jobs.  I've tried doing 256 map tasks on 256 processors.  In this case, if the number of reduce tasks isn't a power of 2, it creates more tasks than the number of processors, and it takes a long time to run.

Again, I agree that the current behavior works well in many cases by default.  I also think it would be nice if there were a few more knobs.  I don't expect that this would be the highest-priority feature request, but I think it would be generally useful.

Thanks.

> Incorrect number of map tasks when there are multiple input files
> -----------------------------------------------------------------
>
>                 Key: HADOOP-960
>                 URL: https://issues.apache.org/jira/browse/HADOOP-960
>             Project: Hadoop
>          Issue Type: Improvement
>          Components: documentation
>    Affects Versions: 0.10.1
>            Reporter: Andrew McNabb
>            Priority: Minor
>
> This problem happens with hadoop-streaming and possibly elsewhere.  If there are 5 input files, it will create 130 map tasks, even if mapred.map.tasks=128.  The number of map tasks is incorrectly set to a multiple of the number of files.  (I wrote a much more complete bug report, but Jira lost it when it had an error, so I'm not in the mood to write it all again)

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Resolved: (HADOOP-960) Incorrect number of map tasks when there are multiple input files

Posted by "Owen O'Malley (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HADOOP-960?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Owen O'Malley resolved HADOOP-960.
----------------------------------

    Resolution: Won't Fix

This may not be documented well enough, but it is not a bug. The rules for picking input split sizes are complicated, but usually a split is 1 dfs block. The number of maps requested is a hint to the InputFormat, nothing else. To get the behavior that you want, you can set mapred.min.split.size to a large value. It will force the splits to contain the entire contents of a single file.

> Incorrect number of map tasks when there are multiple input files
> -----------------------------------------------------------------
>
>                 Key: HADOOP-960
>                 URL: https://issues.apache.org/jira/browse/HADOOP-960
>             Project: Hadoop
>          Issue Type: Bug
>    Affects Versions: 0.10.1
>            Reporter: Andrew McNabb
>
> This problem happens with hadoop-streaming and possibly elsewhere.  If there are 5 input files, it will create 130 map tasks, even if mapred.map.tasks=128.  The number of map tasks is incorrectly set to a multiple of the number of files.  (I wrote a much more complete bug report, but Jira lost it when it had an error, so I'm not in the mood to write it all again)

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.