You are viewing a plain text version of this content. The canonical link for it is here.
Posted to mapreduce-dev@hadoop.apache.org by Jim Donofrio <do...@gmail.com> on 2012/06/14 14:41:11 UTC

mapreduce.job.max.split.locations just a warning in hadoop 1.0.3 but not in 2.0.1-alpha?

I didnt hear anything from common-user about this, maybe that was the 
wrong list because this is more a development issue.

         final int max_loc = conf.getInt(MAX_SPLIT_LOCATIONS, 10);
         if (locations.length > max_loc) {
           LOG.warn("Max block location exceeded for split: "
               + split + " splitsize: " + locations.length +
               " maxsize: " + max_loc);
           locations = Arrays.copyOf(locations, max_loc);
         }

I was wondering about the above code in JobSplitWriter in hadoop 1.0.3.
The below commit comment is somewhat vague. I saw MAPREDUCE-1943 about
setting limits to save memory on the jobtracker. I wanted to confirm
that the above fix just serves as a warning and saves memory on the
jobtracker and does not cap the input at all since most inputformats
seem to ignore the locations?

I also wanted to know why the recent MAPREDUCE-4146 added this cap to
2.0.1-alpha but with the original capping behavior of causing the job to
fail by throwing an IOException instead of just warning the user as the
current code does?


commit 51be5c3d61cbc7960174493428fbaa41d5fbe84d
Author: Chris Douglas <cd...@apache.org>
Date:   Fri Oct 1 01:49:51 2010 -0700

      Change client-side enforcement of limit on locations per split
     to be advisory. Truncate on client, optionally fail job at
JobTracker if
     exceeded. Added mapreduce.job.max.split.locations property.

     +++ b/YAHOO-CHANGES.txt
     +     Change client-side enforcement of limit on locations per split
     +    to be advisory. Truncate on client, optionally fail job at
JobTracker if
     +    exceeded. Added mapreduce.job.max.split.locations property.
(cdouglas)
     +


Re: mapreduce.job.max.split.locations just a warning in hadoop 1.0.3 but not in 2.0.1-alpha?

Posted by Harsh J <ha...@cloudera.com>.
Hey Jim,

These are limits on the locations of a single split (locations for a
regular File would mean where all the file split's blocks may reside).
They do not control or cap inputs, just cap the maximum number of
locations shippable per InputSplit object. For a 'regular' job on a
'regular' cluster, the input split's locations would only be 3 at most
(3 replica locations to data-localize to). Its very rare to hit this
limit and this limit exists only to prevent abuse of memory (One may
send across lots of string values via the locations array via their
input format, say). It also helps catch stupid mistakes on custom
location generation, as an added bonus I guess.

I personally do know not why MAPREDUCE-4146 decided to throw
IOException but I think thats a better idea for a user-set limit than
to just log a warn and cap it to a size. Please raise it on the JIRA
to get the author's/reporters' notice! May have just been a foresight,
but I think I like it this way (gives a proper 'limit' enforcement
feel) :)

On Thu, Jun 14, 2012 at 6:11 PM, Jim Donofrio <do...@gmail.com> wrote:
> I didnt hear anything from common-user about this, maybe that was the wrong
> list because this is more a development issue.
>
>
>        final int max_loc = conf.getInt(MAX_SPLIT_LOCATIONS, 10);
>        if (locations.length > max_loc) {
>          LOG.warn("Max block location exceeded for split: "
>              + split + " splitsize: " + locations.length +
>              " maxsize: " + max_loc);
>          locations = Arrays.copyOf(locations, max_loc);
>        }
>
> I was wondering about the above code in JobSplitWriter in hadoop 1.0.3.
> The below commit comment is somewhat vague. I saw MAPREDUCE-1943 about
> setting limits to save memory on the jobtracker. I wanted to confirm
> that the above fix just serves as a warning and saves memory on the
> jobtracker and does not cap the input at all since most inputformats
> seem to ignore the locations?
>
> I also wanted to know why the recent MAPREDUCE-4146 added this cap to
> 2.0.1-alpha but with the original capping behavior of causing the job to
> fail by throwing an IOException instead of just warning the user as the
> current code does?
>
>
> commit 51be5c3d61cbc7960174493428fbaa41d5fbe84d
> Author: Chris Douglas <cd...@apache.org>
> Date:   Fri Oct 1 01:49:51 2010 -0700
>
>     Change client-side enforcement of limit on locations per split
>    to be advisory. Truncate on client, optionally fail job at
> JobTracker if
>    exceeded. Added mapreduce.job.max.split.locations property.
>
>    +++ b/YAHOO-CHANGES.txt
>    +     Change client-side enforcement of limit on locations per split
>    +    to be advisory. Truncate on client, optionally fail job at
> JobTracker if
>    +    exceeded. Added mapreduce.job.max.split.locations property.
> (cdouglas)
>    +
>



-- 
Harsh J