You are viewing a plain text version of this content. The canonical link for it is here.
Posted to mapreduce-user@hadoop.apache.org by Florin P <fl...@yahoo.com> on 2011/09/21 21:59:43 UTC

Computing the number of mappers when CombineFileInputFormat is used (Reloaded)

 
Hello1
    I would like to know how Hadoop is
 computing the number of mappers when CombineFileInputFormat
 is used? I have read the API specification for
 CombineFileInputFormat (http://hadoop.apache.org/common/docs/current/api/org/apache/hadoop/mapred/lib/CombineFileInputFormat.html),
 but unfortunately I could not understand the way that the
 input splits are computed.
 We have a cluster with 10 Data nodes, and data files
 (mapfile) spread over them.
 We have used this InputFormat in our M/R jobs. Reading the
 spec, we have to distinguish betwenn three scenarios:
   1. if the  mapred.max.split.size property it is
 not specified, then we will have one mapper. This behavior is
 correct regarding the spec:
 "If maxSplitSize is not specified, then blocks from the
 same rack are combined in a single split; no attempt is made
 to create node-local splits"
 2. mapred.max.split.size value specified and equals with
 the the block size (our case 64M). According to spec: "If
 the maxSplitSize is equal to the block size, then this class
 is similar to the default spliting behaviour in Hadoop: each
 block is a locally processed split."
 Question: So for my understanding, in this case, the number
 of splits it is calculated the same as in the case when you
 don't use CombineFileInputFormat?
 3. According to spec: "If a maxSplitSize is specified, then
 blocks on the same node are combined to form a single split.
 Blocks that are left over are then combined with other
 blocks in the same rack"
    Question 1: From the above, I have
 understood that the number of splits is equal with the
 number of nodes ("blocks on the same node are combined to
 form a single split"). I have observed that is not the case.
 So how you compute?
   Question 2. What is the best practice to set up the
 mapred.max.split.size(maxSplitSize) greater or equal with
 the block size?
   (In my opinion, I'll use the same size as block size
 in order do not loose data locality, but please correct me
 if I'm wrong)
 
 In the spec, it is stated that "A split cannot have files
 from different pools". What means pool? A datanode?
 
 I'll look forward for your answers.
 Thank you,
   Florin