You are viewing a plain text version of this content. The canonical link for it is here.
Posted to mapreduce-issues@hadoop.apache.org by "Fabrice Huet (JIRA)" <ji...@apache.org> on 2010/07/30 18:10:19 UTC

[jira] Created: (MAPREDUCE-1987) No verification on sample size can lead to incorrect partition file and "Split points are out of order" IOException

No verification on sample size can lead to incorrect partition file and "Split points are out of order" IOException
-------------------------------------------------------------------------------------------------------------------

                 Key: MAPREDUCE-1987
                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-1987
             Project: Hadoop Map/Reduce
          Issue Type: Bug
    Affects Versions: 0.20.2
         Environment: 10 Linux machines with Hadoop 0.20.2 and JDK1.7.0
            Reporter: Fabrice Huet


If I understand correctly, the partition file should containt distinct values in increasing order.
In InputSampler.writePartitionFile (...)  if  the sample size is lower than the number of reduce size, the k index might keep the same value. As a side effet of the while loop, values will be interleaved.

Example : taking 100 samples on a 120 reducers job will produce the following values of k and last after the while loop 
    while (last >= k && comparator.compare(samples[last], samples[k]) == 0) {
        ++k;
      } 
   //display values here 

                 k 68                                                                                                                                                         
                 last 67        //correct                                                                                                                                              
                                                                                                                                             
                 k 69                                                                                                                                                         
                 last 68      //correct                                                                                                                                                
                                                                                                                                      
                 k 68                                                                                                                                                         
                 last 69    //incorrect, samples[68] has already been written                                                                                                                                                  
                                                                                                                                                
                 k 69                                                                                                                                                         
                 last 68    //incorrect, samples[69] has already been written         

The partition file will be considered as corrupted when reading it  with the TotalOrderPartitioner:
   throw new IOException("Split points are out of order");

It seems to me that the number of partitions should be min(samples.length,  job.getNumReduceTasks(), number of distinct values in sample)



-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.