You are viewing a plain text version of this content. The canonical link for it is here.
Posted to mapreduce-issues@hadoop.apache.org by "Fabrice Huet (JIRA)" <ji...@apache.org> on 2010/07/30 18:10:19 UTC
[jira] Created: (MAPREDUCE-1987) No verification on sample size can
lead to incorrect partition file and "Split points are out of order"
IOException
No verification on sample size can lead to incorrect partition file and "Split points are out of order" IOException
-------------------------------------------------------------------------------------------------------------------
Key: MAPREDUCE-1987
URL: https://issues.apache.org/jira/browse/MAPREDUCE-1987
Project: Hadoop Map/Reduce
Issue Type: Bug
Affects Versions: 0.20.2
Environment: 10 Linux machines with Hadoop 0.20.2 and JDK1.7.0
Reporter: Fabrice Huet
If I understand correctly, the partition file should containt distinct values in increasing order.
In InputSampler.writePartitionFile (...) if the sample size is lower than the number of reduce size, the k index might keep the same value. As a side effet of the while loop, values will be interleaved.
Example : taking 100 samples on a 120 reducers job will produce the following values of k and last after the while loop
while (last >= k && comparator.compare(samples[last], samples[k]) == 0) {
++k;
}
//display values here
k 68
last 67 //correct
k 69
last 68 //correct
k 68
last 69 //incorrect, samples[68] has already been written
k 69
last 68 //incorrect, samples[69] has already been written
The partition file will be considered as corrupted when reading it with the TotalOrderPartitioner:
throw new IOException("Split points are out of order");
It seems to me that the number of partitions should be min(samples.length, job.getNumReduceTasks(), number of distinct values in sample)
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.