You are viewing a plain text version of this content. The canonical link for it is here.
Posted to mapreduce-user@hadoop.apache.org by David Ginzburg <gi...@hotmail.com> on 2012/01/12 13:36:01 UTC
Inconsistent result when fairscheduler preemption is on
Hi,
I am running a 70 node cdh3u2 cluster.
A week ago the analyst ran a hive aggregation query over a year of data and compared the results to what we have in our relational
data warehouse. There was about 2.5% deviation from what was in the Data warehouse.
The data in the data warehouse was generated daily, so the jobs are much smaller.
I ran the same query over the same data and each time got slightly different results !
After further investigation I found that I submitted the job to a pool with low priority while using preemption.
I found there is a correlation between the deviation and the amount of killed reduce tasks.
The only time I got the correct results was when I turned off preemption and submitted the query.
It is difficult to reproduce this issue. at first we thought it is a hive issue -http://mail-archives.apache.org/mod_mbox/hive-user/201201.mbox/%3CSNT135-W29D26007E9692D5BE07458B7990@phx.gbl%3E , but now we suspect it is a mapreduce issue.
The query produced a 3 stage MR job - the largest with 678 reducers.
That's the highest resolution we have gotten so far . I suspect this issue has never come up before , since it is rare there exists
a reference for large data processing results, and the the phenomenon doesn't occur for small data jobs.