You are viewing a plain text version of this content. The canonical link for it is here.
Posted to mapreduce-user@hadoop.apache.org by David Ginzburg <gi...@hotmail.com> on 2012/01/12 13:36:01 UTC

Inconsistent result when fairscheduler preemption is on

Hi,

I am running a 70 node cdh3u2 cluster.


A week ago the analyst ran a hive aggregation query over a year of data and compared the results to what we have in our relational 
data warehouse. There was about  2.5% deviation from what was in the Data warehouse. 
The data in the data warehouse was  generated daily, so the jobs are much smaller.

I ran the same query over the same data and each time got slightly different results !

After further investigation I found that I submitted the job to a pool with low priority while using preemption.

I found there is a correlation between the deviation and the amount of killed reduce tasks.

The only time I got the correct results was when I turned off preemption and submitted the query.

It is difficult to reproduce this issue. at first we thought it is a hive issue -http://mail-archives.apache.org/mod_mbox/hive-user/201201.mbox/%3CSNT135-W29D26007E9692D5BE07458B7990@phx.gbl%3E , but now we suspect it is a mapreduce issue.

The query produced a 3 stage MR job - the largest with 678 reducers.

That's the highest resolution we have gotten so far . I suspect this issue has never come up before , since  it is rare there exists 
a reference for large data processing results, and the the phenomenon doesn't occur  for small data jobs.