You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@hive.apache.org by Shaun Clowes <sc...@atlassian.com> on 2013/06/08 04:33:52 UTC

Very slow insert performance on Amazon EMR when using Dynamic Partitions

Hi All,

I'm trying to track down the root cause of some terrible insert performance
I'm seeing on Hive 0.8.1 in Amazon EMR. When I run an identical insert with
a static partition I get around 15 times the the throughput (across the
cluster) as I do when I use dynamic partitions. There are very few
partitions generated during the insert, definitely less than 15 (the table
is partitioned by a string of the form 'YYYY-MM').

During my experimentation I found that using the "hive.task.progress"
configuration variable would cause the performance of the static partition
version to go very slowly as well. This seems to explain the dynamic
partition performance because it is forced on in SemanticAnalyzer.java:

          // turn on hive.task.progress to update # of partitions created
to the JT
          HiveConf.setBoolVar(conf, HiveConf.ConfVars.HIVEJOBPROGRESS,
true);

Does anyone on the list know why this has to be forced on? I can't seem to
find where in the code this '# of partitions' is generated, sent to the JT
or even read by any other part of the Hive code.

I'm also not 100% sure why this would be causing this tupe of performance
problem. It might be System.currentTimeMillis() being called at least 30
times for each record because of preProcessCounter() and
postProcessCounter()?

1076   /**
1077    * this is called after operator process to buffer some counters.
1078    */
1079   private void postProcessCounter() {
1080     if (counterNameToEnum != null) {
1081       totalTime += (System.currentTimeMillis() - beginTime);
1082     }
1083   }

Thanks,
Shaun