You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Mitesh (Jira)" <ji...@apache.org> on 2022/08/04 04:13:00 UTC

[jira] [Commented] (SPARK-10970) Executors overload Hive metastore by making massive connections at execution time

    [ https://issues.apache.org/jira/browse/SPARK-10970?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17575014#comment-17575014 ] 

Mitesh commented on SPARK-10970:
--------------------------------

[~cheolsoo] I'm seeing this on Spark 2.4, here is my callstack https://gist.github.com/MasterDDT/ac3b2e73bd4a79226d12ef2c78848537. What is worse for me, my Hive is backed by AWS Glue, so the `reloadFunctions()` calls cause throttling issues.


> Executors overload Hive metastore by making massive connections at execution time
> ---------------------------------------------------------------------------------
>
>                 Key: SPARK-10970
>                 URL: https://issues.apache.org/jira/browse/SPARK-10970
>             Project: Spark
>          Issue Type: Bug
>          Components: SQL
>    Affects Versions: 1.5.1
>         Environment: Hive 1.2, Spark on YARN
>            Reporter: Cheolsoo Park
>            Priority: Critical
>
> This is a regression in Spark 1.5, more specifically after upgrading Hive dependency to 1.2.
> HIVE-2573 introduced a new feature that allows users to register functions in session. The problem is that it added a [static code block|https://github.com/apache/hive/blob/branch-1.2/ql/src/java/org/apache/hadoop/hive/ql/metadata/Hive.java#L164-L170] to Hive.java-
> {code}
> // register all permanent functions. need improvement
> static {
>   try {
>     reloadFunctions();
>   } catch (Exception e) {
>     LOG.warn("Failed to access metastore. This class should not accessed in runtime.",e);
>   }
> }
> {code}
> This code block is executed by every Spark executor in cluster when HadoopRDD tries to access to JobConf. So if Spark job has a high parallelism (eg 1000+), executors will hammer the HCat server causing it to go down in the worst case.
> Here is the stack trace that I took in executor when it makes a connection to Hive metastore-
> {code}
> 15/10/06 19:26:05 WARN conf.HiveConf: HiveConf of name hive.optimize.s3.query does not exist
> 15/10/06 19:26:05 INFO hive.metastore: XXX: java.lang.Thread.getStackTrace(Thread.java:1589)
> 15/10/06 19:26:05 INFO hive.metastore: XXX: org.apache.hadoop.hive.metastore.HiveMetaStoreClient.<init>(HiveMetaStoreClient.java:236)
> 15/10/06 19:26:05 INFO hive.metastore: XXX: org.apache.hadoop.hive.ql.metadata.SessionHiveMetaStoreClient.<init>(SessionHiveMetaStoreClient.java:74)
> 15/10/06 19:26:05 INFO hive.metastore: XXX: sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
> 15/10/06 19:26:05 INFO hive.metastore: XXX: sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:57)
> 15/10/06 19:26:05 INFO hive.metastore: XXX: sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
> 15/10/06 19:26:05 INFO hive.metastore: XXX: java.lang.reflect.Constructor.newInstance(Constructor.java:526)
> 15/10/06 19:26:05 INFO hive.metastore: XXX: org.apache.hadoop.hive.metastore.MetaStoreUtils.newInstance(MetaStoreUtils.java:1521)
> 15/10/06 19:26:05 INFO hive.metastore: XXX: org.apache.hadoop.hive.metastore.RetryingMetaStoreClient.<init>(RetryingMetaStoreClient.java:86)
> 15/10/06 19:26:05 INFO hive.metastore: XXX: org.apache.hadoop.hive.metastore.RetryingMetaStoreClient.getProxy(RetryingMetaStoreClient.java:132)
> 15/10/06 19:26:05 INFO hive.metastore: XXX: org.apache.hadoop.hive.metastore.RetryingMetaStoreClient.getProxy(RetryingMetaStoreClient.java:104)
> 15/10/06 19:26:05 INFO hive.metastore: XXX: org.apache.hadoop.hive.ql.metadata.Hive.createMetaStoreClient(Hive.java:3005)
> 15/10/06 19:26:05 INFO hive.metastore: XXX: org.apache.hadoop.hive.ql.metadata.Hive.getMSC(Hive.java:3024)
> 15/10/06 19:26:05 INFO hive.metastore: XXX: org.apache.hadoop.hive.ql.metadata.Hive.getAllDatabases(Hive.java:1234)
> 15/10/06 19:26:05 INFO hive.metastore: XXX: org.apache.hadoop.hive.ql.metadata.Hive.reloadFunctions(Hive.java:174)
> 15/10/06 19:26:05 INFO hive.metastore: XXX: org.apache.hadoop.hive.ql.metadata.Hive.<clinit>(Hive.java:166)
> 15/10/06 19:26:05 INFO hive.metastore: XXX: org.apache.hadoop.hive.ql.plan.PlanUtils.configureJobPropertiesForStorageHandler(PlanUtils.java:803)
> 15/10/06 19:26:05 INFO hive.metastore: XXX: org.apache.hadoop.hive.ql.plan.PlanUtils.configureInputJobPropertiesForStorageHandler(PlanUtils.java:782)
> 15/10/06 19:26:05 INFO hive.metastore: XXX: org.apache.spark.sql.hive.HadoopTableReader$.initializeLocalJobConfFunc(TableReader.scala:347)
> 15/10/06 19:26:05 INFO hive.metastore: XXX: org.apache.spark.sql.hive.HadoopTableReader$anonfun$17.apply(TableReader.scala:322)
> 15/10/06 19:26:05 INFO hive.metastore: XXX: org.apache.spark.sql.hive.HadoopTableReader$anonfun$17.apply(TableReader.scala:322)
> 15/10/06 19:26:05 INFO hive.metastore: XXX: org.apache.spark.rdd.HadoopRDD$anonfun$getJobConf$6.apply(HadoopRDD.scala:179)
> 15/10/06 19:26:05 INFO hive.metastore: XXX: org.apache.spark.rdd.HadoopRDD$anonfun$getJobConf$6.apply(HadoopRDD.scala:179)
> 15/10/06 19:26:05 INFO hive.metastore: XXX: scala.Option.map(Option.scala:145)
> 15/10/06 19:26:05 INFO hive.metastore: XXX: org.apache.spark.rdd.HadoopRDD.getJobConf(HadoopRDD.scala:179)
> 15/10/06 19:26:05 INFO hive.metastore: XXX: org.apache.spark.rdd.HadoopRDD$anon$1.<init>(HadoopRDD.scala:231)
> 15/10/06 19:26:05 INFO hive.metastore: XXX: org.apache.spark.rdd.HadoopRDD.compute(HadoopRDD.scala:227)
> 15/10/06 19:26:05 INFO hive.metastore: XXX: org.apache.spark.rdd.HadoopRDD.compute(HadoopRDD.scala:103)
> 15/10/06 19:26:05 INFO hive.metastore: XXX: org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297)
> 15/10/06 19:26:05 INFO hive.metastore: XXX: org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
> 15/10/06 19:26:05 INFO hive.metastore: XXX: org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
> 15/10/06 19:26:05 INFO hive.metastore: XXX: org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297)
> 15/10/06 19:26:05 INFO hive.metastore: XXX: org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
> 15/10/06 19:26:05 INFO hive.metastore: XXX: org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
> 15/10/06 19:26:05 INFO hive.metastore: XXX: org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297)
> 15/10/06 19:26:05 INFO hive.metastore: XXX: org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
> 15/10/06 19:26:05 INFO hive.metastore: XXX: org.apache.spark.rdd.UnionRDD.compute(UnionRDD.scala:97)
> 15/10/06 19:26:05 INFO hive.metastore: XXX: org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297)
> 15/10/06 19:26:05 INFO hive.metastore: XXX: org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
> 15/10/06 19:26:05 INFO hive.metastore: XXX: org.apache.spark.rdd.MapPartitionsWithPreparationRDD.compute(MapPartitionsWithPreparationRDD.scala:63)
> 15/10/06 19:26:05 INFO hive.metastore: XXX: org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297)
> 15/10/06 19:26:05 INFO hive.metastore: XXX: org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
> 15/10/06 19:26:05 INFO hive.metastore: XXX: org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
> 15/10/06 19:26:05 INFO hive.metastore: XXX: org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297)
> 15/10/06 19:26:05 INFO hive.metastore: XXX: org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
> 15/10/06 19:26:05 INFO hive.metastore: XXX: org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:73)
> 15/10/06 19:26:05 INFO hive.metastore: XXX: org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
> 15/10/06 19:26:05 INFO hive.metastore: XXX: org.apache.spark.scheduler.Task.run(Task.scala:88)
> 15/10/06 19:26:05 INFO hive.metastore: XXX: org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
> 15/10/06 19:26:05 INFO hive.metastore: XXX: java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
> 15/10/06 19:26:05 INFO hive.metastore: XXX: java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
> 15/10/06 19:26:05 INFO hive.metastore: XXX: java.lang.Thread.run(Thread.java:745)
> 15/10/06 19:26:05 INFO hive.metastore: Trying to connect to metastore with URI thrift://admin.gateway.dataeng.netflix.net:11002
> {code}
> As can be seen, HadoopRDD tries to get JobConf in executor, which in turn invokes the {{reloadFunctions()}} function in Hive.java.
> What's worse, due to HIVE-10319, a single {{reloadFunctions()}} call ends up making hundreds of thrift calls to Hive metastore if there are a large number of databases in Hive metastore. So any Spark job can easily take down HCat server in production.
> As a workaround, I forked Databrick's [Hive 1.2 repo|https://github.com/pwendell/hive/commits/release-1.2.1-spark], removed the static code block from Hive.java, and rebuilt Spark with this forked version of Hive. I don't know if there is a better way of fixing this problem.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org