You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@spark.apache.org by Nirav Patel <np...@xactlycorp.com> on 2020/01/30 18:02:44 UTC

Spark 2.4 and Hive 2.3 - Performance issue with concurrent hive DDL queries

Hi,

I am trying to do 1000s of update parquet partition operations on different
hive tables parallely from my client application. I am using sparksql in
local mode with hive enabled in my application to submit hive query.
Spark is being used in local mode because all the operations we do are
pretty simple DDL queries so we don't want to use cluster resources in this
case.

Spark hive config property is:

hive.metastore.uris=thrift://hivebox:9083

Example sql query that we want to execute in parallel:

spark.sql(" ALTER TABLE mytable PARTITION (a=3, b=3) SET LOCATION
'/newdata/mytable/a=3/b=3/part.parquet")


I can see all the queries are submitted via different threads from my
fork-join pool. i couldn't scale this operation however way i tweak the
thread pool. Then I started observing hive metastore logs and I see that
only thread is making all writes.

    2020-01-29T16:27:15,638  INFO [pool-6-thread-163]
metastore.HiveMetaStore: 163: source:10.250.70.14 get_table : db=mydb
tbl=mytable1
2020-01-29T16:27:15,638  INFO [*pool-6-thread-163*]
HiveMetaStore.audit: ugi=mycomp   ip=10.250.70.14
cmd=*source:10.250.70.14* get_table : db=mydb tbl=mytable1
2020-01-29T16:27:15,653  INFO [*pool-6-thread-163*]
metastore.HiveMetaStore: 163: source:10.250.70.14 get_database: mydb
2020-01-29T16:27:15,653  INFO [*pool-6-thread-163*]
HiveMetaStore.audit: ugi=mycomp   ip=10.250.70.14
cmd=*source:10.250.70.14* get_database: mydb
2020-01-29T16:27:15,655  INFO [pool-6-thread-163]
metastore.HiveMetaStore: 163: source:10.250.70.14 get_table : db=mydb
tbl=mytable2
2020-01-29T16:27:15,656  INFO [pool-6-thread-163] HiveMetaStore.audit:
ugi=mycomp   ip=10.250.70.14 cmd=*source:10.250.70.14* get_table :
db=mydb tbl=mytable2
2020-01-29T16:27:15,670  INFO [pool-6-thread-163]
metastore.HiveMetaStore: 163: source:10.250.70.14 get_database: mydb
2020-01-29T16:27:15,670  INFO [pool-6-thread-163] HiveMetaStore.audit:
ugi=mycomp   ip=10.250.70.14 cmd=source:10.250.70.14 get_database:
mydb
2020-01-29T16:27:15,672  INFO [pool-6-thread-163]
metastore.HiveMetaStore: 163: source:10.250.70.14 get_table : db=mydb
tbl=mytable3
2020-01-29T16:27:15,672  INFO [pool-6-thread-163] HiveMetaStore.audit:
ugi=mycomp   ip=10.250.70.14 cmd=source:10.250.70.14 get_table :
db=mydb tbl=mytable3

All actions are performed by only one thread pool-6-thread-163 I have
scanned 100s of lines and it just same thread. I don't see much log in
hiverserver.log file.

Is it bound to consumer IP because i see in log record source:10.250.70.14?
which would make sense as I am submitting all jobs from single machine. If
that's the case how do I scale this? Am I missing any configuration or is
there any issue with how hive handles connection from spark client?

I know the workaround could be to run my application in cluster in which
case queries will be submitted by different client machines (worker nodes)
but we really just want to use spark in local mode.

Thanks,

Nirav

-- 
 <http://www.xactlycorp.com>


 
<https://www.xactlyunleashed.com/event/a022327e-063e-4089-bfc2-e68b1773374c/summary?5S%2CM3%2Ca022327e-063e-4089-bfc2-e68b1773374c=&utm_campaign=event_unleashed2020&utm_content=cost&utm_medium=signature&utm_source=email>