You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@hive.apache.org by Nirav Patel <np...@xactlycorp.com> on 2020/01/30 01:04:45 UTC

Performance issue with hive metastore

<https://stackoverflow.com/posts/59977690/timeline>

Hi,

I am trying to do 1000s of update parquet partition operations on different
hive tables parallely from my client application. I am using sparksql with
hive enabled in my application to submit hive query.

spark.sql(" ALTER TABLE mytable PARTITION (a=3, b=3) SET LOCATION
'/newdata/mytable/a=3/b=3/part.parquet")

I can see all the queries are submitted via different threads from my
fork-join pool. i couldn't scale this operation however way i tweak the
thread pool. Then I started observing hive metastore logs and I see that
only thread is making all writes.

    2020-01-29T16:27:15,638  INFO [pool-6-thread-163]
metastore.HiveMetaStore: 163: source:10.250.70.14 get_table : db=mydb
tbl=mytable1
2020-01-29T16:27:15,638  INFO [pool-6-thread-163] HiveMetaStore.audit:
ugi=mycomp   ip=10.250.70.14 cmd=source:10.250.70.14 get_table :
db=mydb tbl=mytable1
2020-01-29T16:27:15,653  INFO [pool-6-thread-163]
metastore.HiveMetaStore: 163: source:10.250.70.14 get_database: mydb
2020-01-29T16:27:15,653  INFO [pool-6-thread-163] HiveMetaStore.audit:
ugi=mycomp   ip=10.250.70.14 cmd=source:10.250.70.14 get_database:
mydb
2020-01-29T16:27:15,655  INFO [pool-6-thread-163]
metastore.HiveMetaStore: 163: source:10.250.70.14 get_table : db=mydb
tbl=mytable2
2020-01-29T16:27:15,656  INFO [pool-6-thread-163] HiveMetaStore.audit:
ugi=mycomp   ip=10.250.70.14 cmd=source:10.250.70.14 get_table :
db=mydb tbl=mytable2
2020-01-29T16:27:15,670  INFO [pool-6-thread-163]
metastore.HiveMetaStore: 163: source:10.250.70.14 get_database: mydb
2020-01-29T16:27:15,670  INFO [pool-6-thread-163] HiveMetaStore.audit:
ugi=mycomp   ip=10.250.70.14 cmd=source:10.250.70.14 get_database:
mydb
2020-01-29T16:27:15,672  INFO [pool-6-thread-163]
metastore.HiveMetaStore: 163: source:10.250.70.14 get_table : db=mydb
tbl=mytable3
2020-01-29T16:27:15,672  INFO [pool-6-thread-163] HiveMetaStore.audit:
ugi=mycomp   ip=10.250.70.14 cmd=source:10.250.70.14 get_table :
db=mydb tbl=mytable3

ALl actions are performed by only one thread pool-6-thread-163 I have
scanned 100s of lines and it just same thread. I don't see much log in
hiverserver.log file.

I see in hive document following default values:

hive.metastore.server.min.threads Default Value: 200
hive.metastore.server.max.threads Default Value: 100000

which should be good enough but why just one thread doing all the work? Is
it bound to consumer IP ? which would make sense as I am submitting all
jobs from single machine.


Am I missing any configuration or is there any issue with this approach
from my application side?


Thanks,

Nirav

-- 
 <http://www.xactlycorp.com>


 
<https://www.xactlyunleashed.com/event/a022327e-063e-4089-bfc2-e68b1773374c/summary?5S%2CM3%2Ca022327e-063e-4089-bfc2-e68b1773374c=&utm_campaign=event_unleashed2020&utm_content=cost&utm_medium=signature&utm_source=email>

Re: Performance issue with hive metastore

Posted by Peter Vary <pv...@cloudera.com>.

Hi Nirav,

I am not sure how spark uses Hive.
If the ALTER TABLE sql is issued through Hive then Spark is not connecting directly to the HMS, but it connects to HS2 instead. If it is using only HMS uri, then the sql is translated inside Spark, and only metastore calls are sent to the HMS. This is Spark internals, and I would suggest to ask on the Spark mailing list (sadly I do not know enough of Spark to answer your question here)

If you are using JDBC to connect to Hive then every JDBC connection has its own session, so you would be able to alter the partitions parallel.

Thanks,
Peter

> On Jan 30, 2020, at 18:56, Nirav Patel <np...@xactlycorp.com> wrote:
> 
> Thanks for responding Peter.
> 
> It indeed seems like a one session per client (we can see in every log record - source:10.250.70.14 ). I don't create session with hive thrift server. Spark basically require this property 
> "hive.metastore.uris" in sparkconfig which we set to "thrift://hivebox:9083"  So we don't have Direct control over how many sessions (connections) we can create. 
> In our current usce case, we run spark in local mode so basically only one client IP. If we run spark in cluster mode we can potentially have multiple cluster nodes submitting queries to hive and hence it might scale. but latter is workaround we don't want to run spark in cluster mode in this particular case. 
> I think Spark and Hive should scale by using multiple threads even when there's only one client machine and one server is involved. Do you know any other spark config that can help with this? 
> 
> Do you think if I use hive jdbc instead of spark to submit these queries in parallel they will executed by hive using multiple threads? 
> 
> Thanks,
> Nirav
> 
> On Thu, Jan 30, 2020 at 1:07 AM Peter Vary <pvary@cloudera.com <ma...@cloudera.com>> wrote:
> Hi Nirav,
> 
> There are several configurations which could affect the number of parallel queries running in your environment depending on you Hive version.
> 
> Thrift client is not thread safe and this causes bottleneck in the client - HS2, and HS2 - HMS communication.
> Hive solves this by creating its own connections on Session level.
> 
> Not sure what spark.sql exactly does, but my guess it reuses the HS2 connection and with it the Session. You might be able to increase your throughput by creating multiple connections.
> 
> Thanks,
> Peter
> 
> 
>> On Jan 30, 2020, at 02:04, Nirav Patel <npatel@xactlycorp.com <ma...@xactlycorp.com>> wrote:
>> 
>> 
>>  <https://stackoverflow.com/posts/59977690/timeline>
>> Hi,
>> 
>> I am trying to do 1000s of update parquet partition operations on different hive tables parallely from my client application. I am using sparksql with hive enabled in my application to submit hive query.
>> 
>> spark.sql(" ALTER TABLE mytable PARTITION (a=3, b=3) SET LOCATION 
>>         '/newdata/mytable/a=3/b=3/part.parquet")
>> 
>> I can see all the queries are submitted via different threads from my fork-join pool. i couldn't scale this operation however way i tweak the thread pool. Then I started observing hive metastore logs and I see that only thread is making all writes.
>> 
>>     2020-01-29T16:27:15,638  INFO [pool-6-thread-163] metastore.HiveMetaStore: 163: source:10.250.70.14 get_table : db=mydb tbl=mytable1
>> 2020-01-29T16:27:15,638  INFO [pool-6-thread-163] HiveMetaStore.audit: ugi=mycomp   ip=10.250.70.14 cmd=source:10.250.70.14 get_table : db=mydb tbl=mytable1    
>> 2020-01-29T16:27:15,653  INFO [pool-6-thread-163] metastore.HiveMetaStore: 163: source:10.250.70.14 get_database: mydb
>> 2020-01-29T16:27:15,653  INFO [pool-6-thread-163] HiveMetaStore.audit: ugi=mycomp   ip=10.250.70.14 cmd=source:10.250.70.14 get_database: mydb  
>> 2020-01-29T16:27:15,655  INFO [pool-6-thread-163] metastore.HiveMetaStore: 163: source:10.250.70.14 get_table : db=mydb tbl=mytable2
>> 2020-01-29T16:27:15,656  INFO [pool-6-thread-163] HiveMetaStore.audit: ugi=mycomp   ip=10.250.70.14 cmd=source:10.250.70.14 get_table : db=mydb tbl=mytable2    
>> 2020-01-29T16:27:15,670  INFO [pool-6-thread-163] metastore.HiveMetaStore: 163: source:10.250.70.14 get_database: mydb
>> 2020-01-29T16:27:15,670  INFO [pool-6-thread-163] HiveMetaStore.audit: ugi=mycomp   ip=10.250.70.14 cmd=source:10.250.70.14 get_database: mydb  
>> 2020-01-29T16:27:15,672  INFO [pool-6-thread-163] metastore.HiveMetaStore: 163: source:10.250.70.14 get_table : db=mydb tbl=mytable3
>> 2020-01-29T16:27:15,672  INFO [pool-6-thread-163] HiveMetaStore.audit: ugi=mycomp   ip=10.250.70.14 cmd=source:10.250.70.14 get_table : db=mydb tbl=mytable3
>> ALl actions are performed by only one thread pool-6-thread-163 I have scanned 100s of lines and it just same thread. I don't see much log in hiverserver.log file.
>> 
>> I see in hive document following default values:
>> 
>> hive.metastore.server.min.threads Default Value: 200 hive.metastore.server.max.threads Default Value: 100000
>> 
>> which should be good enough but why just one thread doing all the work? Is it bound to consumer IP ? which would make sense as I am submitting all jobs from single machine.
>> 
>> 
>> 
>> Am I missing any configuration or is there any issue with this approach from my application side?
>> 
>> 
>> 
>> Thanks,
>> 
>> Nirav
>> 
>> 
>>  <http://www.xactlycorp.com/>
>> 
>>  <https://www.xactlyunleashed.com/event/a022327e-063e-4089-bfc2-e68b1773374c/summary?5S%2CM3%2Ca022327e-063e-4089-bfc2-e68b1773374c=&utm_campaign=event_unleashed2020&utm_content=cost&utm_medium=signature&utm_source=email>
> 
>  <http://www.xactlycorp.com/>
> 
>  <https://www.xactlyunleashed.com/event/a022327e-063e-4089-bfc2-e68b1773374c/summary?5S%2CM3%2Ca022327e-063e-4089-bfc2-e68b1773374c=&utm_campaign=event_unleashed2020&utm_content=cost&utm_medium=signature&utm_source=email>

Re: Performance issue with hive metastore

Posted by Nirav Patel <np...@xactlycorp.com>.

Thanks for responding Peter.

It indeed seems like a one session per client (we can see in every log
record - source:10.250.70.14 ). I don't create session with hive thrift
server. Spark basically require this property
"hive.metastore.uris" in sparkconfig which we set to
"thrift://hivebox:9083"  So we don't have Direct control over how many
sessions (connections) we can create.
In our current usce case, we run spark in local mode so basically only one
client IP. If we run spark in cluster mode we can potentially have multiple
cluster nodes submitting queries to hive and hence it might scale. but
latter is workaround we don't want to run spark in cluster mode in this
particular case.
I think Spark and Hive should scale by using multiple threads even when
there's only one client machine and one server is involved. Do you know any
other spark config that can help with this?

Do you think if I use hive jdbc instead of spark to submit these queries in
parallel they will executed by hive using multiple threads?

Thanks,
Nirav

On Thu, Jan 30, 2020 at 1:07 AM Peter Vary <pv...@cloudera.com> wrote:

> Hi Nirav,
>
> There are several configurations which could affect the number of parallel
> queries running in your environment depending on you Hive version.
>
> Thrift client is not thread safe and this causes bottleneck in the client
> - HS2, and HS2 - HMS communication.
> Hive solves this by creating its own connections on Session level.
>
> Not sure what spark.sql exactly does, but my guess it reuses the HS2
> connection and with it the Session. You might be able to increase your
> throughput by creating multiple connections.
>
> Thanks,
> Peter
>
>
> On Jan 30, 2020, at 02:04, Nirav Patel <np...@xactlycorp.com> wrote:
>
>
> <https://stackoverflow.com/posts/59977690/timeline>
>
> Hi,
>
> I am trying to do 1000s of update parquet partition operations on
> different hive tables parallely from my client application. I am using
> sparksql with hive enabled in my application to submit hive query.
>
> spark.sql(" ALTER TABLE mytable PARTITION (a=3, b=3) SET LOCATION
> '/newdata/mytable/a=3/b=3/part.parquet")
>
> I can see all the queries are submitted via different threads from my
> fork-join pool. i couldn't scale this operation however way i tweak the
> thread pool. Then I started observing hive metastore logs and I see that
> only thread is making all writes.
>
>     2020-01-29T16:27:15,638  INFO [pool-6-thread-163] metastore.HiveMetaStore: 163: source:10.250.70.14 get_table : db=mydb tbl=mytable1
> 2020-01-29T16:27:15,638  INFO [pool-6-thread-163] HiveMetaStore.audit: ugi=mycomp   ip=10.250.70.14 cmd=source:10.250.70.14 get_table : db=mydb tbl=mytable1
> 2020-01-29T16:27:15,653  INFO [pool-6-thread-163] metastore.HiveMetaStore: 163: source:10.250.70.14 get_database: mydb
> 2020-01-29T16:27:15,653  INFO [pool-6-thread-163] HiveMetaStore.audit: ugi=mycomp   ip=10.250.70.14 cmd=source:10.250.70.14 get_database: mydb
> 2020-01-29T16:27:15,655  INFO [pool-6-thread-163] metastore.HiveMetaStore: 163: source:10.250.70.14 get_table : db=mydb tbl=mytable2
> 2020-01-29T16:27:15,656  INFO [pool-6-thread-163] HiveMetaStore.audit: ugi=mycomp   ip=10.250.70.14 cmd=source:10.250.70.14 get_table : db=mydb tbl=mytable2
> 2020-01-29T16:27:15,670  INFO [pool-6-thread-163] metastore.HiveMetaStore: 163: source:10.250.70.14 get_database: mydb
> 2020-01-29T16:27:15,670  INFO [pool-6-thread-163] HiveMetaStore.audit: ugi=mycomp   ip=10.250.70.14 cmd=source:10.250.70.14 get_database: mydb
> 2020-01-29T16:27:15,672  INFO [pool-6-thread-163] metastore.HiveMetaStore: 163: source:10.250.70.14 get_table : db=mydb tbl=mytable3
> 2020-01-29T16:27:15,672  INFO [pool-6-thread-163] HiveMetaStore.audit: ugi=mycomp   ip=10.250.70.14 cmd=source:10.250.70.14 get_table : db=mydb tbl=mytable3
>
> ALl actions are performed by only one thread pool-6-thread-163 I have
> scanned 100s of lines and it just same thread. I don't see much log in
> hiverserver.log file.
>
> I see in hive document following default values:
>
> hive.metastore.server.min.threads Default Value: 200
> hive.metastore.server.max.threads Default Value: 100000
>
> which should be good enough but why just one thread doing all the work? Is
> it bound to consumer IP ? which would make sense as I am submitting all
> jobs from single machine.
>
>
> Am I missing any configuration or is there any issue with this approach
> from my application side?
>
>
> Thanks,
>
> Nirav
>
> <http://www.xactlycorp.com/>
>
>
> <https://www.xactlyunleashed.com/event/a022327e-063e-4089-bfc2-e68b1773374c/summary?5S%2CM3%2Ca022327e-063e-4089-bfc2-e68b1773374c=&utm_campaign=event_unleashed2020&utm_content=cost&utm_medium=signature&utm_source=email>
>
>
>

-- 
 <http://www.xactlycorp.com>


 
<https://www.xactlyunleashed.com/event/a022327e-063e-4089-bfc2-e68b1773374c/summary?5S%2CM3%2Ca022327e-063e-4089-bfc2-e68b1773374c=&utm_campaign=event_unleashed2020&utm_content=cost&utm_medium=signature&utm_source=email>

Re: Performance issue with hive metastore

Posted by Peter Vary <pv...@cloudera.com>.

Hi Nirav,

There are several configurations which could affect the number of parallel queries running in your environment depending on you Hive version.

Thrift client is not thread safe and this causes bottleneck in the client - HS2, and HS2 - HMS communication.
Hive solves this by creating its own connections on Session level.

Not sure what spark.sql exactly does, but my guess it reuses the HS2 connection and with it the Session. You might be able to increase your throughput by creating multiple connections.

Thanks,
Peter


> On Jan 30, 2020, at 02:04, Nirav Patel <np...@xactlycorp.com> wrote:
> 
> 
>  <https://stackoverflow.com/posts/59977690/timeline>
> Hi,
> 
> I am trying to do 1000s of update parquet partition operations on different hive tables parallely from my client application. I am using sparksql with hive enabled in my application to submit hive query.
> 
> spark.sql(" ALTER TABLE mytable PARTITION (a=3, b=3) SET LOCATION 
>         '/newdata/mytable/a=3/b=3/part.parquet")
> 
> I can see all the queries are submitted via different threads from my fork-join pool. i couldn't scale this operation however way i tweak the thread pool. Then I started observing hive metastore logs and I see that only thread is making all writes.
> 
>     2020-01-29T16:27:15,638  INFO [pool-6-thread-163] metastore.HiveMetaStore: 163: source:10.250.70.14 get_table : db=mydb tbl=mytable1
> 2020-01-29T16:27:15,638  INFO [pool-6-thread-163] HiveMetaStore.audit: ugi=mycomp   ip=10.250.70.14 cmd=source:10.250.70.14 get_table : db=mydb tbl=mytable1    
> 2020-01-29T16:27:15,653  INFO [pool-6-thread-163] metastore.HiveMetaStore: 163: source:10.250.70.14 get_database: mydb
> 2020-01-29T16:27:15,653  INFO [pool-6-thread-163] HiveMetaStore.audit: ugi=mycomp   ip=10.250.70.14 cmd=source:10.250.70.14 get_database: mydb  
> 2020-01-29T16:27:15,655  INFO [pool-6-thread-163] metastore.HiveMetaStore: 163: source:10.250.70.14 get_table : db=mydb tbl=mytable2
> 2020-01-29T16:27:15,656  INFO [pool-6-thread-163] HiveMetaStore.audit: ugi=mycomp   ip=10.250.70.14 cmd=source:10.250.70.14 get_table : db=mydb tbl=mytable2    
> 2020-01-29T16:27:15,670  INFO [pool-6-thread-163] metastore.HiveMetaStore: 163: source:10.250.70.14 get_database: mydb
> 2020-01-29T16:27:15,670  INFO [pool-6-thread-163] HiveMetaStore.audit: ugi=mycomp   ip=10.250.70.14 cmd=source:10.250.70.14 get_database: mydb  
> 2020-01-29T16:27:15,672  INFO [pool-6-thread-163] metastore.HiveMetaStore: 163: source:10.250.70.14 get_table : db=mydb tbl=mytable3
> 2020-01-29T16:27:15,672  INFO [pool-6-thread-163] HiveMetaStore.audit: ugi=mycomp   ip=10.250.70.14 cmd=source:10.250.70.14 get_table : db=mydb tbl=mytable3
> ALl actions are performed by only one thread pool-6-thread-163 I have scanned 100s of lines and it just same thread. I don't see much log in hiverserver.log file.
> 
> I see in hive document following default values:
> 
> hive.metastore.server.min.threads Default Value: 200 hive.metastore.server.max.threads Default Value: 100000
> 
> which should be good enough but why just one thread doing all the work? Is it bound to consumer IP ? which would make sense as I am submitting all jobs from single machine.
> 
> 
> 
> Am I missing any configuration or is there any issue with this approach from my application side?
> 
> 
> 
> Thanks,
> 
> Nirav
> 
> 
>  <http://www.xactlycorp.com/>
> 
>  <https://www.xactlyunleashed.com/event/a022327e-063e-4089-bfc2-e68b1773374c/summary?5S%2CM3%2Ca022327e-063e-4089-bfc2-e68b1773374c=&utm_campaign=event_unleashed2020&utm_content=cost&utm_medium=signature&utm_source=email>