You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@hive.apache.org by Prabhakar Reddy <pr...@gmail.com> on 2018/08/26 10:43:11 UTC

Improve performance of Analyze table compute statistics

Hello,

Are there any properties that I can set to improve the performance of
Analyze table compute statistics statement.My data sits in s3 and I see
it's taking one second per file to read the schema of each file from s3.

2018-08-24T03:25:57,525 INFO  [2b2c0a06-7da5-4fcd-83a7-4931b8e1b4b1
main([])]: orc.ReaderImpl (ReaderImpl.java:rowsOptions(79)) - Reading ORC
rows from s3://file_1
2018-08-24T03:25:57,526 INFO  [2b2c0a06-7da5-4fcd-83a7-4931b8e1b4b1
main([])]: impl.RecordReaderImpl (RecordReaderImpl.java:<init>(187)) -
Reader schema not provided -- using file schema

2018-08-24T03:25:58,395 INFO  [2b2c0a06-7da5-4fcd-83a7-4931b8e1b4b1
main([])]: orc.ReaderImpl (ReaderImpl.java:rowsOptions(79)) - Reading ORC
rows from s3://file_2
2018-08-24T03:25:58,395 INFO  [2b2c0a06-7da5-4fcd-83a7-4931b8e1b4b1
main([])]: impl.RecordReaderImpl (RecordReaderImpl.java:<init>(187)) -
Reader schema not provided -- using file schema

It takes around 80 seconds for 76 files with total size of 23 GB.


2018-08-24T03:27:07,673 INFO  [2b2c0a06-7da5-4fcd-83a7-4931b8e1b4b1
main([])]: exec.Task (SessionState.java:printInfo(1111)) - Table
dept_data_services.lc_credit_2018_08_20_temp stats: [numFiles=76,
numRows=101341845, totalSize=26500166568, rawDataSize=294491898741]
2018-08-24T03:27:07,673 INFO  [2b2c0a06-7da5-4fcd-83a7-4931b8e1b4b1
main([])]: ql.Driver (Driver.java:execute(2050)) - Completed executing
command(queryId=lcapp_20180824032545_aba21e71-ea4b-4214-8793-705a5e0367f0);
Time taken: 81.169 seconds
2018-08-24T03:27:07,674 INFO  [2b2c0a06-7da5-4fcd-83a7-4931b8e1b4b1
main([])]: ql.Driver (SessionState.java:printInfo(1111)) - OK
2018-08-24T03:27:07,681 INFO  [2b2c0a06-7da5-4fcd-83a7-4931b8e1b4b1
main([])]: CliDriver (SessionState.java:printInfo(1111)) - Time taken:
81.992 seconds

If I run the same command with few columns then the query runs 60%
faster.Is there any property that I can modify to reduce the time taken for
this read?

Regards
Prabhakar Reddy

Re: Improve performance of Analyze table compute statistics

Posted by Prabhakar Reddy <pr...@gmail.com>.

Thank you Gopal for this Information.Currently I am using EMR to run this
query.As this operation is CPU intensive could you please let me know if
increasing the RAM/cores can speed up this process?

On Tue, Aug 28, 2018 at 8:56 PM Gopal Vijayaraghavan <go...@apache.org>
wrote:

>
> > Will it be referring to orc metadata or it will be loading the whole
> file and then counting the rows.
>
> Depends on the partial-scan setting or if it is computing full column
> stats (the full column stats does an nDV, which reads all rows).
>
> hive> analyze table compute statistics ... partialscan;
>
> https://issues.apache.org/jira/browse/HIVE-4177
>
> AFAIK, this got removed in Hive 3.x (because we really want autogather
> column stats on insert, not just basic stats from this).
>
> > Is there any place to cache this information so that I don't need to
> scan all the files every time.
>
> https://cwiki.apache.org/confluence/display/Hive/LLAP
>
> Cheers,
> Gopal
>
>
>

Re: Improve performance of Analyze table compute statistics

Posted by Gopal Vijayaraghavan <go...@apache.org>.

> Will it be referring to orc metadata or it will be loading the whole file and then counting the rows.

Depends on the partial-scan setting or if it is computing full column stats (the full column stats does an nDV, which reads all rows).

hive> analyze table compute statistics ... partialscan;

https://issues.apache.org/jira/browse/HIVE-4177

AFAIK, this got removed in Hive 3.x (because we really want autogather column stats on insert, not just basic stats from this).

> Is there any place to cache this information so that I don't need to scan all the files every time.

https://cwiki.apache.org/confluence/display/Hive/LLAP

Cheers,
Gopal

Re: Improve performance of Analyze table compute statistics

Posted by Prabhakar Reddy <pr...@gmail.com>.

Yeah partition level statistics are good.I see hive orc reader is reading
rows from s3 for each file in the hive server log.Will it be referring to
orc metadata or it will be loading the whole file and then counting the
rows.Is there any place to cache this information so that I don't need to
scan all the files every time.

On Sun, Aug 26, 2018, 4:24 PM Jörn Franke <jo...@gmail.com> wrote:

> You can partition it and only compute statistics for new partitions...
>
> On 26. Aug 2018, at 12:43, Prabhakar Reddy <pr...@gmail.com> wrote:
>
> Hello,
>
> Are there any properties that I can set to improve the performance of
> Analyze table compute statistics statement.My data sits in s3 and I see
> it's taking one second per file to read the schema of each file from s3.
>
> 2018-08-24T03:25:57,525 INFO  [2b2c0a06-7da5-4fcd-83a7-4931b8e1b4b1
> main([])]: orc.ReaderImpl (ReaderImpl.java:rowsOptions(79)) - Reading ORC
> rows from s3://file_1
> 2018-08-24T03:25:57,526 INFO  [2b2c0a06-7da5-4fcd-83a7-4931b8e1b4b1
> main([])]: impl.RecordReaderImpl (RecordReaderImpl.java:<init>(187)) -
> Reader schema not provided -- using file schema
>
> 2018-08-24T03:25:58,395 INFO  [2b2c0a06-7da5-4fcd-83a7-4931b8e1b4b1
> main([])]: orc.ReaderImpl (ReaderImpl.java:rowsOptions(79)) - Reading ORC
> rows from s3://file_2
> 2018-08-24T03:25:58,395 INFO  [2b2c0a06-7da5-4fcd-83a7-4931b8e1b4b1
> main([])]: impl.RecordReaderImpl (RecordReaderImpl.java:<init>(187)) -
> Reader schema not provided -- using file schema
>
> It takes around 80 seconds for 76 files with total size of 23 GB.
>
>
> 2018-08-24T03:27:07,673 INFO  [2b2c0a06-7da5-4fcd-83a7-4931b8e1b4b1
> main([])]: exec.Task (SessionState.java:printInfo(1111)) - Table
> dept_data_services.lc_credit_2018_08_20_temp stats: [numFiles=76,
> numRows=101341845, totalSize=26500166568, rawDataSize=294491898741]
> 2018-08-24T03:27:07,673 INFO  [2b2c0a06-7da5-4fcd-83a7-4931b8e1b4b1
> main([])]: ql.Driver (Driver.java:execute(2050)) - Completed executing
> command(queryId=lcapp_20180824032545_aba21e71-ea4b-4214-8793-705a5e0367f0);
> Time taken: 81.169 seconds
> 2018-08-24T03:27:07,674 INFO  [2b2c0a06-7da5-4fcd-83a7-4931b8e1b4b1
> main([])]: ql.Driver (SessionState.java:printInfo(1111)) - OK
> 2018-08-24T03:27:07,681 INFO  [2b2c0a06-7da5-4fcd-83a7-4931b8e1b4b1
> main([])]: CliDriver (SessionState.java:printInfo(1111)) - Time taken:
> 81.992 seconds
>
> If I run the same command with few columns then the query runs 60%
> faster.Is there any property that I can modify to reduce the time taken for
> this read?
>
> Regards
> Prabhakar Reddy
>
>
>
>
>

Re: Improve performance of Analyze table compute statistics

Posted by Jörn Franke <jo...@gmail.com>.

You can partition it and only compute statistics for new partitions...

> On 26. Aug 2018, at 12:43, Prabhakar Reddy <pr...@gmail.com> wrote:
> 
> Hello,
> 
> Are there any properties that I can set to improve the performance of Analyze table compute statistics statement.My data sits in s3 and I see it's taking one second per file to read the schema of each file from s3.
> 
> 2018-08-24T03:25:57,525 INFO  [2b2c0a06-7da5-4fcd-83a7-4931b8e1b4b1 main([])]: orc.ReaderImpl (ReaderImpl.java:rowsOptions(79)) - Reading ORC rows from s3://file_1
> 2018-08-24T03:25:57,526 INFO  [2b2c0a06-7da5-4fcd-83a7-4931b8e1b4b1 main([])]: impl.RecordReaderImpl (RecordReaderImpl.java:<init>(187)) - Reader schema not provided -- using file schema 
> 
> 2018-08-24T03:25:58,395 INFO  [2b2c0a06-7da5-4fcd-83a7-4931b8e1b4b1 main([])]: orc.ReaderImpl (ReaderImpl.java:rowsOptions(79)) - Reading ORC rows from s3://file_2
> 2018-08-24T03:25:58,395 INFO  [2b2c0a06-7da5-4fcd-83a7-4931b8e1b4b1 main([])]: impl.RecordReaderImpl (RecordReaderImpl.java:<init>(187)) - Reader schema not provided -- using file schema
> 
> It takes around 80 seconds for 76 files with total size of 23 GB.
> 
> 
> 2018-08-24T03:27:07,673 INFO  [2b2c0a06-7da5-4fcd-83a7-4931b8e1b4b1 main([])]: exec.Task (SessionState.java:printInfo(1111)) - Table dept_data_services.lc_credit_2018_08_20_temp stats: [numFiles=76, numRows=101341845, totalSize=26500166568, rawDataSize=294491898741]
> 2018-08-24T03:27:07,673 INFO  [2b2c0a06-7da5-4fcd-83a7-4931b8e1b4b1 main([])]: ql.Driver (Driver.java:execute(2050)) - Completed executing command(queryId=lcapp_20180824032545_aba21e71-ea4b-4214-8793-705a5e0367f0); Time taken: 81.169 seconds
> 2018-08-24T03:27:07,674 INFO  [2b2c0a06-7da5-4fcd-83a7-4931b8e1b4b1 main([])]: ql.Driver (SessionState.java:printInfo(1111)) - OK
> 2018-08-24T03:27:07,681 INFO  [2b2c0a06-7da5-4fcd-83a7-4931b8e1b4b1 main([])]: CliDriver (SessionState.java:printInfo(1111)) - Time taken: 81.992 seconds
> 
> If I run the same command with few columns then the query runs 60% faster.Is there any property that I can modify to reduce the time taken for this read?
> 
> Regards
> Prabhakar Reddy
> 
> 
>