You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@hive.apache.org by Sowjanya Kakarala <so...@agrible.com> on 2018/07/23 15:01:26 UTC

Performance in hive fetch

Hi Guys,

I am trying to fetch the data from hive through python code based on dates
and id.

For fetch of 20days till current day for 7tables together it is taking
30seconds.
for fetching for an year worth data for 7tables together it is taking
3minutes26seconds.
My tables are stored as orc and transactional set to true.

so our goal is to make the fetch for an year data within a second or
2seconds.

I have tried it two ways:
1.  cursor.execute("set hive.support.concurrency=true")
     cursor.execute("set hive.exec.dynamic.partition.mode=nonstrict")
     cursor.execute("SET hive.exec.parallel=true")
     cursor.execute("set tez.grouping.split-count=85")

2.  cursor.execute("set hive.fetch.task.conversion=more")

either way is performing the same, is there any better way for reaching our
goal?

Any help is appreciable.


Thanks
Sowjanya

Re: Performance in hive fetch

Posted by Sowjanya Kakarala <so...@agrible.com>.
Hi Shawn,

How long is it taking to run the actual query if you create a temp table or
something with the result?
    the time I mentioned was only for the actual queries run.

How many rows are returned?
    as I mentioned depends on the dates we give if its 20days then 20
records based on the id. If one year then 365/366 days as per id for each
table(we have 7tables and fetch all together).

Need to narrow down if it’s the fetch taken a while or the actual query.
     so to narrow down the actual query, its lil complicated as our data is
weather data and based on dates available in actual tables we have to fetch
from actual table and to fetch the rest we have forecast table to fetch
from.

The queries are queried as follows:

These are seven tables until the dates available in actual tables:


           SELECT data FROM db.tbl1  WHERE cell_id = 637448

                 and time_stamp between  '2018-07-01' and '2018-07-20'



               &

            SELECT data FROM db.tbl2_deg WHERE cell_id = 2466

                 and time_stamp between  '2018-07-01' and '2018-07-20'



               &

            SELECT data FROM db.tbl3 WHERE cell_id = 2466

                 and time_stamp between  '2018-07-01' and '2018-07-20'



               &

            SELECT data FROM db.tbl4 WHERE cell_id = 24360

                 and time_stamp between  '2018-07-01' and '2018-07-20'



               &

            SELECT data FROM db.tbl5  WHERE cell_id = 24360

                 and time_stamp between  '2018-07-01' and '2018-07-20'



               &

            SELECT data FROM db.tbl6  WHERE cell_id = 24360

                 and time_stamp between  '2018-07-01' and '2018-07-20'



               &

            SELECT data FROM db.tbl7  WHERE cell_id = 24360

                 and time_stamp between  '2018-07-01' and '2018-07-20'



 lets say the dates available from actual tables are:

{'tbl6': '2018-07-15', 'tbl5': '2018-07-15', 'tbl4': '2018-07-15', 'tbl3':
'2018-07-15', 'tbl2': '2018-07-20', 'tbl7': '2018-07-22', 'tbl1':
'2018-07-20'}


hence the code generates forecast queries based on the dates above:



           SELECT a[0] FROM db.forecast WHERE cell_id = 9341 AND

           run_date between  '2018-07-23' and '2018-07-20'  and init_hour =
0

        &

           SELECT b[0] FROM db.forecast WHERE cell_id = 9341 AND

           run_date between '2018-07-21' and '2018-07-20' and init_hour = 0

        &

           SELECT c[0] FROM db.forecast WHERE cell_id = 9341 AND

           run_date between '2018-07-21' and '2018-07-20' and init_hour = 0

        &

           SELECT d[0] FROM db.forecast WHERE cell_id = 9341 AND

           run_date between  '2018-07-16' and '2018-07-20'  and init_hour =
0

        &

           SELECT e[0] FROM db.forecast WHERE cell_id = 9341 AND

           run_date between  '2018-07-16' and '2018-07-20'  and init_hour =
0

        &

           SELECT f[0] FROM db.forecast WHERE cell_id = 9341 AND

           run_date between  '2018-07-16' and '2018-07-20'  and init_hour =
0

        &

           SELECT g [0] FROM db.forecast WHERE cell_id = 9341 AND

           run_date between  '2018-07-16' and '2018-07-20'  and init_hour =
0



On Mon, Jul 23, 2018 at 10:07 AM, Shawn Weeks <sw...@weeksconsulting.us>
wrote:

> How long is it taking to run the actual query if you create a temp table
> or something with the result? How many rows are returned? Need to narrow
> down if it’s the fetch taken a while or the actual query.
>
>
>
> Thanks
>
> Shawn
>
>
>
> *From:* Sowjanya Kakarala <so...@agrible.com>
> *Sent:* Monday, July 23, 2018 10:01 AM
> *To:* user@hive.apache.org
> *Subject:* Performance in hive fetch
>
>
>
> Hi Guys,
>
>
>
> I am trying to fetch the data from hive through python code based on dates
> and id.
>
>
>
> For fetch of 20days till current day for 7tables together it is taking
> 30seconds.
>
> for fetching for an year worth data for 7tables together it is taking
> 3minutes26seconds.
>
> My tables are stored as orc and transactional set to true.
>
>
>
> so our goal is to make the fetch for an year data within a second or
> 2seconds.
>
>
>
> I have tried it two ways:
>
> 1.  cursor.execute("set hive.support.concurrency=true")
>
>      cursor.execute("set hive.exec.dynamic.partition.mode=nonstrict")
>
>      cursor.execute("SET hive.exec.parallel=true")
>
>      cursor.execute("set tez.grouping.split-count=85")
>
>
>
> 2.  cursor.execute("set hive.fetch.task.conversion=more")
>
>
>
> either way is performing the same, is there any better way for reaching
> our goal?
>
>
>
> Any help is appreciable.
>
>
>
>
>
> Thanks
>
> Sowjanya
>
>
>

RE: Performance in hive fetch

Posted by Shawn Weeks <sw...@weeksconsulting.us>.
How long is it taking to run the actual query if you create a temp table or something with the result? How many rows are returned? Need to narrow down if it’s the fetch taken a while or the actual query.

Thanks
Shawn

From: Sowjanya Kakarala <so...@agrible.com>
Sent: Monday, July 23, 2018 10:01 AM
To: user@hive.apache.org
Subject: Performance in hive fetch

Hi Guys,

I am trying to fetch the data from hive through python code based on dates and id.

For fetch of 20days till current day for 7tables together it is taking 30seconds.
for fetching for an year worth data for 7tables together it is taking 3minutes26seconds.
My tables are stored as orc and transactional set to true.

so our goal is to make the fetch for an year data within a second or 2seconds.

I have tried it two ways:
1.  cursor.execute("set hive.support.concurrency=true")
     cursor.execute("set hive.exec.dynamic.partition.mode=nonstrict")
     cursor.execute("SET hive.exec.parallel=true")
     cursor.execute("set tez.grouping.split-count=85")

2.  cursor.execute("set hive.fetch.task.conversion=more")

either way is performing the same, is there any better way for reaching our goal?

Any help is appreciable.


Thanks
Sowjanya