You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@hive.apache.org by Tale Firefly <ta...@gmail.com> on 2016/03/21 15:26:36 UTC

Mechanism when doing a select *

Hello guys !

I'm trying to understand the mechanism for a simple query select * from
my_table when using HiveServer2.

I'm using the hortonworks ODBC Driver for HiveServer2.
I just do a select * from my_table.
my_table is an ORC table based on files divised into blocks located on all
my datanodes.
I have 50 datanodes.

My question is the following :
Does all the data go from the datanodes to the node hosting the hiveserver2
before coming back to my client ?
Or does all the data go directly from the datanodes to my client ?

Hope you can help me o/

Thank you

Tale

Re: Mechanism when doing a select *

Posted by Tale Firefly <ta...@gmail.com>.

Hello guys !

Just a quick thank you again for your answer on this topic !

I noticed that when a job is performed (if the table is bigger than
hive.fetch.task.conversion.threshold), then it seems that temporary files
are created in HDFS (in /tmp). If I understood well, the select * is
divided into subsets and each of this subset is performed inside a mapper,
right ?

And then, once all these subsets are completed successfully, the global
result is sent to the client through the HiveServer2 but where it is stored
? In the RAM of HS2 ? If I perform a select * on a table far bigger than
the treshhold, is there a risk my HS2 crash because the result is too big ?

Best regards.

Tale

On Tue, Mar 22, 2016 at 10:21 AM, Tale Firefly <ta...@gmail.com> wrote:

> Hello everyone.
>
> Thanks for your answers.
>
> I'm gonna test this.
>
> Best regards.
>
> Tale
>
>
>
> On Mon, Mar 21, 2016 at 10:06 PM, Prasanth Jayachandran <
> pjayachandran@hortonworks.com> wrote:
>
>> Hi
>>
>> Simple select * query launches a job when the input size is >1Gb by
>> default. Two configs that determines if a job has to be launched
>>
>> hive.fetch.task.conversion
>> hive.fetch.task.conversion.threshold
>>
>> Is your table size >1GB (hive.fetch.task.conversion.threshold)? You can
>> see that from “describe formatted tablename”.
>>
>> Thanks
>> Prasanth
>>
>> On Mar 21, 2016, at 11:16 AM, Mich Talebzadeh <mi...@gmail.com>
>> wrote:
>>
>> You are correct. it  should not. There is nothing to optimise here.
>>
>> 0: jdbc:hive2://rhes564:10010/default>
>> *select * from countries; *OK
>> INFO  : Compiling
>> command(queryId=hduser_20160321162726_7efeecbb-46ee-431f-9095-f67e0602b318):
>> select * from countries
>> INFO  : Semantic Analysis Completed
>> INFO  : Returning Hive schema:
>> Schema(fieldSchemas:[FieldSchema(name:countries.country_id, type:double,
>> comment:null), FieldSchema(name:countries.country_iso_code, type:string,
>> comment:null), FieldSchema(name:countries.country_name, type:string,
>> comment:null), FieldSchema(name:countries.country_subregion, type:string,
>> comment:null), FieldSchema(name:countries.country_subregion_id,
>> type:double, comment:null), FieldSchema(name:countries.country_region,
>> type:string, comment:null), FieldSchema(name:countries.country_region_id,
>> type:double, comment:null), FieldSchema(name:countries.country_total,
>> type:string, comment:null), FieldSchema(name:countries.country_total_id,
>> type:double, comment:null), FieldSchema(name:countries.country_name_hist,
>> type:string, comment:null)], properties:null)
>> INFO  : Completed compiling
>> command(queryId=hduser_20160321162726_7efeecbb-46ee-431f-9095-f67e0602b318);
>> Time taken: 0.047 seconds
>> INFO  : Executing
>> command(queryId=hduser_20160321162726_7efeecbb-46ee-431f-9095-f67e0602b318):
>> select * from countries
>> INFO  : Completed executing
>> command(queryId=hduser_20160321162726_7efeecbb-46ee-431f-9095-f67e0602b318);
>> Time taken: 0.001 seconds
>> INFO  : OK
>>
>> Dr Mich Talebzadeh
>>
>>
>> LinkedIn * https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>>
>>
>> http://talebzadehmich.wordpress.com
>>
>>
>>
>> On 21 March 2016 at 15:56, Tale Firefly <ta...@gmail.com> wrote:
>>
>>> Hm, I need to check if statistics are enabled for this table and
>>> up-to-date.
>>> I'm going to check this.
>>>
>>> I don't know if I was clear in my previous statement, but I am surprised
>>> that a job is launched just by doing a select * from my_table.
>>> I thought a select * from my_table was not running any MR jobs.
>>>
>>> Best regards.
>>>
>>> Tale.
>>>
>>> On Mon, Mar 21, 2016 at 4:48 PM, Mich Talebzadeh <
>>> mich.talebzadeh@gmail.com> wrote:
>>>
>>>> Well I use Spark as engine.
>>>>
>>>> Now the question is have you updated statistics on ORC table?
>>>>
>>>> HTH
>>>>
>>>>
>>>>
>>>> Dr Mich Talebzadeh
>>>>
>>>>
>>>> LinkedIn * https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>>>> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>>>>
>>>>
>>>> http://talebzadehmich.wordpress.com
>>>>
>>>>
>>>>
>>>> On 21 March 2016 at 15:32, Tale Firefly <ta...@gmail.com> wrote:
>>>>
>>>>> Re.
>>>>>
>>>>> Ty ty for your answer.
>>>>>
>>>>> I'm using Tez as execution engine for this query.
>>>>> And it launches a job to yarn.
>>>>>
>>>>> Do you know why it launches a job just for a select when I use Tez as
>>>>> execution engine ?
>>>>>
>>>>> BR.
>>>>>
>>>>> Tale
>>>>>
>>>>>
>>>>> On Mon, Mar 21, 2016 at 4:17 PM, Mich Talebzadeh <
>>>>> mich.talebzadeh@gmail.com> wrote:
>>>>>
>>>>>> Hi,
>>>>>>
>>>>>> Your query is a table level query  that covers all rows in the table.
>>>>>>
>>>>>> Using ODBC you are connecting to Hive server 2 that runs on a given
>>>>>> port.
>>>>>>
>>>>>> Depending on the version of Hive you are running Hive under the
>>>>>> bonnet is most likely using Map-Reduce as the execution engine.
>>>>>>
>>>>>> Data has to be collected from all blocks that hold data for this
>>>>>> table. The underlying ORC stats can only act at table level as there is no
>>>>>> predicate push down and data has to be sent to ODBC driver through the
>>>>>> network.
>>>>>>
>>>>>> The ODBC driver can only communicate with Hive server 2 so there is
>>>>>> no connectivity to individual nodes from your client.
>>>>>>
>>>>>> So in summary Hive server 2 collects data from all blocks and
>>>>>> forwards it to the client. The actual collection and filtering of result
>>>>>> set in SQL query will depend on many factors.
>>>>>>
>>>>>> HTH
>>>>>>
>>>>>> Dr Mich Talebzadeh
>>>>>>
>>>>>>
>>>>>> LinkedIn * https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>>>>>> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>>>>>>
>>>>>>
>>>>>> http://talebzadehmich.wordpress.com
>>>>>>
>>>>>>
>>>>>>
>>>>>> On 21 March 2016 at 14:26, Tale Firefly <ta...@gmail.com> wrote:
>>>>>>
>>>>>>> Hello guys !
>>>>>>>
>>>>>>> I'm trying to understand the mechanism for a simple query select *
>>>>>>> from my_table when using HiveServer2.
>>>>>>>
>>>>>>> I'm using the hortonworks ODBC Driver for HiveServer2.
>>>>>>> I just do a select * from my_table.
>>>>>>> my_table is an ORC table based on files divised into blocks located
>>>>>>> on all my datanodes.
>>>>>>> I have 50 datanodes.
>>>>>>>
>>>>>>> My question is the following :
>>>>>>> Does all the data go from the datanodes to the node hosting the
>>>>>>> hiveserver2 before coming back to my client ?
>>>>>>> Or does all the data go directly from the datanodes to my client ?
>>>>>>>
>>>>>>> Hope you can help me o/
>>>>>>>
>>>>>>> Thank you
>>>>>>>
>>>>>>> Tale
>>>>>>>
>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>>
>

Re: Mechanism when doing a select *

Posted by Tale Firefly <ta...@gmail.com>.

Hello everyone.

Thanks for your answers.

I'm gonna test this.

Best regards.

Tale



On Mon, Mar 21, 2016 at 10:06 PM, Prasanth Jayachandran <
pjayachandran@hortonworks.com> wrote:

> Hi
>
> Simple select * query launches a job when the input size is >1Gb by
> default. Two configs that determines if a job has to be launched
>
> hive.fetch.task.conversion
> hive.fetch.task.conversion.threshold
>
> Is your table size >1GB (hive.fetch.task.conversion.threshold)? You can
> see that from “describe formatted tablename”.
>
> Thanks
> Prasanth
>
> On Mar 21, 2016, at 11:16 AM, Mich Talebzadeh <mi...@gmail.com>
> wrote:
>
> You are correct. it  should not. There is nothing to optimise here.
>
> 0: jdbc:hive2://rhes564:10010/default>
> *select * from countries; *OK
> INFO  : Compiling
> command(queryId=hduser_20160321162726_7efeecbb-46ee-431f-9095-f67e0602b318):
> select * from countries
> INFO  : Semantic Analysis Completed
> INFO  : Returning Hive schema:
> Schema(fieldSchemas:[FieldSchema(name:countries.country_id, type:double,
> comment:null), FieldSchema(name:countries.country_iso_code, type:string,
> comment:null), FieldSchema(name:countries.country_name, type:string,
> comment:null), FieldSchema(name:countries.country_subregion, type:string,
> comment:null), FieldSchema(name:countries.country_subregion_id,
> type:double, comment:null), FieldSchema(name:countries.country_region,
> type:string, comment:null), FieldSchema(name:countries.country_region_id,
> type:double, comment:null), FieldSchema(name:countries.country_total,
> type:string, comment:null), FieldSchema(name:countries.country_total_id,
> type:double, comment:null), FieldSchema(name:countries.country_name_hist,
> type:string, comment:null)], properties:null)
> INFO  : Completed compiling
> command(queryId=hduser_20160321162726_7efeecbb-46ee-431f-9095-f67e0602b318);
> Time taken: 0.047 seconds
> INFO  : Executing
> command(queryId=hduser_20160321162726_7efeecbb-46ee-431f-9095-f67e0602b318):
> select * from countries
> INFO  : Completed executing
> command(queryId=hduser_20160321162726_7efeecbb-46ee-431f-9095-f67e0602b318);
> Time taken: 0.001 seconds
> INFO  : OK
>
> Dr Mich Talebzadeh
>
>
> LinkedIn * https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>
>
> http://talebzadehmich.wordpress.com
>
>
>
> On 21 March 2016 at 15:56, Tale Firefly <ta...@gmail.com> wrote:
>
>> Hm, I need to check if statistics are enabled for this table and
>> up-to-date.
>> I'm going to check this.
>>
>> I don't know if I was clear in my previous statement, but I am surprised
>> that a job is launched just by doing a select * from my_table.
>> I thought a select * from my_table was not running any MR jobs.
>>
>> Best regards.
>>
>> Tale.
>>
>> On Mon, Mar 21, 2016 at 4:48 PM, Mich Talebzadeh <
>> mich.talebzadeh@gmail.com> wrote:
>>
>>> Well I use Spark as engine.
>>>
>>> Now the question is have you updated statistics on ORC table?
>>>
>>> HTH
>>>
>>>
>>>
>>> Dr Mich Talebzadeh
>>>
>>>
>>> LinkedIn * https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>>> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>>>
>>>
>>> http://talebzadehmich.wordpress.com
>>>
>>>
>>>
>>> On 21 March 2016 at 15:32, Tale Firefly <ta...@gmail.com> wrote:
>>>
>>>> Re.
>>>>
>>>> Ty ty for your answer.
>>>>
>>>> I'm using Tez as execution engine for this query.
>>>> And it launches a job to yarn.
>>>>
>>>> Do you know why it launches a job just for a select when I use Tez as
>>>> execution engine ?
>>>>
>>>> BR.
>>>>
>>>> Tale
>>>>
>>>>
>>>> On Mon, Mar 21, 2016 at 4:17 PM, Mich Talebzadeh <
>>>> mich.talebzadeh@gmail.com> wrote:
>>>>
>>>>> Hi,
>>>>>
>>>>> Your query is a table level query  that covers all rows in the table.
>>>>>
>>>>> Using ODBC you are connecting to Hive server 2 that runs on a given
>>>>> port.
>>>>>
>>>>> Depending on the version of Hive you are running Hive under the
>>>>> bonnet is most likely using Map-Reduce as the execution engine.
>>>>>
>>>>> Data has to be collected from all blocks that hold data for this
>>>>> table. The underlying ORC stats can only act at table level as there is no
>>>>> predicate push down and data has to be sent to ODBC driver through the
>>>>> network.
>>>>>
>>>>> The ODBC driver can only communicate with Hive server 2 so there is no
>>>>> connectivity to individual nodes from your client.
>>>>>
>>>>> So in summary Hive server 2 collects data from all blocks and forwards
>>>>> it to the client. The actual collection and filtering of result set in SQL
>>>>> query will depend on many factors.
>>>>>
>>>>> HTH
>>>>>
>>>>> Dr Mich Talebzadeh
>>>>>
>>>>>
>>>>> LinkedIn * https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>>>>> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>>>>>
>>>>>
>>>>> http://talebzadehmich.wordpress.com
>>>>>
>>>>>
>>>>>
>>>>> On 21 March 2016 at 14:26, Tale Firefly <ta...@gmail.com> wrote:
>>>>>
>>>>>> Hello guys !
>>>>>>
>>>>>> I'm trying to understand the mechanism for a simple query select *
>>>>>> from my_table when using HiveServer2.
>>>>>>
>>>>>> I'm using the hortonworks ODBC Driver for HiveServer2.
>>>>>> I just do a select * from my_table.
>>>>>> my_table is an ORC table based on files divised into blocks located
>>>>>> on all my datanodes.
>>>>>> I have 50 datanodes.
>>>>>>
>>>>>> My question is the following :
>>>>>> Does all the data go from the datanodes to the node hosting the
>>>>>> hiveserver2 before coming back to my client ?
>>>>>> Or does all the data go directly from the datanodes to my client ?
>>>>>>
>>>>>> Hope you can help me o/
>>>>>>
>>>>>> Thank you
>>>>>>
>>>>>> Tale
>>>>>>
>>>>>
>>>>>
>>>>
>>>
>>
>
>

Re: Mechanism when doing a select *

Posted by Prasanth Jayachandran <pj...@hortonworks.com>.

Simple select * query launches a job when the input size is >1Gb by default. Two configs that determines if a job has to be launched

hive.fetch.task.conversion
hive.fetch.task.conversion.threshold

Is your table size >1GB (hive.fetch.task.conversion.threshold)? You can see that from “describe formatted tablename”.

Thanks
Prasanth

On Mar 21, 2016, at 11:16 AM, Mich Talebzadeh <mi...@gmail.com>> wrote:

You are correct. it should not. There is nothing to optimise here.

0: jdbc:hive2://rhes564:10010/default> select * from countries;
OK
INFO : Compiling command(queryId=hduser_20160321162726_7efeecbb-46ee-431f-9095-f67e0602b318): select * from countries
INFO : Semantic Analysis Completed
INFO : Returning Hive schema: Schema(fieldSchemas:[FieldSchema(name:countries.country_id, type:double, comment:null), FieldSchema(name:countries.country_iso_code, type:string, comment:null), FieldSchema(name:countries.country_name, type:string, comment:null), FieldSchema(name:countries.country_subregion, type:string, comment:null), FieldSchema(name:countries.country_subregion_id, type:double, comment:null), FieldSchema(name:countries.country_region, type:string, comment:null), FieldSchema(name:countries.country_region_id, type:double, comment:null), FieldSchema(name:countries.country_total, type:string, comment:null), FieldSchema(name:countries.country_total_id, type:double, comment:null), FieldSchema(name:countries.country_name_hist, type:string, comment:null)], properties:null)
INFO : Completed compiling command(queryId=hduser_20160321162726_7efeecbb-46ee-431f-9095-f67e0602b318); Time taken: 0.047 seconds
INFO : Executing command(queryId=hduser_20160321162726_7efeecbb-46ee-431f-9095-f67e0602b318): select * from countries
INFO : Completed executing command(queryId=hduser_20160321162726_7efeecbb-46ee-431f-9095-f67e0602b318); Time taken: 0.001 seconds
INFO : OK

Dr Mich Talebzadeh

LinkedIn https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw

http://talebzadehmich.wordpress.com<http://talebzadehmich.wordpress.com/>

On 21 March 2016 at 15:56, Tale Firefly <ta...@gmail.com>> wrote:
Hm, I need to check if statistics are enabled for this table and up-to-date.
I'm going to check this.

I don't know if I was clear in my previous statement, but I am surprised that a job is launched just by doing a select * from my_table.
I thought a select * from my_table was not running any MR jobs.

Best regards.

Tale.

On Mon, Mar 21, 2016 at 4:48 PM, Mich Talebzadeh <mi...@gmail.com>> wrote:
Well I use Spark as engine.

Now the question is have you updated statistics on ORC table?

HTH

Dr Mich Talebzadeh

LinkedIn https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw

http://talebzadehmich.wordpress.com<http://talebzadehmich.wordpress.com/>

On 21 March 2016 at 15:32, Tale Firefly <ta...@gmail.com>> wrote:
Re.

Ty ty for your answer.

I'm using Tez as execution engine for this query.
And it launches a job to yarn.

Do you know why it launches a job just for a select when I use Tez as execution engine ?

BR.

Tale

On Mon, Mar 21, 2016 at 4:17 PM, Mich Talebzadeh <mi...@gmail.com>> wrote:
Hi,

Your query is a table level query that covers all rows in the table.

Using ODBC you are connecting to Hive server 2 that runs on a given port.

Depending on the version of Hive you are running Hive under the bonnet is most likely using Map-Reduce as the execution engine.

Data has to be collected from all blocks that hold data for this table. The underlying ORC stats can only act at table level as there is no predicate push down and data has to be sent to ODBC driver through the network.

The ODBC driver can only communicate with Hive server 2 so there is no connectivity to individual nodes from your client.

So in summary Hive server 2 collects data from all blocks and forwards it to the client. The actual collection and filtering of result set in SQL query will depend on many factors.

HTH

Dr Mich Talebzadeh

LinkedIn https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw

http://talebzadehmich.wordpress.com<http://talebzadehmich.wordpress.com/>

On 21 March 2016 at 14:26, Tale Firefly <ta...@gmail.com>> wrote:
Hello guys !

I'm trying to understand the mechanism for a simple query select * from my_table when using HiveServer2.

I'm using the hortonworks ODBC Driver for HiveServer2.
I just do a select * from my_table.
my_table is an ORC table based on files divised into blocks located on all my datanodes.
I have 50 datanodes.

My question is the following :
Does all the data go from the datanodes to the node hosting the hiveserver2 before coming back to my client ?
Or does all the data go directly from the datanodes to my client ?

Hope you can help me o/

Thank you

Tale

Re: Mechanism when doing a select *

Posted by Mich Talebzadeh <mi...@gmail.com>.

You are correct. it  should not. There is nothing to optimise here.

0: jdbc:hive2://rhes564:10010/default>
*select * from countries;*OK
INFO  : Compiling
command(queryId=hduser_20160321162726_7efeecbb-46ee-431f-9095-f67e0602b318):
select * from countries
INFO  : Semantic Analysis Completed
INFO  : Returning Hive schema:
Schema(fieldSchemas:[FieldSchema(name:countries.country_id, type:double,
comment:null), FieldSchema(name:countries.country_iso_code, type:string,
comment:null), FieldSchema(name:countries.country_name, type:string,
comment:null), FieldSchema(name:countries.country_subregion, type:string,
comment:null), FieldSchema(name:countries.country_subregion_id,
type:double, comment:null), FieldSchema(name:countries.country_region,
type:string, comment:null), FieldSchema(name:countries.country_region_id,
type:double, comment:null), FieldSchema(name:countries.country_total,
type:string, comment:null), FieldSchema(name:countries.country_total_id,
type:double, comment:null), FieldSchema(name:countries.country_name_hist,
type:string, comment:null)], properties:null)
INFO  : Completed compiling
command(queryId=hduser_20160321162726_7efeecbb-46ee-431f-9095-f67e0602b318);
Time taken: 0.047 seconds
INFO  : Executing
command(queryId=hduser_20160321162726_7efeecbb-46ee-431f-9095-f67e0602b318):
select * from countries
INFO  : Completed executing
command(queryId=hduser_20160321162726_7efeecbb-46ee-431f-9095-f67e0602b318);
Time taken: 0.001 seconds
INFO  : OK

Dr Mich Talebzadeh



LinkedIn * https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
<https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*



http://talebzadehmich.wordpress.com



On 21 March 2016 at 15:56, Tale Firefly <ta...@gmail.com> wrote:

> Hm, I need to check if statistics are enabled for this table and
> up-to-date.
> I'm going to check this.
>
> I don't know if I was clear in my previous statement, but I am surprised
> that a job is launched just by doing a select * from my_table.
> I thought a select * from my_table was not running any MR jobs.
>
> Best regards.
>
> Tale.
>
> On Mon, Mar 21, 2016 at 4:48 PM, Mich Talebzadeh <
> mich.talebzadeh@gmail.com> wrote:
>
>> Well I use Spark as engine.
>>
>> Now the question is have you updated statistics on ORC table?
>>
>> HTH
>>
>>
>>
>> Dr Mich Talebzadeh
>>
>>
>>
>> LinkedIn * https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>>
>>
>>
>> http://talebzadehmich.wordpress.com
>>
>>
>>
>> On 21 March 2016 at 15:32, Tale Firefly <ta...@gmail.com> wrote:
>>
>>> Re.
>>>
>>> Ty ty for your answer.
>>>
>>> I'm using Tez as execution engine for this query.
>>> And it launches a job to yarn.
>>>
>>> Do you know why it launches a job just for a select when I use Tez as
>>> execution engine ?
>>>
>>> BR.
>>>
>>> Tale
>>>
>>>
>>> On Mon, Mar 21, 2016 at 4:17 PM, Mich Talebzadeh <
>>> mich.talebzadeh@gmail.com> wrote:
>>>
>>>> Hi,
>>>>
>>>> Your query is a table level query  that covers all rows in the table.
>>>>
>>>> Using ODBC you are connecting to Hive server 2 that runs on a given
>>>> port.
>>>>
>>>> Depending on the version of Hive you are running Hive under the
>>>> bonnet is most likely using Map-Reduce as the execution engine.
>>>>
>>>> Data has to be collected from all blocks that hold data for this table.
>>>> The underlying ORC stats can only act at table level as there is no
>>>> predicate push down and data has to be sent to ODBC driver through the
>>>> network.
>>>>
>>>> The ODBC driver can only communicate with Hive server 2 so there is no
>>>> connectivity to individual nodes from your client.
>>>>
>>>> So in summary Hive server 2 collects data from all blocks and forwards
>>>> it to the client. The actual collection and filtering of result set in SQL
>>>> query will depend on many factors.
>>>>
>>>> HTH
>>>>
>>>> Dr Mich Talebzadeh
>>>>
>>>>
>>>>
>>>> LinkedIn * https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>>>> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>>>>
>>>>
>>>>
>>>> http://talebzadehmich.wordpress.com
>>>>
>>>>
>>>>
>>>> On 21 March 2016 at 14:26, Tale Firefly <ta...@gmail.com> wrote:
>>>>
>>>>> Hello guys !
>>>>>
>>>>> I'm trying to understand the mechanism for a simple query select *
>>>>> from my_table when using HiveServer2.
>>>>>
>>>>> I'm using the hortonworks ODBC Driver for HiveServer2.
>>>>> I just do a select * from my_table.
>>>>> my_table is an ORC table based on files divised into blocks located on
>>>>> all my datanodes.
>>>>> I have 50 datanodes.
>>>>>
>>>>> My question is the following :
>>>>> Does all the data go from the datanodes to the node hosting the
>>>>> hiveserver2 before coming back to my client ?
>>>>> Or does all the data go directly from the datanodes to my client ?
>>>>>
>>>>> Hope you can help me o/
>>>>>
>>>>> Thank you
>>>>>
>>>>> Tale
>>>>>
>>>>
>>>>
>>>
>>
>

Re: Mechanism when doing a select *

Posted by Tale Firefly <ta...@gmail.com>.

Oh my bad, even with the execution engine set to MR, my query turns into a
MR job.

I'm gonna make more tests with Hive CLI and beeline, and excel to check if
this behaviour is linked to the ODBC driver.

BR.

Tale.

On Mon, Mar 21, 2016 at 4:56 PM, Tale Firefly <ta...@gmail.com> wrote:

> Hm, I need to check if statistics are enabled for this table and
> up-to-date.
> I'm going to check this.
>
> I don't know if I was clear in my previous statement, but I am surprised
> that a job is launched just by doing a select * from my_table.
> I thought a select * from my_table was not running any MR jobs.
>
> Best regards.
>
> Tale.
>
> On Mon, Mar 21, 2016 at 4:48 PM, Mich Talebzadeh <
> mich.talebzadeh@gmail.com> wrote:
>
>> Well I use Spark as engine.
>>
>> Now the question is have you updated statistics on ORC table?
>>
>> HTH
>>
>>
>>
>> Dr Mich Talebzadeh
>>
>>
>>
>> LinkedIn * https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>>
>>
>>
>> http://talebzadehmich.wordpress.com
>>
>>
>>
>> On 21 March 2016 at 15:32, Tale Firefly <ta...@gmail.com> wrote:
>>
>>> Re.
>>>
>>> Ty ty for your answer.
>>>
>>> I'm using Tez as execution engine for this query.
>>> And it launches a job to yarn.
>>>
>>> Do you know why it launches a job just for a select when I use Tez as
>>> execution engine ?
>>>
>>> BR.
>>>
>>> Tale
>>>
>>>
>>> On Mon, Mar 21, 2016 at 4:17 PM, Mich Talebzadeh <
>>> mich.talebzadeh@gmail.com> wrote:
>>>
>>>> Hi,
>>>>
>>>> Your query is a table level query  that covers all rows in the table.
>>>>
>>>> Using ODBC you are connecting to Hive server 2 that runs on a given
>>>> port.
>>>>
>>>> Depending on the version of Hive you are running Hive under the
>>>> bonnet is most likely using Map-Reduce as the execution engine.
>>>>
>>>> Data has to be collected from all blocks that hold data for this table.
>>>> The underlying ORC stats can only act at table level as there is no
>>>> predicate push down and data has to be sent to ODBC driver through the
>>>> network.
>>>>
>>>> The ODBC driver can only communicate with Hive server 2 so there is no
>>>> connectivity to individual nodes from your client.
>>>>
>>>> So in summary Hive server 2 collects data from all blocks and forwards
>>>> it to the client. The actual collection and filtering of result set in SQL
>>>> query will depend on many factors.
>>>>
>>>> HTH
>>>>
>>>> Dr Mich Talebzadeh
>>>>
>>>>
>>>>
>>>> LinkedIn * https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>>>> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>>>>
>>>>
>>>>
>>>> http://talebzadehmich.wordpress.com
>>>>
>>>>
>>>>
>>>> On 21 March 2016 at 14:26, Tale Firefly <ta...@gmail.com> wrote:
>>>>
>>>>> Hello guys !
>>>>>
>>>>> I'm trying to understand the mechanism for a simple query select *
>>>>> from my_table when using HiveServer2.
>>>>>
>>>>> I'm using the hortonworks ODBC Driver for HiveServer2.
>>>>> I just do a select * from my_table.
>>>>> my_table is an ORC table based on files divised into blocks located on
>>>>> all my datanodes.
>>>>> I have 50 datanodes.
>>>>>
>>>>> My question is the following :
>>>>> Does all the data go from the datanodes to the node hosting the
>>>>> hiveserver2 before coming back to my client ?
>>>>> Or does all the data go directly from the datanodes to my client ?
>>>>>
>>>>> Hope you can help me o/
>>>>>
>>>>> Thank you
>>>>>
>>>>> Tale
>>>>>
>>>>
>>>>
>>>
>>
>

Re: Mechanism when doing a select *

Posted by Tale Firefly <ta...@gmail.com>.

Hm, I need to check if statistics are enabled for this table and up-to-date.
I'm going to check this.

I don't know if I was clear in my previous statement, but I am surprised
that a job is launched just by doing a select * from my_table.
I thought a select * from my_table was not running any MR jobs.

Best regards.

Tale.

On Mon, Mar 21, 2016 at 4:48 PM, Mich Talebzadeh <mi...@gmail.com>
wrote:

> Well I use Spark as engine.
>
> Now the question is have you updated statistics on ORC table?
>
> HTH
>
>
>
> Dr Mich Talebzadeh
>
>
>
> LinkedIn * https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>
>
>
> http://talebzadehmich.wordpress.com
>
>
>
> On 21 March 2016 at 15:32, Tale Firefly <ta...@gmail.com> wrote:
>
>> Re.
>>
>> Ty ty for your answer.
>>
>> I'm using Tez as execution engine for this query.
>> And it launches a job to yarn.
>>
>> Do you know why it launches a job just for a select when I use Tez as
>> execution engine ?
>>
>> BR.
>>
>> Tale
>>
>>
>> On Mon, Mar 21, 2016 at 4:17 PM, Mich Talebzadeh <
>> mich.talebzadeh@gmail.com> wrote:
>>
>>> Hi,
>>>
>>> Your query is a table level query  that covers all rows in the table.
>>>
>>> Using ODBC you are connecting to Hive server 2 that runs on a given port.
>>>
>>> Depending on the version of Hive you are running Hive under the
>>> bonnet is most likely using Map-Reduce as the execution engine.
>>>
>>> Data has to be collected from all blocks that hold data for this table.
>>> The underlying ORC stats can only act at table level as there is no
>>> predicate push down and data has to be sent to ODBC driver through the
>>> network.
>>>
>>> The ODBC driver can only communicate with Hive server 2 so there is no
>>> connectivity to individual nodes from your client.
>>>
>>> So in summary Hive server 2 collects data from all blocks and forwards
>>> it to the client. The actual collection and filtering of result set in SQL
>>> query will depend on many factors.
>>>
>>> HTH
>>>
>>> Dr Mich Talebzadeh
>>>
>>>
>>>
>>> LinkedIn * https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>>> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>>>
>>>
>>>
>>> http://talebzadehmich.wordpress.com
>>>
>>>
>>>
>>> On 21 March 2016 at 14:26, Tale Firefly <ta...@gmail.com> wrote:
>>>
>>>> Hello guys !
>>>>
>>>> I'm trying to understand the mechanism for a simple query select * from
>>>> my_table when using HiveServer2.
>>>>
>>>> I'm using the hortonworks ODBC Driver for HiveServer2.
>>>> I just do a select * from my_table.
>>>> my_table is an ORC table based on files divised into blocks located on
>>>> all my datanodes.
>>>> I have 50 datanodes.
>>>>
>>>> My question is the following :
>>>> Does all the data go from the datanodes to the node hosting the
>>>> hiveserver2 before coming back to my client ?
>>>> Or does all the data go directly from the datanodes to my client ?
>>>>
>>>> Hope you can help me o/
>>>>
>>>> Thank you
>>>>
>>>> Tale
>>>>
>>>
>>>
>>
>

Re: Mechanism when doing a select *

Posted by Mich Talebzadeh <mi...@gmail.com>.

Well I use Spark as engine.

Now the question is have you updated statistics on ORC table?

HTH



Dr Mich Talebzadeh



LinkedIn * https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
<https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*



http://talebzadehmich.wordpress.com



On 21 March 2016 at 15:32, Tale Firefly <ta...@gmail.com> wrote:

> Re.
>
> Ty ty for your answer.
>
> I'm using Tez as execution engine for this query.
> And it launches a job to yarn.
>
> Do you know why it launches a job just for a select when I use Tez as
> execution engine ?
>
> BR.
>
> Tale
>
>
> On Mon, Mar 21, 2016 at 4:17 PM, Mich Talebzadeh <
> mich.talebzadeh@gmail.com> wrote:
>
>> Hi,
>>
>> Your query is a table level query  that covers all rows in the table.
>>
>> Using ODBC you are connecting to Hive server 2 that runs on a given port.
>>
>> Depending on the version of Hive you are running Hive under the bonnet is
>> most likely using Map-Reduce as the execution engine.
>>
>> Data has to be collected from all blocks that hold data for this table.
>> The underlying ORC stats can only act at table level as there is no
>> predicate push down and data has to be sent to ODBC driver through the
>> network.
>>
>> The ODBC driver can only communicate with Hive server 2 so there is no
>> connectivity to individual nodes from your client.
>>
>> So in summary Hive server 2 collects data from all blocks and forwards it
>> to the client. The actual collection and filtering of result set in SQL
>> query will depend on many factors.
>>
>> HTH
>>
>> Dr Mich Talebzadeh
>>
>>
>>
>> LinkedIn * https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>>
>>
>>
>> http://talebzadehmich.wordpress.com
>>
>>
>>
>> On 21 March 2016 at 14:26, Tale Firefly <ta...@gmail.com> wrote:
>>
>>> Hello guys !
>>>
>>> I'm trying to understand the mechanism for a simple query select * from
>>> my_table when using HiveServer2.
>>>
>>> I'm using the hortonworks ODBC Driver for HiveServer2.
>>> I just do a select * from my_table.
>>> my_table is an ORC table based on files divised into blocks located on
>>> all my datanodes.
>>> I have 50 datanodes.
>>>
>>> My question is the following :
>>> Does all the data go from the datanodes to the node hosting the
>>> hiveserver2 before coming back to my client ?
>>> Or does all the data go directly from the datanodes to my client ?
>>>
>>> Hope you can help me o/
>>>
>>> Thank you
>>>
>>> Tale
>>>
>>
>>
>

Re: Mechanism when doing a select *

Posted by Tale Firefly <ta...@gmail.com>.

Re.

Ty ty for your answer.

I'm using Tez as execution engine for this query.
And it launches a job to yarn.

Do you know why it launches a job just for a select when I use Tez as
execution engine ?

BR.

Tale


On Mon, Mar 21, 2016 at 4:17 PM, Mich Talebzadeh <mi...@gmail.com>
wrote:

> Hi,
>
> Your query is a table level query  that covers all rows in the table.
>
> Using ODBC you are connecting to Hive server 2 that runs on a given port.
>
> Depending on the version of Hive you are running Hive under the bonnet is
> most likely using Map-Reduce as the execution engine.
>
> Data has to be collected from all blocks that hold data for this table.
> The underlying ORC stats can only act at table level as there is no
> predicate push down and data has to be sent to ODBC driver through the
> network.
>
> The ODBC driver can only communicate with Hive server 2 so there is no
> connectivity to individual nodes from your client.
>
> So in summary Hive server 2 collects data from all blocks and forwards it
> to the client. The actual collection and filtering of result set in SQL
> query will depend on many factors.
>
> HTH
>
> Dr Mich Talebzadeh
>
>
>
> LinkedIn * https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>
>
>
> http://talebzadehmich.wordpress.com
>
>
>
> On 21 March 2016 at 14:26, Tale Firefly <ta...@gmail.com> wrote:
>
>> Hello guys !
>>
>> I'm trying to understand the mechanism for a simple query select * from
>> my_table when using HiveServer2.
>>
>> I'm using the hortonworks ODBC Driver for HiveServer2.
>> I just do a select * from my_table.
>> my_table is an ORC table based on files divised into blocks located on
>> all my datanodes.
>> I have 50 datanodes.
>>
>> My question is the following :
>> Does all the data go from the datanodes to the node hosting the
>> hiveserver2 before coming back to my client ?
>> Or does all the data go directly from the datanodes to my client ?
>>
>> Hope you can help me o/
>>
>> Thank you
>>
>> Tale
>>
>
>

Re: Mechanism when doing a select *

Posted by Mich Talebzadeh <mi...@gmail.com>.

Hi,

Your query is a table level query  that covers all rows in the table.

Using ODBC you are connecting to Hive server 2 that runs on a given port.

Depending on the version of Hive you are running Hive under the bonnet is
most likely using Map-Reduce as the execution engine.

Data has to be collected from all blocks that hold data for this table. The
underlying ORC stats can only act at table level as there is no predicate
push down and data has to be sent to ODBC driver through the network.

The ODBC driver can only communicate with Hive server 2 so there is no
connectivity to individual nodes from your client.

So in summary Hive server 2 collects data from all blocks and forwards it
to the client. The actual collection and filtering of result set in SQL
query will depend on many factors.

HTH

Dr Mich Talebzadeh

LinkedIn * https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
<https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*

http://talebzadehmich.wordpress.com

On 21 March 2016 at 14:26, Tale Firefly <ta...@gmail.com> wrote:

> Hello guys !
>
> I'm trying to understand the mechanism for a simple query select * from
> my_table when using HiveServer2.
>
> I'm using the hortonworks ODBC Driver for HiveServer2.
> I just do a select * from my_table.
> my_table is an ORC table based on files divised into blocks located on all
> my datanodes.
> I have 50 datanodes.
>
> My question is the following :
> Does all the data go from the datanodes to the node hosting the
> hiveserver2 before coming back to my client ?
> Or does all the data go directly from the datanodes to my client ?
>
> Hope you can help me o/
>
> Thank you
>
> Tale
>

Re: Mechanism when doing a select *

Posted by Gopal Vijayaraghavan <go...@apache.org>.

>> Or does all the data go directly from the datanodes to my client ?

Not yet.

https://issues.apache.org/jira/browse/HIVE-11527


Cheers,
Gopal

Re: Mechanism when doing a select *

Posted by Gopal Vijayaraghavan <go...@hortonworks.com>.

> Or does all the data go directly from the datanodes to my client ?

Not yet.

https://issues.apache.org/jira/browse/HIVE-11527


Cheers,
Gopal