You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@hive.apache.org by John B <jo...@gmail.com> on 2012/02/15 16:59:09 UTC

parallel inserts ?

Other sql datbases typically can parallelize selects but are unable to
automatically parallelize inserts.

With the most recent stable hiveql will the following statement have
the --insert-- automatically parallelized ?

 INSERT OVERWRITE TABLE pv_gender
 SELECT pv_users.gender
 FROM pv_users


I understand there is now 'insert into ..select from' syntax. Is the
insert part of that statement automatically parallelized ?

What is the highest insert speed anybody has seen - and I am not
talking about imports I mean inserts from one table to another ?

Re: parallel inserts ?

Posted by Edward Capriolo <ed...@gmail.com>.

Wow. This thread has some shelf life.

Inserts in hive are not like inserts in most relational databases.
"INSERTING" into a table typically involves moving files into a
directory. So this case The bulk of the work is in the selecting half
of the query. Hive knows the source and the destination metadata so
the the results are formatted for the destination as the job is being
processed.

So the answer to your question is . "YES THIS OPERATION IS
PARALLELIZED" because the query that creates the data to be moved is
parallelized.



On Wed, Apr 18, 2012 at 1:00 PM, John B <jo...@gmail.com> wrote:
> Thanks.
>
> I would like to use the cloudera demo (vmware) vm to test the actual
> performance of this.
> https://ccp.cloudera.com/display/SUPPORT/Cloudera%27s+Hadoop+Demo+VM
>
> It only has 2 vcores it seems. What setup would get the best performance on
> such a hive query with possibly a more complicated select - maybe a join in
> the select ?
> Should I set up multiple such demo vms and connect them or can I increase
> the number of cores for that vm somehow and perhaps other hadoop settings ?
> Id perfer the second so the parallel processes can  communicate through
> shared memory on my 16 core machine rather than likely slower vnics.
>
>
> On Wed, Feb 15, 2012 at 11:19 AM, <be...@yahoo.com> wrote:
>>
>> Hi John
>> Yes Insert is parallel in default for hive. Hive QL gets transformed to
>> mapreduce jobs and hence definitely it is parallel. The only case it is not
>> parallel is when you have just 1 reducer . It is just reading and processing
>> the input files and in parallel using map reduce jobs from the source table
>> data dir and writes the desired output files to the destination table dir.
>> Hive is just an abstraction over map reduce and can't be compared against
>> a db in terms of features. Almost every data processing operation is just
>> some map reduce jobs.
>> Regards
>> Bejoy K S
>>
>> From handheld, Please excuse typos.
>> ________________________________
>> From: John B <jo...@gmail.com>
>> Date: Wed, 15 Feb 2012 10:59:09 -0500
>> To: <us...@hive.apache.org>
>> ReplyTo: user@hive.apache.org
>> Subject: parallel inserts ?
>>
>> Other sql datbases typically can parallelize selects but are unable to
>> automatically parallelize inserts.
>>
>>
>>
>> With the most recent stable hiveql will the following statement have the
>> --insert-- automatically parallelized ?
>>
>>  INSERT OVERWRITE TABLE pv_gender
>>
>>
>>  SELECT pv_users.gender
>>  FROM pv_users
>>
>>
>>
>>
>> I understand there is now 'insert into ..select from' syntax. Is the
>> insert part of that statement automatically parallelized ?
>>
>> What is the highest insert speed anybody has seen - and I am not talking
>> about imports I mean inserts from one table to another ?
>>
>>
>

Re: parallel inserts ?

Posted by John B <jo...@gmail.com>.

Thanks.

I would like to use the cloudera demo (vmware) vm to test the actual
performance of this.
https://ccp.cloudera.com/display/SUPPORT/Cloudera%27s+Hadoop+Demo+VM

It only has 2 vcores it seems. What setup would get the best performance on
such a hive query with possibly a more complicated select - maybe a join in
the select ?
Should I set up multiple such demo vms and connect them or can I increase
the number of cores for that vm somehow and perhaps other hadoop settings ?
Id perfer the second so the parallel processes can  communicate through
shared memory on my 16 core machine rather than likely slower vnics.

On Wed, Feb 15, 2012 at 11:19 AM, <be...@yahoo.com> wrote:

> **
> Hi John
> Yes Insert is parallel in default for hive. Hive QL gets transformed to
> mapreduce jobs and hence definitely it is parallel. The only case it is not
> parallel is when you have just 1 reducer . It is just reading and
> processing the input files and in parallel using map reduce jobs from the
> source table data dir and writes the desired output files to the
> destination table dir.
> Hive is just an abstraction over map reduce and can't be compared against
> a db in terms of features. Almost every data processing operation is just
> some map reduce jobs.
> Regards
> Bejoy K S
>
> From handheld, Please excuse typos.
> ------------------------------
> *From: * John B <jo...@gmail.com>
> *Date: *Wed, 15 Feb 2012 10:59:09 -0500
> *To: *<us...@hive.apache.org>
> *ReplyTo: * user@hive.apache.org
> *Subject: *parallel inserts ?
>
> Other sql datbases typically can parallelize selects but are unable to automatically parallelize inserts.
>
> With the most recent stable hiveql will the following statement have the --insert-- automatically parallelized ?
>
>  INSERT OVERWRITE TABLE pv_gender
>  SELECT pv_users.gender
>  FROM pv_users
>
>
> I understand there is now 'insert into ..select from' syntax. Is the insert part of that statement automatically parallelized ?
>
> What is the highest insert speed anybody has seen - and I am not talking about imports I mean inserts from one table to another ?
>
>

Re: parallel inserts ?

Posted by be...@yahoo.com.

Hi John
       Yes Insert is parallel in default for hive. Hive QL gets transformed to mapreduce jobs and hence definitely it is parallel. The only case it is not parallel is when you have just 1 reducer . It is just reading and processing the input files and in parallel using map reduce jobs from the source table data dir and writes the desired output files to the destination table dir.       
        Hive is just an abstraction over map reduce and can't be compared against a db in terms of features. Almost every data processing operation is just some map reduce jobs. 
Regards
Bejoy K S

From handheld, Please excuse typos.

-----Original Message-----
From: John B <jo...@gmail.com>
Date: Wed, 15 Feb 2012 10:59:09 
To: <us...@hive.apache.org>
Reply-To: user@hive.apache.org
Subject: parallel inserts ?

Other sql datbases typically can parallelize selects but are unable to
automatically parallelize inserts.

With the most recent stable hiveql will the following statement have
the --insert-- automatically parallelized ?

 INSERT OVERWRITE TABLE pv_gender
 SELECT pv_users.gender
 FROM pv_users


I understand there is now 'insert into ..select from' syntax. Is the
insert part of that statement automatically parallelized ?

What is the highest insert speed anybody has seen - and I am not
talking about imports I mean inserts from one table to another ?