You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@hive.apache.org by Hadi Moshayedi <ha...@moshayedi.net> on 2012/10/06 15:55:47 UTC
Compression of Intermediate Data
I wanted to look into improving performance of my Hive cluster, and from
what I read turning on compression of intermediate data could help. As I
understand, this would help because it would reduce the amount of data
written to disk in between jobs.
I look at the documentation and set the following settings:
SET hive.exec.compress.intermediate=true;
SET mapred.output.compression.type=BLOCK;
SET
mapred.map.output.compression.codec=org.apache.hadoop.io.compress.SnappyCodec;
I ran some queries to see how compression impacts the performance. But it
usually made the query time worse. I also had a query whose size of
intermediate data was close to the size of input data, but it made the
performance worse for this query too.
Question 1: Are the above settings correct settings for using compression
of intermediate data?
Question 2: Are there use-cases in which compression of intermediate data
would not help performance? Why would someone not keep it turned on always?
Thanks
Re: Compression of Intermediate Data
Posted by Hadi Moshayedi <ha...@moshayedi.net>.
Hi Bejoy,
Thanks.
Following your instructions, I also enabled map output compression.
I tried different queries but I couldn't get the benefit from compression
in any single of them. I also tried creating queries which have large
intermediate data, but it didn't improve the performance for them either.
I should also note that our Hadoop cluster is setup at few Amazon EC2
m2.2xlarge instances.
Question is: What are the scenarios in which compression can improve the
performance?
Thanks,
-- Hadi
On Sat, Oct 6, 2012 at 6:32 PM, Bejoy KS <be...@yahoo.com> wrote:
> **
> Hi Hadi
>
> The propertis you specified doen't enable compression of map output. To
> enable map output compression you need to enable the following properties
>
> SET hive.exec.compress.output=true;
>
> SET mapred.map.output.compression=true;
> SET
> mapred.map.output.compression.codec=org.apache.hadoop.io.compress.SnappyCodec;
>
>
> This property 'hive.exec.compress.intermediate
> ' Is used to enable compression of data in between multiple mapreduce jobs
> generated by a hive query.
> Regards
> Bejoy KS
>
> Sent from handheld, please excuse typos.
> ------------------------------
> *From: * Hadi Moshayedi <ha...@moshayedi.net>
> *Date: *Sat, 6 Oct 2012 16:55:47 +0300
> *To: *<us...@hive.apache.org>
> *ReplyTo: * user@hive.apache.org
> *Subject: *Compression of Intermediate Data
>
> I wanted to look into improving performance of my Hive cluster, and from
> what I read turning on compression of intermediate data could help. As I
> understand, this would help because it would reduce the amount of data
> written to disk in between jobs.
>
> I look at the documentation and set the following settings:
>
> SET hive.exec.compress.intermediate=true;
> SET mapred.output.compression.type=BLOCK;
> SET
> mapred.map.output.compression.codec=org.apache.hadoop.io.compress.SnappyCodec;
>
> I ran some queries to see how compression impacts the performance. But it
> usually made the query time worse. I also had a query whose size of
> intermediate data was close to the size of input data, but it made the
> performance worse for this query too.
>
> Question 1: Are the above settings correct settings for using compression
> of intermediate data?
>
> Question 2: Are there use-cases in which compression of intermediate data
> would not help performance? Why would someone not keep it turned on always?
>
> Thanks
>
Re: Compression of Intermediate Data
Posted by Bejoy KS <be...@yahoo.com>.
Hi Hadi
The propertis you specified doen't enable compression of map output. To enable map output compression you need to enable the following properties
SET hive.exec.compress.output=true;
SET mapred.map.output.compression=true;
SET mapred.map.output.compression.codec=org.apache.hadoop.io.compress.SnappyCodec;
This property 'hive.exec.compress.intermediate
' Is used to enable compression of data in between multiple mapreduce jobs generated by a hive query.
Regards
Bejoy KS
Sent from handheld, please excuse typos.
-----Original Message-----
From: Hadi Moshayedi <ha...@moshayedi.net>
Date: Sat, 6 Oct 2012 16:55:47
To: <us...@hive.apache.org>
Reply-To: user@hive.apache.org
Subject: Compression of Intermediate Data
I wanted to look into improving performance of my Hive cluster, and from
what I read turning on compression of intermediate data could help. As I
understand, this would help because it would reduce the amount of data
written to disk in between jobs.
I look at the documentation and set the following settings:
SET hive.exec.compress.intermediate=true;
SET mapred.output.compression.type=BLOCK;
SET
mapred.map.output.compression.codec=org.apache.hadoop.io.compress.SnappyCodec;
I ran some queries to see how compression impacts the performance. But it
usually made the query time worse. I also had a query whose size of
intermediate data was close to the size of input data, but it made the
performance worse for this query too.
Question 1: Are the above settings correct settings for using compression
of intermediate data?
Question 2: Are there use-cases in which compression of intermediate data
would not help performance? Why would someone not keep it turned on always?
Thanks