You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@hive.apache.org by Viraj Bhat <vi...@yahoo-inc.com> on 2010/06/11 23:59:23 UTC
Zebra, RC and Text size comparison
Hi all,
I have some data in Zebra around 9 TB which I converted first to
PlainText using the TextOutputFormat in M/R and it resulted in around
43.07TB. [[I think I used no compression here.]]
I then later converted this data to RC using on the hive console as:
CREATE TABLE LARGERC
ROW FORMAT SERDE
"org.apache.hadoop.hive.serde2.columnar.ColumnarSerDe"
STORED AS RCFile
LOCATION '/user/viraj/huge AS
SELECT * FROM PLAINTEXT;
(PLAINTEXT is the external table which is 43.07 TB in size)
The overall sizes of these files were around 41.65 TB. I am suspecting
that some compression was not being applied.
I read the following documentation:
http://hadoop.apache.org/hive/docs/r0.4.0/api/org/apache/hadoop/hive/ql/
io/RCFile.html and it tells that: "The actual compression algorithm used
to compress key and/or values can be specified by using the appropriate
CompressionCodec"
a) What is the default Codec that is being used?
b) Any thoughts on how I can reduce the size?
Viraj
Re: Zebra, RC and Text size comparison
Posted by yongqiang he <he...@gmail.com>.
there is a config hive.exec.compress.output (pls double check) to
contry whether to compress final data or not.
Maybe you can just try to convert data directly from zebra I sent
out some code to do that. Have u tried?
Also I think maybe it's good to first try to test on some small data
before try on such a large datasets.
On Friday, June 11, 2010, Viraj Bhat <vi...@yahoo-inc.com> wrote:
>
>
>
>
>
>
>
>
>
>
>
>
>
> Hi all,
>
> I have some data in Zebra around 9 TB which I
> converted first to PlainText using the TextOutputFormat in M/R and it resulted in
> around 43.07TB. [[I think I used no compression here.]]
>
> I then later converted this data to RC using on the hive
> console as:
>
>
>
> CREATE TABLE LARGERC
>
> ROW FORMAT SERDE
> "org.apache.hadoop.hive.serde2.columnar.ColumnarSerDe"
>
> STORED AS RCFile
>
> LOCATION '/user/viraj/huge AS
>
> SELECT * FROM PLAINTEXT;
>
>
>
> (PLAINTEXT is the external table which is 43.07 TB in size)
>
>
>
> The overall sizes of these files were around 41.65 TB. I am
> suspecting that some compression was not being applied.
>
>
>
> I read the following documentation:
>
> http://hadoop.apache.org/hive/docs/r0.4.0/api/org/apache/hadoop/hive/ql/io/RCFile.html
> and it tells that: “The actual compression algorithm used to compress key
> and/or values can be specified by using the appropriate CompressionCodec”
>
>
>
> a) What is the
> default Codec that is being used?
>
> b) Any thoughts
> on how I can reduce the size?
>
> Viraj
>
>
>
>
>
>
>