You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@kudu.apache.org by Jason Heo <ja...@gmail.com> on 2017/03/11 03:16:48 UTC

Apache Kudu Table is 6.6 times larger than Parquet File.

Hello, I'm new to Apache Kudu. I was really impressed by the concept of
Kudu and benchmark results. I'm considering using (Impala + Kudu) on my
team project.

One of the issues I have is that Kudu Table is too big compared to Parquet
File

- Parquet File: 1.3TB
- Kudu Table: 8.6TB

(both tables configured 3 replica factor)

I'm using Kudu with CDH 5.10 and most of the configurations is not changed
(I've only changed `memory_limit_hard_bytes` and `block_cache_capacity_mb`
to prevent bulk load error)

When I changed `ENCODING` for some fields, only decreased by 5%. I'm
thinking there are some optimization techniques to reduce Kudu table size.

I would really appreciate it if someone gives advice to me.

Thanks for advance answer.

`parquet_table` has 38 STRING fields and 6B rows.

The schema of `parquet_table` looks like belows

    ```
    > SHOW CREATE TABLE parquet_table;

+---------------------------------------------------------------------------------+
    | result
           |

+---------------------------------------------------------------------------------+
    | CREATE EXTERNAL TABLE default.parquet_table (
          |
    |   a STRING,
          |
    |   b STRING,
          |
    |   c STRING,
          |
    |   d STRING,
          |
        ...
        ...
    | )
          |
    | PARTITIONED BY (
           |
    |   ymd STRING
           |
    | )
          |
    | WITH SERDEPROPERTIES ('serialization.format'='1')
          |
    | STORED AS PARQUET
          |
    | LOCATION 'hdfs://hostname/path/to/parquet' |
    |
          |

+---------------------------------------------------------------------------------+
    ```

I've created `kudu_table` and bulk loaded using `INSERT INTO kudu SELECT *
FROM parquet_table`

    ```
    > SHOW CREATE TABLE kudu_table;

+----------------------------------------------------------------------------------+
    | result
            |

+----------------------------------------------------------------------------------+
    | CREATE TABLE default.kudu_table (
           |
    |   a STRING NOT NULL ENCODING AUTO_ENCODING COMPRESSION
DEFAULT_COMPRESSION,      |
    |   b STRING NOT NULL ENCODING AUTO_ENCODING COMPRESSION
DEFAULT_COMPRESSION,      |
    |   c STRING NULL ENCODING AUTO_ENCODING COMPRESSION
DEFAULT_COMPRESSION,          |
    |   d STRING NULL ENCODING AUTO_ENCODING COMPRESSION
DEFAULT_COMPRESSION,          |
        ...
    |   PRIMARY KEY (a, b)
            |
    | )
           |
    | PARTITION BY HASH (a) PARTITIONS 40
           |
    | STORED AS KUDU
            |
    | TBLPROPERTIES ('kudu.master_addresses'='host1,host2',
                    'kudu.table_name'='impala::kudu_table') |

+----------------------------------------------------------------------------------+
    ```

Re: Apache Kudu Table is 6.6 times larger than Parquet File.

Posted by Jason Heo <ja...@gmail.com>.

Hi, Janne

As I mentioned, I'm using CDH 5.10. I checked it using Cloudera Manager at
"Kudu -> Chart Library"

I'm not sure there is another way.

Thanks.

2017-03-13 17:46 GMT+09:00 Janne Keskitalo <ja...@paf.com>:

> Hi
>
> How do you check the physical size of a kudu table?
>
> 
>

Re: Apache Kudu Table is 6.6 times larger than Parquet File.

Posted by Janne Keskitalo <ja...@paf.com>.

Hi

How do you check the physical size of a kudu table?

Re: Apache Kudu Table is 6.6 times larger than Parquet File.

Posted by Jason Heo <ja...@gmail.com>.

Hi Todd

The schema I've posted was generate by 'CREATE kudu_table AS SELECT * FROM
parquet_table`.

Last weekend, I've tested on difference combinations of encodings and
compressions. Currenlty, size is fallen by 70% but it is still bigger than
parquet 200%. I'm still testing which encoding is the best for specific
columns. I hope it gets closer to Parquet ;)

Thanks

Jason

2017-03-13 15:30 GMT+09:00 Todd Lipcon <to...@cloudera.com>:

> Hi Jason,
>
> The first thing that jumps out to me is that you aren't using dictionary
> encoding on your string columns. I would recommend using DICT_ENCODING for
> all string fields and BIT_SHUFFLE for all int/double/float fields. If you
> have any string fields which are not repetitive (low cardinality) then I
> would also recommend enabling LZ4 compression on them (Parquet uses lz4 by
> default on all strings).
>
> That should get you close to Parquet sizes (and those are the new defaults
> in the upcoming 1.3 release). If you still see a 6x blowup after making
> these changes please report back.
>
> -Todd
>
> On Fri, Mar 10, 2017 at 7:16 PM, Jason Heo <ja...@gmail.com>
> wrote:
>
>> Hello, I'm new to Apache Kudu. I was really impressed by the concept of
>> Kudu and benchmark results. I'm considering using (Impala + Kudu) on my
>> team project.
>>
>> One of the issues I have is that Kudu Table is too big compared to
>> Parquet File
>>
>> - Parquet File: 1.3TB
>> - Kudu Table: 8.6TB
>>
>> (both tables configured 3 replica factor)
>>
>> I'm using Kudu with CDH 5.10 and most of the configurations is not
>> changed (I've only changed `memory_limit_hard_bytes` and
>> `block_cache_capacity_mb` to prevent bulk load error)
>>
>> When I changed `ENCODING` for some fields, only decreased by 5%. I'm
>> thinking there are some optimization techniques to reduce Kudu table size.
>>
>> I would really appreciate it if someone gives advice to me.
>>
>> Thanks for advance answer.
>>
>> `parquet_table` has 38 STRING fields and 6B rows.
>>
>> The schema of `parquet_table` looks like belows
>>
>>     ```
>>     > SHOW CREATE TABLE parquet_table;
>>     +-----------------------------------------------------------
>> ----------------------+
>>     | result
>>              |
>>     +-----------------------------------------------------------
>> ----------------------+
>>     | CREATE EXTERNAL TABLE default.parquet_table (
>>             |
>>     |   a STRING,
>>             |
>>     |   b STRING,
>>             |
>>     |   c STRING,
>>             |
>>     |   d STRING,
>>             |
>>         ...
>>         ...
>>     | )
>>             |
>>     | PARTITIONED BY (
>>              |
>>     |   ymd STRING
>>              |
>>     | )
>>             |
>>     | WITH SERDEPROPERTIES ('serialization.format'='1')
>>             |
>>     | STORED AS PARQUET
>>             |
>>     | LOCATION 'hdfs://hostname/path/to/parquet' |
>>     |
>>             |
>>     +-----------------------------------------------------------
>> ----------------------+
>>     ```
>>
>> I've created `kudu_table` and bulk loaded using `INSERT INTO kudu SELECT
>> * FROM parquet_table`
>>
>>     ```
>>     > SHOW CREATE TABLE kudu_table;
>>     +-----------------------------------------------------------
>> -----------------------+
>>     | result
>>               |
>>     +-----------------------------------------------------------
>> -----------------------+
>>     | CREATE TABLE default.kudu_table (
>>              |
>>     |   a STRING NOT NULL ENCODING AUTO_ENCODING COMPRESSION
>> DEFAULT_COMPRESSION,      |
>>     |   b STRING NOT NULL ENCODING AUTO_ENCODING COMPRESSION
>> DEFAULT_COMPRESSION,      |
>>     |   c STRING NULL ENCODING AUTO_ENCODING COMPRESSION
>> DEFAULT_COMPRESSION,          |
>>     |   d STRING NULL ENCODING AUTO_ENCODING COMPRESSION
>> DEFAULT_COMPRESSION,          |
>>         ...
>>     |   PRIMARY KEY (a, b)
>>               |
>>     | )
>>              |
>>     | PARTITION BY HASH (a) PARTITIONS 40
>>              |
>>     | STORED AS KUDU
>>               |
>>     | TBLPROPERTIES ('kudu.master_addresses'='host1,host2',
>>                     'kudu.table_name'='impala::kudu_table') |
>>     +-----------------------------------------------------------
>> -----------------------+
>>     ```
>>
>>
>
>
> --
> Todd Lipcon
> Software Engineer, Cloudera
>

Re: Apache Kudu Table is 6.6 times larger than Parquet File.

Posted by Todd Lipcon <to...@cloudera.com>.

Hi Jason,

The first thing that jumps out to me is that you aren't using dictionary
encoding on your string columns. I would recommend using DICT_ENCODING for
all string fields and BIT_SHUFFLE for all int/double/float fields. If you
have any string fields which are not repetitive (low cardinality) then I
would also recommend enabling LZ4 compression on them (Parquet uses lz4 by
default on all strings).

That should get you close to Parquet sizes (and those are the new defaults
in the upcoming 1.3 release). If you still see a 6x blowup after making
these changes please report back.

-Todd

On Fri, Mar 10, 2017 at 7:16 PM, Jason Heo <ja...@gmail.com> wrote:

> Hello, I'm new to Apache Kudu. I was really impressed by the concept of
> Kudu and benchmark results. I'm considering using (Impala + Kudu) on my
> team project.
>
> One of the issues I have is that Kudu Table is too big compared to Parquet
> File
>
> - Parquet File: 1.3TB
> - Kudu Table: 8.6TB
>
> (both tables configured 3 replica factor)
>
> I'm using Kudu with CDH 5.10 and most of the configurations is not changed
> (I've only changed `memory_limit_hard_bytes` and `block_cache_capacity_mb`
> to prevent bulk load error)
>
> When I changed `ENCODING` for some fields, only decreased by 5%. I'm
> thinking there are some optimization techniques to reduce Kudu table size.
>
> I would really appreciate it if someone gives advice to me.
>
> Thanks for advance answer.
>
> `parquet_table` has 38 STRING fields and 6B rows.
>
> The schema of `parquet_table` looks like belows
>
>     ```
>     > SHOW CREATE TABLE parquet_table;
>     +-----------------------------------------------------------
> ----------------------+
>     | result
>            |
>     +-----------------------------------------------------------
> ----------------------+
>     | CREATE EXTERNAL TABLE default.parquet_table (
>             |
>     |   a STRING,
>             |
>     |   b STRING,
>             |
>     |   c STRING,
>             |
>     |   d STRING,
>             |
>         ...
>         ...
>     | )
>             |
>     | PARTITIONED BY (
>            |
>     |   ymd STRING
>            |
>     | )
>             |
>     | WITH SERDEPROPERTIES ('serialization.format'='1')
>             |
>     | STORED AS PARQUET
>             |
>     | LOCATION 'hdfs://hostname/path/to/parquet' |
>     |
>             |
>     +-----------------------------------------------------------
> ----------------------+
>     ```
>
> I've created `kudu_table` and bulk loaded using `INSERT INTO kudu SELECT *
> FROM parquet_table`
>
>     ```
>     > SHOW CREATE TABLE kudu_table;
>     +-----------------------------------------------------------
> -----------------------+
>     | result
>             |
>     +-----------------------------------------------------------
> -----------------------+
>     | CREATE TABLE default.kudu_table (
>              |
>     |   a STRING NOT NULL ENCODING AUTO_ENCODING COMPRESSION
> DEFAULT_COMPRESSION,      |
>     |   b STRING NOT NULL ENCODING AUTO_ENCODING COMPRESSION
> DEFAULT_COMPRESSION,      |
>     |   c STRING NULL ENCODING AUTO_ENCODING COMPRESSION
> DEFAULT_COMPRESSION,          |
>     |   d STRING NULL ENCODING AUTO_ENCODING COMPRESSION
> DEFAULT_COMPRESSION,          |
>         ...
>     |   PRIMARY KEY (a, b)
>             |
>     | )
>              |
>     | PARTITION BY HASH (a) PARTITIONS 40
>              |
>     | STORED AS KUDU
>             |
>     | TBLPROPERTIES ('kudu.master_addresses'='host1,host2',
>                     'kudu.table_name'='impala::kudu_table') |
>     +-----------------------------------------------------------
> -----------------------+
>     ```
>
>


-- 
Todd Lipcon
Software Engineer, Cloudera