You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@parquet.apache.org by Chin Wei Low <lo...@gmail.com> on 2015/04/10 04:26:04 UTC

Re: Parquet's memory management

Hi Ryan,

I am interested to know the memory consumption on the reading of parquet
file. Is it determine by the row group size as well or the page size?

Regards,
Chin Wei

On Wed, Mar 25, 2015 at 12:57 AM, Ryan Blue <bl...@cloudera.com> wrote:

> On 03/24/2015 07:20 AM, Stephen Carman wrote:
>
>> Hello,
>>
>> I'm looking for guidance on tuning parquet's memory usage as well as how
>> it
>> generates partitions of data. Can anyone point me in the correct direction
>> on how to tune these or to specifically have programmatic methods of
>> generating partitions.
>>
>> Thanks,
>> Steve Carman
>>
>>
> Hi Steve,
>
> I recently wrote a blog post on the Parquet row group size with the
> basics. It's here:
>
>   http://ingest.tips/2015/01/31/parquet-row-group-size/
>
> For partitioning, that's mostly outside the scope of the format itself
> because it requires you to separate data into partitions in your processing.
>
> There are a couple of off-the-shelf ways to partition your data, the most
> popular is Hive where you specify in the SQL-like language how to derive
> partition values in your insert statements. Another option you can use is
> Kite (kitesdk.org), which will partition the data for you based on a
> config file. Kite is a library you can include in your application.
>
> rb
>
> (By the way, I work on Kite as well)
>
> --
> Ryan Blue
> Software Engineer
> Cloudera, Inc.
>

Re: Parquet's memory management

Posted by Chin Wei Low <lo...@gmail.com>.

Thanks for the quick response.

On Fri, Apr 10, 2015 at 10:32 AM, Ryan Blue <bl...@cloudera.com> wrote:

> Memory consumption is on the order of the row group size, but can vary
> significantly depending on the columns that you select. Parquet is based on
> the idea of being able to skip large sections of data at read time. If you
> read all the columns, then you'll consume approximately the row group size.
> If you read half of the large columns, then maybe you'll use just half of
> the row group size, but it really depends on the data.
>
> To estimate, I recommend writing some data and seeing what kind of size
> distribution you get for the column chunks. Then look at what columns
> you're likely to select to get an estimate of what percentage of the row
> group you would actually read.
>
> rb
>
>
> On 04/09/2015 07:26 PM, Chin Wei Low wrote:
>
>> Hi Ryan,
>>
>> I am interested to know the memory consumption on the reading of parquet
>> file. Is it determine by the row group size as well or the page size?
>>
>> Regards,
>> Chin Wei
>>
>> On Wed, Mar 25, 2015 at 12:57 AM, Ryan Blue <bl...@cloudera.com> wrote:
>>
>>  On 03/24/2015 07:20 AM, Stephen Carman wrote:
>>>
>>>  Hello,
>>>>
>>>> I'm looking for guidance on tuning parquet's memory usage as well as how
>>>> it
>>>> generates partitions of data. Can anyone point me in the correct
>>>> direction
>>>> on how to tune these or to specifically have programmatic methods of
>>>> generating partitions.
>>>>
>>>> Thanks,
>>>> Steve Carman
>>>>
>>>>
>>>>  Hi Steve,
>>>
>>> I recently wrote a blog post on the Parquet row group size with the
>>> basics. It's here:
>>>
>>>    http://ingest.tips/2015/01/31/parquet-row-group-size/
>>>
>>> For partitioning, that's mostly outside the scope of the format itself
>>> because it requires you to separate data into partitions in your
>>> processing.
>>>
>>> There are a couple of off-the-shelf ways to partition your data, the most
>>> popular is Hive where you specify in the SQL-like language how to derive
>>> partition values in your insert statements. Another option you can use is
>>> Kite (kitesdk.org), which will partition the data for you based on a
>>> config file. Kite is a library you can include in your application.
>>>
>>> rb
>>>
>>> (By the way, I work on Kite as well)
>>>
>>> --
>>> Ryan Blue
>>> Software Engineer
>>> Cloudera, Inc.
>>>
>>>
>>
>
> --
> Ryan Blue
> Software Engineer
> Cloudera, Inc.
>

Re: Parquet's memory management

Posted by Ryan Blue <bl...@cloudera.com>.

Memory consumption is on the order of the row group size, but can vary 
significantly depending on the columns that you select. Parquet is based 
on the idea of being able to skip large sections of data at read time. 
If you read all the columns, then you'll consume approximately the row 
group size. If you read half of the large columns, then maybe you'll use 
just half of the row group size, but it really depends on the data.

To estimate, I recommend writing some data and seeing what kind of size 
distribution you get for the column chunks. Then look at what columns 
you're likely to select to get an estimate of what percentage of the row 
group you would actually read.

rb

On 04/09/2015 07:26 PM, Chin Wei Low wrote:
> Hi Ryan,
>
> I am interested to know the memory consumption on the reading of parquet
> file. Is it determine by the row group size as well or the page size?
>
> Regards,
> Chin Wei
>
> On Wed, Mar 25, 2015 at 12:57 AM, Ryan Blue <bl...@cloudera.com> wrote:
>
>> On 03/24/2015 07:20 AM, Stephen Carman wrote:
>>
>>> Hello,
>>>
>>> I'm looking for guidance on tuning parquet's memory usage as well as how
>>> it
>>> generates partitions of data. Can anyone point me in the correct direction
>>> on how to tune these or to specifically have programmatic methods of
>>> generating partitions.
>>>
>>> Thanks,
>>> Steve Carman
>>>
>>>
>> Hi Steve,
>>
>> I recently wrote a blog post on the Parquet row group size with the
>> basics. It's here:
>>
>>    http://ingest.tips/2015/01/31/parquet-row-group-size/
>>
>> For partitioning, that's mostly outside the scope of the format itself
>> because it requires you to separate data into partitions in your processing.
>>
>> There are a couple of off-the-shelf ways to partition your data, the most
>> popular is Hive where you specify in the SQL-like language how to derive
>> partition values in your insert statements. Another option you can use is
>> Kite (kitesdk.org), which will partition the data for you based on a
>> config file. Kite is a library you can include in your application.
>>
>> rb
>>
>> (By the way, I work on Kite as well)
>>
>> --
>> Ryan Blue
>> Software Engineer
>> Cloudera, Inc.
>>
>

-- 
Ryan Blue
Software Engineer
Cloudera, Inc.