You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@pig.apache.org by Dmitriy Ryaboy <dv...@gmail.com> on 2013/03/12 16:45:25 UTC

Introducing Parquet: efficient columnar storage for Hadoop.

Fellow Hadoopers,

We'd like to introduce a joint project between Twitter and Cloudera
engineers -- a new columnar storage format for Hadoop called Parquet (
http://parquet.github.com).

We created Parquet to make the advantages of compressed, efficient columnar
data representation available to any project in the Hadoop ecosystem,
regardless of the choice of data processing framework, data model, or
programming language.

Parquet is built from the ground up with complex nested data structures in
mind. We adopted the repetition/definition level approach to encoding such
data structures, as described in Google's Dremel paper; we have found this
to be a very efficient method of encoding data in non-trivial object
schemas.

Parquet is built to support very efficient compression and encoding
schemes. Parquet allows compression schemes to be specified on a per-column
level, and is future-proofed to allow adding more encodings as they are
invented and implemented. We separate the concepts of encoding and
compression, allowing parquet consumers to implement operators that work
directly on encoded data without paying decompression and decoding penalty
when possible.

Parquet is built to be used by anyone. The Hadoop ecosystem is rich with
data processing frameworks, and we are not interested in playing favorites.
We believe that an efficient, well-implemented columnar storage substrate
should be useful to all frameworks without the cost of extensive and
difficult to set up dependencies.

The initial code, available at https://github.com/Parquet, defines the file
format, provides Java building blocks for processing columnar data, and
implements Hadoop Input/Output Formats, Pig Storers/Loaders, and an example
of a complex integration -- Input/Output formats that can convert
Parquet-stored data directly to and from Thrift objects.

A preview version of Parquet support will be available in Cloudera's Impala
0.7.

Twitter is starting to convert some of its major data source to Parquet in
order to take advantage of the compression and deserialization savings.

Parquet is currently under heavy development. Parquet's near-term roadmap
includes:
* Hive SerDes (Criteo)
* Cascading Taps (Criteo)
* Support for dictionary encoding, zigzag encoding, and RLE encoding of
data (Cloudera and Twitter)
* Further improvements to Pig support (Twitter)

Company names in parenthesis indicate whose engineers signed up to do the
work -- others can feel free to jump in too, of course.

We've also heard requests to provide an Avro container layer, similar to
what we do with Thrift. Seeking volunteers!

We welcome all feedback, patches, and ideas; to foster community
development, we plan to contribute Parquet to the Apache Incubator when the
development is farther along.

Regards,
Nong Li, Julien Le Dem, Marcel Kornacker, Todd Lipcon, Dmitriy Ryaboy,
Jonathan Coveney, and friends.

Re: Introducing Parquet: efficient columnar storage for Hadoop.

Posted by Rob Weltman <ro...@cloudera.com>.

On 3/13/13 10:40 AM, Abhishek Kashyap wrote:
> The blog indicates Trevni is giving way to Parquet, and there will be no 
> need for Trevni any more. Let us know if that is an incorrect 
> interpretation.

Trevni is part of the Apache open source community and people are free to continue using and contributing to it if they see it as a better fit for their use case. For the reasons described on the list and in the Cloudera blog post, Twitter and Cloudera have decided to invest in the Parquet format and contribute it to the community.

Re: Introducing Parquet: efficient columnar storage for Hadoop.

Posted by Rob Weltman <ro...@cloudera.com>.

On 3/13/13 10:40 AM, Abhishek Kashyap wrote:
> The blog indicates Trevni is giving way to Parquet, and there will be no 
> need for Trevni any more. Let us know if that is an incorrect 
> interpretation.

Trevni is part of the Apache open source community and people are free to continue using and contributing to it if they see it as a better fit for their use case. For the reasons described on the list and in the Cloudera blog post, Twitter and Cloudera have decided to invest in the Parquet format and contribute it to the community.

Re: Introducing Parquet: efficient columnar storage for Hadoop.

Posted by Rob Weltman <ro...@cloudera.com>.

On 3/13/13 10:40 AM, Abhishek Kashyap wrote:
> The blog indicates Trevni is giving way to Parquet, and there will be no 
> need for Trevni any more. Let us know if that is an incorrect 
> interpretation.

Trevni is part of the Apache open source community and people are free to continue using and contributing to it if they see it as a better fit for their use case. For the reasons described on the list and in the Cloudera blog post, Twitter and Cloudera have decided to invest in the Parquet format and contribute it to the community.

Re: Introducing Parquet: efficient columnar storage for Hadoop.

Posted by Rob Weltman <ro...@cloudera.com>.

On 3/13/13 10:40 AM, Abhishek Kashyap wrote:
> The blog indicates Trevni is giving way to Parquet, and there will be no 
> need for Trevni any more. Let us know if that is an incorrect 
> interpretation.

Trevni is part of the Apache open source community and people are free to continue using and contributing to it if they see it as a better fit for their use case. For the reasons described on the list and in the Cloudera blog post, Twitter and Cloudera have decided to invest in the Parquet format and contribute it to the community.

Re: Introducing Parquet: efficient columnar storage for Hadoop.

Posted by Abhishek Kashyap <ak...@vmware.com>.

The blog indicates Trevni is giving way to Parquet, and there will be no need for Trevni any more. Let us know if that is an incorrect interpretation. 

----- Original Message -----

From: "Dmitriy Ryaboy" <dv...@gmail.com> 
To: "pig-user@hadoop.apache.org" <us...@hadoop.apache.org> 
Sent: Wednesday, March 13, 2013 10:25:04 AM 
Subject: Re: Introducing Parquet: efficient columnar storage for Hadoop. 

Hi folks, 
Thanks for your interest. The Cloudera blog post has a few additional bullet points about the difference between Trevni and Parquet: http://blog.cloudera.com/blog/2013/03/introducing-parquet-columnar-storage-for-apache-hadoop/ 

D 

On Tue, Mar 12, 2013 at 3:40 PM, Luke Lu < llu@apache.org > wrote: 

IMO, it'll be enlightening to Hadoop users to compare Parquet with Trevni and ORCFile, all of which are columnar formats for Hadoop that are relatively new. Do we really need 3 columnar formats? 

On Tue, Mar 12, 2013 at 8:45 AM, Dmitriy Ryaboy < dvryaboy@gmail.com > wrote: 

<blockquote>
Fellow Hadoopers, 

We'd like to introduce a joint project between Twitter and Cloudera 
engineers -- a new columnar storage format for Hadoop called Parquet ( 
http://parquet.github.com ). 

We created Parquet to make the advantages of compressed, efficient columnar 
data representation available to any project in the Hadoop ecosystem, 
regardless of the choice of data processing framework, data model, or 
programming language. 

Parquet is built from the ground up with complex nested data structures in 
mind. We adopted the repetition/definition level approach to encoding such 
data structures, as described in Google's Dremel paper; we have found this 
to be a very efficient method of encoding data in non-trivial object 
schemas. 

Parquet is built to support very efficient compression and encoding 
schemes. Parquet allows compression schemes to be specified on a per-column 
level, and is future-proofed to allow adding more encodings as they are 
invented and implemented. We separate the concepts of encoding and 
compression, allowing parquet consumers to implement operators that work 
directly on encoded data without paying decompression and decoding penalty 
when possible. 

Parquet is built to be used by anyone. The Hadoop ecosystem is rich with 
data processing frameworks, and we are not interested in playing favorites. 
We believe that an efficient, well-implemented columnar storage substrate 
should be useful to all frameworks without the cost of extensive and 
difficult to set up dependencies. 

The initial code, available at https://github.com/Parquet , defines the file 
format, provides Java building blocks for processing columnar data, and 
implements Hadoop Input/Output Formats, Pig Storers/Loaders, and an example 
of a complex integration -- Input/Output formats that can convert 
Parquet-stored data directly to and from Thrift objects. 

A preview version of Parquet support will be available in Cloudera's Impala 
0.7. 

Twitter is starting to convert some of its major data source to Parquet in 
order to take advantage of the compression and deserialization savings. 

Parquet is currently under heavy development. Parquet's near-term roadmap 
includes: 
* Hive SerDes (Criteo) 
* Cascading Taps (Criteo) 
* Support for dictionary encoding, zigzag encoding, and RLE encoding of 
data (Cloudera and Twitter) 
* Further improvements to Pig support (Twitter) 

Company names in parenthesis indicate whose engineers signed up to do the 
work -- others can feel free to jump in too, of course. 

We've also heard requests to provide an Avro container layer, similar to 
what we do with Thrift. Seeking volunteers! 

We welcome all feedback, patches, and ideas; to foster community 
development, we plan to contribute Parquet to the Apache Incubator when the 
development is farther along. 

Regards, 
Nong Li, Julien Le Dem, Marcel Kornacker, Todd Lipcon, Dmitriy Ryaboy, 
Jonathan Coveney, and friends. 

</blockquote>

Re: Introducing Parquet: efficient columnar storage for Hadoop.

Posted by Abhishek Kashyap <ak...@vmware.com>.

The blog indicates Trevni is giving way to Parquet, and there will be no need for Trevni any more. Let us know if that is an incorrect interpretation. 

----- Original Message -----

From: "Dmitriy Ryaboy" <dv...@gmail.com> 
To: "pig-user@hadoop.apache.org" <us...@hadoop.apache.org> 
Sent: Wednesday, March 13, 2013 10:25:04 AM 
Subject: Re: Introducing Parquet: efficient columnar storage for Hadoop. 

Hi folks, 
Thanks for your interest. The Cloudera blog post has a few additional bullet points about the difference between Trevni and Parquet: http://blog.cloudera.com/blog/2013/03/introducing-parquet-columnar-storage-for-apache-hadoop/ 

D 

On Tue, Mar 12, 2013 at 3:40 PM, Luke Lu < llu@apache.org > wrote: 

IMO, it'll be enlightening to Hadoop users to compare Parquet with Trevni and ORCFile, all of which are columnar formats for Hadoop that are relatively new. Do we really need 3 columnar formats? 

On Tue, Mar 12, 2013 at 8:45 AM, Dmitriy Ryaboy < dvryaboy@gmail.com > wrote: 

<blockquote>
Fellow Hadoopers, 

We'd like to introduce a joint project between Twitter and Cloudera 
engineers -- a new columnar storage format for Hadoop called Parquet ( 
http://parquet.github.com ). 

We created Parquet to make the advantages of compressed, efficient columnar 
data representation available to any project in the Hadoop ecosystem, 
regardless of the choice of data processing framework, data model, or 
programming language. 

Parquet is built from the ground up with complex nested data structures in 
mind. We adopted the repetition/definition level approach to encoding such 
data structures, as described in Google's Dremel paper; we have found this 
to be a very efficient method of encoding data in non-trivial object 
schemas. 

Parquet is built to support very efficient compression and encoding 
schemes. Parquet allows compression schemes to be specified on a per-column 
level, and is future-proofed to allow adding more encodings as they are 
invented and implemented. We separate the concepts of encoding and 
compression, allowing parquet consumers to implement operators that work 
directly on encoded data without paying decompression and decoding penalty 
when possible. 

Parquet is built to be used by anyone. The Hadoop ecosystem is rich with 
data processing frameworks, and we are not interested in playing favorites. 
We believe that an efficient, well-implemented columnar storage substrate 
should be useful to all frameworks without the cost of extensive and 
difficult to set up dependencies. 

The initial code, available at https://github.com/Parquet , defines the file 
format, provides Java building blocks for processing columnar data, and 
implements Hadoop Input/Output Formats, Pig Storers/Loaders, and an example 
of a complex integration -- Input/Output formats that can convert 
Parquet-stored data directly to and from Thrift objects. 

A preview version of Parquet support will be available in Cloudera's Impala 
0.7. 

Twitter is starting to convert some of its major data source to Parquet in 
order to take advantage of the compression and deserialization savings. 

Parquet is currently under heavy development. Parquet's near-term roadmap 
includes: 
* Hive SerDes (Criteo) 
* Cascading Taps (Criteo) 
* Support for dictionary encoding, zigzag encoding, and RLE encoding of 
data (Cloudera and Twitter) 
* Further improvements to Pig support (Twitter) 

Company names in parenthesis indicate whose engineers signed up to do the 
work -- others can feel free to jump in too, of course. 

We've also heard requests to provide an Avro container layer, similar to 
what we do with Thrift. Seeking volunteers! 

We welcome all feedback, patches, and ideas; to foster community 
development, we plan to contribute Parquet to the Apache Incubator when the 
development is farther along. 

Regards, 
Nong Li, Julien Le Dem, Marcel Kornacker, Todd Lipcon, Dmitriy Ryaboy, 
Jonathan Coveney, and friends. 

</blockquote>

Re: Introducing Parquet: efficient columnar storage for Hadoop.

Posted by Abhishek Kashyap <ak...@vmware.com>.

The blog indicates Trevni is giving way to Parquet, and there will be no need for Trevni any more. Let us know if that is an incorrect interpretation. 

----- Original Message -----

From: "Dmitriy Ryaboy" <dv...@gmail.com> 
To: "pig-user@hadoop.apache.org" <us...@hadoop.apache.org> 
Sent: Wednesday, March 13, 2013 10:25:04 AM 
Subject: Re: Introducing Parquet: efficient columnar storage for Hadoop. 

Hi folks, 
Thanks for your interest. The Cloudera blog post has a few additional bullet points about the difference between Trevni and Parquet: http://blog.cloudera.com/blog/2013/03/introducing-parquet-columnar-storage-for-apache-hadoop/ 

D 

On Tue, Mar 12, 2013 at 3:40 PM, Luke Lu < llu@apache.org > wrote: 

IMO, it'll be enlightening to Hadoop users to compare Parquet with Trevni and ORCFile, all of which are columnar formats for Hadoop that are relatively new. Do we really need 3 columnar formats? 

On Tue, Mar 12, 2013 at 8:45 AM, Dmitriy Ryaboy < dvryaboy@gmail.com > wrote: 

<blockquote>
Fellow Hadoopers, 

We'd like to introduce a joint project between Twitter and Cloudera 
engineers -- a new columnar storage format for Hadoop called Parquet ( 
http://parquet.github.com ). 

We created Parquet to make the advantages of compressed, efficient columnar 
data representation available to any project in the Hadoop ecosystem, 
regardless of the choice of data processing framework, data model, or 
programming language. 

Parquet is built from the ground up with complex nested data structures in 
mind. We adopted the repetition/definition level approach to encoding such 
data structures, as described in Google's Dremel paper; we have found this 
to be a very efficient method of encoding data in non-trivial object 
schemas. 

Parquet is built to support very efficient compression and encoding 
schemes. Parquet allows compression schemes to be specified on a per-column 
level, and is future-proofed to allow adding more encodings as they are 
invented and implemented. We separate the concepts of encoding and 
compression, allowing parquet consumers to implement operators that work 
directly on encoded data without paying decompression and decoding penalty 
when possible. 

Parquet is built to be used by anyone. The Hadoop ecosystem is rich with 
data processing frameworks, and we are not interested in playing favorites. 
We believe that an efficient, well-implemented columnar storage substrate 
should be useful to all frameworks without the cost of extensive and 
difficult to set up dependencies. 

The initial code, available at https://github.com/Parquet , defines the file 
format, provides Java building blocks for processing columnar data, and 
implements Hadoop Input/Output Formats, Pig Storers/Loaders, and an example 
of a complex integration -- Input/Output formats that can convert 
Parquet-stored data directly to and from Thrift objects. 

A preview version of Parquet support will be available in Cloudera's Impala 
0.7. 

Twitter is starting to convert some of its major data source to Parquet in 
order to take advantage of the compression and deserialization savings. 

Parquet is currently under heavy development. Parquet's near-term roadmap 
includes: 
* Hive SerDes (Criteo) 
* Cascading Taps (Criteo) 
* Support for dictionary encoding, zigzag encoding, and RLE encoding of 
data (Cloudera and Twitter) 
* Further improvements to Pig support (Twitter) 

Company names in parenthesis indicate whose engineers signed up to do the 
work -- others can feel free to jump in too, of course. 

We've also heard requests to provide an Avro container layer, similar to 
what we do with Thrift. Seeking volunteers! 

We welcome all feedback, patches, and ideas; to foster community 
development, we plan to contribute Parquet to the Apache Incubator when the 
development is farther along. 

Regards, 
Nong Li, Julien Le Dem, Marcel Kornacker, Todd Lipcon, Dmitriy Ryaboy, 
Jonathan Coveney, and friends. 

</blockquote>

Re: Introducing Parquet: efficient columnar storage for Hadoop.

Posted by Abhishek Kashyap <ak...@vmware.com>.

The blog indicates Trevni is giving way to Parquet, and there will be no need for Trevni any more. Let us know if that is an incorrect interpretation. 

----- Original Message -----

From: "Dmitriy Ryaboy" <dv...@gmail.com> 
To: "pig-user@hadoop.apache.org" <us...@hadoop.apache.org> 
Sent: Wednesday, March 13, 2013 10:25:04 AM 
Subject: Re: Introducing Parquet: efficient columnar storage for Hadoop. 

Hi folks, 
Thanks for your interest. The Cloudera blog post has a few additional bullet points about the difference between Trevni and Parquet: http://blog.cloudera.com/blog/2013/03/introducing-parquet-columnar-storage-for-apache-hadoop/ 

D 

On Tue, Mar 12, 2013 at 3:40 PM, Luke Lu < llu@apache.org > wrote: 

IMO, it'll be enlightening to Hadoop users to compare Parquet with Trevni and ORCFile, all of which are columnar formats for Hadoop that are relatively new. Do we really need 3 columnar formats? 

On Tue, Mar 12, 2013 at 8:45 AM, Dmitriy Ryaboy < dvryaboy@gmail.com > wrote: 

<blockquote>
Fellow Hadoopers, 

We'd like to introduce a joint project between Twitter and Cloudera 
engineers -- a new columnar storage format for Hadoop called Parquet ( 
http://parquet.github.com ). 

We created Parquet to make the advantages of compressed, efficient columnar 
data representation available to any project in the Hadoop ecosystem, 
regardless of the choice of data processing framework, data model, or 
programming language. 

Parquet is built from the ground up with complex nested data structures in 
mind. We adopted the repetition/definition level approach to encoding such 
data structures, as described in Google's Dremel paper; we have found this 
to be a very efficient method of encoding data in non-trivial object 
schemas. 

Parquet is built to support very efficient compression and encoding 
schemes. Parquet allows compression schemes to be specified on a per-column 
level, and is future-proofed to allow adding more encodings as they are 
invented and implemented. We separate the concepts of encoding and 
compression, allowing parquet consumers to implement operators that work 
directly on encoded data without paying decompression and decoding penalty 
when possible. 

Parquet is built to be used by anyone. The Hadoop ecosystem is rich with 
data processing frameworks, and we are not interested in playing favorites. 
We believe that an efficient, well-implemented columnar storage substrate 
should be useful to all frameworks without the cost of extensive and 
difficult to set up dependencies. 

The initial code, available at https://github.com/Parquet , defines the file 
format, provides Java building blocks for processing columnar data, and 
implements Hadoop Input/Output Formats, Pig Storers/Loaders, and an example 
of a complex integration -- Input/Output formats that can convert 
Parquet-stored data directly to and from Thrift objects. 

A preview version of Parquet support will be available in Cloudera's Impala 
0.7. 

Twitter is starting to convert some of its major data source to Parquet in 
order to take advantage of the compression and deserialization savings. 

Parquet is currently under heavy development. Parquet's near-term roadmap 
includes: 
* Hive SerDes (Criteo) 
* Cascading Taps (Criteo) 
* Support for dictionary encoding, zigzag encoding, and RLE encoding of 
data (Cloudera and Twitter) 
* Further improvements to Pig support (Twitter) 

Company names in parenthesis indicate whose engineers signed up to do the 
work -- others can feel free to jump in too, of course. 

We've also heard requests to provide an Avro container layer, similar to 
what we do with Thrift. Seeking volunteers! 

We welcome all feedback, patches, and ideas; to foster community 
development, we plan to contribute Parquet to the Apache Incubator when the 
development is farther along. 

Regards, 
Nong Li, Julien Le Dem, Marcel Kornacker, Todd Lipcon, Dmitriy Ryaboy, 
Jonathan Coveney, and friends. 

</blockquote>

Re: Introducing Parquet: efficient columnar storage for Hadoop.

Posted by Dmitriy Ryaboy <dv...@gmail.com>.

Hi folks,
Thanks for your interest. The Cloudera blog post has a few additional
bullet points about the difference between Trevni and Parquet:
http://blog.cloudera.com/blog/2013/03/introducing-parquet-columnar-storage-for-apache-hadoop/

D


On Tue, Mar 12, 2013 at 3:40 PM, Luke Lu <ll...@apache.org> wrote:

> IMO, it'll be enlightening to Hadoop users to compare Parquet with Trevni
> and ORCFile, all of which are columnar formats for Hadoop that are
> relatively new. Do we really need 3 columnar formats?
>
>
> On Tue, Mar 12, 2013 at 8:45 AM, Dmitriy Ryaboy <dv...@gmail.com>wrote:
>
>> Fellow Hadoopers,
>>
>> We'd like to introduce a joint project between Twitter and Cloudera
>> engineers -- a new columnar storage format for Hadoop called Parquet (
>> http://parquet.github.com).
>>
>> We created Parquet to make the advantages of compressed, efficient
>> columnar
>> data representation available to any project in the Hadoop ecosystem,
>> regardless of the choice of data processing framework, data model, or
>> programming language.
>>
>> Parquet is built from the ground up with complex nested data structures in
>> mind. We adopted the repetition/definition level approach to encoding such
>> data structures, as described in Google's Dremel paper; we have found this
>> to be a very efficient method of encoding data in non-trivial object
>> schemas.
>>
>> Parquet is built to support very efficient compression and encoding
>> schemes. Parquet allows compression schemes to be specified on a
>> per-column
>> level, and is future-proofed to allow adding more encodings as they are
>> invented and implemented. We separate the concepts of encoding and
>> compression, allowing parquet consumers to implement operators that work
>> directly on encoded data without paying decompression and decoding penalty
>> when possible.
>>
>> Parquet is built to be used by anyone. The Hadoop ecosystem is rich with
>> data processing frameworks, and we are not interested in playing
>> favorites.
>> We believe that an efficient, well-implemented columnar storage substrate
>> should be useful to all frameworks without the cost of extensive and
>> difficult to set up dependencies.
>>
>> The initial code, available at https://github.com/Parquet, defines the
>> file
>> format, provides Java building blocks for processing columnar data, and
>> implements Hadoop Input/Output Formats, Pig Storers/Loaders, and an
>> example
>> of a complex integration -- Input/Output formats that can convert
>> Parquet-stored data directly to and from Thrift objects.
>>
>> A preview version of Parquet support will be available in Cloudera's
>> Impala
>> 0.7.
>>
>> Twitter is starting to convert some of its major data source to Parquet in
>> order to take advantage of the compression and deserialization savings.
>>
>> Parquet is currently under heavy development. Parquet's near-term roadmap
>> includes:
>> * Hive SerDes (Criteo)
>> * Cascading Taps (Criteo)
>> * Support for dictionary encoding, zigzag encoding, and RLE encoding of
>> data (Cloudera and Twitter)
>> * Further improvements to Pig support (Twitter)
>>
>> Company names in parenthesis indicate whose engineers signed up to do the
>> work -- others can feel free to jump in too, of course.
>>
>> We've also heard requests to provide an Avro container layer, similar to
>> what we do with Thrift. Seeking volunteers!
>>
>> We welcome all feedback, patches, and ideas; to foster community
>> development, we plan to contribute Parquet to the Apache Incubator when
>> the
>> development is farther along.
>>
>> Regards,
>> Nong Li, Julien Le Dem, Marcel Kornacker, Todd Lipcon, Dmitriy Ryaboy,
>> Jonathan Coveney, and friends.
>>
>
>

Re: Introducing Parquet: efficient columnar storage for Hadoop.

Posted by Dmitriy Ryaboy <dv...@gmail.com>.

Hi folks,
Thanks for your interest. The Cloudera blog post has a few additional
bullet points about the difference between Trevni and Parquet:
http://blog.cloudera.com/blog/2013/03/introducing-parquet-columnar-storage-for-apache-hadoop/

D


On Tue, Mar 12, 2013 at 3:40 PM, Luke Lu <ll...@apache.org> wrote:

> IMO, it'll be enlightening to Hadoop users to compare Parquet with Trevni
> and ORCFile, all of which are columnar formats for Hadoop that are
> relatively new. Do we really need 3 columnar formats?
>
>
> On Tue, Mar 12, 2013 at 8:45 AM, Dmitriy Ryaboy <dv...@gmail.com>wrote:
>
>> Fellow Hadoopers,
>>
>> We'd like to introduce a joint project between Twitter and Cloudera
>> engineers -- a new columnar storage format for Hadoop called Parquet (
>> http://parquet.github.com).
>>
>> We created Parquet to make the advantages of compressed, efficient
>> columnar
>> data representation available to any project in the Hadoop ecosystem,
>> regardless of the choice of data processing framework, data model, or
>> programming language.
>>
>> Parquet is built from the ground up with complex nested data structures in
>> mind. We adopted the repetition/definition level approach to encoding such
>> data structures, as described in Google's Dremel paper; we have found this
>> to be a very efficient method of encoding data in non-trivial object
>> schemas.
>>
>> Parquet is built to support very efficient compression and encoding
>> schemes. Parquet allows compression schemes to be specified on a
>> per-column
>> level, and is future-proofed to allow adding more encodings as they are
>> invented and implemented. We separate the concepts of encoding and
>> compression, allowing parquet consumers to implement operators that work
>> directly on encoded data without paying decompression and decoding penalty
>> when possible.
>>
>> Parquet is built to be used by anyone. The Hadoop ecosystem is rich with
>> data processing frameworks, and we are not interested in playing
>> favorites.
>> We believe that an efficient, well-implemented columnar storage substrate
>> should be useful to all frameworks without the cost of extensive and
>> difficult to set up dependencies.
>>
>> The initial code, available at https://github.com/Parquet, defines the
>> file
>> format, provides Java building blocks for processing columnar data, and
>> implements Hadoop Input/Output Formats, Pig Storers/Loaders, and an
>> example
>> of a complex integration -- Input/Output formats that can convert
>> Parquet-stored data directly to and from Thrift objects.
>>
>> A preview version of Parquet support will be available in Cloudera's
>> Impala
>> 0.7.
>>
>> Twitter is starting to convert some of its major data source to Parquet in
>> order to take advantage of the compression and deserialization savings.
>>
>> Parquet is currently under heavy development. Parquet's near-term roadmap
>> includes:
>> * Hive SerDes (Criteo)
>> * Cascading Taps (Criteo)
>> * Support for dictionary encoding, zigzag encoding, and RLE encoding of
>> data (Cloudera and Twitter)
>> * Further improvements to Pig support (Twitter)
>>
>> Company names in parenthesis indicate whose engineers signed up to do the
>> work -- others can feel free to jump in too, of course.
>>
>> We've also heard requests to provide an Avro container layer, similar to
>> what we do with Thrift. Seeking volunteers!
>>
>> We welcome all feedback, patches, and ideas; to foster community
>> development, we plan to contribute Parquet to the Apache Incubator when
>> the
>> development is farther along.
>>
>> Regards,
>> Nong Li, Julien Le Dem, Marcel Kornacker, Todd Lipcon, Dmitriy Ryaboy,
>> Jonathan Coveney, and friends.
>>
>
>

Re: Introducing Parquet: efficient columnar storage for Hadoop.

Posted by Dmitriy Ryaboy <dv...@gmail.com>.

Hi folks,
Thanks for your interest. The Cloudera blog post has a few additional
bullet points about the difference between Trevni and Parquet:
http://blog.cloudera.com/blog/2013/03/introducing-parquet-columnar-storage-for-apache-hadoop/

D


On Tue, Mar 12, 2013 at 3:40 PM, Luke Lu <ll...@apache.org> wrote:

> IMO, it'll be enlightening to Hadoop users to compare Parquet with Trevni
> and ORCFile, all of which are columnar formats for Hadoop that are
> relatively new. Do we really need 3 columnar formats?
>
>
> On Tue, Mar 12, 2013 at 8:45 AM, Dmitriy Ryaboy <dv...@gmail.com>wrote:
>
>> Fellow Hadoopers,
>>
>> We'd like to introduce a joint project between Twitter and Cloudera
>> engineers -- a new columnar storage format for Hadoop called Parquet (
>> http://parquet.github.com).
>>
>> We created Parquet to make the advantages of compressed, efficient
>> columnar
>> data representation available to any project in the Hadoop ecosystem,
>> regardless of the choice of data processing framework, data model, or
>> programming language.
>>
>> Parquet is built from the ground up with complex nested data structures in
>> mind. We adopted the repetition/definition level approach to encoding such
>> data structures, as described in Google's Dremel paper; we have found this
>> to be a very efficient method of encoding data in non-trivial object
>> schemas.
>>
>> Parquet is built to support very efficient compression and encoding
>> schemes. Parquet allows compression schemes to be specified on a
>> per-column
>> level, and is future-proofed to allow adding more encodings as they are
>> invented and implemented. We separate the concepts of encoding and
>> compression, allowing parquet consumers to implement operators that work
>> directly on encoded data without paying decompression and decoding penalty
>> when possible.
>>
>> Parquet is built to be used by anyone. The Hadoop ecosystem is rich with
>> data processing frameworks, and we are not interested in playing
>> favorites.
>> We believe that an efficient, well-implemented columnar storage substrate
>> should be useful to all frameworks without the cost of extensive and
>> difficult to set up dependencies.
>>
>> The initial code, available at https://github.com/Parquet, defines the
>> file
>> format, provides Java building blocks for processing columnar data, and
>> implements Hadoop Input/Output Formats, Pig Storers/Loaders, and an
>> example
>> of a complex integration -- Input/Output formats that can convert
>> Parquet-stored data directly to and from Thrift objects.
>>
>> A preview version of Parquet support will be available in Cloudera's
>> Impala
>> 0.7.
>>
>> Twitter is starting to convert some of its major data source to Parquet in
>> order to take advantage of the compression and deserialization savings.
>>
>> Parquet is currently under heavy development. Parquet's near-term roadmap
>> includes:
>> * Hive SerDes (Criteo)
>> * Cascading Taps (Criteo)
>> * Support for dictionary encoding, zigzag encoding, and RLE encoding of
>> data (Cloudera and Twitter)
>> * Further improvements to Pig support (Twitter)
>>
>> Company names in parenthesis indicate whose engineers signed up to do the
>> work -- others can feel free to jump in too, of course.
>>
>> We've also heard requests to provide an Avro container layer, similar to
>> what we do with Thrift. Seeking volunteers!
>>
>> We welcome all feedback, patches, and ideas; to foster community
>> development, we plan to contribute Parquet to the Apache Incubator when
>> the
>> development is farther along.
>>
>> Regards,
>> Nong Li, Julien Le Dem, Marcel Kornacker, Todd Lipcon, Dmitriy Ryaboy,
>> Jonathan Coveney, and friends.
>>
>
>

Re: Introducing Parquet: efficient columnar storage for Hadoop.

Posted by Dmitriy Ryaboy <dv...@gmail.com>.

Hi folks,
Thanks for your interest. The Cloudera blog post has a few additional
bullet points about the difference between Trevni and Parquet:
http://blog.cloudera.com/blog/2013/03/introducing-parquet-columnar-storage-for-apache-hadoop/

D


On Tue, Mar 12, 2013 at 3:40 PM, Luke Lu <ll...@apache.org> wrote:

> IMO, it'll be enlightening to Hadoop users to compare Parquet with Trevni
> and ORCFile, all of which are columnar formats for Hadoop that are
> relatively new. Do we really need 3 columnar formats?
>
>
> On Tue, Mar 12, 2013 at 8:45 AM, Dmitriy Ryaboy <dv...@gmail.com>wrote:
>
>> Fellow Hadoopers,
>>
>> We'd like to introduce a joint project between Twitter and Cloudera
>> engineers -- a new columnar storage format for Hadoop called Parquet (
>> http://parquet.github.com).
>>
>> We created Parquet to make the advantages of compressed, efficient
>> columnar
>> data representation available to any project in the Hadoop ecosystem,
>> regardless of the choice of data processing framework, data model, or
>> programming language.
>>
>> Parquet is built from the ground up with complex nested data structures in
>> mind. We adopted the repetition/definition level approach to encoding such
>> data structures, as described in Google's Dremel paper; we have found this
>> to be a very efficient method of encoding data in non-trivial object
>> schemas.
>>
>> Parquet is built to support very efficient compression and encoding
>> schemes. Parquet allows compression schemes to be specified on a
>> per-column
>> level, and is future-proofed to allow adding more encodings as they are
>> invented and implemented. We separate the concepts of encoding and
>> compression, allowing parquet consumers to implement operators that work
>> directly on encoded data without paying decompression and decoding penalty
>> when possible.
>>
>> Parquet is built to be used by anyone. The Hadoop ecosystem is rich with
>> data processing frameworks, and we are not interested in playing
>> favorites.
>> We believe that an efficient, well-implemented columnar storage substrate
>> should be useful to all frameworks without the cost of extensive and
>> difficult to set up dependencies.
>>
>> The initial code, available at https://github.com/Parquet, defines the
>> file
>> format, provides Java building blocks for processing columnar data, and
>> implements Hadoop Input/Output Formats, Pig Storers/Loaders, and an
>> example
>> of a complex integration -- Input/Output formats that can convert
>> Parquet-stored data directly to and from Thrift objects.
>>
>> A preview version of Parquet support will be available in Cloudera's
>> Impala
>> 0.7.
>>
>> Twitter is starting to convert some of its major data source to Parquet in
>> order to take advantage of the compression and deserialization savings.
>>
>> Parquet is currently under heavy development. Parquet's near-term roadmap
>> includes:
>> * Hive SerDes (Criteo)
>> * Cascading Taps (Criteo)
>> * Support for dictionary encoding, zigzag encoding, and RLE encoding of
>> data (Cloudera and Twitter)
>> * Further improvements to Pig support (Twitter)
>>
>> Company names in parenthesis indicate whose engineers signed up to do the
>> work -- others can feel free to jump in too, of course.
>>
>> We've also heard requests to provide an Avro container layer, similar to
>> what we do with Thrift. Seeking volunteers!
>>
>> We welcome all feedback, patches, and ideas; to foster community
>> development, we plan to contribute Parquet to the Apache Incubator when
>> the
>> development is farther along.
>>
>> Regards,
>> Nong Li, Julien Le Dem, Marcel Kornacker, Todd Lipcon, Dmitriy Ryaboy,
>> Jonathan Coveney, and friends.
>>
>
>

Re: Introducing Parquet: efficient columnar storage for Hadoop.

Posted by Luke Lu <ll...@apache.org>.

IMO, it'll be enlightening to Hadoop users to compare Parquet with Trevni
and ORCFile, all of which are columnar formats for Hadoop that are
relatively new. Do we really need 3 columnar formats?


On Tue, Mar 12, 2013 at 8:45 AM, Dmitriy Ryaboy <dv...@gmail.com> wrote:

> Fellow Hadoopers,
>
> We'd like to introduce a joint project between Twitter and Cloudera
> engineers -- a new columnar storage format for Hadoop called Parquet (
> http://parquet.github.com).
>
> We created Parquet to make the advantages of compressed, efficient columnar
> data representation available to any project in the Hadoop ecosystem,
> regardless of the choice of data processing framework, data model, or
> programming language.
>
> Parquet is built from the ground up with complex nested data structures in
> mind. We adopted the repetition/definition level approach to encoding such
> data structures, as described in Google's Dremel paper; we have found this
> to be a very efficient method of encoding data in non-trivial object
> schemas.
>
> Parquet is built to support very efficient compression and encoding
> schemes. Parquet allows compression schemes to be specified on a per-column
> level, and is future-proofed to allow adding more encodings as they are
> invented and implemented. We separate the concepts of encoding and
> compression, allowing parquet consumers to implement operators that work
> directly on encoded data without paying decompression and decoding penalty
> when possible.
>
> Parquet is built to be used by anyone. The Hadoop ecosystem is rich with
> data processing frameworks, and we are not interested in playing favorites.
> We believe that an efficient, well-implemented columnar storage substrate
> should be useful to all frameworks without the cost of extensive and
> difficult to set up dependencies.
>
> The initial code, available at https://github.com/Parquet, defines the
> file
> format, provides Java building blocks for processing columnar data, and
> implements Hadoop Input/Output Formats, Pig Storers/Loaders, and an example
> of a complex integration -- Input/Output formats that can convert
> Parquet-stored data directly to and from Thrift objects.
>
> A preview version of Parquet support will be available in Cloudera's Impala
> 0.7.
>
> Twitter is starting to convert some of its major data source to Parquet in
> order to take advantage of the compression and deserialization savings.
>
> Parquet is currently under heavy development. Parquet's near-term roadmap
> includes:
> * Hive SerDes (Criteo)
> * Cascading Taps (Criteo)
> * Support for dictionary encoding, zigzag encoding, and RLE encoding of
> data (Cloudera and Twitter)
> * Further improvements to Pig support (Twitter)
>
> Company names in parenthesis indicate whose engineers signed up to do the
> work -- others can feel free to jump in too, of course.
>
> We've also heard requests to provide an Avro container layer, similar to
> what we do with Thrift. Seeking volunteers!
>
> We welcome all feedback, patches, and ideas; to foster community
> development, we plan to contribute Parquet to the Apache Incubator when the
> development is farther along.
>
> Regards,
> Nong Li, Julien Le Dem, Marcel Kornacker, Todd Lipcon, Dmitriy Ryaboy,
> Jonathan Coveney, and friends.
>

Re: Introducing Parquet: efficient columnar storage for Hadoop.

Posted by Jonathan Coveney <jc...@gmail.com>.

Super excited that this is finally public. The benefits are huge, and
having an (eventually) battle tested columnar storage format developed for
a diverse set of needs will be awesome.


2013/3/12 Kevin Olson <ko...@marinsoftware.com>

> Second on that. Parquet looks compelling, but I'm curious to understand why
> Cloudera suddenly switched from espousing future support for Trevni to
> teaming with Twitter on Parquet.
>
> On Tue, Mar 12, 2013 at 11:01 AM, Stan Rosenberg
> <st...@gmail.com>wrote:
>
> > Dmitriy,
> >
> > Please excuse my ignorance.  What is/was wrong with trevni
> > (https://github.com/cutting/trevni) ?
> >
> > Thanks,
> >
> > stan
> >
> > On Tue, Mar 12, 2013 at 11:45 AM, Dmitriy Ryaboy <dv...@gmail.com>
> > wrote:
> > > Fellow Hadoopers,
> > >
> > > We'd like to introduce a joint project between Twitter and Cloudera
> > > engineers -- a new columnar storage format for Hadoop called Parquet (
> > > http://parquet.github.com).
> > >
> > > We created Parquet to make the advantages of compressed, efficient
> > columnar
> > > data representation available to any project in the Hadoop ecosystem,
> > > regardless of the choice of data processing framework, data model, or
> > > programming language.
> > >
> > > Parquet is built from the ground up with complex nested data structures
> > in
> > > mind. We adopted the repetition/definition level approach to encoding
> > such
> > > data structures, as described in Google's Dremel paper; we have found
> > this
> > > to be a very efficient method of encoding data in non-trivial object
> > > schemas.
> > >
> > > Parquet is built to support very efficient compression and encoding
> > > schemes. Parquet allows compression schemes to be specified on a
> > per-column
> > > level, and is future-proofed to allow adding more encodings as they are
> > > invented and implemented. We separate the concepts of encoding and
> > > compression, allowing parquet consumers to implement operators that
> work
> > > directly on encoded data without paying decompression and decoding
> > penalty
> > > when possible.
> > >
> > > Parquet is built to be used by anyone. The Hadoop ecosystem is rich
> with
> > > data processing frameworks, and we are not interested in playing
> > favorites.
> > > We believe that an efficient, well-implemented columnar storage
> substrate
> > > should be useful to all frameworks without the cost of extensive and
> > > difficult to set up dependencies.
> > >
> > > The initial code, available at https://github.com/Parquet, defines the
> > file
> > > format, provides Java building blocks for processing columnar data, and
> > > implements Hadoop Input/Output Formats, Pig Storers/Loaders, and an
> > example
> > > of a complex integration -- Input/Output formats that can convert
> > > Parquet-stored data directly to and from Thrift objects.
> > >
> > > A preview version of Parquet support will be available in Cloudera's
> > Impala
> > > 0.7.
> > >
> > > Twitter is starting to convert some of its major data source to Parquet
> > in
> > > order to take advantage of the compression and deserialization savings.
> > >
> > > Parquet is currently under heavy development. Parquet's near-term
> roadmap
> > > includes:
> > > * Hive SerDes (Criteo)
> > > * Cascading Taps (Criteo)
> > > * Support for dictionary encoding, zigzag encoding, and RLE encoding of
> > > data (Cloudera and Twitter)
> > > * Further improvements to Pig support (Twitter)
> > >
> > > Company names in parenthesis indicate whose engineers signed up to do
> the
> > > work -- others can feel free to jump in too, of course.
> > >
> > > We've also heard requests to provide an Avro container layer, similar
> to
> > > what we do with Thrift. Seeking volunteers!
> > >
> > > We welcome all feedback, patches, and ideas; to foster community
> > > development, we plan to contribute Parquet to the Apache Incubator when
> > the
> > > development is farther along.
> > >
> > > Regards,
> > > Nong Li, Julien Le Dem, Marcel Kornacker, Todd Lipcon, Dmitriy Ryaboy,
> > > Jonathan Coveney, and friends.
> >
>

Re: Introducing Parquet: efficient columnar storage for Hadoop.

Posted by Jarek Jarcec Cecho <ja...@apache.org>.

Cloudera has published a blog post [1] about the Parquet which seems to be answering most of the questions. I would encourage to read that article. It specifically talks about relationship with Trevni:

Parquet is designed to bring efficient columnar storage to Hadoop. Compared to, and learning from, the initial work done toward this goal in Trevni, Parquet includes the following enhancements:

* Efficiently encode nested structures and sparsely populated data based on the Google Dremel definition/repetition levels
* Provide extensible support for per-column encodings (e.g. delta, run length, etc)
* Provide extensibility of storing multiple types of data in column data (e.g. indexes, bloom filters, statistics)
* Offer better write performance by storing metadata at the end of the file

Jarcec

Links:
1: http://blog.cloudera.com/blog/2013/03/introducing-parquet-columnar-storage-for-apache-hadoop/

On Tue, Mar 12, 2013 at 01:06:04PM -0700, Kevin Olson wrote:
> Second on that. Parquet looks compelling, but I'm curious to understand why
> Cloudera suddenly switched from espousing future support for Trevni to
> teaming with Twitter on Parquet.
> 
> On Tue, Mar 12, 2013 at 11:01 AM, Stan Rosenberg
> <st...@gmail.com>wrote:
> 
> > Dmitriy,
> >
> > Please excuse my ignorance.  What is/was wrong with trevni
> > (https://github.com/cutting/trevni) ?
> >
> > Thanks,
> >
> > stan
> >
> > On Tue, Mar 12, 2013 at 11:45 AM, Dmitriy Ryaboy <dv...@gmail.com>
> > wrote:
> > > Fellow Hadoopers,
> > >
> > > We'd like to introduce a joint project between Twitter and Cloudera
> > > engineers -- a new columnar storage format for Hadoop called Parquet (
> > > http://parquet.github.com).
> > >
> > > We created Parquet to make the advantages of compressed, efficient
> > columnar
> > > data representation available to any project in the Hadoop ecosystem,
> > > regardless of the choice of data processing framework, data model, or
> > > programming language.
> > >
> > > Parquet is built from the ground up with complex nested data structures
> > in
> > > mind. We adopted the repetition/definition level approach to encoding
> > such
> > > data structures, as described in Google's Dremel paper; we have found
> > this
> > > to be a very efficient method of encoding data in non-trivial object
> > > schemas.
> > >
> > > Parquet is built to support very efficient compression and encoding
> > > schemes. Parquet allows compression schemes to be specified on a
> > per-column
> > > level, and is future-proofed to allow adding more encodings as they are
> > > invented and implemented. We separate the concepts of encoding and
> > > compression, allowing parquet consumers to implement operators that work
> > > directly on encoded data without paying decompression and decoding
> > penalty
> > > when possible.
> > >
> > > Parquet is built to be used by anyone. The Hadoop ecosystem is rich with
> > > data processing frameworks, and we are not interested in playing
> > favorites.
> > > We believe that an efficient, well-implemented columnar storage substrate
> > > should be useful to all frameworks without the cost of extensive and
> > > difficult to set up dependencies.
> > >
> > > The initial code, available at https://github.com/Parquet, defines the
> > file
> > > format, provides Java building blocks for processing columnar data, and
> > > implements Hadoop Input/Output Formats, Pig Storers/Loaders, and an
> > example
> > > of a complex integration -- Input/Output formats that can convert
> > > Parquet-stored data directly to and from Thrift objects.
> > >
> > > A preview version of Parquet support will be available in Cloudera's
> > Impala
> > > 0.7.
> > >
> > > Twitter is starting to convert some of its major data source to Parquet
> > in
> > > order to take advantage of the compression and deserialization savings.
> > >
> > > Parquet is currently under heavy development. Parquet's near-term roadmap
> > > includes:
> > > * Hive SerDes (Criteo)
> > > * Cascading Taps (Criteo)
> > > * Support for dictionary encoding, zigzag encoding, and RLE encoding of
> > > data (Cloudera and Twitter)
> > > * Further improvements to Pig support (Twitter)
> > >
> > > Company names in parenthesis indicate whose engineers signed up to do the
> > > work -- others can feel free to jump in too, of course.
> > >
> > > We've also heard requests to provide an Avro container layer, similar to
> > > what we do with Thrift. Seeking volunteers!
> > >
> > > We welcome all feedback, patches, and ideas; to foster community
> > > development, we plan to contribute Parquet to the Apache Incubator when
> > the
> > > development is farther along.
> > >
> > > Regards,
> > > Nong Li, Julien Le Dem, Marcel Kornacker, Todd Lipcon, Dmitriy Ryaboy,
> > > Jonathan Coveney, and friends.
> >

Re: Introducing Parquet: efficient columnar storage for Hadoop.

Posted by Kevin Olson <ko...@marinsoftware.com>.

Second on that. Parquet looks compelling, but I'm curious to understand why
Cloudera suddenly switched from espousing future support for Trevni to
teaming with Twitter on Parquet.

On Tue, Mar 12, 2013 at 11:01 AM, Stan Rosenberg
<st...@gmail.com>wrote:

> Dmitriy,
>
> Please excuse my ignorance.  What is/was wrong with trevni
> (https://github.com/cutting/trevni) ?
>
> Thanks,
>
> stan
>
> On Tue, Mar 12, 2013 at 11:45 AM, Dmitriy Ryaboy <dv...@gmail.com>
> wrote:
> > Fellow Hadoopers,
> >
> > We'd like to introduce a joint project between Twitter and Cloudera
> > engineers -- a new columnar storage format for Hadoop called Parquet (
> > http://parquet.github.com).
> >
> > We created Parquet to make the advantages of compressed, efficient
> columnar
> > data representation available to any project in the Hadoop ecosystem,
> > regardless of the choice of data processing framework, data model, or
> > programming language.
> >
> > Parquet is built from the ground up with complex nested data structures
> in
> > mind. We adopted the repetition/definition level approach to encoding
> such
> > data structures, as described in Google's Dremel paper; we have found
> this
> > to be a very efficient method of encoding data in non-trivial object
> > schemas.
> >
> > Parquet is built to support very efficient compression and encoding
> > schemes. Parquet allows compression schemes to be specified on a
> per-column
> > level, and is future-proofed to allow adding more encodings as they are
> > invented and implemented. We separate the concepts of encoding and
> > compression, allowing parquet consumers to implement operators that work
> > directly on encoded data without paying decompression and decoding
> penalty
> > when possible.
> >
> > Parquet is built to be used by anyone. The Hadoop ecosystem is rich with
> > data processing frameworks, and we are not interested in playing
> favorites.
> > We believe that an efficient, well-implemented columnar storage substrate
> > should be useful to all frameworks without the cost of extensive and
> > difficult to set up dependencies.
> >
> > The initial code, available at https://github.com/Parquet, defines the
> file
> > format, provides Java building blocks for processing columnar data, and
> > implements Hadoop Input/Output Formats, Pig Storers/Loaders, and an
> example
> > of a complex integration -- Input/Output formats that can convert
> > Parquet-stored data directly to and from Thrift objects.
> >
> > A preview version of Parquet support will be available in Cloudera's
> Impala
> > 0.7.
> >
> > Twitter is starting to convert some of its major data source to Parquet
> in
> > order to take advantage of the compression and deserialization savings.
> >
> > Parquet is currently under heavy development. Parquet's near-term roadmap
> > includes:
> > * Hive SerDes (Criteo)
> > * Cascading Taps (Criteo)
> > * Support for dictionary encoding, zigzag encoding, and RLE encoding of
> > data (Cloudera and Twitter)
> > * Further improvements to Pig support (Twitter)
> >
> > Company names in parenthesis indicate whose engineers signed up to do the
> > work -- others can feel free to jump in too, of course.
> >
> > We've also heard requests to provide an Avro container layer, similar to
> > what we do with Thrift. Seeking volunteers!
> >
> > We welcome all feedback, patches, and ideas; to foster community
> > development, we plan to contribute Parquet to the Apache Incubator when
> the
> > development is farther along.
> >
> > Regards,
> > Nong Li, Julien Le Dem, Marcel Kornacker, Todd Lipcon, Dmitriy Ryaboy,
> > Jonathan Coveney, and friends.
>

Re: Introducing Parquet: efficient columnar storage for Hadoop.

Posted by Stan Rosenberg <st...@gmail.com>.

Dmitriy,

Please excuse my ignorance.  What is/was wrong with trevni
(https://github.com/cutting/trevni) ?

Thanks,

stan

On Tue, Mar 12, 2013 at 11:45 AM, Dmitriy Ryaboy <dv...@gmail.com> wrote:
> Fellow Hadoopers,
>
> We'd like to introduce a joint project between Twitter and Cloudera
> engineers -- a new columnar storage format for Hadoop called Parquet (
> http://parquet.github.com).
>
> We created Parquet to make the advantages of compressed, efficient columnar
> data representation available to any project in the Hadoop ecosystem,
> regardless of the choice of data processing framework, data model, or
> programming language.
>
> Parquet is built from the ground up with complex nested data structures in
> mind. We adopted the repetition/definition level approach to encoding such
> data structures, as described in Google's Dremel paper; we have found this
> to be a very efficient method of encoding data in non-trivial object
> schemas.
>
> Parquet is built to support very efficient compression and encoding
> schemes. Parquet allows compression schemes to be specified on a per-column
> level, and is future-proofed to allow adding more encodings as they are
> invented and implemented. We separate the concepts of encoding and
> compression, allowing parquet consumers to implement operators that work
> directly on encoded data without paying decompression and decoding penalty
> when possible.
>
> Parquet is built to be used by anyone. The Hadoop ecosystem is rich with
> data processing frameworks, and we are not interested in playing favorites.
> We believe that an efficient, well-implemented columnar storage substrate
> should be useful to all frameworks without the cost of extensive and
> difficult to set up dependencies.
>
> The initial code, available at https://github.com/Parquet, defines the file
> format, provides Java building blocks for processing columnar data, and
> implements Hadoop Input/Output Formats, Pig Storers/Loaders, and an example
> of a complex integration -- Input/Output formats that can convert
> Parquet-stored data directly to and from Thrift objects.
>
> A preview version of Parquet support will be available in Cloudera's Impala
> 0.7.
>
> Twitter is starting to convert some of its major data source to Parquet in
> order to take advantage of the compression and deserialization savings.
>
> Parquet is currently under heavy development. Parquet's near-term roadmap
> includes:
> * Hive SerDes (Criteo)
> * Cascading Taps (Criteo)
> * Support for dictionary encoding, zigzag encoding, and RLE encoding of
> data (Cloudera and Twitter)
> * Further improvements to Pig support (Twitter)
>
> Company names in parenthesis indicate whose engineers signed up to do the
> work -- others can feel free to jump in too, of course.
>
> We've also heard requests to provide an Avro container layer, similar to
> what we do with Thrift. Seeking volunteers!
>
> We welcome all feedback, patches, and ideas; to foster community
> development, we plan to contribute Parquet to the Apache Incubator when the
> development is farther along.
>
> Regards,
> Nong Li, Julien Le Dem, Marcel Kornacker, Todd Lipcon, Dmitriy Ryaboy,
> Jonathan Coveney, and friends.

Re: Introducing Parquet: efficient columnar storage for Hadoop.

Posted by Luke Lu <ll...@apache.org>.

IMO, it'll be enlightening to Hadoop users to compare Parquet with Trevni
and ORCFile, all of which are columnar formats for Hadoop that are
relatively new. Do we really need 3 columnar formats?


On Tue, Mar 12, 2013 at 8:45 AM, Dmitriy Ryaboy <dv...@gmail.com> wrote:

> Fellow Hadoopers,
>
> We'd like to introduce a joint project between Twitter and Cloudera
> engineers -- a new columnar storage format for Hadoop called Parquet (
> http://parquet.github.com).
>
> We created Parquet to make the advantages of compressed, efficient columnar
> data representation available to any project in the Hadoop ecosystem,
> regardless of the choice of data processing framework, data model, or
> programming language.
>
> Parquet is built from the ground up with complex nested data structures in
> mind. We adopted the repetition/definition level approach to encoding such
> data structures, as described in Google's Dremel paper; we have found this
> to be a very efficient method of encoding data in non-trivial object
> schemas.
>
> Parquet is built to support very efficient compression and encoding
> schemes. Parquet allows compression schemes to be specified on a per-column
> level, and is future-proofed to allow adding more encodings as they are
> invented and implemented. We separate the concepts of encoding and
> compression, allowing parquet consumers to implement operators that work
> directly on encoded data without paying decompression and decoding penalty
> when possible.
>
> Parquet is built to be used by anyone. The Hadoop ecosystem is rich with
> data processing frameworks, and we are not interested in playing favorites.
> We believe that an efficient, well-implemented columnar storage substrate
> should be useful to all frameworks without the cost of extensive and
> difficult to set up dependencies.
>
> The initial code, available at https://github.com/Parquet, defines the
> file
> format, provides Java building blocks for processing columnar data, and
> implements Hadoop Input/Output Formats, Pig Storers/Loaders, and an example
> of a complex integration -- Input/Output formats that can convert
> Parquet-stored data directly to and from Thrift objects.
>
> A preview version of Parquet support will be available in Cloudera's Impala
> 0.7.
>
> Twitter is starting to convert some of its major data source to Parquet in
> order to take advantage of the compression and deserialization savings.
>
> Parquet is currently under heavy development. Parquet's near-term roadmap
> includes:
> * Hive SerDes (Criteo)
> * Cascading Taps (Criteo)
> * Support for dictionary encoding, zigzag encoding, and RLE encoding of
> data (Cloudera and Twitter)
> * Further improvements to Pig support (Twitter)
>
> Company names in parenthesis indicate whose engineers signed up to do the
> work -- others can feel free to jump in too, of course.
>
> We've also heard requests to provide an Avro container layer, similar to
> what we do with Thrift. Seeking volunteers!
>
> We welcome all feedback, patches, and ideas; to foster community
> development, we plan to contribute Parquet to the Apache Incubator when the
> development is farther along.
>
> Regards,
> Nong Li, Julien Le Dem, Marcel Kornacker, Todd Lipcon, Dmitriy Ryaboy,
> Jonathan Coveney, and friends.
>

Re: Introducing Parquet: efficient columnar storage for Hadoop.

Posted by Luke Lu <ll...@apache.org>.

IMO, it'll be enlightening to Hadoop users to compare Parquet with Trevni
and ORCFile, all of which are columnar formats for Hadoop that are
relatively new. Do we really need 3 columnar formats?


On Tue, Mar 12, 2013 at 8:45 AM, Dmitriy Ryaboy <dv...@gmail.com> wrote:

> Fellow Hadoopers,
>
> We'd like to introduce a joint project between Twitter and Cloudera
> engineers -- a new columnar storage format for Hadoop called Parquet (
> http://parquet.github.com).
>
> We created Parquet to make the advantages of compressed, efficient columnar
> data representation available to any project in the Hadoop ecosystem,
> regardless of the choice of data processing framework, data model, or
> programming language.
>
> Parquet is built from the ground up with complex nested data structures in
> mind. We adopted the repetition/definition level approach to encoding such
> data structures, as described in Google's Dremel paper; we have found this
> to be a very efficient method of encoding data in non-trivial object
> schemas.
>
> Parquet is built to support very efficient compression and encoding
> schemes. Parquet allows compression schemes to be specified on a per-column
> level, and is future-proofed to allow adding more encodings as they are
> invented and implemented. We separate the concepts of encoding and
> compression, allowing parquet consumers to implement operators that work
> directly on encoded data without paying decompression and decoding penalty
> when possible.
>
> Parquet is built to be used by anyone. The Hadoop ecosystem is rich with
> data processing frameworks, and we are not interested in playing favorites.
> We believe that an efficient, well-implemented columnar storage substrate
> should be useful to all frameworks without the cost of extensive and
> difficult to set up dependencies.
>
> The initial code, available at https://github.com/Parquet, defines the
> file
> format, provides Java building blocks for processing columnar data, and
> implements Hadoop Input/Output Formats, Pig Storers/Loaders, and an example
> of a complex integration -- Input/Output formats that can convert
> Parquet-stored data directly to and from Thrift objects.
>
> A preview version of Parquet support will be available in Cloudera's Impala
> 0.7.
>
> Twitter is starting to convert some of its major data source to Parquet in
> order to take advantage of the compression and deserialization savings.
>
> Parquet is currently under heavy development. Parquet's near-term roadmap
> includes:
> * Hive SerDes (Criteo)
> * Cascading Taps (Criteo)
> * Support for dictionary encoding, zigzag encoding, and RLE encoding of
> data (Cloudera and Twitter)
> * Further improvements to Pig support (Twitter)
>
> Company names in parenthesis indicate whose engineers signed up to do the
> work -- others can feel free to jump in too, of course.
>
> We've also heard requests to provide an Avro container layer, similar to
> what we do with Thrift. Seeking volunteers!
>
> We welcome all feedback, patches, and ideas; to foster community
> development, we plan to contribute Parquet to the Apache Incubator when the
> development is farther along.
>
> Regards,
> Nong Li, Julien Le Dem, Marcel Kornacker, Todd Lipcon, Dmitriy Ryaboy,
> Jonathan Coveney, and friends.
>

Re: Introducing Parquet: efficient columnar storage for Hadoop.

Posted by Luke Lu <ll...@apache.org>.

IMO, it'll be enlightening to Hadoop users to compare Parquet with Trevni
and ORCFile, all of which are columnar formats for Hadoop that are
relatively new. Do we really need 3 columnar formats?


On Tue, Mar 12, 2013 at 8:45 AM, Dmitriy Ryaboy <dv...@gmail.com> wrote:

> Fellow Hadoopers,
>
> We'd like to introduce a joint project between Twitter and Cloudera
> engineers -- a new columnar storage format for Hadoop called Parquet (
> http://parquet.github.com).
>
> We created Parquet to make the advantages of compressed, efficient columnar
> data representation available to any project in the Hadoop ecosystem,
> regardless of the choice of data processing framework, data model, or
> programming language.
>
> Parquet is built from the ground up with complex nested data structures in
> mind. We adopted the repetition/definition level approach to encoding such
> data structures, as described in Google's Dremel paper; we have found this
> to be a very efficient method of encoding data in non-trivial object
> schemas.
>
> Parquet is built to support very efficient compression and encoding
> schemes. Parquet allows compression schemes to be specified on a per-column
> level, and is future-proofed to allow adding more encodings as they are
> invented and implemented. We separate the concepts of encoding and
> compression, allowing parquet consumers to implement operators that work
> directly on encoded data without paying decompression and decoding penalty
> when possible.
>
> Parquet is built to be used by anyone. The Hadoop ecosystem is rich with
> data processing frameworks, and we are not interested in playing favorites.
> We believe that an efficient, well-implemented columnar storage substrate
> should be useful to all frameworks without the cost of extensive and
> difficult to set up dependencies.
>
> The initial code, available at https://github.com/Parquet, defines the
> file
> format, provides Java building blocks for processing columnar data, and
> implements Hadoop Input/Output Formats, Pig Storers/Loaders, and an example
> of a complex integration -- Input/Output formats that can convert
> Parquet-stored data directly to and from Thrift objects.
>
> A preview version of Parquet support will be available in Cloudera's Impala
> 0.7.
>
> Twitter is starting to convert some of its major data source to Parquet in
> order to take advantage of the compression and deserialization savings.
>
> Parquet is currently under heavy development. Parquet's near-term roadmap
> includes:
> * Hive SerDes (Criteo)
> * Cascading Taps (Criteo)
> * Support for dictionary encoding, zigzag encoding, and RLE encoding of
> data (Cloudera and Twitter)
> * Further improvements to Pig support (Twitter)
>
> Company names in parenthesis indicate whose engineers signed up to do the
> work -- others can feel free to jump in too, of course.
>
> We've also heard requests to provide an Avro container layer, similar to
> what we do with Thrift. Seeking volunteers!
>
> We welcome all feedback, patches, and ideas; to foster community
> development, we plan to contribute Parquet to the Apache Incubator when the
> development is farther along.
>
> Regards,
> Nong Li, Julien Le Dem, Marcel Kornacker, Todd Lipcon, Dmitriy Ryaboy,
> Jonathan Coveney, and friends.
>