You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@drill.apache.org by Jacques Nadeau <ja...@apache.org> on 2013/04/05 01:12:50 UTC

Re: Another columnar format Parquet

Does ORC support nested data?  How does it compare to the Dremel encoding
approach that Parquet utilizes?

Thanks,
Jacques

On Thu, Mar 28, 2013 at 11:22 PM, Owen O'Malley <om...@apache.org> wrote:

> On Tue, Mar 12, 2013 at 11:45 AM, Ted Dunning <te...@gmail.com>
> wrote:
>
> > So is it fair to say that Parquet will be open to contributions and will
> > hopefully develop an open community to drive it?
> >
> > If so, that is an excellent development.
> >
> > Is ORC file well enough developed for a comparison?
> >
>
> ORC is committed to Hive's trunk and seems more feature complete than
> Parquet. Parquet hasn't implemented indexes, dictionaries, or a datetime
> encoder yet. Obviously, if you have questions about ORC, please ask over on
> Hive's dev list.
>
> -- Owen
>
>
> >
> > On Tue, Mar 12, 2013 at 10:40 AM, Todd Lipcon <to...@cloudera.com> wrote:
> >
> > > Hey Jacques,
> > >
> > > Feel free to ping us with any questions. Despite some of the _users_ of
> > > Parquet competing with each other (eg query engines), we hope the file
> > > format itself can be easily implemented by everyone and become
> > ubiquitous.
> > >
> > > There are a few changes still in flight that we're working on, so you
> may
> > > want to join the parquet dev mailing list as well to follow along.
> > >
> > > Thanks
> > > -Todd
> > >
> > > On Tue, Mar 12, 2013 at 10:29 AM, Jacques Nadeau <ja...@apache.org>
> > > wrote:
> > >
> > > > When you said soon, you meant very soon.  This looks like great work.
> > > >  Thanks for sharing it with the world.  Will come back after spending
> > > some
> > > > time with it.
> > > >
> > > > thanks again,
> > > > Jacques
> > > >
> > > >
> > > >
> > > > On Tue, Mar 12, 2013 at 9:50 AM, Julien Le Dem <ju...@twitter.com>
> > > wrote:
> > > >
> > > > > The repo is now available: http://parquet.github.com/
> > > > > Let me know if you have questions
> > > > >
> > > > > On Mon, Mar 11, 2013 at 11:31 AM, Jacques Nadeau <
> jacques@apache.org
> > >
> > > > > wrote:
> > > > > > There definitely seem to be some new kids on the block.  I really
> > > hope
> > > > > that
> > > > > > Drill can adopt either ORC or Parquet as a closely related
> "native"
> > > > > format.
> > > > > >   At the moment, I'm actually more focused on the in-memory
> > execution
> > > > > > format and the right abstraction to support compressed columnar
> > > > execution
> > > > > > and vectorization.  Historically, the biggest gaps I'd worry
> about
> > > are
> > > > > > java-centricity and expectation of early materialization &
> > > > decompression.
> > > > > >  Once we get some execution stuff working, lets see how each fits
> > in.
> > > > > >  Rather than start a third competing format (or fourth if you
> count
> > > > > > Trevni), let's either use or extend/contribute back on one of the
> > > > > existing
> > > > > > new kids.
> > > > > >
> > > > > > Julien, do you think more will be shared about Parquet before the
> > > > Hadoop
> > > > > > Summit so we can start toying with using it inside of Drill?
> > > > > >
> > > > > > J
> > > > > >
> > > > > > On Mon, Mar 11, 2013 at 11:02 AM, Ken Krugler
> > > > > > <kk...@transpac.com>wrote:
> > > > > >
> > > > > >> Hi all,
> > > > > >>
> > > > > >> I've been trying to track down status/comparisons of various
> > > columnar
> > > > > >> formats, and just heard about Parquet.
> > > > > >>
> > > > > >> I don't have any direct experience with Parquet, but Really
> Smart
> > > Guy
> > > > > said:
> > > > > >>
> > > > > >> > From what I hear there are two key features that
> > > > > >> > differentiate it from ORC and Trevni: 1) columns can be
> > optionally
> > > > > split
> > > > > >> into
> > > > > >> > separate files, and 2) the mechanism for shredding nested
> fields
> > > > into
> > > > > >> > columns is taken almost verbatim from Dremel. Feature (1)
> won't
> > be
> > > > > >> practical
> > > > > >> > to use until Hadoop introduces support for a file group
> locality
> > > > > >> feature, but once it
> > > > > >> > does this feature should enable more efficient use of the
> buffer
> > > > cache
> > > > > >> for predicate
> > > > > >> > pushdown operations.
> > > > > >>
> > > > > >> -- Ken
> > > > > >>
> > > > > >>
> > > > > >> On Mar 11, 2013, at 10:56am, Julien Le Dem wrote:
> > > > > >>
> > > > > >> > Parquet is actually implementing the algorithm described in
> the
> > > > > >> > "Nested Columnar Storage" section of the Dremel paper[1].
> > > > > >> >
> > > > > >> > [1] http://research.google.com/pubs/pub36632.html
> > > > > >> >
> > > > > >> > On Mon, Mar 11, 2013 at 10:41 AM, Timothy Chen <
> > tnachen@gmail.com
> > > >
> > > > > >> wrote:
> > > > > >> >> Just saw this:
> > > > > >> >>
> > > > > >> >> http://t.co/ES1dGDZlKA
> > > > > >> >>
> > > > > >> >> I know Trevni is another Dremel inspired Columnar format as
> > well,
> > > > > anyone
> > > > > >> >> saw much info Parquet and how it's different?
> > > > > >> >>
> > > > > >> >> Tim
> > > > > >>
> > > > > >> --------------------------
> > > > > >> Ken Krugler
> > > > > >> +1 530-210-6378
> > > > > >> http://www.scaleunlimited.com
> > > > > >> custom big data solutions & training
> > > > > >> Hadoop, Cascading, Cassandra & Solr
> > > > > >>
> > > > > >>
> > > > > >>
> > > > > >>
> > > > > >>
> > > > > >>
> > > > >
> > > >
> > >
> > >
> > >
> > > --
> > > Todd Lipcon
> > > Software Engineer, Cloudera
> > >
> >
>

Re: Another columnar format Parquet

Posted by Ted Dunning <te...@gmail.com>.

Looking at this more carefully the other day with Jacques makes it seem that

a) as Owen says ORC has a more elaborate type structure.  The data stored
is equivalent (ref the Protobuf versus Avro versus Thrift discussions)
subject to the possibility of the null, null difference that Owen mentions

b) as the Dremel paper points out, access to a very rare repeated structure
inside a common repeated structure will require only traversal of the rare
element using Dremel type structures, but with repeat counts will require
an additional traversal of a much more dense column.  How much difference
this will make in practice is unknown, but there are clearly cases that you
can imagine that this will cause orders of magnitude difference in favor of
Parquet.  Those cases may, howver, be vanishingly rare.

On Mon, Apr 15, 2013 at 4:06 PM, Owen O'Malley <om...@apache.org> wrote:

> Just a bit saying whether the record was present or null. Note that this is
> strictly more expressive than the Parquet's format in that it can encode
> structures with all null values. I believe the Parquet encoder would
> discard a row of the form (null, null) since it wouldn't have any leaves to
> make it materialize.
>
> -- Owen
>
>
> On Wed, Apr 10, 2013 at 3:48 PM, Ted Dunning <te...@gmail.com>
> wrote:
>
> > On Wed, Apr 10, 2013 at 10:17 AM, Owen O'Malley <om...@apache.org>
> > wrote:
> >
> > > Ted,
> > >    ORC does support nested structures and splits them into primitive
> > > columns.
> >
> >
> > Good to hear.
> >
> >
> > > ...
> > > create table Foo (
> > >   complex: struct<field1: int, field2: map<string, int>>
> > >   simple: timestamp
> > > );
> > >
> > > will end up with a prefix-order flattening of the columns:
> > >
> > > columns:
> > > 0 - top level record (struct, children: 1, 6)
> > > 1 - complex (struct, children: 2, 3)
> > >
> >
> > What is stored in column 1?
> >
>

Re: Another columnar format Parquet

Posted by Owen O'Malley <om...@apache.org>.

Just a bit saying whether the record was present or null. Note that this is
strictly more expressive than the Parquet's format in that it can encode
structures with all null values. I believe the Parquet encoder would
discard a row of the form (null, null) since it wouldn't have any leaves to
make it materialize.

-- Owen

On Wed, Apr 10, 2013 at 3:48 PM, Ted Dunning <te...@gmail.com> wrote:

> On Wed, Apr 10, 2013 at 10:17 AM, Owen O'Malley <om...@apache.org>
> wrote:
>
> > Ted,
> >    ORC does support nested structures and splits them into primitive
> > columns.
>
>
> Good to hear.
>
>
> > ...
> > create table Foo (
> >   complex: struct<field1: int, field2: map<string, int>>
> >   simple: timestamp
> > );
> >
> > will end up with a prefix-order flattening of the columns:
> >
> > columns:
> > 0 - top level record (struct, children: 1, 6)
> > 1 - complex (struct, children: 2, 3)
> >
>
> What is stored in column 1?
>

Re: Another columnar format Parquet

Posted by Ted Dunning <te...@gmail.com>.

On Wed, Apr 10, 2013 at 10:17 AM, Owen O'Malley <om...@apache.org> wrote:

> Ted,
>    ORC does support nested structures and splits them into primitive
> columns.


Good to hear.


> ...
> create table Foo (
>   complex: struct<field1: int, field2: map<string, int>>
>   simple: timestamp
> );
>
> will end up with a prefix-order flattening of the columns:
>
> columns:
> 0 - top level record (struct, children: 1, 6)
> 1 - complex (struct, children: 2, 3)
>

What is stored in column 1?

Re: Another columnar format Parquet

Posted by Owen O'Malley <om...@apache.org>.

Ted,
   ORC does support nested structures and splits them into primitive
columns. That is required to get the benefits of type-specific encodings.
ORC doesn't do the complex repetition and definition levels that require a
DFA to reassemble the rows, but takes the more straightforward approach of
recording the information for the intermediate columns.

create table Foo (
  complex: struct<field1: int, field2: map<string, int>>
  simple: timestamp
);

will end up with a prefix-order flattening of the columns:

columns:
0 - top level record (struct, children: 1, 6)
1 - complex (struct, children: 2, 3)
2 - field1 (int)
3 - field2 (map, children: 4,5 )
4 - map key (string)
5 - map value (int)
6 - simple (timestamp)

Instead of encoding the definition and repetition levels in columns 4 and
5, ORC encodes the number of entries in the map in the data for column 3.
It would be very interesting to take the githubarchive.org logs and put
them into ORC and Parquet and measure the resulting file sizes. (Other
thoughts about such a comparison: try compressed versus uncompressed and
turning off ORC's indexes since Parquet doesn't have indexes.)

-- Owen


On Tue, Apr 9, 2013 at 8:43 AM, Timothy Chen <tn...@gmail.com> wrote:

> Hi Ted,
>
> Can you explain more about the question you have about encoding in ORC?
>
> Tim
>
> Sent from my iPhone
>
> On Apr 4, 2013, at 11:01 PM, Ted Dunning <te...@gmail.com> wrote:
>
> > Yes it does.
> >
> > I have seen conflicting docs on format it uses.  One seemed to say that
> > complex cells were stored within a single cell.  The other seemed to say
> > that nested structures were shredded in the style of Parquet or Dremel.
> >
> > One thing that I worry about with ORC is that it exactly replicates the
> > schema model of Hive which isn't as congenial (to me) as the protobuf
> style
> > of Parquet.  As Julien mentioned in the Drill meetup, there is also the
> > question of the correctness of the encoding.  The Dremel column shredding
> > is pretty subtle.  Hopefully ORC authors started from first principles in
> > designing the encoding.
> >
> >
> > On Fri, Apr 5, 2013 at 1:12 AM, Jacques Nadeau <ja...@apache.org>
> wrote:
> >
> >> Does ORC support nested data?  How does it compare to the Dremel
> encoding
> >> approach that Parquet utilizes?
> >>
> >> Thanks,
> >> Jacques
> >>
> >> On Thu, Mar 28, 2013 at 11:22 PM, Owen O'Malley <om...@apache.org>
> >> wrote:
> >>
> >>> On Tue, Mar 12, 2013 at 11:45 AM, Ted Dunning <te...@gmail.com>
> >>> wrote:
> >>>
> >>>> So is it fair to say that Parquet will be open to contributions and
> >> will
> >>>> hopefully develop an open community to drive it?
> >>>>
> >>>> If so, that is an excellent development.
> >>>>
> >>>> Is ORC file well enough developed for a comparison?
> >>>
> >>> ORC is committed to Hive's trunk and seems more feature complete than
> >>> Parquet. Parquet hasn't implemented indexes, dictionaries, or a
> datetime
> >>> encoder yet. Obviously, if you have questions about ORC, please ask
> over
> >> on
> >>> Hive's dev list.
> >>>
> >>> -- Owen
> >>>
> >>>
> >>>>
> >>>> On Tue, Mar 12, 2013 at 10:40 AM, Todd Lipcon <to...@cloudera.com>
> >> wrote:
> >>>>
> >>>>> Hey Jacques,
> >>>>>
> >>>>> Feel free to ping us with any questions. Despite some of the _users_
> >> of
> >>>>> Parquet competing with each other (eg query engines), we hope the
> >> file
> >>>>> format itself can be easily implemented by everyone and become
> >>>> ubiquitous.
> >>>>>
> >>>>> There are a few changes still in flight that we're working on, so you
> >>> may
> >>>>> want to join the parquet dev mailing list as well to follow along.
> >>>>>
> >>>>> Thanks
> >>>>> -Todd
> >>>>>
> >>>>> On Tue, Mar 12, 2013 at 10:29 AM, Jacques Nadeau <jacques@apache.org
> >>>
> >>>>> wrote:
> >>>>>
> >>>>>> When you said soon, you meant very soon.  This looks like great
> >> work.
> >>>>>> Thanks for sharing it with the world.  Will come back after
> >> spending
> >>>>> some
> >>>>>> time with it.
> >>>>>>
> >>>>>> thanks again,
> >>>>>> Jacques
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>> On Tue, Mar 12, 2013 at 9:50 AM, Julien Le Dem <julien@twitter.com
> >>>
> >>>>> wrote:
> >>>>>>
> >>>>>>> The repo is now available: http://parquet.github.com/
> >>>>>>> Let me know if you have questions
> >>>>>>>
> >>>>>>> On Mon, Mar 11, 2013 at 11:31 AM, Jacques Nadeau <
> >>> jacques@apache.org
> >>>>>
> >>>>>>> wrote:
> >>>>>>>> There definitely seem to be some new kids on the block.  I
> >> really
> >>>>> hope
> >>>>>>> that
> >>>>>>>> Drill can adopt either ORC or Parquet as a closely related
> >>> "native"
> >>>>>>> format.
> >>>>>>>>  At the moment, I'm actually more focused on the in-memory
> >>>> execution
> >>>>>>>> format and the right abstraction to support compressed columnar
> >>>>>> execution
> >>>>>>>> and vectorization.  Historically, the biggest gaps I'd worry
> >>> about
> >>>>> are
> >>>>>>>> java-centricity and expectation of early materialization &
> >>>>>> decompression.
> >>>>>>>> Once we get some execution stuff working, lets see how each
> >> fits
> >>>> in.
> >>>>>>>> Rather than start a third competing format (or fourth if you
> >>> count
> >>>>>>>> Trevni), let's either use or extend/contribute back on one of
> >> the
> >>>>>>> existing
> >>>>>>>> new kids.
> >>>>>>>>
> >>>>>>>> Julien, do you think more will be shared about Parquet before
> >> the
> >>>>>> Hadoop
> >>>>>>>> Summit so we can start toying with using it inside of Drill?
> >>>>>>>>
> >>>>>>>> J
> >>>>>>>>
> >>>>>>>> On Mon, Mar 11, 2013 at 11:02 AM, Ken Krugler
> >>>>>>>> <kk...@transpac.com>wrote:
> >>>>>>>>
> >>>>>>>>> Hi all,
> >>>>>>>>>
> >>>>>>>>> I've been trying to track down status/comparisons of various
> >>>>> columnar
> >>>>>>>>> formats, and just heard about Parquet.
> >>>>>>>>>
> >>>>>>>>> I don't have any direct experience with Parquet, but Really
> >>> Smart
> >>>>> Guy
> >>>>>>> said:
> >>>>>>>>>
> >>>>>>>>>> From what I hear there are two key features that
> >>>>>>>>>> differentiate it from ORC and Trevni: 1) columns can be
> >>>> optionally
> >>>>>>> split
> >>>>>>>>> into
> >>>>>>>>>> separate files, and 2) the mechanism for shredding nested
> >>> fields
> >>>>>> into
> >>>>>>>>>> columns is taken almost verbatim from Dremel. Feature (1)
> >>> won't
> >>>> be
> >>>>>>>>> practical
> >>>>>>>>>> to use until Hadoop introduces support for a file group
> >>> locality
> >>>>>>>>> feature, but once it
> >>>>>>>>>> does this feature should enable more efficient use of the
> >>> buffer
> >>>>>> cache
> >>>>>>>>> for predicate
> >>>>>>>>>> pushdown operations.
> >>>>>>>>>
> >>>>>>>>> -- Ken
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>> On Mar 11, 2013, at 10:56am, Julien Le Dem wrote:
> >>>>>>>>>
> >>>>>>>>>> Parquet is actually implementing the algorithm described in
> >>> the
> >>>>>>>>>> "Nested Columnar Storage" section of the Dremel paper[1].
> >>>>>>>>>>
> >>>>>>>>>> [1] http://research.google.com/pubs/pub36632.html
> >>>>>>>>>>
> >>>>>>>>>> On Mon, Mar 11, 2013 at 10:41 AM, Timothy Chen <
> >>>> tnachen@gmail.com
> >>>>>>
> >>>>>>>>> wrote:
> >>>>>>>>>>> Just saw this:
> >>>>>>>>>>>
> >>>>>>>>>>> http://t.co/ES1dGDZlKA
> >>>>>>>>>>>
> >>>>>>>>>>> I know Trevni is another Dremel inspired Columnar format as
> >>>> well,
> >>>>>>> anyone
> >>>>>>>>>>> saw much info Parquet and how it's different?
> >>>>>>>>>>>
> >>>>>>>>>>> Tim
> >>>>>>>>>
> >>>>>>>>> --------------------------
> >>>>>>>>> Ken Krugler
> >>>>>>>>> +1 530-210-6378
> >>>>>>>>> http://www.scaleunlimited.com
> >>>>>>>>> custom big data solutions & training
> >>>>>>>>> Hadoop, Cascading, Cassandra & Solr
> >>>>>
> >>>>>
> >>>>>
> >>>>> --
> >>>>> Todd Lipcon
> >>>>> Software Engineer, Cloudera
> >>
>

Re: Another columnar format Parquet

Posted by Timothy Chen <tn...@gmail.com>.

Hi Ted,

Can you explain more about the question you have about encoding in ORC? 

Tim

Sent from my iPhone

On Apr 4, 2013, at 11:01 PM, Ted Dunning <te...@gmail.com> wrote:

> Yes it does.
> 
> I have seen conflicting docs on format it uses.  One seemed to say that
> complex cells were stored within a single cell.  The other seemed to say
> that nested structures were shredded in the style of Parquet or Dremel.
> 
> One thing that I worry about with ORC is that it exactly replicates the
> schema model of Hive which isn't as congenial (to me) as the protobuf style
> of Parquet.  As Julien mentioned in the Drill meetup, there is also the
> question of the correctness of the encoding.  The Dremel column shredding
> is pretty subtle.  Hopefully ORC authors started from first principles in
> designing the encoding.
> 
> 
> On Fri, Apr 5, 2013 at 1:12 AM, Jacques Nadeau <ja...@apache.org> wrote:
> 
>> Does ORC support nested data?  How does it compare to the Dremel encoding
>> approach that Parquet utilizes?
>> 
>> Thanks,
>> Jacques
>> 
>> On Thu, Mar 28, 2013 at 11:22 PM, Owen O'Malley <om...@apache.org>
>> wrote:
>> 
>>> On Tue, Mar 12, 2013 at 11:45 AM, Ted Dunning <te...@gmail.com>
>>> wrote:
>>> 
>>>> So is it fair to say that Parquet will be open to contributions and
>> will
>>>> hopefully develop an open community to drive it?
>>>> 
>>>> If so, that is an excellent development.
>>>> 
>>>> Is ORC file well enough developed for a comparison?
>>> 
>>> ORC is committed to Hive's trunk and seems more feature complete than
>>> Parquet. Parquet hasn't implemented indexes, dictionaries, or a datetime
>>> encoder yet. Obviously, if you have questions about ORC, please ask over
>> on
>>> Hive's dev list.
>>> 
>>> -- Owen
>>> 
>>> 
>>>> 
>>>> On Tue, Mar 12, 2013 at 10:40 AM, Todd Lipcon <to...@cloudera.com>
>> wrote:
>>>> 
>>>>> Hey Jacques,
>>>>> 
>>>>> Feel free to ping us with any questions. Despite some of the _users_
>> of
>>>>> Parquet competing with each other (eg query engines), we hope the
>> file
>>>>> format itself can be easily implemented by everyone and become
>>>> ubiquitous.
>>>>> 
>>>>> There are a few changes still in flight that we're working on, so you
>>> may
>>>>> want to join the parquet dev mailing list as well to follow along.
>>>>> 
>>>>> Thanks
>>>>> -Todd
>>>>> 
>>>>> On Tue, Mar 12, 2013 at 10:29 AM, Jacques Nadeau <jacques@apache.org
>>> 
>>>>> wrote:
>>>>> 
>>>>>> When you said soon, you meant very soon.  This looks like great
>> work.
>>>>>> Thanks for sharing it with the world.  Will come back after
>> spending
>>>>> some
>>>>>> time with it.
>>>>>> 
>>>>>> thanks again,
>>>>>> Jacques
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> On Tue, Mar 12, 2013 at 9:50 AM, Julien Le Dem <julien@twitter.com
>>> 
>>>>> wrote:
>>>>>> 
>>>>>>> The repo is now available: http://parquet.github.com/
>>>>>>> Let me know if you have questions
>>>>>>> 
>>>>>>> On Mon, Mar 11, 2013 at 11:31 AM, Jacques Nadeau <
>>> jacques@apache.org
>>>>> 
>>>>>>> wrote:
>>>>>>>> There definitely seem to be some new kids on the block.  I
>> really
>>>>> hope
>>>>>>> that
>>>>>>>> Drill can adopt either ORC or Parquet as a closely related
>>> "native"
>>>>>>> format.
>>>>>>>>  At the moment, I'm actually more focused on the in-memory
>>>> execution
>>>>>>>> format and the right abstraction to support compressed columnar
>>>>>> execution
>>>>>>>> and vectorization.  Historically, the biggest gaps I'd worry
>>> about
>>>>> are
>>>>>>>> java-centricity and expectation of early materialization &
>>>>>> decompression.
>>>>>>>> Once we get some execution stuff working, lets see how each
>> fits
>>>> in.
>>>>>>>> Rather than start a third competing format (or fourth if you
>>> count
>>>>>>>> Trevni), let's either use or extend/contribute back on one of
>> the
>>>>>>> existing
>>>>>>>> new kids.
>>>>>>>> 
>>>>>>>> Julien, do you think more will be shared about Parquet before
>> the
>>>>>> Hadoop
>>>>>>>> Summit so we can start toying with using it inside of Drill?
>>>>>>>> 
>>>>>>>> J
>>>>>>>> 
>>>>>>>> On Mon, Mar 11, 2013 at 11:02 AM, Ken Krugler
>>>>>>>> <kk...@transpac.com>wrote:
>>>>>>>> 
>>>>>>>>> Hi all,
>>>>>>>>> 
>>>>>>>>> I've been trying to track down status/comparisons of various
>>>>> columnar
>>>>>>>>> formats, and just heard about Parquet.
>>>>>>>>> 
>>>>>>>>> I don't have any direct experience with Parquet, but Really
>>> Smart
>>>>> Guy
>>>>>>> said:
>>>>>>>>> 
>>>>>>>>>> From what I hear there are two key features that
>>>>>>>>>> differentiate it from ORC and Trevni: 1) columns can be
>>>> optionally
>>>>>>> split
>>>>>>>>> into
>>>>>>>>>> separate files, and 2) the mechanism for shredding nested
>>> fields
>>>>>> into
>>>>>>>>>> columns is taken almost verbatim from Dremel. Feature (1)
>>> won't
>>>> be
>>>>>>>>> practical
>>>>>>>>>> to use until Hadoop introduces support for a file group
>>> locality
>>>>>>>>> feature, but once it
>>>>>>>>>> does this feature should enable more efficient use of the
>>> buffer
>>>>>> cache
>>>>>>>>> for predicate
>>>>>>>>>> pushdown operations.
>>>>>>>>> 
>>>>>>>>> -- Ken
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> On Mar 11, 2013, at 10:56am, Julien Le Dem wrote:
>>>>>>>>> 
>>>>>>>>>> Parquet is actually implementing the algorithm described in
>>> the
>>>>>>>>>> "Nested Columnar Storage" section of the Dremel paper[1].
>>>>>>>>>> 
>>>>>>>>>> [1] http://research.google.com/pubs/pub36632.html
>>>>>>>>>> 
>>>>>>>>>> On Mon, Mar 11, 2013 at 10:41 AM, Timothy Chen <
>>>> tnachen@gmail.com
>>>>>> 
>>>>>>>>> wrote:
>>>>>>>>>>> Just saw this:
>>>>>>>>>>> 
>>>>>>>>>>> http://t.co/ES1dGDZlKA
>>>>>>>>>>> 
>>>>>>>>>>> I know Trevni is another Dremel inspired Columnar format as
>>>> well,
>>>>>>> anyone
>>>>>>>>>>> saw much info Parquet and how it's different?
>>>>>>>>>>> 
>>>>>>>>>>> Tim
>>>>>>>>> 
>>>>>>>>> --------------------------
>>>>>>>>> Ken Krugler
>>>>>>>>> +1 530-210-6378
>>>>>>>>> http://www.scaleunlimited.com
>>>>>>>>> custom big data solutions & training
>>>>>>>>> Hadoop, Cascading, Cassandra & Solr
>>>>> 
>>>>> 
>>>>> 
>>>>> --
>>>>> Todd Lipcon
>>>>> Software Engineer, Cloudera
>>

Re: Another columnar format Parquet

Posted by Ted Dunning <te...@gmail.com>.

Yes it does.

I have seen conflicting docs on format it uses.  One seemed to say that
complex cells were stored within a single cell.  The other seemed to say
that nested structures were shredded in the style of Parquet or Dremel.

One thing that I worry about with ORC is that it exactly replicates the
schema model of Hive which isn't as congenial (to me) as the protobuf style
of Parquet.  As Julien mentioned in the Drill meetup, there is also the
question of the correctness of the encoding.  The Dremel column shredding
is pretty subtle.  Hopefully ORC authors started from first principles in
designing the encoding.


On Fri, Apr 5, 2013 at 1:12 AM, Jacques Nadeau <ja...@apache.org> wrote:

> Does ORC support nested data?  How does it compare to the Dremel encoding
> approach that Parquet utilizes?
>
> Thanks,
> Jacques
>
> On Thu, Mar 28, 2013 at 11:22 PM, Owen O'Malley <om...@apache.org>
> wrote:
>
> > On Tue, Mar 12, 2013 at 11:45 AM, Ted Dunning <te...@gmail.com>
> > wrote:
> >
> > > So is it fair to say that Parquet will be open to contributions and
> will
> > > hopefully develop an open community to drive it?
> > >
> > > If so, that is an excellent development.
> > >
> > > Is ORC file well enough developed for a comparison?
> > >
> >
> > ORC is committed to Hive's trunk and seems more feature complete than
> > Parquet. Parquet hasn't implemented indexes, dictionaries, or a datetime
> > encoder yet. Obviously, if you have questions about ORC, please ask over
> on
> > Hive's dev list.
> >
> > -- Owen
> >
> >
> > >
> > > On Tue, Mar 12, 2013 at 10:40 AM, Todd Lipcon <to...@cloudera.com>
> wrote:
> > >
> > > > Hey Jacques,
> > > >
> > > > Feel free to ping us with any questions. Despite some of the _users_
> of
> > > > Parquet competing with each other (eg query engines), we hope the
> file
> > > > format itself can be easily implemented by everyone and become
> > > ubiquitous.
> > > >
> > > > There are a few changes still in flight that we're working on, so you
> > may
> > > > want to join the parquet dev mailing list as well to follow along.
> > > >
> > > > Thanks
> > > > -Todd
> > > >
> > > > On Tue, Mar 12, 2013 at 10:29 AM, Jacques Nadeau <jacques@apache.org
> >
> > > > wrote:
> > > >
> > > > > When you said soon, you meant very soon.  This looks like great
> work.
> > > > >  Thanks for sharing it with the world.  Will come back after
> spending
> > > > some
> > > > > time with it.
> > > > >
> > > > > thanks again,
> > > > > Jacques
> > > > >
> > > > >
> > > > >
> > > > > On Tue, Mar 12, 2013 at 9:50 AM, Julien Le Dem <julien@twitter.com
> >
> > > > wrote:
> > > > >
> > > > > > The repo is now available: http://parquet.github.com/
> > > > > > Let me know if you have questions
> > > > > >
> > > > > > On Mon, Mar 11, 2013 at 11:31 AM, Jacques Nadeau <
> > jacques@apache.org
> > > >
> > > > > > wrote:
> > > > > > > There definitely seem to be some new kids on the block.  I
> really
> > > > hope
> > > > > > that
> > > > > > > Drill can adopt either ORC or Parquet as a closely related
> > "native"
> > > > > > format.
> > > > > > >   At the moment, I'm actually more focused on the in-memory
> > > execution
> > > > > > > format and the right abstraction to support compressed columnar
> > > > > execution
> > > > > > > and vectorization.  Historically, the biggest gaps I'd worry
> > about
> > > > are
> > > > > > > java-centricity and expectation of early materialization &
> > > > > decompression.
> > > > > > >  Once we get some execution stuff working, lets see how each
> fits
> > > in.
> > > > > > >  Rather than start a third competing format (or fourth if you
> > count
> > > > > > > Trevni), let's either use or extend/contribute back on one of
> the
> > > > > > existing
> > > > > > > new kids.
> > > > > > >
> > > > > > > Julien, do you think more will be shared about Parquet before
> the
> > > > > Hadoop
> > > > > > > Summit so we can start toying with using it inside of Drill?
> > > > > > >
> > > > > > > J
> > > > > > >
> > > > > > > On Mon, Mar 11, 2013 at 11:02 AM, Ken Krugler
> > > > > > > <kk...@transpac.com>wrote:
> > > > > > >
> > > > > > >> Hi all,
> > > > > > >>
> > > > > > >> I've been trying to track down status/comparisons of various
> > > > columnar
> > > > > > >> formats, and just heard about Parquet.
> > > > > > >>
> > > > > > >> I don't have any direct experience with Parquet, but Really
> > Smart
> > > > Guy
> > > > > > said:
> > > > > > >>
> > > > > > >> > From what I hear there are two key features that
> > > > > > >> > differentiate it from ORC and Trevni: 1) columns can be
> > > optionally
> > > > > > split
> > > > > > >> into
> > > > > > >> > separate files, and 2) the mechanism for shredding nested
> > fields
> > > > > into
> > > > > > >> > columns is taken almost verbatim from Dremel. Feature (1)
> > won't
> > > be
> > > > > > >> practical
> > > > > > >> > to use until Hadoop introduces support for a file group
> > locality
> > > > > > >> feature, but once it
> > > > > > >> > does this feature should enable more efficient use of the
> > buffer
> > > > > cache
> > > > > > >> for predicate
> > > > > > >> > pushdown operations.
> > > > > > >>
> > > > > > >> -- Ken
> > > > > > >>
> > > > > > >>
> > > > > > >> On Mar 11, 2013, at 10:56am, Julien Le Dem wrote:
> > > > > > >>
> > > > > > >> > Parquet is actually implementing the algorithm described in
> > the
> > > > > > >> > "Nested Columnar Storage" section of the Dremel paper[1].
> > > > > > >> >
> > > > > > >> > [1] http://research.google.com/pubs/pub36632.html
> > > > > > >> >
> > > > > > >> > On Mon, Mar 11, 2013 at 10:41 AM, Timothy Chen <
> > > tnachen@gmail.com
> > > > >
> > > > > > >> wrote:
> > > > > > >> >> Just saw this:
> > > > > > >> >>
> > > > > > >> >> http://t.co/ES1dGDZlKA
> > > > > > >> >>
> > > > > > >> >> I know Trevni is another Dremel inspired Columnar format as
> > > well,
> > > > > > anyone
> > > > > > >> >> saw much info Parquet and how it's different?
> > > > > > >> >>
> > > > > > >> >> Tim
> > > > > > >>
> > > > > > >> --------------------------
> > > > > > >> Ken Krugler
> > > > > > >> +1 530-210-6378
> > > > > > >> http://www.scaleunlimited.com
> > > > > > >> custom big data solutions & training
> > > > > > >> Hadoop, Cascading, Cassandra & Solr
> > > > > > >>
> > > > > > >>
> > > > > > >>
> > > > > > >>
> > > > > > >>
> > > > > > >>
> > > > > >
> > > > >
> > > >
> > > >
> > > >
> > > > --
> > > > Todd Lipcon
> > > > Software Engineer, Cloudera
> > > >
> > >
> >
>