You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@drill.apache.org by Timothy Chen <tn...@gmail.com> on 2013/03/11 18:41:10 UTC

Another columnar format Parquet

Just saw this:

http://t.co/ES1dGDZlKA

I know Trevni is another Dremel inspired Columnar format as well, anyone
saw much info Parquet and how it's different?

Tim

Re: Another columnar format Parquet

Posted by Julien Le Dem <ju...@twitter.com>.

Parquet is not java centric.
We will share more about it soon.
Julien

On Mon, Mar 11, 2013 at 11:31 AM, Jacques Nadeau <ja...@apache.org> wrote:
> There definitely seem to be some new kids on the block.  I really hope that
> Drill can adopt either ORC or Parquet as a closely related "native" format.
>   At the moment, I'm actually more focused on the in-memory execution
> format and the right abstraction to support compressed columnar execution
> and vectorization.  Historically, the biggest gaps I'd worry about are
> java-centricity and expectation of early materialization & decompression.
>  Once we get some execution stuff working, lets see how each fits in.
>  Rather than start a third competing format (or fourth if you count
> Trevni), let's either use or extend/contribute back on one of the existing
> new kids.
>
> Julien, do you think more will be shared about Parquet before the Hadoop
> Summit so we can start toying with using it inside of Drill?
>
> J
>
> On Mon, Mar 11, 2013 at 11:02 AM, Ken Krugler
> <kk...@transpac.com>wrote:
>
>> Hi all,
>>
>> I've been trying to track down status/comparisons of various columnar
>> formats, and just heard about Parquet.
>>
>> I don't have any direct experience with Parquet, but Really Smart Guy said:
>>
>> > From what I hear there are two key features that
>> > differentiate it from ORC and Trevni: 1) columns can be optionally split
>> into
>> > separate files, and 2) the mechanism for shredding nested fields into
>> > columns is taken almost verbatim from Dremel. Feature (1) won't be
>> practical
>> > to use until Hadoop introduces support for a file group locality
>> feature, but once it
>> > does this feature should enable more efficient use of the buffer cache
>> for predicate
>> > pushdown operations.
>>
>> -- Ken
>>
>>
>> On Mar 11, 2013, at 10:56am, Julien Le Dem wrote:
>>
>> > Parquet is actually implementing the algorithm described in the
>> > "Nested Columnar Storage" section of the Dremel paper[1].
>> >
>> > [1] http://research.google.com/pubs/pub36632.html
>> >
>> > On Mon, Mar 11, 2013 at 10:41 AM, Timothy Chen <tn...@gmail.com>
>> wrote:
>> >> Just saw this:
>> >>
>> >> http://t.co/ES1dGDZlKA
>> >>
>> >> I know Trevni is another Dremel inspired Columnar format as well, anyone
>> >> saw much info Parquet and how it's different?
>> >>
>> >> Tim
>>
>> --------------------------
>> Ken Krugler
>> +1 530-210-6378
>> http://www.scaleunlimited.com
>> custom big data solutions & training
>> Hadoop, Cascading, Cassandra & Solr
>>
>>
>>
>>
>>
>>

Re: Another columnar format Parquet

Posted by Julien Le Dem <ju...@twitter.com>.

the mailling list: parquet-dev@googlegroups.com

On Tue, Mar 12, 2013 at 10:40 AM, Todd Lipcon <to...@cloudera.com> wrote:
> Hey Jacques,
>
> Feel free to ping us with any questions. Despite some of the _users_ of
> Parquet competing with each other (eg query engines), we hope the file
> format itself can be easily implemented by everyone and become ubiquitous.
>
> There are a few changes still in flight that we're working on, so you may
> want to join the parquet dev mailing list as well to follow along.
>
> Thanks
> -Todd
>
> On Tue, Mar 12, 2013 at 10:29 AM, Jacques Nadeau <ja...@apache.org> wrote:
>
>> When you said soon, you meant very soon.  This looks like great work.
>>  Thanks for sharing it with the world.  Will come back after spending some
>> time with it.
>>
>> thanks again,
>> Jacques
>>
>>
>>
>> On Tue, Mar 12, 2013 at 9:50 AM, Julien Le Dem <ju...@twitter.com> wrote:
>>
>> > The repo is now available: http://parquet.github.com/
>> > Let me know if you have questions
>> >
>> > On Mon, Mar 11, 2013 at 11:31 AM, Jacques Nadeau <ja...@apache.org>
>> > wrote:
>> > > There definitely seem to be some new kids on the block.  I really hope
>> > that
>> > > Drill can adopt either ORC or Parquet as a closely related "native"
>> > format.
>> > >   At the moment, I'm actually more focused on the in-memory execution
>> > > format and the right abstraction to support compressed columnar
>> execution
>> > > and vectorization.  Historically, the biggest gaps I'd worry about are
>> > > java-centricity and expectation of early materialization &
>> decompression.
>> > >  Once we get some execution stuff working, lets see how each fits in.
>> > >  Rather than start a third competing format (or fourth if you count
>> > > Trevni), let's either use or extend/contribute back on one of the
>> > existing
>> > > new kids.
>> > >
>> > > Julien, do you think more will be shared about Parquet before the
>> Hadoop
>> > > Summit so we can start toying with using it inside of Drill?
>> > >
>> > > J
>> > >
>> > > On Mon, Mar 11, 2013 at 11:02 AM, Ken Krugler
>> > > <kk...@transpac.com>wrote:
>> > >
>> > >> Hi all,
>> > >>
>> > >> I've been trying to track down status/comparisons of various columnar
>> > >> formats, and just heard about Parquet.
>> > >>
>> > >> I don't have any direct experience with Parquet, but Really Smart Guy
>> > said:
>> > >>
>> > >> > From what I hear there are two key features that
>> > >> > differentiate it from ORC and Trevni: 1) columns can be optionally
>> > split
>> > >> into
>> > >> > separate files, and 2) the mechanism for shredding nested fields
>> into
>> > >> > columns is taken almost verbatim from Dremel. Feature (1) won't be
>> > >> practical
>> > >> > to use until Hadoop introduces support for a file group locality
>> > >> feature, but once it
>> > >> > does this feature should enable more efficient use of the buffer
>> cache
>> > >> for predicate
>> > >> > pushdown operations.
>> > >>
>> > >> -- Ken
>> > >>
>> > >>
>> > >> On Mar 11, 2013, at 10:56am, Julien Le Dem wrote:
>> > >>
>> > >> > Parquet is actually implementing the algorithm described in the
>> > >> > "Nested Columnar Storage" section of the Dremel paper[1].
>> > >> >
>> > >> > [1] http://research.google.com/pubs/pub36632.html
>> > >> >
>> > >> > On Mon, Mar 11, 2013 at 10:41 AM, Timothy Chen <tn...@gmail.com>
>> > >> wrote:
>> > >> >> Just saw this:
>> > >> >>
>> > >> >> http://t.co/ES1dGDZlKA
>> > >> >>
>> > >> >> I know Trevni is another Dremel inspired Columnar format as well,
>> > anyone
>> > >> >> saw much info Parquet and how it's different?
>> > >> >>
>> > >> >> Tim
>> > >>
>> > >> --------------------------
>> > >> Ken Krugler
>> > >> +1 530-210-6378
>> > >> http://www.scaleunlimited.com
>> > >> custom big data solutions & training
>> > >> Hadoop, Cascading, Cassandra & Solr
>> > >>
>> > >>
>> > >>
>> > >>
>> > >>
>> > >>
>> >
>>
>
>
>
> --
> Todd Lipcon
> Software Engineer, Cloudera

Re: Another columnar format Parquet

Posted by Ted Dunning <te...@gmail.com>.

Cool.  That was the expected answer, but it is still great to hear.

On Tue, Mar 12, 2013 at 11:53 AM, Julien Le Dem <ju...@twitter.com> wrote:

> Pull requests are more than welcome.
> You can open an issue on github or email the list to start a discussion
> Julien
>
> On Tue, Mar 12, 2013 at 11:29 AM, Jacques Nadeau <ja...@apache.org>
> wrote:
> > Bummer, that's what I figured.   That just means there is an opportunity
> > for extension, right? :)
> >
> > J
> >
> >
> > On Tue, Mar 12, 2013 at 11:16 AM, Todd Lipcon <to...@cloudera.com> wrote:
> >
> >> On Tue, Mar 12, 2013 at 11:11 AM, Jacques Nadeau <ja...@apache.org>
> >> wrote:
> >>
> >> > Joined, thanks.  I'm glad that the approach was open for this.  I
> think
> >> > that helps it chances to be ubiquitous.  As much as this might be
> >> > blasphemous to some, I really hope that the final solution to the
> query
> >> > wars is a collaborative solution as opposed to a competitive one.
> >> >
> >> > Having not looked at the code yet, do the existing read interfaces
> >> support
> >> > working with "late materialization" execution strategies similar to
> some
> >> of
> >> > the ideas at [1]?  Definitely seems harder to implement in a
> >> > nested/repeated environment but wanted to get a sense of the thinking
> >> > behind the initial efforts.
> >> >
> >>
> >> The existing read interface in Java is tuple-at-a-time, but there's no
> >> reason one couldn't build a column-at-a-time late materialization
> approach.
> >> It would just be a lot more "custom", and not directly user-usable, so
> >> there's none in the initial implementation.
> >>
> >> Like you said, it's a little tougher with arbitrary nesting, but I think
> >> still doable.
> >>
> >> -Todd
> >>
> >> >
> >> > On Tue, Mar 12, 2013 at 10:40 AM, Todd Lipcon <to...@cloudera.com>
> wrote:
> >> >
> >> > > Hey Jacques,
> >> > >
> >> > > Feel free to ping us with any questions. Despite some of the
> _users_ of
> >> > > Parquet competing with each other (eg query engines), we hope the
> file
> >> > > format itself can be easily implemented by everyone and become
> >> > ubiquitous.
> >> > >
> >> > > There are a few changes still in flight that we're working on, so
> you
> >> may
> >> > > want to join the parquet dev mailing list as well to follow along.
> >> > >
> >> > > Thanks
> >> > > -Todd
> >> > >
> >> > > On Tue, Mar 12, 2013 at 10:29 AM, Jacques Nadeau <
> jacques@apache.org>
> >> > > wrote:
> >> > >
> >> > > > When you said soon, you meant very soon.  This looks like great
> work.
> >> > > >  Thanks for sharing it with the world.  Will come back after
> spending
> >> > > some
> >> > > > time with it.
> >> > > >
> >> > > > thanks again,
> >> > > > Jacques
> >> > > >
> >> > > >
> >> > > >
> >> > > > On Tue, Mar 12, 2013 at 9:50 AM, Julien Le Dem <
> julien@twitter.com>
> >> > > wrote:
> >> > > >
> >> > > > > The repo is now available: http://parquet.github.com/
> >> > > > > Let me know if you have questions
> >> > > > >
> >> > > > > On Mon, Mar 11, 2013 at 11:31 AM, Jacques Nadeau <
> >> jacques@apache.org
> >> > >
> >> > > > > wrote:
> >> > > > > > There definitely seem to be some new kids on the block.  I
> really
> >> > > hope
> >> > > > > that
> >> > > > > > Drill can adopt either ORC or Parquet as a closely related
> >> "native"
> >> > > > > format.
> >> > > > > >   At the moment, I'm actually more focused on the in-memory
> >> > execution
> >> > > > > > format and the right abstraction to support compressed
> columnar
> >> > > > execution
> >> > > > > > and vectorization.  Historically, the biggest gaps I'd worry
> >> about
> >> > > are
> >> > > > > > java-centricity and expectation of early materialization &
> >> > > > decompression.
> >> > > > > >  Once we get some execution stuff working, lets see how each
> fits
> >> > in.
> >> > > > > >  Rather than start a third competing format (or fourth if you
> >> count
> >> > > > > > Trevni), let's either use or extend/contribute back on one of
> the
> >> > > > > existing
> >> > > > > > new kids.
> >> > > > > >
> >> > > > > > Julien, do you think more will be shared about Parquet before
> the
> >> > > > Hadoop
> >> > > > > > Summit so we can start toying with using it inside of Drill?
> >> > > > > >
> >> > > > > > J
> >> > > > > >
> >> > > > > > On Mon, Mar 11, 2013 at 11:02 AM, Ken Krugler
> >> > > > > > <kk...@transpac.com>wrote:
> >> > > > > >
> >> > > > > >> Hi all,
> >> > > > > >>
> >> > > > > >> I've been trying to track down status/comparisons of various
> >> > > columnar
> >> > > > > >> formats, and just heard about Parquet.
> >> > > > > >>
> >> > > > > >> I don't have any direct experience with Parquet, but Really
> >> Smart
> >> > > Guy
> >> > > > > said:
> >> > > > > >>
> >> > > > > >> > From what I hear there are two key features that
> >> > > > > >> > differentiate it from ORC and Trevni: 1) columns can be
> >> > optionally
> >> > > > > split
> >> > > > > >> into
> >> > > > > >> > separate files, and 2) the mechanism for shredding nested
> >> fields
> >> > > > into
> >> > > > > >> > columns is taken almost verbatim from Dremel. Feature (1)
> >> won't
> >> > be
> >> > > > > >> practical
> >> > > > > >> > to use until Hadoop introduces support for a file group
> >> locality
> >> > > > > >> feature, but once it
> >> > > > > >> > does this feature should enable more efficient use of the
> >> buffer
> >> > > > cache
> >> > > > > >> for predicate
> >> > > > > >> > pushdown operations.
> >> > > > > >>
> >> > > > > >> -- Ken
> >> > > > > >>
> >> > > > > >>
> >> > > > > >> On Mar 11, 2013, at 10:56am, Julien Le Dem wrote:
> >> > > > > >>
> >> > > > > >> > Parquet is actually implementing the algorithm described in
> >> the
> >> > > > > >> > "Nested Columnar Storage" section of the Dremel paper[1].
> >> > > > > >> >
> >> > > > > >> > [1] http://research.google.com/pubs/pub36632.html
> >> > > > > >> >
> >> > > > > >> > On Mon, Mar 11, 2013 at 10:41 AM, Timothy Chen <
> >> > tnachen@gmail.com
> >> > > >
> >> > > > > >> wrote:
> >> > > > > >> >> Just saw this:
> >> > > > > >> >>
> >> > > > > >> >> http://t.co/ES1dGDZlKA
> >> > > > > >> >>
> >> > > > > >> >> I know Trevni is another Dremel inspired Columnar format
> as
> >> > well,
> >> > > > > anyone
> >> > > > > >> >> saw much info Parquet and how it's different?
> >> > > > > >> >>
> >> > > > > >> >> Tim
> >> > > > > >>
> >> > > > > >> --------------------------
> >> > > > > >> Ken Krugler
> >> > > > > >> +1 530-210-6378
> >> > > > > >> http://www.scaleunlimited.com
> >> > > > > >> custom big data solutions & training
> >> > > > > >> Hadoop, Cascading, Cassandra & Solr
> >> > > > > >>
> >> > > > > >>
> >> > > > > >>
> >> > > > > >>
> >> > > > > >>
> >> > > > > >>
> >> > > > >
> >> > > >
> >> > >
> >> > >
> >> > >
> >> > > --
> >> > > Todd Lipcon
> >> > > Software Engineer, Cloudera
> >> > >
> >> >
> >>
> >>
> >>
> >> --
> >> Todd Lipcon
> >> Software Engineer, Cloudera
> >>
>

Re: Another columnar format Parquet

Posted by Julien Le Dem <ju...@twitter.com>.

Pull requests are more than welcome.
You can open an issue on github or email the list to start a discussion
Julien

On Tue, Mar 12, 2013 at 11:29 AM, Jacques Nadeau <ja...@apache.org> wrote:
> Bummer, that's what I figured.   That just means there is an opportunity
> for extension, right? :)
>
> J
>
>
> On Tue, Mar 12, 2013 at 11:16 AM, Todd Lipcon <to...@cloudera.com> wrote:
>
>> On Tue, Mar 12, 2013 at 11:11 AM, Jacques Nadeau <ja...@apache.org>
>> wrote:
>>
>> > Joined, thanks.  I'm glad that the approach was open for this.  I think
>> > that helps it chances to be ubiquitous.  As much as this might be
>> > blasphemous to some, I really hope that the final solution to the query
>> > wars is a collaborative solution as opposed to a competitive one.
>> >
>> > Having not looked at the code yet, do the existing read interfaces
>> support
>> > working with "late materialization" execution strategies similar to some
>> of
>> > the ideas at [1]?  Definitely seems harder to implement in a
>> > nested/repeated environment but wanted to get a sense of the thinking
>> > behind the initial efforts.
>> >
>>
>> The existing read interface in Java is tuple-at-a-time, but there's no
>> reason one couldn't build a column-at-a-time late materialization approach.
>> It would just be a lot more "custom", and not directly user-usable, so
>> there's none in the initial implementation.
>>
>> Like you said, it's a little tougher with arbitrary nesting, but I think
>> still doable.
>>
>> -Todd
>>
>> >
>> > On Tue, Mar 12, 2013 at 10:40 AM, Todd Lipcon <to...@cloudera.com> wrote:
>> >
>> > > Hey Jacques,
>> > >
>> > > Feel free to ping us with any questions. Despite some of the _users_ of
>> > > Parquet competing with each other (eg query engines), we hope the file
>> > > format itself can be easily implemented by everyone and become
>> > ubiquitous.
>> > >
>> > > There are a few changes still in flight that we're working on, so you
>> may
>> > > want to join the parquet dev mailing list as well to follow along.
>> > >
>> > > Thanks
>> > > -Todd
>> > >
>> > > On Tue, Mar 12, 2013 at 10:29 AM, Jacques Nadeau <ja...@apache.org>
>> > > wrote:
>> > >
>> > > > When you said soon, you meant very soon.  This looks like great work.
>> > > >  Thanks for sharing it with the world.  Will come back after spending
>> > > some
>> > > > time with it.
>> > > >
>> > > > thanks again,
>> > > > Jacques
>> > > >
>> > > >
>> > > >
>> > > > On Tue, Mar 12, 2013 at 9:50 AM, Julien Le Dem <ju...@twitter.com>
>> > > wrote:
>> > > >
>> > > > > The repo is now available: http://parquet.github.com/
>> > > > > Let me know if you have questions
>> > > > >
>> > > > > On Mon, Mar 11, 2013 at 11:31 AM, Jacques Nadeau <
>> jacques@apache.org
>> > >
>> > > > > wrote:
>> > > > > > There definitely seem to be some new kids on the block.  I really
>> > > hope
>> > > > > that
>> > > > > > Drill can adopt either ORC or Parquet as a closely related
>> "native"
>> > > > > format.
>> > > > > >   At the moment, I'm actually more focused on the in-memory
>> > execution
>> > > > > > format and the right abstraction to support compressed columnar
>> > > > execution
>> > > > > > and vectorization.  Historically, the biggest gaps I'd worry
>> about
>> > > are
>> > > > > > java-centricity and expectation of early materialization &
>> > > > decompression.
>> > > > > >  Once we get some execution stuff working, lets see how each fits
>> > in.
>> > > > > >  Rather than start a third competing format (or fourth if you
>> count
>> > > > > > Trevni), let's either use or extend/contribute back on one of the
>> > > > > existing
>> > > > > > new kids.
>> > > > > >
>> > > > > > Julien, do you think more will be shared about Parquet before the
>> > > > Hadoop
>> > > > > > Summit so we can start toying with using it inside of Drill?
>> > > > > >
>> > > > > > J
>> > > > > >
>> > > > > > On Mon, Mar 11, 2013 at 11:02 AM, Ken Krugler
>> > > > > > <kk...@transpac.com>wrote:
>> > > > > >
>> > > > > >> Hi all,
>> > > > > >>
>> > > > > >> I've been trying to track down status/comparisons of various
>> > > columnar
>> > > > > >> formats, and just heard about Parquet.
>> > > > > >>
>> > > > > >> I don't have any direct experience with Parquet, but Really
>> Smart
>> > > Guy
>> > > > > said:
>> > > > > >>
>> > > > > >> > From what I hear there are two key features that
>> > > > > >> > differentiate it from ORC and Trevni: 1) columns can be
>> > optionally
>> > > > > split
>> > > > > >> into
>> > > > > >> > separate files, and 2) the mechanism for shredding nested
>> fields
>> > > > into
>> > > > > >> > columns is taken almost verbatim from Dremel. Feature (1)
>> won't
>> > be
>> > > > > >> practical
>> > > > > >> > to use until Hadoop introduces support for a file group
>> locality
>> > > > > >> feature, but once it
>> > > > > >> > does this feature should enable more efficient use of the
>> buffer
>> > > > cache
>> > > > > >> for predicate
>> > > > > >> > pushdown operations.
>> > > > > >>
>> > > > > >> -- Ken
>> > > > > >>
>> > > > > >>
>> > > > > >> On Mar 11, 2013, at 10:56am, Julien Le Dem wrote:
>> > > > > >>
>> > > > > >> > Parquet is actually implementing the algorithm described in
>> the
>> > > > > >> > "Nested Columnar Storage" section of the Dremel paper[1].
>> > > > > >> >
>> > > > > >> > [1] http://research.google.com/pubs/pub36632.html
>> > > > > >> >
>> > > > > >> > On Mon, Mar 11, 2013 at 10:41 AM, Timothy Chen <
>> > tnachen@gmail.com
>> > > >
>> > > > > >> wrote:
>> > > > > >> >> Just saw this:
>> > > > > >> >>
>> > > > > >> >> http://t.co/ES1dGDZlKA
>> > > > > >> >>
>> > > > > >> >> I know Trevni is another Dremel inspired Columnar format as
>> > well,
>> > > > > anyone
>> > > > > >> >> saw much info Parquet and how it's different?
>> > > > > >> >>
>> > > > > >> >> Tim
>> > > > > >>
>> > > > > >> --------------------------
>> > > > > >> Ken Krugler
>> > > > > >> +1 530-210-6378
>> > > > > >> http://www.scaleunlimited.com
>> > > > > >> custom big data solutions & training
>> > > > > >> Hadoop, Cascading, Cassandra & Solr
>> > > > > >>
>> > > > > >>
>> > > > > >>
>> > > > > >>
>> > > > > >>
>> > > > > >>
>> > > > >
>> > > >
>> > >
>> > >
>> > >
>> > > --
>> > > Todd Lipcon
>> > > Software Engineer, Cloudera
>> > >
>> >
>>
>>
>>
>> --
>> Todd Lipcon
>> Software Engineer, Cloudera
>>

Re: Another columnar format Parquet

Posted by Jacques Nadeau <ja...@apache.org>.

Bummer, that's what I figured.   That just means there is an opportunity
for extension, right? :)

J


On Tue, Mar 12, 2013 at 11:16 AM, Todd Lipcon <to...@cloudera.com> wrote:

> On Tue, Mar 12, 2013 at 11:11 AM, Jacques Nadeau <ja...@apache.org>
> wrote:
>
> > Joined, thanks.  I'm glad that the approach was open for this.  I think
> > that helps it chances to be ubiquitous.  As much as this might be
> > blasphemous to some, I really hope that the final solution to the query
> > wars is a collaborative solution as opposed to a competitive one.
> >
> > Having not looked at the code yet, do the existing read interfaces
> support
> > working with "late materialization" execution strategies similar to some
> of
> > the ideas at [1]?  Definitely seems harder to implement in a
> > nested/repeated environment but wanted to get a sense of the thinking
> > behind the initial efforts.
> >
>
> The existing read interface in Java is tuple-at-a-time, but there's no
> reason one couldn't build a column-at-a-time late materialization approach.
> It would just be a lot more "custom", and not directly user-usable, so
> there's none in the initial implementation.
>
> Like you said, it's a little tougher with arbitrary nesting, but I think
> still doable.
>
> -Todd
>
> >
> > On Tue, Mar 12, 2013 at 10:40 AM, Todd Lipcon <to...@cloudera.com> wrote:
> >
> > > Hey Jacques,
> > >
> > > Feel free to ping us with any questions. Despite some of the _users_ of
> > > Parquet competing with each other (eg query engines), we hope the file
> > > format itself can be easily implemented by everyone and become
> > ubiquitous.
> > >
> > > There are a few changes still in flight that we're working on, so you
> may
> > > want to join the parquet dev mailing list as well to follow along.
> > >
> > > Thanks
> > > -Todd
> > >
> > > On Tue, Mar 12, 2013 at 10:29 AM, Jacques Nadeau <ja...@apache.org>
> > > wrote:
> > >
> > > > When you said soon, you meant very soon.  This looks like great work.
> > > >  Thanks for sharing it with the world.  Will come back after spending
> > > some
> > > > time with it.
> > > >
> > > > thanks again,
> > > > Jacques
> > > >
> > > >
> > > >
> > > > On Tue, Mar 12, 2013 at 9:50 AM, Julien Le Dem <ju...@twitter.com>
> > > wrote:
> > > >
> > > > > The repo is now available: http://parquet.github.com/
> > > > > Let me know if you have questions
> > > > >
> > > > > On Mon, Mar 11, 2013 at 11:31 AM, Jacques Nadeau <
> jacques@apache.org
> > >
> > > > > wrote:
> > > > > > There definitely seem to be some new kids on the block.  I really
> > > hope
> > > > > that
> > > > > > Drill can adopt either ORC or Parquet as a closely related
> "native"
> > > > > format.
> > > > > >   At the moment, I'm actually more focused on the in-memory
> > execution
> > > > > > format and the right abstraction to support compressed columnar
> > > > execution
> > > > > > and vectorization.  Historically, the biggest gaps I'd worry
> about
> > > are
> > > > > > java-centricity and expectation of early materialization &
> > > > decompression.
> > > > > >  Once we get some execution stuff working, lets see how each fits
> > in.
> > > > > >  Rather than start a third competing format (or fourth if you
> count
> > > > > > Trevni), let's either use or extend/contribute back on one of the
> > > > > existing
> > > > > > new kids.
> > > > > >
> > > > > > Julien, do you think more will be shared about Parquet before the
> > > > Hadoop
> > > > > > Summit so we can start toying with using it inside of Drill?
> > > > > >
> > > > > > J
> > > > > >
> > > > > > On Mon, Mar 11, 2013 at 11:02 AM, Ken Krugler
> > > > > > <kk...@transpac.com>wrote:
> > > > > >
> > > > > >> Hi all,
> > > > > >>
> > > > > >> I've been trying to track down status/comparisons of various
> > > columnar
> > > > > >> formats, and just heard about Parquet.
> > > > > >>
> > > > > >> I don't have any direct experience with Parquet, but Really
> Smart
> > > Guy
> > > > > said:
> > > > > >>
> > > > > >> > From what I hear there are two key features that
> > > > > >> > differentiate it from ORC and Trevni: 1) columns can be
> > optionally
> > > > > split
> > > > > >> into
> > > > > >> > separate files, and 2) the mechanism for shredding nested
> fields
> > > > into
> > > > > >> > columns is taken almost verbatim from Dremel. Feature (1)
> won't
> > be
> > > > > >> practical
> > > > > >> > to use until Hadoop introduces support for a file group
> locality
> > > > > >> feature, but once it
> > > > > >> > does this feature should enable more efficient use of the
> buffer
> > > > cache
> > > > > >> for predicate
> > > > > >> > pushdown operations.
> > > > > >>
> > > > > >> -- Ken
> > > > > >>
> > > > > >>
> > > > > >> On Mar 11, 2013, at 10:56am, Julien Le Dem wrote:
> > > > > >>
> > > > > >> > Parquet is actually implementing the algorithm described in
> the
> > > > > >> > "Nested Columnar Storage" section of the Dremel paper[1].
> > > > > >> >
> > > > > >> > [1] http://research.google.com/pubs/pub36632.html
> > > > > >> >
> > > > > >> > On Mon, Mar 11, 2013 at 10:41 AM, Timothy Chen <
> > tnachen@gmail.com
> > > >
> > > > > >> wrote:
> > > > > >> >> Just saw this:
> > > > > >> >>
> > > > > >> >> http://t.co/ES1dGDZlKA
> > > > > >> >>
> > > > > >> >> I know Trevni is another Dremel inspired Columnar format as
> > well,
> > > > > anyone
> > > > > >> >> saw much info Parquet and how it's different?
> > > > > >> >>
> > > > > >> >> Tim
> > > > > >>
> > > > > >> --------------------------
> > > > > >> Ken Krugler
> > > > > >> +1 530-210-6378
> > > > > >> http://www.scaleunlimited.com
> > > > > >> custom big data solutions & training
> > > > > >> Hadoop, Cascading, Cassandra & Solr
> > > > > >>
> > > > > >>
> > > > > >>
> > > > > >>
> > > > > >>
> > > > > >>
> > > > >
> > > >
> > >
> > >
> > >
> > > --
> > > Todd Lipcon
> > > Software Engineer, Cloudera
> > >
> >
>
>
>
> --
> Todd Lipcon
> Software Engineer, Cloudera
>

Re: Another columnar format Parquet

Posted by Ted Dunning <te...@gmail.com>.

GPL dependencies are always a problem.  This support would be a great
candidate for an external project.

On Wed, Mar 13, 2013 at 2:08 PM, Tsuyoshi OZAWA <oz...@gmail.com>wrote:

> One alternative columnar storage is wiredtiger used by amazon.com.
> It provides with a columnar storage and record-style storage library
> API like berkley DB.
>
> One concern is that wiredtiger is licensed by GPL and BSD.
> However, supporting it can empower Drill project.
>
> http://wiredtiger.com/
>
> On Wed, Mar 13, 2013 at 4:22 PM, Ted Dunning <te...@gmail.com>
> wrote:
> > Can you bring 5 slides on parquet?  (ppt or pptx?)
> >
> > On Tue, Mar 12, 2013 at 8:59 PM, Julien Le Dem <ju...@twitter.com>
> wrote:
> >
> >> I should be able to come to the Drill meetup tomorrow.
> >> We can chat about it then.
> >> Julien
> >>
> >> On Tue, Mar 12, 2013 at 1:43 PM, Dmitriy Ryaboy <dv...@gmail.com>
> >> wrote:
> >> > ColumnIO implementations return values from a column independently of
> >> other
> >> > columns; RecordReaderImplementation does materialize the whole record
> (by
> >> > using a bunch of column readers at the same time). You could
> construct a
> >> > column-at-a-time, late materialization api by dropping directly into
> >> using
> >> > column readers; so it just depends on which level of abstraction you
> want
> >> > to hook up with.
> >> >
> >> > We were initially concerned with "record-oriented" frameworks so we
> built
> >> > the record materialization machinery for them first; a  more truly
> >> columnar
> >> > engine should work with ColumnIO instead of RecordReaders.
> >> >
> >> > Also, since the API is still young, it's certainly open to discussion
> and
> >> > improvement.
> >> >
> >> > D
> >> >
> >> >
> >> > On Tue, Mar 12, 2013 at 11:16 AM, Todd Lipcon <to...@cloudera.com>
> wrote:
> >> >
> >> >> On Tue, Mar 12, 2013 at 11:11 AM, Jacques Nadeau <jacques@apache.org
> >
> >> >> wrote:
> >> >>
> >> >> > Joined, thanks.  I'm glad that the approach was open for this.  I
> >> think
> >> >> > that helps it chances to be ubiquitous.  As much as this might be
> >> >> > blasphemous to some, I really hope that the final solution to the
> >> query
> >> >> > wars is a collaborative solution as opposed to a competitive one.
> >> >> >
> >> >> > Having not looked at the code yet, do the existing read interfaces
> >> >> support
> >> >> > working with "late materialization" execution strategies similar to
> >> some
> >> >> of
> >> >> > the ideas at [1]?  Definitely seems harder to implement in a
> >> >> > nested/repeated environment but wanted to get a sense of the
> thinking
> >> >> > behind the initial efforts.
> >> >> >
> >> >>
> >> >> The existing read interface in Java is tuple-at-a-time, but there's
> no
> >> >> reason one couldn't build a column-at-a-time late materialization
> >> approach.
> >> >> It would just be a lot more "custom", and not directly user-usable,
> so
> >> >> there's none in the initial implementation.
> >> >>
> >> >> Like you said, it's a little tougher with arbitrary nesting, but I
> think
> >> >> still doable.
> >> >>
> >> >> -Todd
> >> >>
> >> >> >
> >> >> > On Tue, Mar 12, 2013 at 10:40 AM, Todd Lipcon <to...@cloudera.com>
> >> wrote:
> >> >> >
> >> >> > > Hey Jacques,
> >> >> > >
> >> >> > > Feel free to ping us with any questions. Despite some of the
> >> _users_ of
> >> >> > > Parquet competing with each other (eg query engines), we hope the
> >> file
> >> >> > > format itself can be easily implemented by everyone and become
> >> >> > ubiquitous.
> >> >> > >
> >> >> > > There are a few changes still in flight that we're working on, so
> >> you
> >> >> may
> >> >> > > want to join the parquet dev mailing list as well to follow
> along.
> >> >> > >
> >> >> > > Thanks
> >> >> > > -Todd
> >> >> > >
> >> >> > > On Tue, Mar 12, 2013 at 10:29 AM, Jacques Nadeau <
> >> jacques@apache.org>
> >> >> > > wrote:
> >> >> > >
> >> >> > > > When you said soon, you meant very soon.  This looks like great
> >> work.
> >> >> > > >  Thanks for sharing it with the world.  Will come back after
> >> spending
> >> >> > > some
> >> >> > > > time with it.
> >> >> > > >
> >> >> > > > thanks again,
> >> >> > > > Jacques
> >> >> > > >
> >> >> > > >
> >> >> > > >
> >> >> > > > On Tue, Mar 12, 2013 at 9:50 AM, Julien Le Dem <
> >> julien@twitter.com>
> >> >> > > wrote:
> >> >> > > >
> >> >> > > > > The repo is now available: http://parquet.github.com/
> >> >> > > > > Let me know if you have questions
> >> >> > > > >
> >> >> > > > > On Mon, Mar 11, 2013 at 11:31 AM, Jacques Nadeau <
> >> >> jacques@apache.org
> >> >> > >
> >> >> > > > > wrote:
> >> >> > > > > > There definitely seem to be some new kids on the block.  I
> >> really
> >> >> > > hope
> >> >> > > > > that
> >> >> > > > > > Drill can adopt either ORC or Parquet as a closely related
> >> >> "native"
> >> >> > > > > format.
> >> >> > > > > >   At the moment, I'm actually more focused on the in-memory
> >> >> > execution
> >> >> > > > > > format and the right abstraction to support compressed
> >> columnar
> >> >> > > > execution
> >> >> > > > > > and vectorization.  Historically, the biggest gaps I'd
> worry
> >> >> about
> >> >> > > are
> >> >> > > > > > java-centricity and expectation of early materialization &
> >> >> > > > decompression.
> >> >> > > > > >  Once we get some execution stuff working, lets see how
> each
> >> fits
> >> >> > in.
> >> >> > > > > >  Rather than start a third competing format (or fourth if
> you
> >> >> count
> >> >> > > > > > Trevni), let's either use or extend/contribute back on one
> of
> >> the
> >> >> > > > > existing
> >> >> > > > > > new kids.
> >> >> > > > > >
> >> >> > > > > > Julien, do you think more will be shared about Parquet
> before
> >> the
> >> >> > > > Hadoop
> >> >> > > > > > Summit so we can start toying with using it inside of
> Drill?
> >> >> > > > > >
> >> >> > > > > > J
> >> >> > > > > >
> >> >> > > > > > On Mon, Mar 11, 2013 at 11:02 AM, Ken Krugler
> >> >> > > > > > <kk...@transpac.com>wrote:
> >> >> > > > > >
> >> >> > > > > >> Hi all,
> >> >> > > > > >>
> >> >> > > > > >> I've been trying to track down status/comparisons of
> various
> >> >> > > columnar
> >> >> > > > > >> formats, and just heard about Parquet.
> >> >> > > > > >>
> >> >> > > > > >> I don't have any direct experience with Parquet, but
> Really
> >> >> Smart
> >> >> > > Guy
> >> >> > > > > said:
> >> >> > > > > >>
> >> >> > > > > >> > From what I hear there are two key features that
> >> >> > > > > >> > differentiate it from ORC and Trevni: 1) columns can be
> >> >> > optionally
> >> >> > > > > split
> >> >> > > > > >> into
> >> >> > > > > >> > separate files, and 2) the mechanism for shredding
> nested
> >> >> fields
> >> >> > > > into
> >> >> > > > > >> > columns is taken almost verbatim from Dremel. Feature
> (1)
> >> >> won't
> >> >> > be
> >> >> > > > > >> practical
> >> >> > > > > >> > to use until Hadoop introduces support for a file group
> >> >> locality
> >> >> > > > > >> feature, but once it
> >> >> > > > > >> > does this feature should enable more efficient use of
> the
> >> >> buffer
> >> >> > > > cache
> >> >> > > > > >> for predicate
> >> >> > > > > >> > pushdown operations.
> >> >> > > > > >>
> >> >> > > > > >> -- Ken
> >> >> > > > > >>
> >> >> > > > > >>
> >> >> > > > > >> On Mar 11, 2013, at 10:56am, Julien Le Dem wrote:
> >> >> > > > > >>
> >> >> > > > > >> > Parquet is actually implementing the algorithm
> described in
> >> >> the
> >> >> > > > > >> > "Nested Columnar Storage" section of the Dremel
> paper[1].
> >> >> > > > > >> >
> >> >> > > > > >> > [1] http://research.google.com/pubs/pub36632.html
> >> >> > > > > >> >
> >> >> > > > > >> > On Mon, Mar 11, 2013 at 10:41 AM, Timothy Chen <
> >> >> > tnachen@gmail.com
> >> >> > > >
> >> >> > > > > >> wrote:
> >> >> > > > > >> >> Just saw this:
> >> >> > > > > >> >>
> >> >> > > > > >> >> http://t.co/ES1dGDZlKA
> >> >> > > > > >> >>
> >> >> > > > > >> >> I know Trevni is another Dremel inspired Columnar
> format
> >> as
> >> >> > well,
> >> >> > > > > anyone
> >> >> > > > > >> >> saw much info Parquet and how it's different?
> >> >> > > > > >> >>
> >> >> > > > > >> >> Tim
> >> >> > > > > >>
> >> >> > > > > >> --------------------------
> >> >> > > > > >> Ken Krugler
> >> >> > > > > >> +1 530-210-6378
> >> >> > > > > >> http://www.scaleunlimited.com
> >> >> > > > > >> custom big data solutions & training
> >> >> > > > > >> Hadoop, Cascading, Cassandra & Solr
> >> >> > > > > >>
> >> >> > > > > >>
> >> >> > > > > >>
> >> >> > > > > >>
> >> >> > > > > >>
> >> >> > > > > >>
> >> >> > > > >
> >> >> > > >
> >> >> > >
> >> >> > >
> >> >> > >
> >> >> > > --
> >> >> > > Todd Lipcon
> >> >> > > Software Engineer, Cloudera
> >> >> > >
> >> >> >
> >> >>
> >> >>
> >> >>
> >> >> --
> >> >> Todd Lipcon
> >> >> Software Engineer, Cloudera
> >> >>
> >>
>
>
>
> --
> - Tsuyoshi
>

Re: Another columnar format Parquet

Posted by Tsuyoshi OZAWA <oz...@gmail.com>.

One alternative columnar storage is wiredtiger used by amazon.com.
It provides with a columnar storage and record-style storage library
API like berkley DB.

One concern is that wiredtiger is licensed by GPL and BSD.
However, supporting it can empower Drill project.

http://wiredtiger.com/

On Wed, Mar 13, 2013 at 4:22 PM, Ted Dunning <te...@gmail.com> wrote:
> Can you bring 5 slides on parquet?  (ppt or pptx?)
>
> On Tue, Mar 12, 2013 at 8:59 PM, Julien Le Dem <ju...@twitter.com> wrote:
>
>> I should be able to come to the Drill meetup tomorrow.
>> We can chat about it then.
>> Julien
>>
>> On Tue, Mar 12, 2013 at 1:43 PM, Dmitriy Ryaboy <dv...@gmail.com>
>> wrote:
>> > ColumnIO implementations return values from a column independently of
>> other
>> > columns; RecordReaderImplementation does materialize the whole record (by
>> > using a bunch of column readers at the same time). You could construct a
>> > column-at-a-time, late materialization api by dropping directly into
>> using
>> > column readers; so it just depends on which level of abstraction you want
>> > to hook up with.
>> >
>> > We were initially concerned with "record-oriented" frameworks so we built
>> > the record materialization machinery for them first; a  more truly
>> columnar
>> > engine should work with ColumnIO instead of RecordReaders.
>> >
>> > Also, since the API is still young, it's certainly open to discussion and
>> > improvement.
>> >
>> > D
>> >
>> >
>> > On Tue, Mar 12, 2013 at 11:16 AM, Todd Lipcon <to...@cloudera.com> wrote:
>> >
>> >> On Tue, Mar 12, 2013 at 11:11 AM, Jacques Nadeau <ja...@apache.org>
>> >> wrote:
>> >>
>> >> > Joined, thanks.  I'm glad that the approach was open for this.  I
>> think
>> >> > that helps it chances to be ubiquitous.  As much as this might be
>> >> > blasphemous to some, I really hope that the final solution to the
>> query
>> >> > wars is a collaborative solution as opposed to a competitive one.
>> >> >
>> >> > Having not looked at the code yet, do the existing read interfaces
>> >> support
>> >> > working with "late materialization" execution strategies similar to
>> some
>> >> of
>> >> > the ideas at [1]?  Definitely seems harder to implement in a
>> >> > nested/repeated environment but wanted to get a sense of the thinking
>> >> > behind the initial efforts.
>> >> >
>> >>
>> >> The existing read interface in Java is tuple-at-a-time, but there's no
>> >> reason one couldn't build a column-at-a-time late materialization
>> approach.
>> >> It would just be a lot more "custom", and not directly user-usable, so
>> >> there's none in the initial implementation.
>> >>
>> >> Like you said, it's a little tougher with arbitrary nesting, but I think
>> >> still doable.
>> >>
>> >> -Todd
>> >>
>> >> >
>> >> > On Tue, Mar 12, 2013 at 10:40 AM, Todd Lipcon <to...@cloudera.com>
>> wrote:
>> >> >
>> >> > > Hey Jacques,
>> >> > >
>> >> > > Feel free to ping us with any questions. Despite some of the
>> _users_ of
>> >> > > Parquet competing with each other (eg query engines), we hope the
>> file
>> >> > > format itself can be easily implemented by everyone and become
>> >> > ubiquitous.
>> >> > >
>> >> > > There are a few changes still in flight that we're working on, so
>> you
>> >> may
>> >> > > want to join the parquet dev mailing list as well to follow along.
>> >> > >
>> >> > > Thanks
>> >> > > -Todd
>> >> > >
>> >> > > On Tue, Mar 12, 2013 at 10:29 AM, Jacques Nadeau <
>> jacques@apache.org>
>> >> > > wrote:
>> >> > >
>> >> > > > When you said soon, you meant very soon.  This looks like great
>> work.
>> >> > > >  Thanks for sharing it with the world.  Will come back after
>> spending
>> >> > > some
>> >> > > > time with it.
>> >> > > >
>> >> > > > thanks again,
>> >> > > > Jacques
>> >> > > >
>> >> > > >
>> >> > > >
>> >> > > > On Tue, Mar 12, 2013 at 9:50 AM, Julien Le Dem <
>> julien@twitter.com>
>> >> > > wrote:
>> >> > > >
>> >> > > > > The repo is now available: http://parquet.github.com/
>> >> > > > > Let me know if you have questions
>> >> > > > >
>> >> > > > > On Mon, Mar 11, 2013 at 11:31 AM, Jacques Nadeau <
>> >> jacques@apache.org
>> >> > >
>> >> > > > > wrote:
>> >> > > > > > There definitely seem to be some new kids on the block.  I
>> really
>> >> > > hope
>> >> > > > > that
>> >> > > > > > Drill can adopt either ORC or Parquet as a closely related
>> >> "native"
>> >> > > > > format.
>> >> > > > > >   At the moment, I'm actually more focused on the in-memory
>> >> > execution
>> >> > > > > > format and the right abstraction to support compressed
>> columnar
>> >> > > > execution
>> >> > > > > > and vectorization.  Historically, the biggest gaps I'd worry
>> >> about
>> >> > > are
>> >> > > > > > java-centricity and expectation of early materialization &
>> >> > > > decompression.
>> >> > > > > >  Once we get some execution stuff working, lets see how each
>> fits
>> >> > in.
>> >> > > > > >  Rather than start a third competing format (or fourth if you
>> >> count
>> >> > > > > > Trevni), let's either use or extend/contribute back on one of
>> the
>> >> > > > > existing
>> >> > > > > > new kids.
>> >> > > > > >
>> >> > > > > > Julien, do you think more will be shared about Parquet before
>> the
>> >> > > > Hadoop
>> >> > > > > > Summit so we can start toying with using it inside of Drill?
>> >> > > > > >
>> >> > > > > > J
>> >> > > > > >
>> >> > > > > > On Mon, Mar 11, 2013 at 11:02 AM, Ken Krugler
>> >> > > > > > <kk...@transpac.com>wrote:
>> >> > > > > >
>> >> > > > > >> Hi all,
>> >> > > > > >>
>> >> > > > > >> I've been trying to track down status/comparisons of various
>> >> > > columnar
>> >> > > > > >> formats, and just heard about Parquet.
>> >> > > > > >>
>> >> > > > > >> I don't have any direct experience with Parquet, but Really
>> >> Smart
>> >> > > Guy
>> >> > > > > said:
>> >> > > > > >>
>> >> > > > > >> > From what I hear there are two key features that
>> >> > > > > >> > differentiate it from ORC and Trevni: 1) columns can be
>> >> > optionally
>> >> > > > > split
>> >> > > > > >> into
>> >> > > > > >> > separate files, and 2) the mechanism for shredding nested
>> >> fields
>> >> > > > into
>> >> > > > > >> > columns is taken almost verbatim from Dremel. Feature (1)
>> >> won't
>> >> > be
>> >> > > > > >> practical
>> >> > > > > >> > to use until Hadoop introduces support for a file group
>> >> locality
>> >> > > > > >> feature, but once it
>> >> > > > > >> > does this feature should enable more efficient use of the
>> >> buffer
>> >> > > > cache
>> >> > > > > >> for predicate
>> >> > > > > >> > pushdown operations.
>> >> > > > > >>
>> >> > > > > >> -- Ken
>> >> > > > > >>
>> >> > > > > >>
>> >> > > > > >> On Mar 11, 2013, at 10:56am, Julien Le Dem wrote:
>> >> > > > > >>
>> >> > > > > >> > Parquet is actually implementing the algorithm described in
>> >> the
>> >> > > > > >> > "Nested Columnar Storage" section of the Dremel paper[1].
>> >> > > > > >> >
>> >> > > > > >> > [1] http://research.google.com/pubs/pub36632.html
>> >> > > > > >> >
>> >> > > > > >> > On Mon, Mar 11, 2013 at 10:41 AM, Timothy Chen <
>> >> > tnachen@gmail.com
>> >> > > >
>> >> > > > > >> wrote:
>> >> > > > > >> >> Just saw this:
>> >> > > > > >> >>
>> >> > > > > >> >> http://t.co/ES1dGDZlKA
>> >> > > > > >> >>
>> >> > > > > >> >> I know Trevni is another Dremel inspired Columnar format
>> as
>> >> > well,
>> >> > > > > anyone
>> >> > > > > >> >> saw much info Parquet and how it's different?
>> >> > > > > >> >>
>> >> > > > > >> >> Tim
>> >> > > > > >>
>> >> > > > > >> --------------------------
>> >> > > > > >> Ken Krugler
>> >> > > > > >> +1 530-210-6378
>> >> > > > > >> http://www.scaleunlimited.com
>> >> > > > > >> custom big data solutions & training
>> >> > > > > >> Hadoop, Cascading, Cassandra & Solr
>> >> > > > > >>
>> >> > > > > >>
>> >> > > > > >>
>> >> > > > > >>
>> >> > > > > >>
>> >> > > > > >>
>> >> > > > >
>> >> > > >
>> >> > >
>> >> > >
>> >> > >
>> >> > > --
>> >> > > Todd Lipcon
>> >> > > Software Engineer, Cloudera
>> >> > >
>> >> >
>> >>
>> >>
>> >>
>> >> --
>> >> Todd Lipcon
>> >> Software Engineer, Cloudera
>> >>
>>



--
- Tsuyoshi

Re: Another columnar format Parquet

Posted by Ted Dunning <te...@gmail.com>.

Can you bring 5 slides on parquet?  (ppt or pptx?)

On Tue, Mar 12, 2013 at 8:59 PM, Julien Le Dem <ju...@twitter.com> wrote:

> I should be able to come to the Drill meetup tomorrow.
> We can chat about it then.
> Julien
>
> On Tue, Mar 12, 2013 at 1:43 PM, Dmitriy Ryaboy <dv...@gmail.com>
> wrote:
> > ColumnIO implementations return values from a column independently of
> other
> > columns; RecordReaderImplementation does materialize the whole record (by
> > using a bunch of column readers at the same time). You could construct a
> > column-at-a-time, late materialization api by dropping directly into
> using
> > column readers; so it just depends on which level of abstraction you want
> > to hook up with.
> >
> > We were initially concerned with "record-oriented" frameworks so we built
> > the record materialization machinery for them first; a  more truly
> columnar
> > engine should work with ColumnIO instead of RecordReaders.
> >
> > Also, since the API is still young, it's certainly open to discussion and
> > improvement.
> >
> > D
> >
> >
> > On Tue, Mar 12, 2013 at 11:16 AM, Todd Lipcon <to...@cloudera.com> wrote:
> >
> >> On Tue, Mar 12, 2013 at 11:11 AM, Jacques Nadeau <ja...@apache.org>
> >> wrote:
> >>
> >> > Joined, thanks.  I'm glad that the approach was open for this.  I
> think
> >> > that helps it chances to be ubiquitous.  As much as this might be
> >> > blasphemous to some, I really hope that the final solution to the
> query
> >> > wars is a collaborative solution as opposed to a competitive one.
> >> >
> >> > Having not looked at the code yet, do the existing read interfaces
> >> support
> >> > working with "late materialization" execution strategies similar to
> some
> >> of
> >> > the ideas at [1]?  Definitely seems harder to implement in a
> >> > nested/repeated environment but wanted to get a sense of the thinking
> >> > behind the initial efforts.
> >> >
> >>
> >> The existing read interface in Java is tuple-at-a-time, but there's no
> >> reason one couldn't build a column-at-a-time late materialization
> approach.
> >> It would just be a lot more "custom", and not directly user-usable, so
> >> there's none in the initial implementation.
> >>
> >> Like you said, it's a little tougher with arbitrary nesting, but I think
> >> still doable.
> >>
> >> -Todd
> >>
> >> >
> >> > On Tue, Mar 12, 2013 at 10:40 AM, Todd Lipcon <to...@cloudera.com>
> wrote:
> >> >
> >> > > Hey Jacques,
> >> > >
> >> > > Feel free to ping us with any questions. Despite some of the
> _users_ of
> >> > > Parquet competing with each other (eg query engines), we hope the
> file
> >> > > format itself can be easily implemented by everyone and become
> >> > ubiquitous.
> >> > >
> >> > > There are a few changes still in flight that we're working on, so
> you
> >> may
> >> > > want to join the parquet dev mailing list as well to follow along.
> >> > >
> >> > > Thanks
> >> > > -Todd
> >> > >
> >> > > On Tue, Mar 12, 2013 at 10:29 AM, Jacques Nadeau <
> jacques@apache.org>
> >> > > wrote:
> >> > >
> >> > > > When you said soon, you meant very soon.  This looks like great
> work.
> >> > > >  Thanks for sharing it with the world.  Will come back after
> spending
> >> > > some
> >> > > > time with it.
> >> > > >
> >> > > > thanks again,
> >> > > > Jacques
> >> > > >
> >> > > >
> >> > > >
> >> > > > On Tue, Mar 12, 2013 at 9:50 AM, Julien Le Dem <
> julien@twitter.com>
> >> > > wrote:
> >> > > >
> >> > > > > The repo is now available: http://parquet.github.com/
> >> > > > > Let me know if you have questions
> >> > > > >
> >> > > > > On Mon, Mar 11, 2013 at 11:31 AM, Jacques Nadeau <
> >> jacques@apache.org
> >> > >
> >> > > > > wrote:
> >> > > > > > There definitely seem to be some new kids on the block.  I
> really
> >> > > hope
> >> > > > > that
> >> > > > > > Drill can adopt either ORC or Parquet as a closely related
> >> "native"
> >> > > > > format.
> >> > > > > >   At the moment, I'm actually more focused on the in-memory
> >> > execution
> >> > > > > > format and the right abstraction to support compressed
> columnar
> >> > > > execution
> >> > > > > > and vectorization.  Historically, the biggest gaps I'd worry
> >> about
> >> > > are
> >> > > > > > java-centricity and expectation of early materialization &
> >> > > > decompression.
> >> > > > > >  Once we get some execution stuff working, lets see how each
> fits
> >> > in.
> >> > > > > >  Rather than start a third competing format (or fourth if you
> >> count
> >> > > > > > Trevni), let's either use or extend/contribute back on one of
> the
> >> > > > > existing
> >> > > > > > new kids.
> >> > > > > >
> >> > > > > > Julien, do you think more will be shared about Parquet before
> the
> >> > > > Hadoop
> >> > > > > > Summit so we can start toying with using it inside of Drill?
> >> > > > > >
> >> > > > > > J
> >> > > > > >
> >> > > > > > On Mon, Mar 11, 2013 at 11:02 AM, Ken Krugler
> >> > > > > > <kk...@transpac.com>wrote:
> >> > > > > >
> >> > > > > >> Hi all,
> >> > > > > >>
> >> > > > > >> I've been trying to track down status/comparisons of various
> >> > > columnar
> >> > > > > >> formats, and just heard about Parquet.
> >> > > > > >>
> >> > > > > >> I don't have any direct experience with Parquet, but Really
> >> Smart
> >> > > Guy
> >> > > > > said:
> >> > > > > >>
> >> > > > > >> > From what I hear there are two key features that
> >> > > > > >> > differentiate it from ORC and Trevni: 1) columns can be
> >> > optionally
> >> > > > > split
> >> > > > > >> into
> >> > > > > >> > separate files, and 2) the mechanism for shredding nested
> >> fields
> >> > > > into
> >> > > > > >> > columns is taken almost verbatim from Dremel. Feature (1)
> >> won't
> >> > be
> >> > > > > >> practical
> >> > > > > >> > to use until Hadoop introduces support for a file group
> >> locality
> >> > > > > >> feature, but once it
> >> > > > > >> > does this feature should enable more efficient use of the
> >> buffer
> >> > > > cache
> >> > > > > >> for predicate
> >> > > > > >> > pushdown operations.
> >> > > > > >>
> >> > > > > >> -- Ken
> >> > > > > >>
> >> > > > > >>
> >> > > > > >> On Mar 11, 2013, at 10:56am, Julien Le Dem wrote:
> >> > > > > >>
> >> > > > > >> > Parquet is actually implementing the algorithm described in
> >> the
> >> > > > > >> > "Nested Columnar Storage" section of the Dremel paper[1].
> >> > > > > >> >
> >> > > > > >> > [1] http://research.google.com/pubs/pub36632.html
> >> > > > > >> >
> >> > > > > >> > On Mon, Mar 11, 2013 at 10:41 AM, Timothy Chen <
> >> > tnachen@gmail.com
> >> > > >
> >> > > > > >> wrote:
> >> > > > > >> >> Just saw this:
> >> > > > > >> >>
> >> > > > > >> >> http://t.co/ES1dGDZlKA
> >> > > > > >> >>
> >> > > > > >> >> I know Trevni is another Dremel inspired Columnar format
> as
> >> > well,
> >> > > > > anyone
> >> > > > > >> >> saw much info Parquet and how it's different?
> >> > > > > >> >>
> >> > > > > >> >> Tim
> >> > > > > >>
> >> > > > > >> --------------------------
> >> > > > > >> Ken Krugler
> >> > > > > >> +1 530-210-6378
> >> > > > > >> http://www.scaleunlimited.com
> >> > > > > >> custom big data solutions & training
> >> > > > > >> Hadoop, Cascading, Cassandra & Solr
> >> > > > > >>
> >> > > > > >>
> >> > > > > >>
> >> > > > > >>
> >> > > > > >>
> >> > > > > >>
> >> > > > >
> >> > > >
> >> > >
> >> > >
> >> > >
> >> > > --
> >> > > Todd Lipcon
> >> > > Software Engineer, Cloudera
> >> > >
> >> >
> >>
> >>
> >>
> >> --
> >> Todd Lipcon
> >> Software Engineer, Cloudera
> >>
>

Re: Another columnar format Parquet

Posted by Julien Le Dem <ju...@twitter.com>.

I should be able to come to the Drill meetup tomorrow.
We can chat about it then.
Julien

On Tue, Mar 12, 2013 at 1:43 PM, Dmitriy Ryaboy <dv...@gmail.com> wrote:
> ColumnIO implementations return values from a column independently of other
> columns; RecordReaderImplementation does materialize the whole record (by
> using a bunch of column readers at the same time). You could construct a
> column-at-a-time, late materialization api by dropping directly into using
> column readers; so it just depends on which level of abstraction you want
> to hook up with.
>
> We were initially concerned with "record-oriented" frameworks so we built
> the record materialization machinery for them first; a  more truly columnar
> engine should work with ColumnIO instead of RecordReaders.
>
> Also, since the API is still young, it's certainly open to discussion and
> improvement.
>
> D
>
>
> On Tue, Mar 12, 2013 at 11:16 AM, Todd Lipcon <to...@cloudera.com> wrote:
>
>> On Tue, Mar 12, 2013 at 11:11 AM, Jacques Nadeau <ja...@apache.org>
>> wrote:
>>
>> > Joined, thanks.  I'm glad that the approach was open for this.  I think
>> > that helps it chances to be ubiquitous.  As much as this might be
>> > blasphemous to some, I really hope that the final solution to the query
>> > wars is a collaborative solution as opposed to a competitive one.
>> >
>> > Having not looked at the code yet, do the existing read interfaces
>> support
>> > working with "late materialization" execution strategies similar to some
>> of
>> > the ideas at [1]?  Definitely seems harder to implement in a
>> > nested/repeated environment but wanted to get a sense of the thinking
>> > behind the initial efforts.
>> >
>>
>> The existing read interface in Java is tuple-at-a-time, but there's no
>> reason one couldn't build a column-at-a-time late materialization approach.
>> It would just be a lot more "custom", and not directly user-usable, so
>> there's none in the initial implementation.
>>
>> Like you said, it's a little tougher with arbitrary nesting, but I think
>> still doable.
>>
>> -Todd
>>
>> >
>> > On Tue, Mar 12, 2013 at 10:40 AM, Todd Lipcon <to...@cloudera.com> wrote:
>> >
>> > > Hey Jacques,
>> > >
>> > > Feel free to ping us with any questions. Despite some of the _users_ of
>> > > Parquet competing with each other (eg query engines), we hope the file
>> > > format itself can be easily implemented by everyone and become
>> > ubiquitous.
>> > >
>> > > There are a few changes still in flight that we're working on, so you
>> may
>> > > want to join the parquet dev mailing list as well to follow along.
>> > >
>> > > Thanks
>> > > -Todd
>> > >
>> > > On Tue, Mar 12, 2013 at 10:29 AM, Jacques Nadeau <ja...@apache.org>
>> > > wrote:
>> > >
>> > > > When you said soon, you meant very soon.  This looks like great work.
>> > > >  Thanks for sharing it with the world.  Will come back after spending
>> > > some
>> > > > time with it.
>> > > >
>> > > > thanks again,
>> > > > Jacques
>> > > >
>> > > >
>> > > >
>> > > > On Tue, Mar 12, 2013 at 9:50 AM, Julien Le Dem <ju...@twitter.com>
>> > > wrote:
>> > > >
>> > > > > The repo is now available: http://parquet.github.com/
>> > > > > Let me know if you have questions
>> > > > >
>> > > > > On Mon, Mar 11, 2013 at 11:31 AM, Jacques Nadeau <
>> jacques@apache.org
>> > >
>> > > > > wrote:
>> > > > > > There definitely seem to be some new kids on the block.  I really
>> > > hope
>> > > > > that
>> > > > > > Drill can adopt either ORC or Parquet as a closely related
>> "native"
>> > > > > format.
>> > > > > >   At the moment, I'm actually more focused on the in-memory
>> > execution
>> > > > > > format and the right abstraction to support compressed columnar
>> > > > execution
>> > > > > > and vectorization.  Historically, the biggest gaps I'd worry
>> about
>> > > are
>> > > > > > java-centricity and expectation of early materialization &
>> > > > decompression.
>> > > > > >  Once we get some execution stuff working, lets see how each fits
>> > in.
>> > > > > >  Rather than start a third competing format (or fourth if you
>> count
>> > > > > > Trevni), let's either use or extend/contribute back on one of the
>> > > > > existing
>> > > > > > new kids.
>> > > > > >
>> > > > > > Julien, do you think more will be shared about Parquet before the
>> > > > Hadoop
>> > > > > > Summit so we can start toying with using it inside of Drill?
>> > > > > >
>> > > > > > J
>> > > > > >
>> > > > > > On Mon, Mar 11, 2013 at 11:02 AM, Ken Krugler
>> > > > > > <kk...@transpac.com>wrote:
>> > > > > >
>> > > > > >> Hi all,
>> > > > > >>
>> > > > > >> I've been trying to track down status/comparisons of various
>> > > columnar
>> > > > > >> formats, and just heard about Parquet.
>> > > > > >>
>> > > > > >> I don't have any direct experience with Parquet, but Really
>> Smart
>> > > Guy
>> > > > > said:
>> > > > > >>
>> > > > > >> > From what I hear there are two key features that
>> > > > > >> > differentiate it from ORC and Trevni: 1) columns can be
>> > optionally
>> > > > > split
>> > > > > >> into
>> > > > > >> > separate files, and 2) the mechanism for shredding nested
>> fields
>> > > > into
>> > > > > >> > columns is taken almost verbatim from Dremel. Feature (1)
>> won't
>> > be
>> > > > > >> practical
>> > > > > >> > to use until Hadoop introduces support for a file group
>> locality
>> > > > > >> feature, but once it
>> > > > > >> > does this feature should enable more efficient use of the
>> buffer
>> > > > cache
>> > > > > >> for predicate
>> > > > > >> > pushdown operations.
>> > > > > >>
>> > > > > >> -- Ken
>> > > > > >>
>> > > > > >>
>> > > > > >> On Mar 11, 2013, at 10:56am, Julien Le Dem wrote:
>> > > > > >>
>> > > > > >> > Parquet is actually implementing the algorithm described in
>> the
>> > > > > >> > "Nested Columnar Storage" section of the Dremel paper[1].
>> > > > > >> >
>> > > > > >> > [1] http://research.google.com/pubs/pub36632.html
>> > > > > >> >
>> > > > > >> > On Mon, Mar 11, 2013 at 10:41 AM, Timothy Chen <
>> > tnachen@gmail.com
>> > > >
>> > > > > >> wrote:
>> > > > > >> >> Just saw this:
>> > > > > >> >>
>> > > > > >> >> http://t.co/ES1dGDZlKA
>> > > > > >> >>
>> > > > > >> >> I know Trevni is another Dremel inspired Columnar format as
>> > well,
>> > > > > anyone
>> > > > > >> >> saw much info Parquet and how it's different?
>> > > > > >> >>
>> > > > > >> >> Tim
>> > > > > >>
>> > > > > >> --------------------------
>> > > > > >> Ken Krugler
>> > > > > >> +1 530-210-6378
>> > > > > >> http://www.scaleunlimited.com
>> > > > > >> custom big data solutions & training
>> > > > > >> Hadoop, Cascading, Cassandra & Solr
>> > > > > >>
>> > > > > >>
>> > > > > >>
>> > > > > >>
>> > > > > >>
>> > > > > >>
>> > > > >
>> > > >
>> > >
>> > >
>> > >
>> > > --
>> > > Todd Lipcon
>> > > Software Engineer, Cloudera
>> > >
>> >
>>
>>
>>
>> --
>> Todd Lipcon
>> Software Engineer, Cloudera
>>

Re: Another columnar format Parquet

Posted by Dmitriy Ryaboy <dv...@gmail.com>.

ColumnIO implementations return values from a column independently of other
columns; RecordReaderImplementation does materialize the whole record (by
using a bunch of column readers at the same time). You could construct a
column-at-a-time, late materialization api by dropping directly into using
column readers; so it just depends on which level of abstraction you want
to hook up with.

We were initially concerned with "record-oriented" frameworks so we built
the record materialization machinery for them first; a  more truly columnar
engine should work with ColumnIO instead of RecordReaders.

Also, since the API is still young, it's certainly open to discussion and
improvement.

D


On Tue, Mar 12, 2013 at 11:16 AM, Todd Lipcon <to...@cloudera.com> wrote:

> On Tue, Mar 12, 2013 at 11:11 AM, Jacques Nadeau <ja...@apache.org>
> wrote:
>
> > Joined, thanks.  I'm glad that the approach was open for this.  I think
> > that helps it chances to be ubiquitous.  As much as this might be
> > blasphemous to some, I really hope that the final solution to the query
> > wars is a collaborative solution as opposed to a competitive one.
> >
> > Having not looked at the code yet, do the existing read interfaces
> support
> > working with "late materialization" execution strategies similar to some
> of
> > the ideas at [1]?  Definitely seems harder to implement in a
> > nested/repeated environment but wanted to get a sense of the thinking
> > behind the initial efforts.
> >
>
> The existing read interface in Java is tuple-at-a-time, but there's no
> reason one couldn't build a column-at-a-time late materialization approach.
> It would just be a lot more "custom", and not directly user-usable, so
> there's none in the initial implementation.
>
> Like you said, it's a little tougher with arbitrary nesting, but I think
> still doable.
>
> -Todd
>
> >
> > On Tue, Mar 12, 2013 at 10:40 AM, Todd Lipcon <to...@cloudera.com> wrote:
> >
> > > Hey Jacques,
> > >
> > > Feel free to ping us with any questions. Despite some of the _users_ of
> > > Parquet competing with each other (eg query engines), we hope the file
> > > format itself can be easily implemented by everyone and become
> > ubiquitous.
> > >
> > > There are a few changes still in flight that we're working on, so you
> may
> > > want to join the parquet dev mailing list as well to follow along.
> > >
> > > Thanks
> > > -Todd
> > >
> > > On Tue, Mar 12, 2013 at 10:29 AM, Jacques Nadeau <ja...@apache.org>
> > > wrote:
> > >
> > > > When you said soon, you meant very soon.  This looks like great work.
> > > >  Thanks for sharing it with the world.  Will come back after spending
> > > some
> > > > time with it.
> > > >
> > > > thanks again,
> > > > Jacques
> > > >
> > > >
> > > >
> > > > On Tue, Mar 12, 2013 at 9:50 AM, Julien Le Dem <ju...@twitter.com>
> > > wrote:
> > > >
> > > > > The repo is now available: http://parquet.github.com/
> > > > > Let me know if you have questions
> > > > >
> > > > > On Mon, Mar 11, 2013 at 11:31 AM, Jacques Nadeau <
> jacques@apache.org
> > >
> > > > > wrote:
> > > > > > There definitely seem to be some new kids on the block.  I really
> > > hope
> > > > > that
> > > > > > Drill can adopt either ORC or Parquet as a closely related
> "native"
> > > > > format.
> > > > > >   At the moment, I'm actually more focused on the in-memory
> > execution
> > > > > > format and the right abstraction to support compressed columnar
> > > > execution
> > > > > > and vectorization.  Historically, the biggest gaps I'd worry
> about
> > > are
> > > > > > java-centricity and expectation of early materialization &
> > > > decompression.
> > > > > >  Once we get some execution stuff working, lets see how each fits
> > in.
> > > > > >  Rather than start a third competing format (or fourth if you
> count
> > > > > > Trevni), let's either use or extend/contribute back on one of the
> > > > > existing
> > > > > > new kids.
> > > > > >
> > > > > > Julien, do you think more will be shared about Parquet before the
> > > > Hadoop
> > > > > > Summit so we can start toying with using it inside of Drill?
> > > > > >
> > > > > > J
> > > > > >
> > > > > > On Mon, Mar 11, 2013 at 11:02 AM, Ken Krugler
> > > > > > <kk...@transpac.com>wrote:
> > > > > >
> > > > > >> Hi all,
> > > > > >>
> > > > > >> I've been trying to track down status/comparisons of various
> > > columnar
> > > > > >> formats, and just heard about Parquet.
> > > > > >>
> > > > > >> I don't have any direct experience with Parquet, but Really
> Smart
> > > Guy
> > > > > said:
> > > > > >>
> > > > > >> > From what I hear there are two key features that
> > > > > >> > differentiate it from ORC and Trevni: 1) columns can be
> > optionally
> > > > > split
> > > > > >> into
> > > > > >> > separate files, and 2) the mechanism for shredding nested
> fields
> > > > into
> > > > > >> > columns is taken almost verbatim from Dremel. Feature (1)
> won't
> > be
> > > > > >> practical
> > > > > >> > to use until Hadoop introduces support for a file group
> locality
> > > > > >> feature, but once it
> > > > > >> > does this feature should enable more efficient use of the
> buffer
> > > > cache
> > > > > >> for predicate
> > > > > >> > pushdown operations.
> > > > > >>
> > > > > >> -- Ken
> > > > > >>
> > > > > >>
> > > > > >> On Mar 11, 2013, at 10:56am, Julien Le Dem wrote:
> > > > > >>
> > > > > >> > Parquet is actually implementing the algorithm described in
> the
> > > > > >> > "Nested Columnar Storage" section of the Dremel paper[1].
> > > > > >> >
> > > > > >> > [1] http://research.google.com/pubs/pub36632.html
> > > > > >> >
> > > > > >> > On Mon, Mar 11, 2013 at 10:41 AM, Timothy Chen <
> > tnachen@gmail.com
> > > >
> > > > > >> wrote:
> > > > > >> >> Just saw this:
> > > > > >> >>
> > > > > >> >> http://t.co/ES1dGDZlKA
> > > > > >> >>
> > > > > >> >> I know Trevni is another Dremel inspired Columnar format as
> > well,
> > > > > anyone
> > > > > >> >> saw much info Parquet and how it's different?
> > > > > >> >>
> > > > > >> >> Tim
> > > > > >>
> > > > > >> --------------------------
> > > > > >> Ken Krugler
> > > > > >> +1 530-210-6378
> > > > > >> http://www.scaleunlimited.com
> > > > > >> custom big data solutions & training
> > > > > >> Hadoop, Cascading, Cassandra & Solr
> > > > > >>
> > > > > >>
> > > > > >>
> > > > > >>
> > > > > >>
> > > > > >>
> > > > >
> > > >
> > >
> > >
> > >
> > > --
> > > Todd Lipcon
> > > Software Engineer, Cloudera
> > >
> >
>
>
>
> --
> Todd Lipcon
> Software Engineer, Cloudera
>

Re: Another columnar format Parquet

Posted by Todd Lipcon <to...@cloudera.com>.

On Tue, Mar 12, 2013 at 11:11 AM, Jacques Nadeau <ja...@apache.org> wrote:

> Joined, thanks.  I'm glad that the approach was open for this.  I think
> that helps it chances to be ubiquitous.  As much as this might be
> blasphemous to some, I really hope that the final solution to the query
> wars is a collaborative solution as opposed to a competitive one.
>
> Having not looked at the code yet, do the existing read interfaces support
> working with "late materialization" execution strategies similar to some of
> the ideas at [1]?  Definitely seems harder to implement in a
> nested/repeated environment but wanted to get a sense of the thinking
> behind the initial efforts.
>

The existing read interface in Java is tuple-at-a-time, but there's no
reason one couldn't build a column-at-a-time late materialization approach.
It would just be a lot more "custom", and not directly user-usable, so
there's none in the initial implementation.

Like you said, it's a little tougher with arbitrary nesting, but I think
still doable.

-Todd

>
> On Tue, Mar 12, 2013 at 10:40 AM, Todd Lipcon <to...@cloudera.com> wrote:
>
> > Hey Jacques,
> >
> > Feel free to ping us with any questions. Despite some of the _users_ of
> > Parquet competing with each other (eg query engines), we hope the file
> > format itself can be easily implemented by everyone and become
> ubiquitous.
> >
> > There are a few changes still in flight that we're working on, so you may
> > want to join the parquet dev mailing list as well to follow along.
> >
> > Thanks
> > -Todd
> >
> > On Tue, Mar 12, 2013 at 10:29 AM, Jacques Nadeau <ja...@apache.org>
> > wrote:
> >
> > > When you said soon, you meant very soon.  This looks like great work.
> > >  Thanks for sharing it with the world.  Will come back after spending
> > some
> > > time with it.
> > >
> > > thanks again,
> > > Jacques
> > >
> > >
> > >
> > > On Tue, Mar 12, 2013 at 9:50 AM, Julien Le Dem <ju...@twitter.com>
> > wrote:
> > >
> > > > The repo is now available: http://parquet.github.com/
> > > > Let me know if you have questions
> > > >
> > > > On Mon, Mar 11, 2013 at 11:31 AM, Jacques Nadeau <jacques@apache.org
> >
> > > > wrote:
> > > > > There definitely seem to be some new kids on the block.  I really
> > hope
> > > > that
> > > > > Drill can adopt either ORC or Parquet as a closely related "native"
> > > > format.
> > > > >   At the moment, I'm actually more focused on the in-memory
> execution
> > > > > format and the right abstraction to support compressed columnar
> > > execution
> > > > > and vectorization.  Historically, the biggest gaps I'd worry about
> > are
> > > > > java-centricity and expectation of early materialization &
> > > decompression.
> > > > >  Once we get some execution stuff working, lets see how each fits
> in.
> > > > >  Rather than start a third competing format (or fourth if you count
> > > > > Trevni), let's either use or extend/contribute back on one of the
> > > > existing
> > > > > new kids.
> > > > >
> > > > > Julien, do you think more will be shared about Parquet before the
> > > Hadoop
> > > > > Summit so we can start toying with using it inside of Drill?
> > > > >
> > > > > J
> > > > >
> > > > > On Mon, Mar 11, 2013 at 11:02 AM, Ken Krugler
> > > > > <kk...@transpac.com>wrote:
> > > > >
> > > > >> Hi all,
> > > > >>
> > > > >> I've been trying to track down status/comparisons of various
> > columnar
> > > > >> formats, and just heard about Parquet.
> > > > >>
> > > > >> I don't have any direct experience with Parquet, but Really Smart
> > Guy
> > > > said:
> > > > >>
> > > > >> > From what I hear there are two key features that
> > > > >> > differentiate it from ORC and Trevni: 1) columns can be
> optionally
> > > > split
> > > > >> into
> > > > >> > separate files, and 2) the mechanism for shredding nested fields
> > > into
> > > > >> > columns is taken almost verbatim from Dremel. Feature (1) won't
> be
> > > > >> practical
> > > > >> > to use until Hadoop introduces support for a file group locality
> > > > >> feature, but once it
> > > > >> > does this feature should enable more efficient use of the buffer
> > > cache
> > > > >> for predicate
> > > > >> > pushdown operations.
> > > > >>
> > > > >> -- Ken
> > > > >>
> > > > >>
> > > > >> On Mar 11, 2013, at 10:56am, Julien Le Dem wrote:
> > > > >>
> > > > >> > Parquet is actually implementing the algorithm described in the
> > > > >> > "Nested Columnar Storage" section of the Dremel paper[1].
> > > > >> >
> > > > >> > [1] http://research.google.com/pubs/pub36632.html
> > > > >> >
> > > > >> > On Mon, Mar 11, 2013 at 10:41 AM, Timothy Chen <
> tnachen@gmail.com
> > >
> > > > >> wrote:
> > > > >> >> Just saw this:
> > > > >> >>
> > > > >> >> http://t.co/ES1dGDZlKA
> > > > >> >>
> > > > >> >> I know Trevni is another Dremel inspired Columnar format as
> well,
> > > > anyone
> > > > >> >> saw much info Parquet and how it's different?
> > > > >> >>
> > > > >> >> Tim
> > > > >>
> > > > >> --------------------------
> > > > >> Ken Krugler
> > > > >> +1 530-210-6378
> > > > >> http://www.scaleunlimited.com
> > > > >> custom big data solutions & training
> > > > >> Hadoop, Cascading, Cassandra & Solr
> > > > >>
> > > > >>
> > > > >>
> > > > >>
> > > > >>
> > > > >>
> > > >
> > >
> >
> >
> >
> > --
> > Todd Lipcon
> > Software Engineer, Cloudera
> >
>



-- 
Todd Lipcon
Software Engineer, Cloudera

Re: Another columnar format Parquet

Posted by Jacques Nadeau <ja...@apache.org>.

Joined, thanks.  I'm glad that the approach was open for this.  I think
that helps it chances to be ubiquitous.  As much as this might be
blasphemous to some, I really hope that the final solution to the query
wars is a collaborative solution as opposed to a competitive one.

Having not looked at the code yet, do the existing read interfaces support
working with "late materialization" execution strategies similar to some of
the ideas at [1]?  Definitely seems harder to implement in a
nested/repeated environment but wanted to get a sense of the thinking
behind the initial efforts.

thanks again,
Jacques

[1] http://cs-www.cs.yale.edu/homes/dna/papers/abadisigmod06.pdf

On Tue, Mar 12, 2013 at 10:40 AM, Todd Lipcon <to...@cloudera.com> wrote:

> Hey Jacques,
>
> Feel free to ping us with any questions. Despite some of the _users_ of
> Parquet competing with each other (eg query engines), we hope the file
> format itself can be easily implemented by everyone and become ubiquitous.
>
> There are a few changes still in flight that we're working on, so you may
> want to join the parquet dev mailing list as well to follow along.
>
> Thanks
> -Todd
>
> On Tue, Mar 12, 2013 at 10:29 AM, Jacques Nadeau <ja...@apache.org>
> wrote:
>
> > When you said soon, you meant very soon.  This looks like great work.
> >  Thanks for sharing it with the world.  Will come back after spending
> some
> > time with it.
> >
> > thanks again,
> > Jacques
> >
> >
> >
> > On Tue, Mar 12, 2013 at 9:50 AM, Julien Le Dem <ju...@twitter.com>
> wrote:
> >
> > > The repo is now available: http://parquet.github.com/
> > > Let me know if you have questions
> > >
> > > On Mon, Mar 11, 2013 at 11:31 AM, Jacques Nadeau <ja...@apache.org>
> > > wrote:
> > > > There definitely seem to be some new kids on the block.  I really
> hope
> > > that
> > > > Drill can adopt either ORC or Parquet as a closely related "native"
> > > format.
> > > >   At the moment, I'm actually more focused on the in-memory execution
> > > > format and the right abstraction to support compressed columnar
> > execution
> > > > and vectorization.  Historically, the biggest gaps I'd worry about
> are
> > > > java-centricity and expectation of early materialization &
> > decompression.
> > > >  Once we get some execution stuff working, lets see how each fits in.
> > > >  Rather than start a third competing format (or fourth if you count
> > > > Trevni), let's either use or extend/contribute back on one of the
> > > existing
> > > > new kids.
> > > >
> > > > Julien, do you think more will be shared about Parquet before the
> > Hadoop
> > > > Summit so we can start toying with using it inside of Drill?
> > > >
> > > > J
> > > >
> > > > On Mon, Mar 11, 2013 at 11:02 AM, Ken Krugler
> > > > <kk...@transpac.com>wrote:
> > > >
> > > >> Hi all,
> > > >>
> > > >> I've been trying to track down status/comparisons of various
> columnar
> > > >> formats, and just heard about Parquet.
> > > >>
> > > >> I don't have any direct experience with Parquet, but Really Smart
> Guy
> > > said:
> > > >>
> > > >> > From what I hear there are two key features that
> > > >> > differentiate it from ORC and Trevni: 1) columns can be optionally
> > > split
> > > >> into
> > > >> > separate files, and 2) the mechanism for shredding nested fields
> > into
> > > >> > columns is taken almost verbatim from Dremel. Feature (1) won't be
> > > >> practical
> > > >> > to use until Hadoop introduces support for a file group locality
> > > >> feature, but once it
> > > >> > does this feature should enable more efficient use of the buffer
> > cache
> > > >> for predicate
> > > >> > pushdown operations.
> > > >>
> > > >> -- Ken
> > > >>
> > > >>
> > > >> On Mar 11, 2013, at 10:56am, Julien Le Dem wrote:
> > > >>
> > > >> > Parquet is actually implementing the algorithm described in the
> > > >> > "Nested Columnar Storage" section of the Dremel paper[1].
> > > >> >
> > > >> > [1] http://research.google.com/pubs/pub36632.html
> > > >> >
> > > >> > On Mon, Mar 11, 2013 at 10:41 AM, Timothy Chen <tnachen@gmail.com
> >
> > > >> wrote:
> > > >> >> Just saw this:
> > > >> >>
> > > >> >> http://t.co/ES1dGDZlKA
> > > >> >>
> > > >> >> I know Trevni is another Dremel inspired Columnar format as well,
> > > anyone
> > > >> >> saw much info Parquet and how it's different?
> > > >> >>
> > > >> >> Tim
> > > >>
> > > >> --------------------------
> > > >> Ken Krugler
> > > >> +1 530-210-6378
> > > >> http://www.scaleunlimited.com
> > > >> custom big data solutions & training
> > > >> Hadoop, Cascading, Cassandra & Solr
> > > >>
> > > >>
> > > >>
> > > >>
> > > >>
> > > >>
> > >
> >
>
>
>
> --
> Todd Lipcon
> Software Engineer, Cloudera
>

Re: Another columnar format Parquet

Posted by Ted Dunning <te...@gmail.com>.

Looking at this more carefully the other day with Jacques makes it seem that

a) as Owen says ORC has a more elaborate type structure.  The data stored
is equivalent (ref the Protobuf versus Avro versus Thrift discussions)
subject to the possibility of the null, null difference that Owen mentions

b) as the Dremel paper points out, access to a very rare repeated structure
inside a common repeated structure will require only traversal of the rare
element using Dremel type structures, but with repeat counts will require
an additional traversal of a much more dense column.  How much difference
this will make in practice is unknown, but there are clearly cases that you
can imagine that this will cause orders of magnitude difference in favor of
Parquet.  Those cases may, howver, be vanishingly rare.

On Mon, Apr 15, 2013 at 4:06 PM, Owen O'Malley <om...@apache.org> wrote:

> Just a bit saying whether the record was present or null. Note that this is
> strictly more expressive than the Parquet's format in that it can encode
> structures with all null values. I believe the Parquet encoder would
> discard a row of the form (null, null) since it wouldn't have any leaves to
> make it materialize.
>
> -- Owen
>
>
> On Wed, Apr 10, 2013 at 3:48 PM, Ted Dunning <te...@gmail.com>
> wrote:
>
> > On Wed, Apr 10, 2013 at 10:17 AM, Owen O'Malley <om...@apache.org>
> > wrote:
> >
> > > Ted,
> > >    ORC does support nested structures and splits them into primitive
> > > columns.
> >
> >
> > Good to hear.
> >
> >
> > > ...
> > > create table Foo (
> > >   complex: struct<field1: int, field2: map<string, int>>
> > >   simple: timestamp
> > > );
> > >
> > > will end up with a prefix-order flattening of the columns:
> > >
> > > columns:
> > > 0 - top level record (struct, children: 1, 6)
> > > 1 - complex (struct, children: 2, 3)
> > >
> >
> > What is stored in column 1?
> >
>

Re: Another columnar format Parquet

Posted by Owen O'Malley <om...@apache.org>.

Just a bit saying whether the record was present or null. Note that this is
strictly more expressive than the Parquet's format in that it can encode
structures with all null values. I believe the Parquet encoder would
discard a row of the form (null, null) since it wouldn't have any leaves to
make it materialize.

-- Owen

On Wed, Apr 10, 2013 at 3:48 PM, Ted Dunning <te...@gmail.com> wrote:

> On Wed, Apr 10, 2013 at 10:17 AM, Owen O'Malley <om...@apache.org>
> wrote:
>
> > Ted,
> >    ORC does support nested structures and splits them into primitive
> > columns.
>
>
> Good to hear.
>
>
> > ...
> > create table Foo (
> >   complex: struct<field1: int, field2: map<string, int>>
> >   simple: timestamp
> > );
> >
> > will end up with a prefix-order flattening of the columns:
> >
> > columns:
> > 0 - top level record (struct, children: 1, 6)
> > 1 - complex (struct, children: 2, 3)
> >
>
> What is stored in column 1?
>

Re: Another columnar format Parquet

Posted by Ted Dunning <te...@gmail.com>.

On Wed, Apr 10, 2013 at 10:17 AM, Owen O'Malley <om...@apache.org> wrote:

> Ted,
>    ORC does support nested structures and splits them into primitive
> columns.


Good to hear.


> ...
> create table Foo (
>   complex: struct<field1: int, field2: map<string, int>>
>   simple: timestamp
> );
>
> will end up with a prefix-order flattening of the columns:
>
> columns:
> 0 - top level record (struct, children: 1, 6)
> 1 - complex (struct, children: 2, 3)
>

What is stored in column 1?

Re: Another columnar format Parquet

Posted by Owen O'Malley <om...@apache.org>.

Ted,
   ORC does support nested structures and splits them into primitive
columns. That is required to get the benefits of type-specific encodings.
ORC doesn't do the complex repetition and definition levels that require a
DFA to reassemble the rows, but takes the more straightforward approach of
recording the information for the intermediate columns.

create table Foo (
  complex: struct<field1: int, field2: map<string, int>>
  simple: timestamp
);

will end up with a prefix-order flattening of the columns:

columns:
0 - top level record (struct, children: 1, 6)
1 - complex (struct, children: 2, 3)
2 - field1 (int)
3 - field2 (map, children: 4,5 )
4 - map key (string)
5 - map value (int)
6 - simple (timestamp)

Instead of encoding the definition and repetition levels in columns 4 and
5, ORC encodes the number of entries in the map in the data for column 3.
It would be very interesting to take the githubarchive.org logs and put
them into ORC and Parquet and measure the resulting file sizes. (Other
thoughts about such a comparison: try compressed versus uncompressed and
turning off ORC's indexes since Parquet doesn't have indexes.)

-- Owen


On Tue, Apr 9, 2013 at 8:43 AM, Timothy Chen <tn...@gmail.com> wrote:

> Hi Ted,
>
> Can you explain more about the question you have about encoding in ORC?
>
> Tim
>
> Sent from my iPhone
>
> On Apr 4, 2013, at 11:01 PM, Ted Dunning <te...@gmail.com> wrote:
>
> > Yes it does.
> >
> > I have seen conflicting docs on format it uses.  One seemed to say that
> > complex cells were stored within a single cell.  The other seemed to say
> > that nested structures were shredded in the style of Parquet or Dremel.
> >
> > One thing that I worry about with ORC is that it exactly replicates the
> > schema model of Hive which isn't as congenial (to me) as the protobuf
> style
> > of Parquet.  As Julien mentioned in the Drill meetup, there is also the
> > question of the correctness of the encoding.  The Dremel column shredding
> > is pretty subtle.  Hopefully ORC authors started from first principles in
> > designing the encoding.
> >
> >
> > On Fri, Apr 5, 2013 at 1:12 AM, Jacques Nadeau <ja...@apache.org>
> wrote:
> >
> >> Does ORC support nested data?  How does it compare to the Dremel
> encoding
> >> approach that Parquet utilizes?
> >>
> >> Thanks,
> >> Jacques
> >>
> >> On Thu, Mar 28, 2013 at 11:22 PM, Owen O'Malley <om...@apache.org>
> >> wrote:
> >>
> >>> On Tue, Mar 12, 2013 at 11:45 AM, Ted Dunning <te...@gmail.com>
> >>> wrote:
> >>>
> >>>> So is it fair to say that Parquet will be open to contributions and
> >> will
> >>>> hopefully develop an open community to drive it?
> >>>>
> >>>> If so, that is an excellent development.
> >>>>
> >>>> Is ORC file well enough developed for a comparison?
> >>>
> >>> ORC is committed to Hive's trunk and seems more feature complete than
> >>> Parquet. Parquet hasn't implemented indexes, dictionaries, or a
> datetime
> >>> encoder yet. Obviously, if you have questions about ORC, please ask
> over
> >> on
> >>> Hive's dev list.
> >>>
> >>> -- Owen
> >>>
> >>>
> >>>>
> >>>> On Tue, Mar 12, 2013 at 10:40 AM, Todd Lipcon <to...@cloudera.com>
> >> wrote:
> >>>>
> >>>>> Hey Jacques,
> >>>>>
> >>>>> Feel free to ping us with any questions. Despite some of the _users_
> >> of
> >>>>> Parquet competing with each other (eg query engines), we hope the
> >> file
> >>>>> format itself can be easily implemented by everyone and become
> >>>> ubiquitous.
> >>>>>
> >>>>> There are a few changes still in flight that we're working on, so you
> >>> may
> >>>>> want to join the parquet dev mailing list as well to follow along.
> >>>>>
> >>>>> Thanks
> >>>>> -Todd
> >>>>>
> >>>>> On Tue, Mar 12, 2013 at 10:29 AM, Jacques Nadeau <jacques@apache.org
> >>>
> >>>>> wrote:
> >>>>>
> >>>>>> When you said soon, you meant very soon.  This looks like great
> >> work.
> >>>>>> Thanks for sharing it with the world.  Will come back after
> >> spending
> >>>>> some
> >>>>>> time with it.
> >>>>>>
> >>>>>> thanks again,
> >>>>>> Jacques
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>> On Tue, Mar 12, 2013 at 9:50 AM, Julien Le Dem <julien@twitter.com
> >>>
> >>>>> wrote:
> >>>>>>
> >>>>>>> The repo is now available: http://parquet.github.com/
> >>>>>>> Let me know if you have questions
> >>>>>>>
> >>>>>>> On Mon, Mar 11, 2013 at 11:31 AM, Jacques Nadeau <
> >>> jacques@apache.org
> >>>>>
> >>>>>>> wrote:
> >>>>>>>> There definitely seem to be some new kids on the block.  I
> >> really
> >>>>> hope
> >>>>>>> that
> >>>>>>>> Drill can adopt either ORC or Parquet as a closely related
> >>> "native"
> >>>>>>> format.
> >>>>>>>>  At the moment, I'm actually more focused on the in-memory
> >>>> execution
> >>>>>>>> format and the right abstraction to support compressed columnar
> >>>>>> execution
> >>>>>>>> and vectorization.  Historically, the biggest gaps I'd worry
> >>> about
> >>>>> are
> >>>>>>>> java-centricity and expectation of early materialization &
> >>>>>> decompression.
> >>>>>>>> Once we get some execution stuff working, lets see how each
> >> fits
> >>>> in.
> >>>>>>>> Rather than start a third competing format (or fourth if you
> >>> count
> >>>>>>>> Trevni), let's either use or extend/contribute back on one of
> >> the
> >>>>>>> existing
> >>>>>>>> new kids.
> >>>>>>>>
> >>>>>>>> Julien, do you think more will be shared about Parquet before
> >> the
> >>>>>> Hadoop
> >>>>>>>> Summit so we can start toying with using it inside of Drill?
> >>>>>>>>
> >>>>>>>> J
> >>>>>>>>
> >>>>>>>> On Mon, Mar 11, 2013 at 11:02 AM, Ken Krugler
> >>>>>>>> <kk...@transpac.com>wrote:
> >>>>>>>>
> >>>>>>>>> Hi all,
> >>>>>>>>>
> >>>>>>>>> I've been trying to track down status/comparisons of various
> >>>>> columnar
> >>>>>>>>> formats, and just heard about Parquet.
> >>>>>>>>>
> >>>>>>>>> I don't have any direct experience with Parquet, but Really
> >>> Smart
> >>>>> Guy
> >>>>>>> said:
> >>>>>>>>>
> >>>>>>>>>> From what I hear there are two key features that
> >>>>>>>>>> differentiate it from ORC and Trevni: 1) columns can be
> >>>> optionally
> >>>>>>> split
> >>>>>>>>> into
> >>>>>>>>>> separate files, and 2) the mechanism for shredding nested
> >>> fields
> >>>>>> into
> >>>>>>>>>> columns is taken almost verbatim from Dremel. Feature (1)
> >>> won't
> >>>> be
> >>>>>>>>> practical
> >>>>>>>>>> to use until Hadoop introduces support for a file group
> >>> locality
> >>>>>>>>> feature, but once it
> >>>>>>>>>> does this feature should enable more efficient use of the
> >>> buffer
> >>>>>> cache
> >>>>>>>>> for predicate
> >>>>>>>>>> pushdown operations.
> >>>>>>>>>
> >>>>>>>>> -- Ken
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>> On Mar 11, 2013, at 10:56am, Julien Le Dem wrote:
> >>>>>>>>>
> >>>>>>>>>> Parquet is actually implementing the algorithm described in
> >>> the
> >>>>>>>>>> "Nested Columnar Storage" section of the Dremel paper[1].
> >>>>>>>>>>
> >>>>>>>>>> [1] http://research.google.com/pubs/pub36632.html
> >>>>>>>>>>
> >>>>>>>>>> On Mon, Mar 11, 2013 at 10:41 AM, Timothy Chen <
> >>>> tnachen@gmail.com
> >>>>>>
> >>>>>>>>> wrote:
> >>>>>>>>>>> Just saw this:
> >>>>>>>>>>>
> >>>>>>>>>>> http://t.co/ES1dGDZlKA
> >>>>>>>>>>>
> >>>>>>>>>>> I know Trevni is another Dremel inspired Columnar format as
> >>>> well,
> >>>>>>> anyone
> >>>>>>>>>>> saw much info Parquet and how it's different?
> >>>>>>>>>>>
> >>>>>>>>>>> Tim
> >>>>>>>>>
> >>>>>>>>> --------------------------
> >>>>>>>>> Ken Krugler
> >>>>>>>>> +1 530-210-6378
> >>>>>>>>> http://www.scaleunlimited.com
> >>>>>>>>> custom big data solutions & training
> >>>>>>>>> Hadoop, Cascading, Cassandra & Solr
> >>>>>
> >>>>>
> >>>>>
> >>>>> --
> >>>>> Todd Lipcon
> >>>>> Software Engineer, Cloudera
> >>
>

Re: Another columnar format Parquet

Posted by Timothy Chen <tn...@gmail.com>.

Hi Ted,

Can you explain more about the question you have about encoding in ORC? 

Tim

Sent from my iPhone

On Apr 4, 2013, at 11:01 PM, Ted Dunning <te...@gmail.com> wrote:

> Yes it does.
> 
> I have seen conflicting docs on format it uses.  One seemed to say that
> complex cells were stored within a single cell.  The other seemed to say
> that nested structures were shredded in the style of Parquet or Dremel.
> 
> One thing that I worry about with ORC is that it exactly replicates the
> schema model of Hive which isn't as congenial (to me) as the protobuf style
> of Parquet.  As Julien mentioned in the Drill meetup, there is also the
> question of the correctness of the encoding.  The Dremel column shredding
> is pretty subtle.  Hopefully ORC authors started from first principles in
> designing the encoding.
> 
> 
> On Fri, Apr 5, 2013 at 1:12 AM, Jacques Nadeau <ja...@apache.org> wrote:
> 
>> Does ORC support nested data?  How does it compare to the Dremel encoding
>> approach that Parquet utilizes?
>> 
>> Thanks,
>> Jacques
>> 
>> On Thu, Mar 28, 2013 at 11:22 PM, Owen O'Malley <om...@apache.org>
>> wrote:
>> 
>>> On Tue, Mar 12, 2013 at 11:45 AM, Ted Dunning <te...@gmail.com>
>>> wrote:
>>> 
>>>> So is it fair to say that Parquet will be open to contributions and
>> will
>>>> hopefully develop an open community to drive it?
>>>> 
>>>> If so, that is an excellent development.
>>>> 
>>>> Is ORC file well enough developed for a comparison?
>>> 
>>> ORC is committed to Hive's trunk and seems more feature complete than
>>> Parquet. Parquet hasn't implemented indexes, dictionaries, or a datetime
>>> encoder yet. Obviously, if you have questions about ORC, please ask over
>> on
>>> Hive's dev list.
>>> 
>>> -- Owen
>>> 
>>> 
>>>> 
>>>> On Tue, Mar 12, 2013 at 10:40 AM, Todd Lipcon <to...@cloudera.com>
>> wrote:
>>>> 
>>>>> Hey Jacques,
>>>>> 
>>>>> Feel free to ping us with any questions. Despite some of the _users_
>> of
>>>>> Parquet competing with each other (eg query engines), we hope the
>> file
>>>>> format itself can be easily implemented by everyone and become
>>>> ubiquitous.
>>>>> 
>>>>> There are a few changes still in flight that we're working on, so you
>>> may
>>>>> want to join the parquet dev mailing list as well to follow along.
>>>>> 
>>>>> Thanks
>>>>> -Todd
>>>>> 
>>>>> On Tue, Mar 12, 2013 at 10:29 AM, Jacques Nadeau <jacques@apache.org
>>> 
>>>>> wrote:
>>>>> 
>>>>>> When you said soon, you meant very soon.  This looks like great
>> work.
>>>>>> Thanks for sharing it with the world.  Will come back after
>> spending
>>>>> some
>>>>>> time with it.
>>>>>> 
>>>>>> thanks again,
>>>>>> Jacques
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> On Tue, Mar 12, 2013 at 9:50 AM, Julien Le Dem <julien@twitter.com
>>> 
>>>>> wrote:
>>>>>> 
>>>>>>> The repo is now available: http://parquet.github.com/
>>>>>>> Let me know if you have questions
>>>>>>> 
>>>>>>> On Mon, Mar 11, 2013 at 11:31 AM, Jacques Nadeau <
>>> jacques@apache.org
>>>>> 
>>>>>>> wrote:
>>>>>>>> There definitely seem to be some new kids on the block.  I
>> really
>>>>> hope
>>>>>>> that
>>>>>>>> Drill can adopt either ORC or Parquet as a closely related
>>> "native"
>>>>>>> format.
>>>>>>>>  At the moment, I'm actually more focused on the in-memory
>>>> execution
>>>>>>>> format and the right abstraction to support compressed columnar
>>>>>> execution
>>>>>>>> and vectorization.  Historically, the biggest gaps I'd worry
>>> about
>>>>> are
>>>>>>>> java-centricity and expectation of early materialization &
>>>>>> decompression.
>>>>>>>> Once we get some execution stuff working, lets see how each
>> fits
>>>> in.
>>>>>>>> Rather than start a third competing format (or fourth if you
>>> count
>>>>>>>> Trevni), let's either use or extend/contribute back on one of
>> the
>>>>>>> existing
>>>>>>>> new kids.
>>>>>>>> 
>>>>>>>> Julien, do you think more will be shared about Parquet before
>> the
>>>>>> Hadoop
>>>>>>>> Summit so we can start toying with using it inside of Drill?
>>>>>>>> 
>>>>>>>> J
>>>>>>>> 
>>>>>>>> On Mon, Mar 11, 2013 at 11:02 AM, Ken Krugler
>>>>>>>> <kk...@transpac.com>wrote:
>>>>>>>> 
>>>>>>>>> Hi all,
>>>>>>>>> 
>>>>>>>>> I've been trying to track down status/comparisons of various
>>>>> columnar
>>>>>>>>> formats, and just heard about Parquet.
>>>>>>>>> 
>>>>>>>>> I don't have any direct experience with Parquet, but Really
>>> Smart
>>>>> Guy
>>>>>>> said:
>>>>>>>>> 
>>>>>>>>>> From what I hear there are two key features that
>>>>>>>>>> differentiate it from ORC and Trevni: 1) columns can be
>>>> optionally
>>>>>>> split
>>>>>>>>> into
>>>>>>>>>> separate files, and 2) the mechanism for shredding nested
>>> fields
>>>>>> into
>>>>>>>>>> columns is taken almost verbatim from Dremel. Feature (1)
>>> won't
>>>> be
>>>>>>>>> practical
>>>>>>>>>> to use until Hadoop introduces support for a file group
>>> locality
>>>>>>>>> feature, but once it
>>>>>>>>>> does this feature should enable more efficient use of the
>>> buffer
>>>>>> cache
>>>>>>>>> for predicate
>>>>>>>>>> pushdown operations.
>>>>>>>>> 
>>>>>>>>> -- Ken
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> On Mar 11, 2013, at 10:56am, Julien Le Dem wrote:
>>>>>>>>> 
>>>>>>>>>> Parquet is actually implementing the algorithm described in
>>> the
>>>>>>>>>> "Nested Columnar Storage" section of the Dremel paper[1].
>>>>>>>>>> 
>>>>>>>>>> [1] http://research.google.com/pubs/pub36632.html
>>>>>>>>>> 
>>>>>>>>>> On Mon, Mar 11, 2013 at 10:41 AM, Timothy Chen <
>>>> tnachen@gmail.com
>>>>>> 
>>>>>>>>> wrote:
>>>>>>>>>>> Just saw this:
>>>>>>>>>>> 
>>>>>>>>>>> http://t.co/ES1dGDZlKA
>>>>>>>>>>> 
>>>>>>>>>>> I know Trevni is another Dremel inspired Columnar format as
>>>> well,
>>>>>>> anyone
>>>>>>>>>>> saw much info Parquet and how it's different?
>>>>>>>>>>> 
>>>>>>>>>>> Tim
>>>>>>>>> 
>>>>>>>>> --------------------------
>>>>>>>>> Ken Krugler
>>>>>>>>> +1 530-210-6378
>>>>>>>>> http://www.scaleunlimited.com
>>>>>>>>> custom big data solutions & training
>>>>>>>>> Hadoop, Cascading, Cassandra & Solr
>>>>> 
>>>>> 
>>>>> 
>>>>> --
>>>>> Todd Lipcon
>>>>> Software Engineer, Cloudera
>>

Re: Another columnar format Parquet

Posted by Ted Dunning <te...@gmail.com>.

Yes it does.

I have seen conflicting docs on format it uses.  One seemed to say that
complex cells were stored within a single cell.  The other seemed to say
that nested structures were shredded in the style of Parquet or Dremel.

One thing that I worry about with ORC is that it exactly replicates the
schema model of Hive which isn't as congenial (to me) as the protobuf style
of Parquet.  As Julien mentioned in the Drill meetup, there is also the
question of the correctness of the encoding.  The Dremel column shredding
is pretty subtle.  Hopefully ORC authors started from first principles in
designing the encoding.


On Fri, Apr 5, 2013 at 1:12 AM, Jacques Nadeau <ja...@apache.org> wrote:

> Does ORC support nested data?  How does it compare to the Dremel encoding
> approach that Parquet utilizes?
>
> Thanks,
> Jacques
>
> On Thu, Mar 28, 2013 at 11:22 PM, Owen O'Malley <om...@apache.org>
> wrote:
>
> > On Tue, Mar 12, 2013 at 11:45 AM, Ted Dunning <te...@gmail.com>
> > wrote:
> >
> > > So is it fair to say that Parquet will be open to contributions and
> will
> > > hopefully develop an open community to drive it?
> > >
> > > If so, that is an excellent development.
> > >
> > > Is ORC file well enough developed for a comparison?
> > >
> >
> > ORC is committed to Hive's trunk and seems more feature complete than
> > Parquet. Parquet hasn't implemented indexes, dictionaries, or a datetime
> > encoder yet. Obviously, if you have questions about ORC, please ask over
> on
> > Hive's dev list.
> >
> > -- Owen
> >
> >
> > >
> > > On Tue, Mar 12, 2013 at 10:40 AM, Todd Lipcon <to...@cloudera.com>
> wrote:
> > >
> > > > Hey Jacques,
> > > >
> > > > Feel free to ping us with any questions. Despite some of the _users_
> of
> > > > Parquet competing with each other (eg query engines), we hope the
> file
> > > > format itself can be easily implemented by everyone and become
> > > ubiquitous.
> > > >
> > > > There are a few changes still in flight that we're working on, so you
> > may
> > > > want to join the parquet dev mailing list as well to follow along.
> > > >
> > > > Thanks
> > > > -Todd
> > > >
> > > > On Tue, Mar 12, 2013 at 10:29 AM, Jacques Nadeau <jacques@apache.org
> >
> > > > wrote:
> > > >
> > > > > When you said soon, you meant very soon.  This looks like great
> work.
> > > > >  Thanks for sharing it with the world.  Will come back after
> spending
> > > > some
> > > > > time with it.
> > > > >
> > > > > thanks again,
> > > > > Jacques
> > > > >
> > > > >
> > > > >
> > > > > On Tue, Mar 12, 2013 at 9:50 AM, Julien Le Dem <julien@twitter.com
> >
> > > > wrote:
> > > > >
> > > > > > The repo is now available: http://parquet.github.com/
> > > > > > Let me know if you have questions
> > > > > >
> > > > > > On Mon, Mar 11, 2013 at 11:31 AM, Jacques Nadeau <
> > jacques@apache.org
> > > >
> > > > > > wrote:
> > > > > > > There definitely seem to be some new kids on the block.  I
> really
> > > > hope
> > > > > > that
> > > > > > > Drill can adopt either ORC or Parquet as a closely related
> > "native"
> > > > > > format.
> > > > > > >   At the moment, I'm actually more focused on the in-memory
> > > execution
> > > > > > > format and the right abstraction to support compressed columnar
> > > > > execution
> > > > > > > and vectorization.  Historically, the biggest gaps I'd worry
> > about
> > > > are
> > > > > > > java-centricity and expectation of early materialization &
> > > > > decompression.
> > > > > > >  Once we get some execution stuff working, lets see how each
> fits
> > > in.
> > > > > > >  Rather than start a third competing format (or fourth if you
> > count
> > > > > > > Trevni), let's either use or extend/contribute back on one of
> the
> > > > > > existing
> > > > > > > new kids.
> > > > > > >
> > > > > > > Julien, do you think more will be shared about Parquet before
> the
> > > > > Hadoop
> > > > > > > Summit so we can start toying with using it inside of Drill?
> > > > > > >
> > > > > > > J
> > > > > > >
> > > > > > > On Mon, Mar 11, 2013 at 11:02 AM, Ken Krugler
> > > > > > > <kk...@transpac.com>wrote:
> > > > > > >
> > > > > > >> Hi all,
> > > > > > >>
> > > > > > >> I've been trying to track down status/comparisons of various
> > > > columnar
> > > > > > >> formats, and just heard about Parquet.
> > > > > > >>
> > > > > > >> I don't have any direct experience with Parquet, but Really
> > Smart
> > > > Guy
> > > > > > said:
> > > > > > >>
> > > > > > >> > From what I hear there are two key features that
> > > > > > >> > differentiate it from ORC and Trevni: 1) columns can be
> > > optionally
> > > > > > split
> > > > > > >> into
> > > > > > >> > separate files, and 2) the mechanism for shredding nested
> > fields
> > > > > into
> > > > > > >> > columns is taken almost verbatim from Dremel. Feature (1)
> > won't
> > > be
> > > > > > >> practical
> > > > > > >> > to use until Hadoop introduces support for a file group
> > locality
> > > > > > >> feature, but once it
> > > > > > >> > does this feature should enable more efficient use of the
> > buffer
> > > > > cache
> > > > > > >> for predicate
> > > > > > >> > pushdown operations.
> > > > > > >>
> > > > > > >> -- Ken
> > > > > > >>
> > > > > > >>
> > > > > > >> On Mar 11, 2013, at 10:56am, Julien Le Dem wrote:
> > > > > > >>
> > > > > > >> > Parquet is actually implementing the algorithm described in
> > the
> > > > > > >> > "Nested Columnar Storage" section of the Dremel paper[1].
> > > > > > >> >
> > > > > > >> > [1] http://research.google.com/pubs/pub36632.html
> > > > > > >> >
> > > > > > >> > On Mon, Mar 11, 2013 at 10:41 AM, Timothy Chen <
> > > tnachen@gmail.com
> > > > >
> > > > > > >> wrote:
> > > > > > >> >> Just saw this:
> > > > > > >> >>
> > > > > > >> >> http://t.co/ES1dGDZlKA
> > > > > > >> >>
> > > > > > >> >> I know Trevni is another Dremel inspired Columnar format as
> > > well,
> > > > > > anyone
> > > > > > >> >> saw much info Parquet and how it's different?
> > > > > > >> >>
> > > > > > >> >> Tim
> > > > > > >>
> > > > > > >> --------------------------
> > > > > > >> Ken Krugler
> > > > > > >> +1 530-210-6378
> > > > > > >> http://www.scaleunlimited.com
> > > > > > >> custom big data solutions & training
> > > > > > >> Hadoop, Cascading, Cassandra & Solr
> > > > > > >>
> > > > > > >>
> > > > > > >>
> > > > > > >>
> > > > > > >>
> > > > > > >>
> > > > > >
> > > > >
> > > >
> > > >
> > > >
> > > > --
> > > > Todd Lipcon
> > > > Software Engineer, Cloudera
> > > >
> > >
> >
>

Re: Another columnar format Parquet

Posted by Jacques Nadeau <ja...@apache.org>.

Does ORC support nested data?  How does it compare to the Dremel encoding
approach that Parquet utilizes?

Thanks,
Jacques

On Thu, Mar 28, 2013 at 11:22 PM, Owen O'Malley <om...@apache.org> wrote:

> On Tue, Mar 12, 2013 at 11:45 AM, Ted Dunning <te...@gmail.com>
> wrote:
>
> > So is it fair to say that Parquet will be open to contributions and will
> > hopefully develop an open community to drive it?
> >
> > If so, that is an excellent development.
> >
> > Is ORC file well enough developed for a comparison?
> >
>
> ORC is committed to Hive's trunk and seems more feature complete than
> Parquet. Parquet hasn't implemented indexes, dictionaries, or a datetime
> encoder yet. Obviously, if you have questions about ORC, please ask over on
> Hive's dev list.
>
> -- Owen
>
>
> >
> > On Tue, Mar 12, 2013 at 10:40 AM, Todd Lipcon <to...@cloudera.com> wrote:
> >
> > > Hey Jacques,
> > >
> > > Feel free to ping us with any questions. Despite some of the _users_ of
> > > Parquet competing with each other (eg query engines), we hope the file
> > > format itself can be easily implemented by everyone and become
> > ubiquitous.
> > >
> > > There are a few changes still in flight that we're working on, so you
> may
> > > want to join the parquet dev mailing list as well to follow along.
> > >
> > > Thanks
> > > -Todd
> > >
> > > On Tue, Mar 12, 2013 at 10:29 AM, Jacques Nadeau <ja...@apache.org>
> > > wrote:
> > >
> > > > When you said soon, you meant very soon.  This looks like great work.
> > > >  Thanks for sharing it with the world.  Will come back after spending
> > > some
> > > > time with it.
> > > >
> > > > thanks again,
> > > > Jacques
> > > >
> > > >
> > > >
> > > > On Tue, Mar 12, 2013 at 9:50 AM, Julien Le Dem <ju...@twitter.com>
> > > wrote:
> > > >
> > > > > The repo is now available: http://parquet.github.com/
> > > > > Let me know if you have questions
> > > > >
> > > > > On Mon, Mar 11, 2013 at 11:31 AM, Jacques Nadeau <
> jacques@apache.org
> > >
> > > > > wrote:
> > > > > > There definitely seem to be some new kids on the block.  I really
> > > hope
> > > > > that
> > > > > > Drill can adopt either ORC or Parquet as a closely related
> "native"
> > > > > format.
> > > > > >   At the moment, I'm actually more focused on the in-memory
> > execution
> > > > > > format and the right abstraction to support compressed columnar
> > > > execution
> > > > > > and vectorization.  Historically, the biggest gaps I'd worry
> about
> > > are
> > > > > > java-centricity and expectation of early materialization &
> > > > decompression.
> > > > > >  Once we get some execution stuff working, lets see how each fits
> > in.
> > > > > >  Rather than start a third competing format (or fourth if you
> count
> > > > > > Trevni), let's either use or extend/contribute back on one of the
> > > > > existing
> > > > > > new kids.
> > > > > >
> > > > > > Julien, do you think more will be shared about Parquet before the
> > > > Hadoop
> > > > > > Summit so we can start toying with using it inside of Drill?
> > > > > >
> > > > > > J
> > > > > >
> > > > > > On Mon, Mar 11, 2013 at 11:02 AM, Ken Krugler
> > > > > > <kk...@transpac.com>wrote:
> > > > > >
> > > > > >> Hi all,
> > > > > >>
> > > > > >> I've been trying to track down status/comparisons of various
> > > columnar
> > > > > >> formats, and just heard about Parquet.
> > > > > >>
> > > > > >> I don't have any direct experience with Parquet, but Really
> Smart
> > > Guy
> > > > > said:
> > > > > >>
> > > > > >> > From what I hear there are two key features that
> > > > > >> > differentiate it from ORC and Trevni: 1) columns can be
> > optionally
> > > > > split
> > > > > >> into
> > > > > >> > separate files, and 2) the mechanism for shredding nested
> fields
> > > > into
> > > > > >> > columns is taken almost verbatim from Dremel. Feature (1)
> won't
> > be
> > > > > >> practical
> > > > > >> > to use until Hadoop introduces support for a file group
> locality
> > > > > >> feature, but once it
> > > > > >> > does this feature should enable more efficient use of the
> buffer
> > > > cache
> > > > > >> for predicate
> > > > > >> > pushdown operations.
> > > > > >>
> > > > > >> -- Ken
> > > > > >>
> > > > > >>
> > > > > >> On Mar 11, 2013, at 10:56am, Julien Le Dem wrote:
> > > > > >>
> > > > > >> > Parquet is actually implementing the algorithm described in
> the
> > > > > >> > "Nested Columnar Storage" section of the Dremel paper[1].
> > > > > >> >
> > > > > >> > [1] http://research.google.com/pubs/pub36632.html
> > > > > >> >
> > > > > >> > On Mon, Mar 11, 2013 at 10:41 AM, Timothy Chen <
> > tnachen@gmail.com
> > > >
> > > > > >> wrote:
> > > > > >> >> Just saw this:
> > > > > >> >>
> > > > > >> >> http://t.co/ES1dGDZlKA
> > > > > >> >>
> > > > > >> >> I know Trevni is another Dremel inspired Columnar format as
> > well,
> > > > > anyone
> > > > > >> >> saw much info Parquet and how it's different?
> > > > > >> >>
> > > > > >> >> Tim
> > > > > >>
> > > > > >> --------------------------
> > > > > >> Ken Krugler
> > > > > >> +1 530-210-6378
> > > > > >> http://www.scaleunlimited.com
> > > > > >> custom big data solutions & training
> > > > > >> Hadoop, Cascading, Cassandra & Solr
> > > > > >>
> > > > > >>
> > > > > >>
> > > > > >>
> > > > > >>
> > > > > >>
> > > > >
> > > >
> > >
> > >
> > >
> > > --
> > > Todd Lipcon
> > > Software Engineer, Cloudera
> > >
> >
>

Re: Another columnar format Parquet

Posted by Owen O'Malley <om...@apache.org>.

On Tue, Mar 12, 2013 at 11:45 AM, Ted Dunning <te...@gmail.com> wrote:

> So is it fair to say that Parquet will be open to contributions and will
> hopefully develop an open community to drive it?
>
> If so, that is an excellent development.
>
> Is ORC file well enough developed for a comparison?
>

ORC is committed to Hive's trunk and seems more feature complete than
Parquet. Parquet hasn't implemented indexes, dictionaries, or a datetime
encoder yet. Obviously, if you have questions about ORC, please ask over on
Hive's dev list.

-- Owen


>
> On Tue, Mar 12, 2013 at 10:40 AM, Todd Lipcon <to...@cloudera.com> wrote:
>
> > Hey Jacques,
> >
> > Feel free to ping us with any questions. Despite some of the _users_ of
> > Parquet competing with each other (eg query engines), we hope the file
> > format itself can be easily implemented by everyone and become
> ubiquitous.
> >
> > There are a few changes still in flight that we're working on, so you may
> > want to join the parquet dev mailing list as well to follow along.
> >
> > Thanks
> > -Todd
> >
> > On Tue, Mar 12, 2013 at 10:29 AM, Jacques Nadeau <ja...@apache.org>
> > wrote:
> >
> > > When you said soon, you meant very soon.  This looks like great work.
> > >  Thanks for sharing it with the world.  Will come back after spending
> > some
> > > time with it.
> > >
> > > thanks again,
> > > Jacques
> > >
> > >
> > >
> > > On Tue, Mar 12, 2013 at 9:50 AM, Julien Le Dem <ju...@twitter.com>
> > wrote:
> > >
> > > > The repo is now available: http://parquet.github.com/
> > > > Let me know if you have questions
> > > >
> > > > On Mon, Mar 11, 2013 at 11:31 AM, Jacques Nadeau <jacques@apache.org
> >
> > > > wrote:
> > > > > There definitely seem to be some new kids on the block.  I really
> > hope
> > > > that
> > > > > Drill can adopt either ORC or Parquet as a closely related "native"
> > > > format.
> > > > >   At the moment, I'm actually more focused on the in-memory
> execution
> > > > > format and the right abstraction to support compressed columnar
> > > execution
> > > > > and vectorization.  Historically, the biggest gaps I'd worry about
> > are
> > > > > java-centricity and expectation of early materialization &
> > > decompression.
> > > > >  Once we get some execution stuff working, lets see how each fits
> in.
> > > > >  Rather than start a third competing format (or fourth if you count
> > > > > Trevni), let's either use or extend/contribute back on one of the
> > > > existing
> > > > > new kids.
> > > > >
> > > > > Julien, do you think more will be shared about Parquet before the
> > > Hadoop
> > > > > Summit so we can start toying with using it inside of Drill?
> > > > >
> > > > > J
> > > > >
> > > > > On Mon, Mar 11, 2013 at 11:02 AM, Ken Krugler
> > > > > <kk...@transpac.com>wrote:
> > > > >
> > > > >> Hi all,
> > > > >>
> > > > >> I've been trying to track down status/comparisons of various
> > columnar
> > > > >> formats, and just heard about Parquet.
> > > > >>
> > > > >> I don't have any direct experience with Parquet, but Really Smart
> > Guy
> > > > said:
> > > > >>
> > > > >> > From what I hear there are two key features that
> > > > >> > differentiate it from ORC and Trevni: 1) columns can be
> optionally
> > > > split
> > > > >> into
> > > > >> > separate files, and 2) the mechanism for shredding nested fields
> > > into
> > > > >> > columns is taken almost verbatim from Dremel. Feature (1) won't
> be
> > > > >> practical
> > > > >> > to use until Hadoop introduces support for a file group locality
> > > > >> feature, but once it
> > > > >> > does this feature should enable more efficient use of the buffer
> > > cache
> > > > >> for predicate
> > > > >> > pushdown operations.
> > > > >>
> > > > >> -- Ken
> > > > >>
> > > > >>
> > > > >> On Mar 11, 2013, at 10:56am, Julien Le Dem wrote:
> > > > >>
> > > > >> > Parquet is actually implementing the algorithm described in the
> > > > >> > "Nested Columnar Storage" section of the Dremel paper[1].
> > > > >> >
> > > > >> > [1] http://research.google.com/pubs/pub36632.html
> > > > >> >
> > > > >> > On Mon, Mar 11, 2013 at 10:41 AM, Timothy Chen <
> tnachen@gmail.com
> > >
> > > > >> wrote:
> > > > >> >> Just saw this:
> > > > >> >>
> > > > >> >> http://t.co/ES1dGDZlKA
> > > > >> >>
> > > > >> >> I know Trevni is another Dremel inspired Columnar format as
> well,
> > > > anyone
> > > > >> >> saw much info Parquet and how it's different?
> > > > >> >>
> > > > >> >> Tim
> > > > >>
> > > > >> --------------------------
> > > > >> Ken Krugler
> > > > >> +1 530-210-6378
> > > > >> http://www.scaleunlimited.com
> > > > >> custom big data solutions & training
> > > > >> Hadoop, Cascading, Cassandra & Solr
> > > > >>
> > > > >>
> > > > >>
> > > > >>
> > > > >>
> > > > >>
> > > >
> > >
> >
> >
> >
> > --
> > Todd Lipcon
> > Software Engineer, Cloudera
> >
>

Re: Another columnar format Parquet

Posted by Ted Dunning <te...@gmail.com>.

So is it fair to say that Parquet will be open to contributions and will
hopefully develop an open community to drive it?

If so, that is an excellent development.

Is ORC file well enough developed for a comparison?

On Tue, Mar 12, 2013 at 10:40 AM, Todd Lipcon <to...@cloudera.com> wrote:

> Hey Jacques,
>
> Feel free to ping us with any questions. Despite some of the _users_ of
> Parquet competing with each other (eg query engines), we hope the file
> format itself can be easily implemented by everyone and become ubiquitous.
>
> There are a few changes still in flight that we're working on, so you may
> want to join the parquet dev mailing list as well to follow along.
>
> Thanks
> -Todd
>
> On Tue, Mar 12, 2013 at 10:29 AM, Jacques Nadeau <ja...@apache.org>
> wrote:
>
> > When you said soon, you meant very soon.  This looks like great work.
> >  Thanks for sharing it with the world.  Will come back after spending
> some
> > time with it.
> >
> > thanks again,
> > Jacques
> >
> >
> >
> > On Tue, Mar 12, 2013 at 9:50 AM, Julien Le Dem <ju...@twitter.com>
> wrote:
> >
> > > The repo is now available: http://parquet.github.com/
> > > Let me know if you have questions
> > >
> > > On Mon, Mar 11, 2013 at 11:31 AM, Jacques Nadeau <ja...@apache.org>
> > > wrote:
> > > > There definitely seem to be some new kids on the block.  I really
> hope
> > > that
> > > > Drill can adopt either ORC or Parquet as a closely related "native"
> > > format.
> > > >   At the moment, I'm actually more focused on the in-memory execution
> > > > format and the right abstraction to support compressed columnar
> > execution
> > > > and vectorization.  Historically, the biggest gaps I'd worry about
> are
> > > > java-centricity and expectation of early materialization &
> > decompression.
> > > >  Once we get some execution stuff working, lets see how each fits in.
> > > >  Rather than start a third competing format (or fourth if you count
> > > > Trevni), let's either use or extend/contribute back on one of the
> > > existing
> > > > new kids.
> > > >
> > > > Julien, do you think more will be shared about Parquet before the
> > Hadoop
> > > > Summit so we can start toying with using it inside of Drill?
> > > >
> > > > J
> > > >
> > > > On Mon, Mar 11, 2013 at 11:02 AM, Ken Krugler
> > > > <kk...@transpac.com>wrote:
> > > >
> > > >> Hi all,
> > > >>
> > > >> I've been trying to track down status/comparisons of various
> columnar
> > > >> formats, and just heard about Parquet.
> > > >>
> > > >> I don't have any direct experience with Parquet, but Really Smart
> Guy
> > > said:
> > > >>
> > > >> > From what I hear there are two key features that
> > > >> > differentiate it from ORC and Trevni: 1) columns can be optionally
> > > split
> > > >> into
> > > >> > separate files, and 2) the mechanism for shredding nested fields
> > into
> > > >> > columns is taken almost verbatim from Dremel. Feature (1) won't be
> > > >> practical
> > > >> > to use until Hadoop introduces support for a file group locality
> > > >> feature, but once it
> > > >> > does this feature should enable more efficient use of the buffer
> > cache
> > > >> for predicate
> > > >> > pushdown operations.
> > > >>
> > > >> -- Ken
> > > >>
> > > >>
> > > >> On Mar 11, 2013, at 10:56am, Julien Le Dem wrote:
> > > >>
> > > >> > Parquet is actually implementing the algorithm described in the
> > > >> > "Nested Columnar Storage" section of the Dremel paper[1].
> > > >> >
> > > >> > [1] http://research.google.com/pubs/pub36632.html
> > > >> >
> > > >> > On Mon, Mar 11, 2013 at 10:41 AM, Timothy Chen <tnachen@gmail.com
> >
> > > >> wrote:
> > > >> >> Just saw this:
> > > >> >>
> > > >> >> http://t.co/ES1dGDZlKA
> > > >> >>
> > > >> >> I know Trevni is another Dremel inspired Columnar format as well,
> > > anyone
> > > >> >> saw much info Parquet and how it's different?
> > > >> >>
> > > >> >> Tim
> > > >>
> > > >> --------------------------
> > > >> Ken Krugler
> > > >> +1 530-210-6378
> > > >> http://www.scaleunlimited.com
> > > >> custom big data solutions & training
> > > >> Hadoop, Cascading, Cassandra & Solr
> > > >>
> > > >>
> > > >>
> > > >>
> > > >>
> > > >>
> > >
> >
>
>
>
> --
> Todd Lipcon
> Software Engineer, Cloudera
>

Re: Another columnar format Parquet

Posted by Todd Lipcon <to...@cloudera.com>.

Hey Jacques,

Feel free to ping us with any questions. Despite some of the _users_ of
Parquet competing with each other (eg query engines), we hope the file
format itself can be easily implemented by everyone and become ubiquitous.

There are a few changes still in flight that we're working on, so you may
want to join the parquet dev mailing list as well to follow along.

Thanks
-Todd

On Tue, Mar 12, 2013 at 10:29 AM, Jacques Nadeau <ja...@apache.org> wrote:

> When you said soon, you meant very soon.  This looks like great work.
>  Thanks for sharing it with the world.  Will come back after spending some
> time with it.
>
> thanks again,
> Jacques
>
>
>
> On Tue, Mar 12, 2013 at 9:50 AM, Julien Le Dem <ju...@twitter.com> wrote:
>
> > The repo is now available: http://parquet.github.com/
> > Let me know if you have questions
> >
> > On Mon, Mar 11, 2013 at 11:31 AM, Jacques Nadeau <ja...@apache.org>
> > wrote:
> > > There definitely seem to be some new kids on the block.  I really hope
> > that
> > > Drill can adopt either ORC or Parquet as a closely related "native"
> > format.
> > >   At the moment, I'm actually more focused on the in-memory execution
> > > format and the right abstraction to support compressed columnar
> execution
> > > and vectorization.  Historically, the biggest gaps I'd worry about are
> > > java-centricity and expectation of early materialization &
> decompression.
> > >  Once we get some execution stuff working, lets see how each fits in.
> > >  Rather than start a third competing format (or fourth if you count
> > > Trevni), let's either use or extend/contribute back on one of the
> > existing
> > > new kids.
> > >
> > > Julien, do you think more will be shared about Parquet before the
> Hadoop
> > > Summit so we can start toying with using it inside of Drill?
> > >
> > > J
> > >
> > > On Mon, Mar 11, 2013 at 11:02 AM, Ken Krugler
> > > <kk...@transpac.com>wrote:
> > >
> > >> Hi all,
> > >>
> > >> I've been trying to track down status/comparisons of various columnar
> > >> formats, and just heard about Parquet.
> > >>
> > >> I don't have any direct experience with Parquet, but Really Smart Guy
> > said:
> > >>
> > >> > From what I hear there are two key features that
> > >> > differentiate it from ORC and Trevni: 1) columns can be optionally
> > split
> > >> into
> > >> > separate files, and 2) the mechanism for shredding nested fields
> into
> > >> > columns is taken almost verbatim from Dremel. Feature (1) won't be
> > >> practical
> > >> > to use until Hadoop introduces support for a file group locality
> > >> feature, but once it
> > >> > does this feature should enable more efficient use of the buffer
> cache
> > >> for predicate
> > >> > pushdown operations.
> > >>
> > >> -- Ken
> > >>
> > >>
> > >> On Mar 11, 2013, at 10:56am, Julien Le Dem wrote:
> > >>
> > >> > Parquet is actually implementing the algorithm described in the
> > >> > "Nested Columnar Storage" section of the Dremel paper[1].
> > >> >
> > >> > [1] http://research.google.com/pubs/pub36632.html
> > >> >
> > >> > On Mon, Mar 11, 2013 at 10:41 AM, Timothy Chen <tn...@gmail.com>
> > >> wrote:
> > >> >> Just saw this:
> > >> >>
> > >> >> http://t.co/ES1dGDZlKA
> > >> >>
> > >> >> I know Trevni is another Dremel inspired Columnar format as well,
> > anyone
> > >> >> saw much info Parquet and how it's different?
> > >> >>
> > >> >> Tim
> > >>
> > >> --------------------------
> > >> Ken Krugler
> > >> +1 530-210-6378
> > >> http://www.scaleunlimited.com
> > >> custom big data solutions & training
> > >> Hadoop, Cascading, Cassandra & Solr
> > >>
> > >>
> > >>
> > >>
> > >>
> > >>
> >
>



-- 
Todd Lipcon
Software Engineer, Cloudera

Re: Another columnar format Parquet

Posted by Jacques Nadeau <ja...@apache.org>.

When you said soon, you meant very soon.  This looks like great work.
 Thanks for sharing it with the world.  Will come back after spending some
time with it.

thanks again,
Jacques



On Tue, Mar 12, 2013 at 9:50 AM, Julien Le Dem <ju...@twitter.com> wrote:

> The repo is now available: http://parquet.github.com/
> Let me know if you have questions
>
> On Mon, Mar 11, 2013 at 11:31 AM, Jacques Nadeau <ja...@apache.org>
> wrote:
> > There definitely seem to be some new kids on the block.  I really hope
> that
> > Drill can adopt either ORC or Parquet as a closely related "native"
> format.
> >   At the moment, I'm actually more focused on the in-memory execution
> > format and the right abstraction to support compressed columnar execution
> > and vectorization.  Historically, the biggest gaps I'd worry about are
> > java-centricity and expectation of early materialization & decompression.
> >  Once we get some execution stuff working, lets see how each fits in.
> >  Rather than start a third competing format (or fourth if you count
> > Trevni), let's either use or extend/contribute back on one of the
> existing
> > new kids.
> >
> > Julien, do you think more will be shared about Parquet before the Hadoop
> > Summit so we can start toying with using it inside of Drill?
> >
> > J
> >
> > On Mon, Mar 11, 2013 at 11:02 AM, Ken Krugler
> > <kk...@transpac.com>wrote:
> >
> >> Hi all,
> >>
> >> I've been trying to track down status/comparisons of various columnar
> >> formats, and just heard about Parquet.
> >>
> >> I don't have any direct experience with Parquet, but Really Smart Guy
> said:
> >>
> >> > From what I hear there are two key features that
> >> > differentiate it from ORC and Trevni: 1) columns can be optionally
> split
> >> into
> >> > separate files, and 2) the mechanism for shredding nested fields into
> >> > columns is taken almost verbatim from Dremel. Feature (1) won't be
> >> practical
> >> > to use until Hadoop introduces support for a file group locality
> >> feature, but once it
> >> > does this feature should enable more efficient use of the buffer cache
> >> for predicate
> >> > pushdown operations.
> >>
> >> -- Ken
> >>
> >>
> >> On Mar 11, 2013, at 10:56am, Julien Le Dem wrote:
> >>
> >> > Parquet is actually implementing the algorithm described in the
> >> > "Nested Columnar Storage" section of the Dremel paper[1].
> >> >
> >> > [1] http://research.google.com/pubs/pub36632.html
> >> >
> >> > On Mon, Mar 11, 2013 at 10:41 AM, Timothy Chen <tn...@gmail.com>
> >> wrote:
> >> >> Just saw this:
> >> >>
> >> >> http://t.co/ES1dGDZlKA
> >> >>
> >> >> I know Trevni is another Dremel inspired Columnar format as well,
> anyone
> >> >> saw much info Parquet and how it's different?
> >> >>
> >> >> Tim
> >>
> >> --------------------------
> >> Ken Krugler
> >> +1 530-210-6378
> >> http://www.scaleunlimited.com
> >> custom big data solutions & training
> >> Hadoop, Cascading, Cassandra & Solr
> >>
> >>
> >>
> >>
> >>
> >>
>

Re: Another columnar format Parquet

Posted by Julien Le Dem <ju...@twitter.com>.

The repo is now available: http://parquet.github.com/
Let me know if you have questions

On Mon, Mar 11, 2013 at 11:31 AM, Jacques Nadeau <ja...@apache.org> wrote:
> There definitely seem to be some new kids on the block.  I really hope that
> Drill can adopt either ORC or Parquet as a closely related "native" format.
>   At the moment, I'm actually more focused on the in-memory execution
> format and the right abstraction to support compressed columnar execution
> and vectorization.  Historically, the biggest gaps I'd worry about are
> java-centricity and expectation of early materialization & decompression.
>  Once we get some execution stuff working, lets see how each fits in.
>  Rather than start a third competing format (or fourth if you count
> Trevni), let's either use or extend/contribute back on one of the existing
> new kids.
>
> Julien, do you think more will be shared about Parquet before the Hadoop
> Summit so we can start toying with using it inside of Drill?
>
> J
>
> On Mon, Mar 11, 2013 at 11:02 AM, Ken Krugler
> <kk...@transpac.com>wrote:
>
>> Hi all,
>>
>> I've been trying to track down status/comparisons of various columnar
>> formats, and just heard about Parquet.
>>
>> I don't have any direct experience with Parquet, but Really Smart Guy said:
>>
>> > From what I hear there are two key features that
>> > differentiate it from ORC and Trevni: 1) columns can be optionally split
>> into
>> > separate files, and 2) the mechanism for shredding nested fields into
>> > columns is taken almost verbatim from Dremel. Feature (1) won't be
>> practical
>> > to use until Hadoop introduces support for a file group locality
>> feature, but once it
>> > does this feature should enable more efficient use of the buffer cache
>> for predicate
>> > pushdown operations.
>>
>> -- Ken
>>
>>
>> On Mar 11, 2013, at 10:56am, Julien Le Dem wrote:
>>
>> > Parquet is actually implementing the algorithm described in the
>> > "Nested Columnar Storage" section of the Dremel paper[1].
>> >
>> > [1] http://research.google.com/pubs/pub36632.html
>> >
>> > On Mon, Mar 11, 2013 at 10:41 AM, Timothy Chen <tn...@gmail.com>
>> wrote:
>> >> Just saw this:
>> >>
>> >> http://t.co/ES1dGDZlKA
>> >>
>> >> I know Trevni is another Dremel inspired Columnar format as well, anyone
>> >> saw much info Parquet and how it's different?
>> >>
>> >> Tim
>>
>> --------------------------
>> Ken Krugler
>> +1 530-210-6378
>> http://www.scaleunlimited.com
>> custom big data solutions & training
>> Hadoop, Cascading, Cassandra & Solr
>>
>>
>>
>>
>>
>>

Re: Another columnar format Parquet

Posted by Ted Dunning <te...@gmail.com>.

One of the key features that I would be looking for is support for column
families.  This provides an opportunity avoid much of the muddle that
RCFiles had in the first place since different columns column families are
presumed to be read at different times.

On Mon, Mar 11, 2013 at 11:31 AM, Jacques Nadeau <ja...@apache.org> wrote:

> There definitely seem to be some new kids on the block.  I really hope that
> Drill can adopt either ORC or Parquet as a closely related "native" format.
>   At the moment, I'm actually more focused on the in-memory execution
> format and the right abstraction to support compressed columnar execution
> and vectorization.  Historically, the biggest gaps I'd worry about are
> java-centricity and expectation of early materialization & decompression.
>  Once we get some execution stuff working, lets see how each fits in.
>  Rather than start a third competing format (or fourth if you count
> Trevni), let's either use or extend/contribute back on one of the existing
> new kids.
>
> Julien, do you think more will be shared about Parquet before the Hadoop
> Summit so we can start toying with using it inside of Drill?
>
> J
>
> On Mon, Mar 11, 2013 at 11:02 AM, Ken Krugler
> <kk...@transpac.com>wrote:
>
> > Hi all,
> >
> > I've been trying to track down status/comparisons of various columnar
> > formats, and just heard about Parquet.
> >
> > I don't have any direct experience with Parquet, but Really Smart Guy
> said:
> >
> > > From what I hear there are two key features that
> > > differentiate it from ORC and Trevni: 1) columns can be optionally
> split
> > into
> > > separate files, and 2) the mechanism for shredding nested fields into
> > > columns is taken almost verbatim from Dremel. Feature (1) won't be
> > practical
> > > to use until Hadoop introduces support for a file group locality
> > feature, but once it
> > > does this feature should enable more efficient use of the buffer cache
> > for predicate
> > > pushdown operations.
> >
> > -- Ken
> >
> >
> > On Mar 11, 2013, at 10:56am, Julien Le Dem wrote:
> >
> > > Parquet is actually implementing the algorithm described in the
> > > "Nested Columnar Storage" section of the Dremel paper[1].
> > >
> > > [1] http://research.google.com/pubs/pub36632.html
> > >
> > > On Mon, Mar 11, 2013 at 10:41 AM, Timothy Chen <tn...@gmail.com>
> > wrote:
> > >> Just saw this:
> > >>
> > >> http://t.co/ES1dGDZlKA
> > >>
> > >> I know Trevni is another Dremel inspired Columnar format as well,
> anyone
> > >> saw much info Parquet and how it's different?
> > >>
> > >> Tim
> >
> > --------------------------
> > Ken Krugler
> > +1 530-210-6378
> > http://www.scaleunlimited.com
> > custom big data solutions & training
> > Hadoop, Cascading, Cassandra & Solr
> >
> >
> >
> >
> >
> >
>

Re: Another columnar format Parquet

Posted by Jacques Nadeau <ja...@apache.org>.

There definitely seem to be some new kids on the block.  I really hope that
Drill can adopt either ORC or Parquet as a closely related "native" format.
  At the moment, I'm actually more focused on the in-memory execution
format and the right abstraction to support compressed columnar execution
and vectorization.  Historically, the biggest gaps I'd worry about are
java-centricity and expectation of early materialization & decompression.
 Once we get some execution stuff working, lets see how each fits in.
 Rather than start a third competing format (or fourth if you count
Trevni), let's either use or extend/contribute back on one of the existing
new kids.

Julien, do you think more will be shared about Parquet before the Hadoop
Summit so we can start toying with using it inside of Drill?

J

On Mon, Mar 11, 2013 at 11:02 AM, Ken Krugler
<kk...@transpac.com>wrote:

> Hi all,
>
> I've been trying to track down status/comparisons of various columnar
> formats, and just heard about Parquet.
>
> I don't have any direct experience with Parquet, but Really Smart Guy said:
>
> > From what I hear there are two key features that
> > differentiate it from ORC and Trevni: 1) columns can be optionally split
> into
> > separate files, and 2) the mechanism for shredding nested fields into
> > columns is taken almost verbatim from Dremel. Feature (1) won't be
> practical
> > to use until Hadoop introduces support for a file group locality
> feature, but once it
> > does this feature should enable more efficient use of the buffer cache
> for predicate
> > pushdown operations.
>
> -- Ken
>
>
> On Mar 11, 2013, at 10:56am, Julien Le Dem wrote:
>
> > Parquet is actually implementing the algorithm described in the
> > "Nested Columnar Storage" section of the Dremel paper[1].
> >
> > [1] http://research.google.com/pubs/pub36632.html
> >
> > On Mon, Mar 11, 2013 at 10:41 AM, Timothy Chen <tn...@gmail.com>
> wrote:
> >> Just saw this:
> >>
> >> http://t.co/ES1dGDZlKA
> >>
> >> I know Trevni is another Dremel inspired Columnar format as well, anyone
> >> saw much info Parquet and how it's different?
> >>
> >> Tim
>
> --------------------------
> Ken Krugler
> +1 530-210-6378
> http://www.scaleunlimited.com
> custom big data solutions & training
> Hadoop, Cascading, Cassandra & Solr
>
>
>
>
>
>

Re: Another columnar format Parquet

Posted by Ken Krugler <kk...@transpac.com>.

Hi all,

I've been trying to track down status/comparisons of various columnar formats, and just heard about Parquet.

I don't have any direct experience with Parquet, but Really Smart Guy said:

> From what I hear there are two key features that
> differentiate it from ORC and Trevni: 1) columns can be optionally split into
> separate files, and 2) the mechanism for shredding nested fields into
> columns is taken almost verbatim from Dremel. Feature (1) won't be practical
> to use until Hadoop introduces support for a file group locality feature, but once it
> does this feature should enable more efficient use of the buffer cache for predicate
> pushdown operations.

-- Ken


On Mar 11, 2013, at 10:56am, Julien Le Dem wrote:

> Parquet is actually implementing the algorithm described in the
> "Nested Columnar Storage" section of the Dremel paper[1].
> 
> [1] http://research.google.com/pubs/pub36632.html
> 
> On Mon, Mar 11, 2013 at 10:41 AM, Timothy Chen <tn...@gmail.com> wrote:
>> Just saw this:
>> 
>> http://t.co/ES1dGDZlKA
>> 
>> I know Trevni is another Dremel inspired Columnar format as well, anyone
>> saw much info Parquet and how it's different?
>> 
>> Tim

--------------------------
Ken Krugler
+1 530-210-6378
http://www.scaleunlimited.com
custom big data solutions & training
Hadoop, Cascading, Cassandra & Solr

Re: Another columnar format Parquet

Posted by Julien Le Dem <ju...@twitter.com>.

Parquet is actually implementing the algorithm described in the
"Nested Columnar Storage" section of the Dremel paper[1].

[1] http://research.google.com/pubs/pub36632.html

On Mon, Mar 11, 2013 at 10:41 AM, Timothy Chen <tn...@gmail.com> wrote:
> Just saw this:
>
> http://t.co/ES1dGDZlKA
>
> I know Trevni is another Dremel inspired Columnar format as well, anyone
> saw much info Parquet and how it's different?
>
> Tim