You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@drill.apache.org by David Gruzman <da...@bigdatacraft.com> on 2012/09/14 22:05:11 UTC

Drill native format

Hi All,
I would like to discuss the question of what will be native format for
drill. Original Google dremel paper defined their hierarchical columnar
data format. Since then
google shifted from hierarchical data format... So it is a question if it
makes sense to stick with it?
If we are also moving to simple flat format we need our own format we have
to support "native". In case of Drill I would define that native support as
"high performance".
I think we can go to some kind of PAX format with comprehensive metadata in
the header, so each file is completely self contained and can be understood
and processed without any external data.
Alternative is to have single file per column. As far as I remember from
our OpenDremel work the main decision point is - if we can read one column
from the  file without loading into node memory unnecessary data from other
columns.
With best regards,
David

Re: Drill native format

Posted by "Clark Yang (杨卓荦)" <ya...@gmail.com>.

Hi

I have been working on the column storage for a while.
I think the most important thing for the distributed column storage is data
locality on MapReduce (See the paper 4.1).
That means how each horizontal partition stores in the same node to compute
locally and reduce data transfer. To achieve this, the big data is usually
horizontally partitioned and distributed first and vertically partitioned
second. There need some strategies to do this, HDFS use "block placement
policy"


Cheers,
Zhuoluo (Clark) Yang



2012/9/15 karthik tunga <ka...@gmail.com>

> Hi,
>
> This paper (http://arxiv.org/pdf/1105.4252.pdf) has column oriented (one
> file per column) vs RCFile.
> They use skip list and lazy record construction.
>
> Cheers,
> Karthik
>
> On 14 September 2012 17:15, Amir Youssefi <am...@gmail.com> wrote:
>
> > "Nested data is not yet implemented" in BigQuery (if I recall exact words
> > correctly). Quoting speaker at the BigQuery presentation at Google
> > Technology User Group last week in Googleplex (intentionally not citing
> > speaker's name).
> >
> > -ay
> >
> > On Sep 14, 2012, at 1:28 PM, David Gruzman <da...@bigdatacraft.com>
> wrote:
> >
> > > I assume that evolution of BigQuery reflects resolution of Dremel... If
> > > somebody have information on it it would be great.
> > > Storage system should understand that all file comprising the
> horizontal
> > > partition of the table are one logical entity, and should store them
> > > together / in some proximity. I agree that PAX will be much more
> > > convinient. The question is - is there performance penalty of PAX vs
> file
> > > per column?
> > > David
> > >
> > > On Fri, Sep 14, 2012 at 11:21 PM, Tomer Shiran <ts...@maprtech.com>
> > wrote:
> > >
> > >> Is there any public information suggesting that Google moved away from
> > >> supporting nested data? Clearly BigQuery doesn't yet allow nested
> data,
> > but
> > >> not sure that applies to Dremel.
> > >>
> > >> There are challenges with one file per column. How do you ensure that
> a
> > >> single record is located on a single machine to avoid costly record
> > >> reconstruction?
> > >>
> > >> On Fri, Sep 14, 2012 at 1:05 PM, David Gruzman <
> david@bigdatacraft.com
> > >>> wrote:
> > >>
> > >>> Hi All,
> > >>> I would like to discuss the question of what will be native format
> for
> > >>> drill. Original Google dremel paper defined their hierarchical
> columnar
> > >>> data format. Since then
> > >>> google shifted from hierarchical data format... So it is a question
> if
> > it
> > >>> makes sense to stick with it?
> > >>> If we are also moving to simple flat format we need our own format we
> > >> have
> > >>> to support "native". In case of Drill I would define that native
> > support
> > >> as
> > >>> "high performance".
> > >>> I think we can go to some kind of PAX format with comprehensive
> > metadata
> > >> in
> > >>> the header, so each file is completely self contained and can be
> > >> understood
> > >>> and processed without any external data.
> > >>> Alternative is to have single file per column. As far as I remember
> > from
> > >>> our OpenDremel work the main decision point is - if we can read one
> > >> column
> > >>> from the  file without loading into node memory unnecessary data from
> > >> other
> > >>> columns.
> > >>> With best regards,
> > >>> David
> > >>>
> > >>
> > >>
> > >>
> > >> --
> > >> Tomer Shiran
> > >> Director of Product Management | MapR Technologies | 650-804-8657
> > >>
> >
>

Re: Drill native format

Posted by karthik tunga <ka...@gmail.com>.

Hi,

This paper (http://arxiv.org/pdf/1105.4252.pdf) has column oriented (one
file per column) vs RCFile.
They use skip list and lazy record construction.

Cheers,
Karthik

On 14 September 2012 17:15, Amir Youssefi <am...@gmail.com> wrote:

> "Nested data is not yet implemented" in BigQuery (if I recall exact words
> correctly). Quoting speaker at the BigQuery presentation at Google
> Technology User Group last week in Googleplex (intentionally not citing
> speaker's name).
>
> -ay
>
> On Sep 14, 2012, at 1:28 PM, David Gruzman <da...@bigdatacraft.com> wrote:
>
> > I assume that evolution of BigQuery reflects resolution of Dremel... If
> > somebody have information on it it would be great.
> > Storage system should understand that all file comprising the horizontal
> > partition of the table are one logical entity, and should store them
> > together / in some proximity. I agree that PAX will be much more
> > convinient. The question is - is there performance penalty of PAX vs file
> > per column?
> > David
> >
> > On Fri, Sep 14, 2012 at 11:21 PM, Tomer Shiran <ts...@maprtech.com>
> wrote:
> >
> >> Is there any public information suggesting that Google moved away from
> >> supporting nested data? Clearly BigQuery doesn't yet allow nested data,
> but
> >> not sure that applies to Dremel.
> >>
> >> There are challenges with one file per column. How do you ensure that a
> >> single record is located on a single machine to avoid costly record
> >> reconstruction?
> >>
> >> On Fri, Sep 14, 2012 at 1:05 PM, David Gruzman <david@bigdatacraft.com
> >>> wrote:
> >>
> >>> Hi All,
> >>> I would like to discuss the question of what will be native format for
> >>> drill. Original Google dremel paper defined their hierarchical columnar
> >>> data format. Since then
> >>> google shifted from hierarchical data format... So it is a question if
> it
> >>> makes sense to stick with it?
> >>> If we are also moving to simple flat format we need our own format we
> >> have
> >>> to support "native". In case of Drill I would define that native
> support
> >> as
> >>> "high performance".
> >>> I think we can go to some kind of PAX format with comprehensive
> metadata
> >> in
> >>> the header, so each file is completely self contained and can be
> >> understood
> >>> and processed without any external data.
> >>> Alternative is to have single file per column. As far as I remember
> from
> >>> our OpenDremel work the main decision point is - if we can read one
> >> column
> >>> from the  file without loading into node memory unnecessary data from
> >> other
> >>> columns.
> >>> With best regards,
> >>> David
> >>>
> >>
> >>
> >>
> >> --
> >> Tomer Shiran
> >> Director of Product Management | MapR Technologies | 650-804-8657
> >>
>

Re: Drill native format

Posted by Amir Youssefi <am...@gmail.com>.

"Nested data is not yet implemented" in BigQuery (if I recall exact words correctly). Quoting speaker at the BigQuery presentation at Google Technology User Group last week in Googleplex (intentionally not citing speaker's name).

-ay

On Sep 14, 2012, at 1:28 PM, David Gruzman <da...@bigdatacraft.com> wrote:

> I assume that evolution of BigQuery reflects resolution of Dremel... If
> somebody have information on it it would be great.
> Storage system should understand that all file comprising the horizontal
> partition of the table are one logical entity, and should store them
> together / in some proximity. I agree that PAX will be much more
> convinient. The question is - is there performance penalty of PAX vs file
> per column?
> David
> 
> On Fri, Sep 14, 2012 at 11:21 PM, Tomer Shiran <ts...@maprtech.com> wrote:
> 
>> Is there any public information suggesting that Google moved away from
>> supporting nested data? Clearly BigQuery doesn't yet allow nested data, but
>> not sure that applies to Dremel.
>> 
>> There are challenges with one file per column. How do you ensure that a
>> single record is located on a single machine to avoid costly record
>> reconstruction?
>> 
>> On Fri, Sep 14, 2012 at 1:05 PM, David Gruzman <david@bigdatacraft.com
>>> wrote:
>> 
>>> Hi All,
>>> I would like to discuss the question of what will be native format for
>>> drill. Original Google dremel paper defined their hierarchical columnar
>>> data format. Since then
>>> google shifted from hierarchical data format... So it is a question if it
>>> makes sense to stick with it?
>>> If we are also moving to simple flat format we need our own format we
>> have
>>> to support "native". In case of Drill I would define that native support
>> as
>>> "high performance".
>>> I think we can go to some kind of PAX format with comprehensive metadata
>> in
>>> the header, so each file is completely self contained and can be
>> understood
>>> and processed without any external data.
>>> Alternative is to have single file per column. As far as I remember from
>>> our OpenDremel work the main decision point is - if we can read one
>> column
>>> from the  file without loading into node memory unnecessary data from
>> other
>>> columns.
>>> With best regards,
>>> David
>>> 
>> 
>> 
>> 
>> --
>> Tomer Shiran
>> Director of Product Management | MapR Technologies | 650-804-8657
>>

Re: Drill native format

Posted by David Gruzman <da...@bigdatacraft.com>.

I assume that evolution of BigQuery reflects resolution of Dremel... If
somebody have information on it it would be great.
Storage system should understand that all file comprising the horizontal
partition of the table are one logical entity, and should store them
together / in some proximity. I agree that PAX will be much more
convinient. The question is - is there performance penalty of PAX vs file
per column?
David

On Fri, Sep 14, 2012 at 11:21 PM, Tomer Shiran <ts...@maprtech.com> wrote:

> Is there any public information suggesting that Google moved away from
> supporting nested data? Clearly BigQuery doesn't yet allow nested data, but
> not sure that applies to Dremel.
>
> There are challenges with one file per column. How do you ensure that a
> single record is located on a single machine to avoid costly record
> reconstruction?
>
> On Fri, Sep 14, 2012 at 1:05 PM, David Gruzman <david@bigdatacraft.com
> >wrote:
>
> > Hi All,
> > I would like to discuss the question of what will be native format for
> > drill. Original Google dremel paper defined their hierarchical columnar
> > data format. Since then
> > google shifted from hierarchical data format... So it is a question if it
> > makes sense to stick with it?
> > If we are also moving to simple flat format we need our own format we
> have
> > to support "native". In case of Drill I would define that native support
> as
> > "high performance".
> > I think we can go to some kind of PAX format with comprehensive metadata
> in
> > the header, so each file is completely self contained and can be
> understood
> > and processed without any external data.
> > Alternative is to have single file per column. As far as I remember from
> > our OpenDremel work the main decision point is - if we can read one
> column
> > from the  file without loading into node memory unnecessary data from
> other
> > columns.
> > With best regards,
> > David
> >
>
>
>
> --
> Tomer Shiran
> Director of Product Management | MapR Technologies | 650-804-8657
>

Re: Drill native format

Posted by Tomer Shiran <ts...@maprtech.com>.

Is there any public information suggesting that Google moved away from
supporting nested data? Clearly BigQuery doesn't yet allow nested data, but
not sure that applies to Dremel.

There are challenges with one file per column. How do you ensure that a
single record is located on a single machine to avoid costly record
reconstruction?

On Fri, Sep 14, 2012 at 1:05 PM, David Gruzman <da...@bigdatacraft.com>wrote:

> Hi All,
> I would like to discuss the question of what will be native format for
> drill. Original Google dremel paper defined their hierarchical columnar
> data format. Since then
> google shifted from hierarchical data format... So it is a question if it
> makes sense to stick with it?
> If we are also moving to simple flat format we need our own format we have
> to support "native". In case of Drill I would define that native support as
> "high performance".
> I think we can go to some kind of PAX format with comprehensive metadata in
> the header, so each file is completely self contained and can be understood
> and processed without any external data.
> Alternative is to have single file per column. As far as I remember from
> our OpenDremel work the main decision point is - if we can read one column
> from the  file without loading into node memory unnecessary data from other
> columns.
> With best regards,
> David
>



-- 
Tomer Shiran
Director of Product Management | MapR Technologies | 650-804-8657

Re: Drill native format

Posted by Azuryy Yu <az...@gmail.com>.

After read the paper, PAX is really good for Drill storage.

one of the benefit is that it just scan query columns, ignore others.
actually in Dremel, it doesn't scan full table, ignored lots of columns
it's not used in one query.


On Sat, Sep 15, 2012 at 4:05 AM, David Gruzman <da...@bigdatacraft.com>wrote:

> Hi All,
> I would like to discuss the question of what will be native format for
> drill. Original Google dremel paper defined their hierarchical columnar
> data format. Since then
> google shifted from hierarchical data format... So it is a question if it
> makes sense to stick with it?
> If we are also moving to simple flat format we need our own format we have
> to support "native". In case of Drill I would define that native support as
> "high performance".
> I think we can go to some kind of PAX format with comprehensive metadata in
> the header, so each file is completely self contained and can be understood
> and processed without any external data.
> Alternative is to have single file per column. As far as I remember from
> our OpenDremel work the main decision point is - if we can read one column
> from the  file without loading into node memory unnecessary data from other
> columns.
> With best regards,
> David
>