You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@parquet.apache.org by Julien Le Dem <ju...@dremio.com> on 2017/04/12 16:51:03 UTC
Parquet sync up in 10 min
10am PT today on google hangout:
https://hangouts.google.com/hangouts/_/dremio.com/parquet-sync-up
--
Julien
Re: Parquet sync up in 10 min
Posted by Julien Le Dem <ju...@dremio.com>.
Thank you!
On Fri, Apr 14, 2017 at 4:19 PM, Ryan Blue <rb...@netflix.com.invalid>
wrote:
> Thanks for the reminder! I've updated the PARQUET-686 PR so it is ready for
> comments. Thanks, everyone!
>
> On Fri, Apr 14, 2017 at 3:25 PM, Julien Le Dem <ju...@dremio.com> wrote:
>
> > Reminder:
> > give feedback in:
> > - https://docs.google.com/document/d/1sBACp8Lbutuj1Zxdowvsrlm8ku4BF
> > xf8U_Do5K2wSO4/edit#
> > - https://github.com/apache/parquet-format/pull/51
> > <https://github.com/apache/parquet-format/pull/51/files>
> > - (once updated by Ryan) https://github.com/apache/
> parquet-format/pull/46
> >
> > On Wed, Apr 12, 2017 at 11:22 AM, Julien Le Dem <ju...@dremio.com>
> wrote:
> >
> > > Notes from the sync (Full room today!)
> > >
> > > Zoltan (Cloudera, Parquet)
> > > Cheng (Databricks, Parquet - Spark integration): Index discussion
> > > Ryan (Netflix): Order changes, Logical type - Timestamp
> > > Deepak (Vertica - Parquet): Timestamp, indexes
> > > Greg (Cloudera): Timestamp
> > > Lars (Cloudera, Impala): Min/Max #46, feedback on indices
> > > Marcel (Cloudera, Impala): Min/Max #46, Index pages
> > > QinHui (Criteo): Migration project from JSON to Parquet using
> Protobuffs.
> > > Problem related to this.
> > > Srinath (Databricks): Indexing
> > > Julien (Dremio): Min/Max, Index discussion
> > >
> > > Min/max: https://github.com/apache/parquet-format/pull/46
> > > - Discussed Forward compatibility requirements to have ColumnOrder as
> > the
> > > gatekeeper to interpret min_value and max_value fields
> > > - have the signed field is redundant and unnecessary
> > > - Action: Ryan to update the PR for final review this week (everyone).
> > >
> > > Index: https://docs.google.com/document/d/
> 1sBACp8Lbutuj1Zxdowvsrlm8ku4BF
> > > xf8U_Do5K2wSO4/edit#
> > > - 2 types of lookup structures.
> > > - SortColumnIndex: index of values on sorted columns. (just boundary
> > > values) (only for main sorting column)
> > > - (name should be changed as it applies even if the column is not
> > > sorted)
> > > - OffsetIndex: locate data pages by row number.
> > > SortColumnIndex is used to narrow down the pages to apply a filter on.
> > > OffsetIndex is used to find the select rows in the other columns
> > (projected
> > > but not filtered on)
> > > - Lars and Marcel to make sure the doc is linked in the JIRA and the
> JIRA
> > > referred to in the title.
> > > - Action for everyone: Provide feedback before April 19.
> > > - After that create a PR in parquet-format (labelled experimental spec
> > > until a reference implementation is finalized).
> > >
> > > Timestamp: https://github.com/apache/parquet-format/pull/51
> > > <https://github.com/apache/parquet-format/pull/51/files>
> > > - PR #51 replaces the current LogicalType enum with a better and
> forward
> > > compatible union based definition.
> > > - Action for everyone: Provide Feedback before April 19
> > >
> > > Protobuf:
> > > - QinHui to propose JIRA/PR for saving field ids in schema for
> > protobufs.
> > > - capture unknown fields for which we only know the ID
> > >
> > >
> > >
> > >
> > >
> > >
> > > On Wed, Apr 12, 2017 at 9:57 AM, Julien Le Dem <ju...@dremio.com>
> > wrote:
> > >
> > >> Marcel and Lars' doc:
> > >> https://docs.google.com/document/d/1sBACp8Lbutuj1Zxdowvsrlm8
> > >> ku4BFxf8U_Do5K2wSO4/edit#heading=h.ft5dh2chrcjb
> > >>
> > >> On Wed, Apr 12, 2017 at 9:51 AM, Julien Le Dem <ju...@dremio.com>
> > wrote:
> > >>
> > >>> 10am PT today on google hangout:
> > >>> https://hangouts.google.com/hangouts/_/dremio.com/parquet-sync-up
> > >>>
> > >>> --
> > >>> Julien
> > >>>
> > >>
> > >>
> > >>
> > >> --
> > >> Julien
> > >>
> > >
> > >
> > >
> > > --
> > > Julien
> > >
> >
> >
> >
> > --
> > Julien
> >
>
>
>
> --
> Ryan Blue
> Software Engineer
> Netflix
>
--
Julien
Re: Parquet sync up in 10 min
Posted by Ryan Blue <rb...@netflix.com.INVALID>.
Thanks for the reminder! I've updated the PARQUET-686 PR so it is ready for
comments. Thanks, everyone!
On Fri, Apr 14, 2017 at 3:25 PM, Julien Le Dem <ju...@dremio.com> wrote:
> Reminder:
> give feedback in:
> - https://docs.google.com/document/d/1sBACp8Lbutuj1Zxdowvsrlm8ku4BF
> xf8U_Do5K2wSO4/edit#
> - https://github.com/apache/parquet-format/pull/51
> <https://github.com/apache/parquet-format/pull/51/files>
> - (once updated by Ryan) https://github.com/apache/parquet-format/pull/46
>
> On Wed, Apr 12, 2017 at 11:22 AM, Julien Le Dem <ju...@dremio.com> wrote:
>
> > Notes from the sync (Full room today!)
> >
> > Zoltan (Cloudera, Parquet)
> > Cheng (Databricks, Parquet - Spark integration): Index discussion
> > Ryan (Netflix): Order changes, Logical type - Timestamp
> > Deepak (Vertica - Parquet): Timestamp, indexes
> > Greg (Cloudera): Timestamp
> > Lars (Cloudera, Impala): Min/Max #46, feedback on indices
> > Marcel (Cloudera, Impala): Min/Max #46, Index pages
> > QinHui (Criteo): Migration project from JSON to Parquet using Protobuffs.
> > Problem related to this.
> > Srinath (Databricks): Indexing
> > Julien (Dremio): Min/Max, Index discussion
> >
> > Min/max: https://github.com/apache/parquet-format/pull/46
> > - Discussed Forward compatibility requirements to have ColumnOrder as
> the
> > gatekeeper to interpret min_value and max_value fields
> > - have the signed field is redundant and unnecessary
> > - Action: Ryan to update the PR for final review this week (everyone).
> >
> > Index: https://docs.google.com/document/d/1sBACp8Lbutuj1Zxdowvsrlm8ku4BF
> > xf8U_Do5K2wSO4/edit#
> > - 2 types of lookup structures.
> > - SortColumnIndex: index of values on sorted columns. (just boundary
> > values) (only for main sorting column)
> > - (name should be changed as it applies even if the column is not
> > sorted)
> > - OffsetIndex: locate data pages by row number.
> > SortColumnIndex is used to narrow down the pages to apply a filter on.
> > OffsetIndex is used to find the select rows in the other columns
> (projected
> > but not filtered on)
> > - Lars and Marcel to make sure the doc is linked in the JIRA and the JIRA
> > referred to in the title.
> > - Action for everyone: Provide feedback before April 19.
> > - After that create a PR in parquet-format (labelled experimental spec
> > until a reference implementation is finalized).
> >
> > Timestamp: https://github.com/apache/parquet-format/pull/51
> > <https://github.com/apache/parquet-format/pull/51/files>
> > - PR #51 replaces the current LogicalType enum with a better and forward
> > compatible union based definition.
> > - Action for everyone: Provide Feedback before April 19
> >
> > Protobuf:
> > - QinHui to propose JIRA/PR for saving field ids in schema for
> protobufs.
> > - capture unknown fields for which we only know the ID
> >
> >
> >
> >
> >
> >
> > On Wed, Apr 12, 2017 at 9:57 AM, Julien Le Dem <ju...@dremio.com>
> wrote:
> >
> >> Marcel and Lars' doc:
> >> https://docs.google.com/document/d/1sBACp8Lbutuj1Zxdowvsrlm8
> >> ku4BFxf8U_Do5K2wSO4/edit#heading=h.ft5dh2chrcjb
> >>
> >> On Wed, Apr 12, 2017 at 9:51 AM, Julien Le Dem <ju...@dremio.com>
> wrote:
> >>
> >>> 10am PT today on google hangout:
> >>> https://hangouts.google.com/hangouts/_/dremio.com/parquet-sync-up
> >>>
> >>> --
> >>> Julien
> >>>
> >>
> >>
> >>
> >> --
> >> Julien
> >>
> >
> >
> >
> > --
> > Julien
> >
>
>
>
> --
> Julien
>
--
Ryan Blue
Software Engineer
Netflix
Re: Parquet sync up in 10 min
Posted by Julien Le Dem <ju...@dremio.com>.
Reminder:
give feedback in:
- https://docs.google.com/document/d/1sBACp8Lbutuj1Zxdowvsrlm8ku4BF
xf8U_Do5K2wSO4/edit#
- https://github.com/apache/parquet-format/pull/51
<https://github.com/apache/parquet-format/pull/51/files>
- (once updated by Ryan) https://github.com/apache/parquet-format/pull/46
On Wed, Apr 12, 2017 at 11:22 AM, Julien Le Dem <ju...@dremio.com> wrote:
> Notes from the sync (Full room today!)
>
> Zoltan (Cloudera, Parquet)
> Cheng (Databricks, Parquet - Spark integration): Index discussion
> Ryan (Netflix): Order changes, Logical type - Timestamp
> Deepak (Vertica - Parquet): Timestamp, indexes
> Greg (Cloudera): Timestamp
> Lars (Cloudera, Impala): Min/Max #46, feedback on indices
> Marcel (Cloudera, Impala): Min/Max #46, Index pages
> QinHui (Criteo): Migration project from JSON to Parquet using Protobuffs.
> Problem related to this.
> Srinath (Databricks): Indexing
> Julien (Dremio): Min/Max, Index discussion
>
> Min/max: https://github.com/apache/parquet-format/pull/46
> - Discussed Forward compatibility requirements to have ColumnOrder as the
> gatekeeper to interpret min_value and max_value fields
> - have the signed field is redundant and unnecessary
> - Action: Ryan to update the PR for final review this week (everyone).
>
> Index: https://docs.google.com/document/d/1sBACp8Lbutuj1Zxdowvsrlm8ku4BF
> xf8U_Do5K2wSO4/edit#
> - 2 types of lookup structures.
> - SortColumnIndex: index of values on sorted columns. (just boundary
> values) (only for main sorting column)
> - (name should be changed as it applies even if the column is not
> sorted)
> - OffsetIndex: locate data pages by row number.
> SortColumnIndex is used to narrow down the pages to apply a filter on.
> OffsetIndex is used to find the select rows in the other columns (projected
> but not filtered on)
> - Lars and Marcel to make sure the doc is linked in the JIRA and the JIRA
> referred to in the title.
> - Action for everyone: Provide feedback before April 19.
> - After that create a PR in parquet-format (labelled experimental spec
> until a reference implementation is finalized).
>
> Timestamp: https://github.com/apache/parquet-format/pull/51
> <https://github.com/apache/parquet-format/pull/51/files>
> - PR #51 replaces the current LogicalType enum with a better and forward
> compatible union based definition.
> - Action for everyone: Provide Feedback before April 19
>
> Protobuf:
> - QinHui to propose JIRA/PR for saving field ids in schema for protobufs.
> - capture unknown fields for which we only know the ID
>
>
>
>
>
>
> On Wed, Apr 12, 2017 at 9:57 AM, Julien Le Dem <ju...@dremio.com> wrote:
>
>> Marcel and Lars' doc:
>> https://docs.google.com/document/d/1sBACp8Lbutuj1Zxdowvsrlm8
>> ku4BFxf8U_Do5K2wSO4/edit#heading=h.ft5dh2chrcjb
>>
>> On Wed, Apr 12, 2017 at 9:51 AM, Julien Le Dem <ju...@dremio.com> wrote:
>>
>>> 10am PT today on google hangout:
>>> https://hangouts.google.com/hangouts/_/dremio.com/parquet-sync-up
>>>
>>> --
>>> Julien
>>>
>>
>>
>>
>> --
>> Julien
>>
>
>
>
> --
> Julien
>
--
Julien
Re: Parquet sync up in 10 min
Posted by Julien Le Dem <ju...@dremio.com>.
Notes from the sync (Full room today!)
Zoltan (Cloudera, Parquet)
Cheng (Databricks, Parquet - Spark integration): Index discussion
Ryan (Netflix): Order changes, Logical type - Timestamp
Deepak (Vertica - Parquet): Timestamp, indexes
Greg (Cloudera): Timestamp
Lars (Cloudera, Impala): Min/Max #46, feedback on indices
Marcel (Cloudera, Impala): Min/Max #46, Index pages
QinHui (Criteo): Migration project from JSON to Parquet using Protobuffs.
Problem related to this.
Srinath (Databricks): Indexing
Julien (Dremio): Min/Max, Index discussion
Min/max: https://github.com/apache/parquet-format/pull/46
- Discussed Forward compatibility requirements to have ColumnOrder as the
gatekeeper to interpret min_value and max_value fields
- have the signed field is redundant and unnecessary
- Action: Ryan to update the PR for final review this week (everyone).
Index:
https://docs.google.com/document/d/1sBACp8Lbutuj1Zxdowvsrlm8ku4BFxf8U_Do5K2wSO4/edit#
- 2 types of lookup structures.
- SortColumnIndex: index of values on sorted columns. (just boundary
values) (only for main sorting column)
- (name should be changed as it applies even if the column is not
sorted)
- OffsetIndex: locate data pages by row number.
SortColumnIndex is used to narrow down the pages to apply a filter on.
OffsetIndex is used to find the select rows in the other columns (projected
but not filtered on)
- Lars and Marcel to make sure the doc is linked in the JIRA and the JIRA
referred to in the title.
- Action for everyone: Provide feedback before April 19.
- After that create a PR in parquet-format (labelled experimental spec
until a reference implementation is finalized).
Timestamp: https://github.com/apache/parquet-format/pull/51
<https://github.com/apache/parquet-format/pull/51/files>
- PR #51 replaces the current LogicalType enum with a better and forward
compatible union based definition.
- Action for everyone: Provide Feedback before April 19
Protobuf:
- QinHui to propose JIRA/PR for saving field ids in schema for protobufs.
- capture unknown fields for which we only know the ID
On Wed, Apr 12, 2017 at 9:57 AM, Julien Le Dem <ju...@dremio.com> wrote:
> Marcel and Lars' doc:
> https://docs.google.com/document/d/1sBACp8Lbutuj1Zxdowvsrlm8ku4BF
> xf8U_Do5K2wSO4/edit#heading=h.ft5dh2chrcjb
>
> On Wed, Apr 12, 2017 at 9:51 AM, Julien Le Dem <ju...@dremio.com> wrote:
>
>> 10am PT today on google hangout:
>> https://hangouts.google.com/hangouts/_/dremio.com/parquet-sync-up
>>
>> --
>> Julien
>>
>
>
>
> --
> Julien
>
--
Julien
Re: Parquet sync up in 10 min
Posted by Julien Le Dem <ju...@dremio.com>.
Marcel and Lars' doc:
https://docs.google.com/document/d/1sBACp8Lbutuj1Zxdowvsrlm8ku4BFxf8U_Do5K2wSO4/edit#heading=h.ft5dh2chrcjb
On Wed, Apr 12, 2017 at 9:51 AM, Julien Le Dem <ju...@dremio.com> wrote:
> 10am PT today on google hangout:
> https://hangouts.google.com/hangouts/_/dremio.com/parquet-sync-up
>
> --
> Julien
>
--
Julien