You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@iceberg.apache.org by Weston Pace <we...@gmail.com> on 2021/05/18 03:49:19 UTC

Usage of parquet field_id

Hello Iceberg devs,

I'm Weston, I've been working on the Arrow project lately and I am
reviewing how we handle the parquet field_id (and also adding support
for specifying a field_id at write time) from parquet[1][2].   This
has brought up two questions.

 1. The original PR adding field_id support[3][4] not only allowed the
field_id to pass through from parquet to arrow but also generated ids
(in a depth first fashion) for fields that did not have a field_id.
In retrospect, it seems this auto-generation of field_id was probably
not a good idea.  Would it have any impact on Iceberg if we removed
it?  Just to be clear, we will still have support  for reading (and
now writing) the parquet field_id.  I am only talking about removing
the auto-generation of missing values.

 2. For the second question I'm looking for the Iceberg community's
opinion as users of Arrow.  Arrow is enabling more support for
computation on data (e.g. relational operators) and I've been
wondering how those transformations should affect metadata (like the
field_id).  For some examples:

 * Filtering a table by column (it seems the field_id/metadata should
remain unchanged)
 * Filtering a table by rows (it seems the field_id/metadata should
remain unchanged)
 * Filling in null values with a placeholder value (the data is changed so ???)
 * Casting a field to a different data type (the meaning of the data
has changed so ???)
 * Combining two fields into a third field (it seems the
field_id/metadata should be erased in the third field but presumably
it could also be the joined metadata from the two origin fields)

Thanks for your time,

-Weston Pace

[1] https://issues.apache.org/jira/browse/PARQUET-1798
[2] https://github.com/apache/arrow/pull/10289
[3] https://issues.apache.org/jira/browse/ARROW-7080
[4] https://github.com/apache/arrow/pull/6408

Re: Usage of parquet field_id

Posted by Ted Gooch <tg...@netflix.com.INVALID>.
Hi Weston,

I can also give you more background on how we use this for read( and plan
on using for write) specifically in relation to arrow in the Iceberg python
client lib.

On Thu, May 20, 2021 at 7:23 AM Weston Pace <we...@gmail.com> wrote:

> > #1 is a problem and we should remove the auto-generation.
>
> Sounds like we are aligned.
>
> > I hope that helps.
>
> Thanks for the extra details.  I've learned a lot and it helps to know
> how this is used.
>
> On Tue, May 18, 2021 at 2:20 PM Ryan Blue <bl...@apache.org> wrote:
> >
> > Hi Weston,
> >
> > #1 is a problem and we should remove the auto-generation. The issue is
> that auto-generating an ID can result in a collision between Iceberg's
> field IDs and the generated IDs. Since Iceberg uses the ID to identify a
> field, that would result in unrelated data being mistaken for a column's
> data.
> >
> > Your description above for #2 is a bit confusing for me. Field IDs are
> used to track fields across renames and other schema changes. Those schema
> changes don't happen in a single file. A file is written with some schema
> (which includes IDs) and later field resolution happens based on ID. I
> might have a table with fields `1: a int, 2: b string` that is later
> evolved to `1: x long, 3: b string`. Any given data file is written with
> only one version of the schema. From the IDs, you can see that field 1 was
> renamed and promoted to long, field 2 was deleted, and field 3 was added
> with field 2's original name.
> >
> > This ID-based approach is an alternative to name-based resolution (like
> Avro uses) or position-based resolution (like CSV uses). Both of those
> resolution methods are flawed and result in correctness issues:
> > 1. Name-based resolution can't drop a column and add a new one with the
> same name
> > 2. Position-based resolution can't drop a column in the middle of the
> schema
> >
> > Only ID-based resolution gives you the expected SQL behavior for table
> evolution (ADD/DROP/RENAME COLUMN).
> >
> > For your original questions:
> >
> > * Filtering a table is a matter of selecting columns by ID and running
> filters by ID. In Iceberg. we bind the current names in a SQL table to the
> field IDs to do this.
> > * Filling in null values is done by identifying that a column ID is
> missing in a data file. Null values are used in place.
> > * Casting or promoting data is done by strict rules in Iceberg. This is
> affected by ID because we know that a field is the same across files, like
> in my example above.
> > * For combining fields, it sounds like you're thinking about operations
> on the data and when to carry IDs through an operation. I wouldn't
> recommend ever carrying IDs through. In Spark, we use the current schema's
> names to produce rows. SQL always uses the current names. And when we write
> back out to a table, we use SQL semantics, which are to align by position.
> >
> > I hope that helps. If it's not clear, I'm happy to jump on a call to
> talk through it with you.
> >
> > Ryan
> >
> > On Tue, May 18, 2021 at 1:48 PM Weston Pace <we...@gmail.com>
> wrote:
> >>
> >> Ok, this is matching my understanding of how field_id is used as well.
> >> I believe #1 will not be an issue because I think Iceberg always sets
> >> the field_id property when writing data?  If that is the case then
> >> Iceberg would never have noticed the old behavior.  In other words,
> >> Iceberg never relied on Arrow to set the field_id.
> >>
> >> For #2 I think your example is helpful.  The `field_id` is sort of a
> >> file-specific concept.  Once you are at the dataset layer the Iceberg
> >> schema takes precedence and the field_id is no longer necessary.
> >>
> >> Also, thinking about it more generally, metadata is really part of the
> >> schema / control channel.  The compute operations in Arrow are more
> >> involved with the data channel.  "Combining metadata" might be a
> >> concern of tools that "combine schema" (e.g. dataset evolution) but
> >> isn't a concern of tools that combine data (e.g. Arrow compute).  So
> >> in that sense the compute operations probably don't need to worry much
> >> about preserving schema.
> >>
> >> This has been helpful to hear how this is used.  I needed a concrete
> >> example to bounce the idea around in my head with.
> >>
> >> Thanks,
> >>
> >> -Weston
> >>
> >> On Tue, May 18, 2021 at 5:48 AM Daniel Weeks <dw...@apache.org> wrote:
> >> >
> >> > Hey Weston,
> >> >
> >> > From the Iceberg's perspective, the field_id is necessary to track
> the evolution of the schema over time.  It's best to think of the problem
> from a dataset perspective as opposed to a file perspective.
> >> >
> >> > Iceberg maintains the mapping of the schema with respect to the field
> ids because as the files in the datasets change, the field names may
> change, but field id is intended to be persistent and referenceable
> regardless of name or position within the file.
> >> >
> >> > For #1 above, I'm not sure I understand the issue of having the field
> ids auto-generated.  If you're not using the field ids to reference the
> columns, does it matter if they are present or not?
> >> >
> >> > For #2, I would speculate that the field id is less relevant after
> the initial projection and filtering (it really depends on how the engine
> wants to track fields at that point, so I would suspect that maybe field id
> wouldn't be ideal especially after various transforms or aggregations are
> applied).  However, it does matter when persisting the data as the field
> ids need to be resolved to the target dataset.  If it's a new dataset, new
> field ids can be generated using the original approach.  However, if the
> data is being appended to an existing dataset, the field ids need to be
> resolved against that target dataset and rewritten before persisting to
> parquet so they align with the Iceberg schema (in SQL this is done
> positionally).
> >> >
> >> > Let me know if any of that doesn't make sense.  I'm still a little
> unclear on the issue in #1, so it would be helpful if you could clarify
> that for me.
> >> >
> >> > Thanks,
> >> > Dan
> >> >
> >> > On Mon, May 17, 2021 at 8:50 PM Weston Pace <we...@gmail.com>
> wrote:
> >> >>
> >> >> Hello Iceberg devs,
> >> >>
> >> >> I'm Weston, I've been working on the Arrow project lately and I am
> >> >> reviewing how we handle the parquet field_id (and also adding support
> >> >> for specifying a field_id at write time) from parquet[1][2].   This
> >> >> has brought up two questions.
> >> >>
> >> >>  1. The original PR adding field_id support[3][4] not only allowed
> the
> >> >> field_id to pass through from parquet to arrow but also generated ids
> >> >> (in a depth first fashion) for fields that did not have a field_id.
> >> >> In retrospect, it seems this auto-generation of field_id was probably
> >> >> not a good idea.  Would it have any impact on Iceberg if we removed
> >> >> it?  Just to be clear, we will still have support  for reading (and
> >> >> now writing) the parquet field_id.  I am only talking about removing
> >> >> the auto-generation of missing values.
> >> >>
> >> >>  2. For the second question I'm looking for the Iceberg community's
> >> >> opinion as users of Arrow.  Arrow is enabling more support for
> >> >> computation on data (e.g. relational operators) and I've been
> >> >> wondering how those transformations should affect metadata (like the
> >> >> field_id).  For some examples:
> >> >>
> >> >>  * Filtering a table by column (it seems the field_id/metadata should
> >> >> remain unchanged)
> >> >>  * Filtering a table by rows (it seems the field_id/metadata should
> >> >> remain unchanged)
> >> >>  * Filling in null values with a placeholder value (the data is
> changed so ???)
> >> >>  * Casting a field to a different data type (the meaning of the data
> >> >> has changed so ???)
> >> >>  * Combining two fields into a third field (it seems the
> >> >> field_id/metadata should be erased in the third field but presumably
> >> >> it could also be the joined metadata from the two origin fields)
> >> >>
> >> >> Thanks for your time,
> >> >>
> >> >> -Weston Pace
> >> >>
> >> >> [1] https://issues.apache.org/jira/browse/PARQUET-1798
> >> >> [2] https://github.com/apache/arrow/pull/10289
> >> >> [3] https://issues.apache.org/jira/browse/ARROW-7080
> >> >> [4] https://github.com/apache/arrow/pull/6408
> >
> >
> >
> > --
> > Ryan Blue
>

Re: Usage of parquet field_id

Posted by Weston Pace <we...@gmail.com>.
> #1 is a problem and we should remove the auto-generation.

Sounds like we are aligned.

> I hope that helps.

Thanks for the extra details.  I've learned a lot and it helps to know
how this is used.

On Tue, May 18, 2021 at 2:20 PM Ryan Blue <bl...@apache.org> wrote:
>
> Hi Weston,
>
> #1 is a problem and we should remove the auto-generation. The issue is that auto-generating an ID can result in a collision between Iceberg's field IDs and the generated IDs. Since Iceberg uses the ID to identify a field, that would result in unrelated data being mistaken for a column's data.
>
> Your description above for #2 is a bit confusing for me. Field IDs are used to track fields across renames and other schema changes. Those schema changes don't happen in a single file. A file is written with some schema (which includes IDs) and later field resolution happens based on ID. I might have a table with fields `1: a int, 2: b string` that is later evolved to `1: x long, 3: b string`. Any given data file is written with only one version of the schema. From the IDs, you can see that field 1 was renamed and promoted to long, field 2 was deleted, and field 3 was added with field 2's original name.
>
> This ID-based approach is an alternative to name-based resolution (like Avro uses) or position-based resolution (like CSV uses). Both of those resolution methods are flawed and result in correctness issues:
> 1. Name-based resolution can't drop a column and add a new one with the same name
> 2. Position-based resolution can't drop a column in the middle of the schema
>
> Only ID-based resolution gives you the expected SQL behavior for table evolution (ADD/DROP/RENAME COLUMN).
>
> For your original questions:
>
> * Filtering a table is a matter of selecting columns by ID and running filters by ID. In Iceberg. we bind the current names in a SQL table to the field IDs to do this.
> * Filling in null values is done by identifying that a column ID is missing in a data file. Null values are used in place.
> * Casting or promoting data is done by strict rules in Iceberg. This is affected by ID because we know that a field is the same across files, like in my example above.
> * For combining fields, it sounds like you're thinking about operations on the data and when to carry IDs through an operation. I wouldn't recommend ever carrying IDs through. In Spark, we use the current schema's names to produce rows. SQL always uses the current names. And when we write back out to a table, we use SQL semantics, which are to align by position.
>
> I hope that helps. If it's not clear, I'm happy to jump on a call to talk through it with you.
>
> Ryan
>
> On Tue, May 18, 2021 at 1:48 PM Weston Pace <we...@gmail.com> wrote:
>>
>> Ok, this is matching my understanding of how field_id is used as well.
>> I believe #1 will not be an issue because I think Iceberg always sets
>> the field_id property when writing data?  If that is the case then
>> Iceberg would never have noticed the old behavior.  In other words,
>> Iceberg never relied on Arrow to set the field_id.
>>
>> For #2 I think your example is helpful.  The `field_id` is sort of a
>> file-specific concept.  Once you are at the dataset layer the Iceberg
>> schema takes precedence and the field_id is no longer necessary.
>>
>> Also, thinking about it more generally, metadata is really part of the
>> schema / control channel.  The compute operations in Arrow are more
>> involved with the data channel.  "Combining metadata" might be a
>> concern of tools that "combine schema" (e.g. dataset evolution) but
>> isn't a concern of tools that combine data (e.g. Arrow compute).  So
>> in that sense the compute operations probably don't need to worry much
>> about preserving schema.
>>
>> This has been helpful to hear how this is used.  I needed a concrete
>> example to bounce the idea around in my head with.
>>
>> Thanks,
>>
>> -Weston
>>
>> On Tue, May 18, 2021 at 5:48 AM Daniel Weeks <dw...@apache.org> wrote:
>> >
>> > Hey Weston,
>> >
>> > From the Iceberg's perspective, the field_id is necessary to track the evolution of the schema over time.  It's best to think of the problem from a dataset perspective as opposed to a file perspective.
>> >
>> > Iceberg maintains the mapping of the schema with respect to the field ids because as the files in the datasets change, the field names may change, but field id is intended to be persistent and referenceable regardless of name or position within the file.
>> >
>> > For #1 above, I'm not sure I understand the issue of having the field ids auto-generated.  If you're not using the field ids to reference the columns, does it matter if they are present or not?
>> >
>> > For #2, I would speculate that the field id is less relevant after the initial projection and filtering (it really depends on how the engine wants to track fields at that point, so I would suspect that maybe field id wouldn't be ideal especially after various transforms or aggregations are applied).  However, it does matter when persisting the data as the field ids need to be resolved to the target dataset.  If it's a new dataset, new field ids can be generated using the original approach.  However, if the data is being appended to an existing dataset, the field ids need to be resolved against that target dataset and rewritten before persisting to parquet so they align with the Iceberg schema (in SQL this is done positionally).
>> >
>> > Let me know if any of that doesn't make sense.  I'm still a little unclear on the issue in #1, so it would be helpful if you could clarify that for me.
>> >
>> > Thanks,
>> > Dan
>> >
>> > On Mon, May 17, 2021 at 8:50 PM Weston Pace <we...@gmail.com> wrote:
>> >>
>> >> Hello Iceberg devs,
>> >>
>> >> I'm Weston, I've been working on the Arrow project lately and I am
>> >> reviewing how we handle the parquet field_id (and also adding support
>> >> for specifying a field_id at write time) from parquet[1][2].   This
>> >> has brought up two questions.
>> >>
>> >>  1. The original PR adding field_id support[3][4] not only allowed the
>> >> field_id to pass through from parquet to arrow but also generated ids
>> >> (in a depth first fashion) for fields that did not have a field_id.
>> >> In retrospect, it seems this auto-generation of field_id was probably
>> >> not a good idea.  Would it have any impact on Iceberg if we removed
>> >> it?  Just to be clear, we will still have support  for reading (and
>> >> now writing) the parquet field_id.  I am only talking about removing
>> >> the auto-generation of missing values.
>> >>
>> >>  2. For the second question I'm looking for the Iceberg community's
>> >> opinion as users of Arrow.  Arrow is enabling more support for
>> >> computation on data (e.g. relational operators) and I've been
>> >> wondering how those transformations should affect metadata (like the
>> >> field_id).  For some examples:
>> >>
>> >>  * Filtering a table by column (it seems the field_id/metadata should
>> >> remain unchanged)
>> >>  * Filtering a table by rows (it seems the field_id/metadata should
>> >> remain unchanged)
>> >>  * Filling in null values with a placeholder value (the data is changed so ???)
>> >>  * Casting a field to a different data type (the meaning of the data
>> >> has changed so ???)
>> >>  * Combining two fields into a third field (it seems the
>> >> field_id/metadata should be erased in the third field but presumably
>> >> it could also be the joined metadata from the two origin fields)
>> >>
>> >> Thanks for your time,
>> >>
>> >> -Weston Pace
>> >>
>> >> [1] https://issues.apache.org/jira/browse/PARQUET-1798
>> >> [2] https://github.com/apache/arrow/pull/10289
>> >> [3] https://issues.apache.org/jira/browse/ARROW-7080
>> >> [4] https://github.com/apache/arrow/pull/6408
>
>
>
> --
> Ryan Blue

Re: Usage of parquet field_id

Posted by Ryan Blue <bl...@apache.org>.
Hi Weston,

#1 is a problem and we should remove the auto-generation. The issue is that
auto-generating an ID can result in a collision between Iceberg's field IDs
and the generated IDs. Since Iceberg uses the ID to identify a field, that
would result in unrelated data being mistaken for a column's data.

Your description above for #2 is a bit confusing for me. Field IDs are used
to track fields across renames and other schema changes. Those schema
changes don't happen in a single file. A file is written with some schema
(which includes IDs) and later field resolution happens based on ID. I
might have a table with fields `1: a int, 2: b string` that is later
evolved to `1: x long, 3: b string`. Any given data file is written with
only one version of the schema. From the IDs, you can see that field 1 was
renamed and promoted to long, field 2 was deleted, and field 3 was added
with field 2's original name.

This ID-based approach is an alternative to name-based resolution (like
Avro uses) or position-based resolution (like CSV uses). Both of those
resolution methods are flawed and result in correctness issues:
1. Name-based resolution can't drop a column and add a new one with the
same name
2. Position-based resolution can't drop a column in the middle of the schema

Only ID-based resolution gives you the expected SQL behavior for table
evolution (ADD/DROP/RENAME COLUMN).

For your original questions:

* Filtering a table is a matter of selecting columns by ID and running
filters by ID. In Iceberg. we bind the current names in a SQL table to the
field IDs to do this.
* Filling in null values is done by identifying that a column ID is missing
in a data file. Null values are used in place.
* Casting or promoting data is done by strict rules in Iceberg. This is
affected by ID because we know that a field is the same across files, like
in my example above.
* For combining fields, it sounds like you're thinking about operations on
the data and when to carry IDs through an operation. I wouldn't recommend
ever carrying IDs through. In Spark, we use the current schema's names to
produce rows. SQL always uses the current names. And when we write back out
to a table, we use SQL semantics, which are to align by position.

I hope that helps. If it's not clear, I'm happy to jump on a call to talk
through it with you.

Ryan

On Tue, May 18, 2021 at 1:48 PM Weston Pace <we...@gmail.com> wrote:

> Ok, this is matching my understanding of how field_id is used as well.
> I believe #1 will not be an issue because I think Iceberg always sets
> the field_id property when writing data?  If that is the case then
> Iceberg would never have noticed the old behavior.  In other words,
> Iceberg never relied on Arrow to set the field_id.
>
> For #2 I think your example is helpful.  The `field_id` is sort of a
> file-specific concept.  Once you are at the dataset layer the Iceberg
> schema takes precedence and the field_id is no longer necessary.
>
> Also, thinking about it more generally, metadata is really part of the
> schema / control channel.  The compute operations in Arrow are more
> involved with the data channel.  "Combining metadata" might be a
> concern of tools that "combine schema" (e.g. dataset evolution) but
> isn't a concern of tools that combine data (e.g. Arrow compute).  So
> in that sense the compute operations probably don't need to worry much
> about preserving schema.
>
> This has been helpful to hear how this is used.  I needed a concrete
> example to bounce the idea around in my head with.
>
> Thanks,
>
> -Weston
>
> On Tue, May 18, 2021 at 5:48 AM Daniel Weeks <dw...@apache.org> wrote:
> >
> > Hey Weston,
> >
> > From the Iceberg's perspective, the field_id is necessary to track the
> evolution of the schema over time.  It's best to think of the problem from
> a dataset perspective as opposed to a file perspective.
> >
> > Iceberg maintains the mapping of the schema with respect to the field
> ids because as the files in the datasets change, the field names may
> change, but field id is intended to be persistent and referenceable
> regardless of name or position within the file.
> >
> > For #1 above, I'm not sure I understand the issue of having the field
> ids auto-generated.  If you're not using the field ids to reference the
> columns, does it matter if they are present or not?
> >
> > For #2, I would speculate that the field id is less relevant after the
> initial projection and filtering (it really depends on how the engine wants
> to track fields at that point, so I would suspect that maybe field id
> wouldn't be ideal especially after various transforms or aggregations are
> applied).  However, it does matter when persisting the data as the field
> ids need to be resolved to the target dataset.  If it's a new dataset, new
> field ids can be generated using the original approach.  However, if the
> data is being appended to an existing dataset, the field ids need to be
> resolved against that target dataset and rewritten before persisting to
> parquet so they align with the Iceberg schema (in SQL this is done
> positionally).
> >
> > Let me know if any of that doesn't make sense.  I'm still a little
> unclear on the issue in #1, so it would be helpful if you could clarify
> that for me.
> >
> > Thanks,
> > Dan
> >
> > On Mon, May 17, 2021 at 8:50 PM Weston Pace <we...@gmail.com>
> wrote:
> >>
> >> Hello Iceberg devs,
> >>
> >> I'm Weston, I've been working on the Arrow project lately and I am
> >> reviewing how we handle the parquet field_id (and also adding support
> >> for specifying a field_id at write time) from parquet[1][2].   This
> >> has brought up two questions.
> >>
> >>  1. The original PR adding field_id support[3][4] not only allowed the
> >> field_id to pass through from parquet to arrow but also generated ids
> >> (in a depth first fashion) for fields that did not have a field_id.
> >> In retrospect, it seems this auto-generation of field_id was probably
> >> not a good idea.  Would it have any impact on Iceberg if we removed
> >> it?  Just to be clear, we will still have support  for reading (and
> >> now writing) the parquet field_id.  I am only talking about removing
> >> the auto-generation of missing values.
> >>
> >>  2. For the second question I'm looking for the Iceberg community's
> >> opinion as users of Arrow.  Arrow is enabling more support for
> >> computation on data (e.g. relational operators) and I've been
> >> wondering how those transformations should affect metadata (like the
> >> field_id).  For some examples:
> >>
> >>  * Filtering a table by column (it seems the field_id/metadata should
> >> remain unchanged)
> >>  * Filtering a table by rows (it seems the field_id/metadata should
> >> remain unchanged)
> >>  * Filling in null values with a placeholder value (the data is changed
> so ???)
> >>  * Casting a field to a different data type (the meaning of the data
> >> has changed so ???)
> >>  * Combining two fields into a third field (it seems the
> >> field_id/metadata should be erased in the third field but presumably
> >> it could also be the joined metadata from the two origin fields)
> >>
> >> Thanks for your time,
> >>
> >> -Weston Pace
> >>
> >> [1] https://issues.apache.org/jira/browse/PARQUET-1798
> >> [2] https://github.com/apache/arrow/pull/10289
> >> [3] https://issues.apache.org/jira/browse/ARROW-7080
> >> [4] https://github.com/apache/arrow/pull/6408
>


-- 
Ryan Blue

Re: Usage of parquet field_id

Posted by Weston Pace <we...@gmail.com>.
Ok, this is matching my understanding of how field_id is used as well.
I believe #1 will not be an issue because I think Iceberg always sets
the field_id property when writing data?  If that is the case then
Iceberg would never have noticed the old behavior.  In other words,
Iceberg never relied on Arrow to set the field_id.

For #2 I think your example is helpful.  The `field_id` is sort of a
file-specific concept.  Once you are at the dataset layer the Iceberg
schema takes precedence and the field_id is no longer necessary.

Also, thinking about it more generally, metadata is really part of the
schema / control channel.  The compute operations in Arrow are more
involved with the data channel.  "Combining metadata" might be a
concern of tools that "combine schema" (e.g. dataset evolution) but
isn't a concern of tools that combine data (e.g. Arrow compute).  So
in that sense the compute operations probably don't need to worry much
about preserving schema.

This has been helpful to hear how this is used.  I needed a concrete
example to bounce the idea around in my head with.

Thanks,

-Weston

On Tue, May 18, 2021 at 5:48 AM Daniel Weeks <dw...@apache.org> wrote:
>
> Hey Weston,
>
> From the Iceberg's perspective, the field_id is necessary to track the evolution of the schema over time.  It's best to think of the problem from a dataset perspective as opposed to a file perspective.
>
> Iceberg maintains the mapping of the schema with respect to the field ids because as the files in the datasets change, the field names may change, but field id is intended to be persistent and referenceable regardless of name or position within the file.
>
> For #1 above, I'm not sure I understand the issue of having the field ids auto-generated.  If you're not using the field ids to reference the columns, does it matter if they are present or not?
>
> For #2, I would speculate that the field id is less relevant after the initial projection and filtering (it really depends on how the engine wants to track fields at that point, so I would suspect that maybe field id wouldn't be ideal especially after various transforms or aggregations are applied).  However, it does matter when persisting the data as the field ids need to be resolved to the target dataset.  If it's a new dataset, new field ids can be generated using the original approach.  However, if the data is being appended to an existing dataset, the field ids need to be resolved against that target dataset and rewritten before persisting to parquet so they align with the Iceberg schema (in SQL this is done positionally).
>
> Let me know if any of that doesn't make sense.  I'm still a little unclear on the issue in #1, so it would be helpful if you could clarify that for me.
>
> Thanks,
> Dan
>
> On Mon, May 17, 2021 at 8:50 PM Weston Pace <we...@gmail.com> wrote:
>>
>> Hello Iceberg devs,
>>
>> I'm Weston, I've been working on the Arrow project lately and I am
>> reviewing how we handle the parquet field_id (and also adding support
>> for specifying a field_id at write time) from parquet[1][2].   This
>> has brought up two questions.
>>
>>  1. The original PR adding field_id support[3][4] not only allowed the
>> field_id to pass through from parquet to arrow but also generated ids
>> (in a depth first fashion) for fields that did not have a field_id.
>> In retrospect, it seems this auto-generation of field_id was probably
>> not a good idea.  Would it have any impact on Iceberg if we removed
>> it?  Just to be clear, we will still have support  for reading (and
>> now writing) the parquet field_id.  I am only talking about removing
>> the auto-generation of missing values.
>>
>>  2. For the second question I'm looking for the Iceberg community's
>> opinion as users of Arrow.  Arrow is enabling more support for
>> computation on data (e.g. relational operators) and I've been
>> wondering how those transformations should affect metadata (like the
>> field_id).  For some examples:
>>
>>  * Filtering a table by column (it seems the field_id/metadata should
>> remain unchanged)
>>  * Filtering a table by rows (it seems the field_id/metadata should
>> remain unchanged)
>>  * Filling in null values with a placeholder value (the data is changed so ???)
>>  * Casting a field to a different data type (the meaning of the data
>> has changed so ???)
>>  * Combining two fields into a third field (it seems the
>> field_id/metadata should be erased in the third field but presumably
>> it could also be the joined metadata from the two origin fields)
>>
>> Thanks for your time,
>>
>> -Weston Pace
>>
>> [1] https://issues.apache.org/jira/browse/PARQUET-1798
>> [2] https://github.com/apache/arrow/pull/10289
>> [3] https://issues.apache.org/jira/browse/ARROW-7080
>> [4] https://github.com/apache/arrow/pull/6408

Re: Usage of parquet field_id

Posted by Daniel Weeks <dw...@apache.org>.
Hey Weston,

From the Iceberg's perspective, the field_id is necessary to track the
evolution of the schema over time.  It's best to think of the problem from
a dataset perspective as opposed to a file perspective.

Iceberg maintains the mapping of the schema with respect to the field ids
because as the files in the datasets change, the field names may change,
but field id is intended to be persistent and referenceable regardless of
name or position within the file.

For #1 above, I'm not sure I understand the issue of having the field ids
auto-generated.  If you're not using the field ids to reference the
columns, does it matter if they are present or not?

For #2, I would speculate that the field id is less relevant after the
initial projection and filtering (it really depends on how the engine wants
to track fields at that point, so I would suspect that maybe field id
wouldn't be ideal especially after various transforms or aggregations are
applied).  However, it does matter when persisting the data as the field
ids need to be resolved to the target dataset.  If it's a new dataset, new
field ids can be generated using the original approach.  However, if the
data is being appended to an existing dataset, the field ids need to be
resolved against that target dataset and rewritten before persisting to
parquet so they align with the Iceberg schema (in SQL this is done
positionally).

Let me know if any of that doesn't make sense.  I'm still a little unclear
on the issue in #1, so it would be helpful if you could clarify that for me.

Thanks,
Dan

On Mon, May 17, 2021 at 8:50 PM Weston Pace <we...@gmail.com> wrote:

> Hello Iceberg devs,
>
> I'm Weston, I've been working on the Arrow project lately and I am
> reviewing how we handle the parquet field_id (and also adding support
> for specifying a field_id at write time) from parquet[1][2].   This
> has brought up two questions.
>
>  1. The original PR adding field_id support[3][4] not only allowed the
> field_id to pass through from parquet to arrow but also generated ids
> (in a depth first fashion) for fields that did not have a field_id.
> In retrospect, it seems this auto-generation of field_id was probably
> not a good idea.  Would it have any impact on Iceberg if we removed
> it?  Just to be clear, we will still have support  for reading (and
> now writing) the parquet field_id.  I am only talking about removing
> the auto-generation of missing values.
>
>  2. For the second question I'm looking for the Iceberg community's
> opinion as users of Arrow.  Arrow is enabling more support for
> computation on data (e.g. relational operators) and I've been
> wondering how those transformations should affect metadata (like the
> field_id).  For some examples:
>
>  * Filtering a table by column (it seems the field_id/metadata should
> remain unchanged)
>  * Filtering a table by rows (it seems the field_id/metadata should
> remain unchanged)
>  * Filling in null values with a placeholder value (the data is changed so
> ???)
>  * Casting a field to a different data type (the meaning of the data
> has changed so ???)
>  * Combining two fields into a third field (it seems the
> field_id/metadata should be erased in the third field but presumably
> it could also be the joined metadata from the two origin fields)
>
> Thanks for your time,
>
> -Weston Pace
>
> [1] https://issues.apache.org/jira/browse/PARQUET-1798
> [2] https://github.com/apache/arrow/pull/10289
> [3] https://issues.apache.org/jira/browse/ARROW-7080
> [4] https://github.com/apache/arrow/pull/6408
>