You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@iceberg.apache.org by "Kruger, Scott" <sc...@paypal.com.INVALID> on 2020/11/03 19:51:25 UTC

Re: Migrating plain parquet tables to iceberg

Awesome, this is working for us, although we had to modify our code to also use the NameMapping when grabbing parquet file metrics. Thanks!

From: Ryan Blue <rb...@netflix.com.INVALID>
Reply-To: "dev@iceberg.apache.org" <de...@iceberg.apache.org>, "rblue@netflix.com" <rb...@netflix.com>
Date: Friday, October 30, 2020 at 5:55 PM
To: "sckruger@paypal.com.invalid" <sc...@paypal.com.invalid>
Cc: "dev@iceberg.apache.org" <de...@iceberg.apache.org>
Subject: Re: Migrating plain parquet tables to iceberg

This message is from an external sender.

For existing tables that use name-based column resolution, you can add a name-to-id mapping that is applied when reading files with no field IDs. There is a utility to generate the name mapping from an existing schema (using the current names) and then you just need to store that in a table property.

NameMapping mapping = MappingUtil.create(table.schema());

table.updateProperties()

    .set("schema.name-mapping.default", NameMappingParser.toJson(mapping))

    .commit()

I think there is also an issue to add a name mapping by default when importing data.

On Fri, Oct 30, 2020 at 3:46 PM Kruger, Scott <sc...@paypal.com.invalid> wrote:
I’m looking to migrate a partitioned parquet table to use iceberg. The issue I’ve run into is that the column order for the data varies wildly, which isn’t a problem for us normally (we just set mergeSchemas=true when reading), but presents a problem with iceberg because the iceberg.schema field isn’t set in the parquet footer. Is there any way to migrate this data over without rewriting the entire dataset?


--
Ryan Blue
Software Engineer
Netflix

Re: Migrating plain parquet tables to iceberg

Posted by Ryan Blue <rb...@netflix.com.INVALID>.
We should probably fix that so that the mapping from the table is used by
default when importing files to a table.

I agree that the regular write path in Spark should handle stats as
expected. That's what we use all the time. I'd recommend trying to move to
it when you can. We're planning on releasing new commands to make working
with data a lot easier!

On Wed, Nov 4, 2020 at 6:57 AM Kruger, Scott <sc...@paypal.com> wrote:

> Whoops, forgot to CC mailing list.
>
>
>
> Ah, the metrics code _*does*_ allow you to use a name mapping if you
> specify one in the call to ParquetUtil.fileMetrics, which is what we did.
> If you don’t, though, the mapping property from the table (if present)
> doesn’t appear to be used automatically.
>
>
>
> To be clear, for semi-complicated reasons that probably don’t bear going
> into (unless you really want to know), we aren’t writing iceberg data
> directly (i.e. using DataFrameWriter.format(“iceberg”)), but rather writing
> plain parquet data and then adding it to the iceberg table post hoc. So I
> don’t think there’s a problem with the regular iceberg write path via spark.
>
>
>
>
>
> *From: *Ryan Blue <rb...@netflix.com>
> *Reply-To: *"rblue@netflix.com" <rb...@netflix.com>
> *Date: *Tuesday, November 3, 2020 at 3:00 PM
> *To: *"Kruger, Scott" <sc...@paypal.com>
> *Cc: *"dev@iceberg.apache.org" <de...@iceberg.apache.org>
> *Subject: *Re: Migrating plain parquet tables to iceberg
>
>
>
> This message is from an external sender.
>
> I thought that we had already updated the metrics code to use a name
> mapping. Sorry I was mistaken. Could you post a PR with your fix?
>
>
>
> Glad it's working!
>
>
>
> On Tue, Nov 3, 2020 at 11:51 AM Kruger, Scott <sc...@paypal.com> wrote:
>
> Awesome, this is working for us, although we had to modify our code to
> also use the NameMapping when grabbing parquet file metrics. Thanks!
>
>
>
> *From: *Ryan Blue <rb...@netflix.com.INVALID>
> *Reply-To: *"dev@iceberg.apache.org" <de...@iceberg.apache.org>, "
> rblue@netflix.com" <rb...@netflix.com>
> *Date: *Friday, October 30, 2020 at 5:55 PM
> *To: *"sckruger@paypal.com.invalid" <sc...@paypal.com.invalid>
> *Cc: *"dev@iceberg.apache.org" <de...@iceberg.apache.org>
> *Subject: *Re: Migrating plain parquet tables to iceberg
>
>
>
> This message is from an external sender.
>
> For existing tables that use name-based column resolution, you can add a
> name-to-id mapping that is applied when reading files with no field IDs.
> There is a utility to generate the name mapping from an existing schema
> (using the current names) and then you just need to store that in a table
> property.
>
> NameMapping mapping = MappingUtil.create(table.schema());
>
> table.updateProperties()
>
>     .set("schema.name-mapping.default", NameMappingParser.toJson(mapping))
>
>     .commit()
>
> I think there is also an issue to add a name mapping by default when
> importing data.
>
>
>
> On Fri, Oct 30, 2020 at 3:46 PM Kruger, Scott <sc...@paypal.com.invalid>
> wrote:
>
> I’m looking to migrate a partitioned parquet table to use iceberg. The
> issue I’ve run into is that the column order for the data varies wildly,
> which isn’t a problem for us normally (we just set mergeSchemas=true when
> reading), but presents a problem with iceberg because the iceberg.schema
> field isn’t set in the parquet footer. Is there any way to migrate this
> data over without rewriting the entire dataset?
>
>
>
>
> --
>
> Ryan Blue
>
> Software Engineer
>
> Netflix
>
>
>
>
> --
>
> Ryan Blue
>
> Software Engineer
>
> Netflix
>


-- 
Ryan Blue
Software Engineer
Netflix

Re: Migrating plain parquet tables to iceberg

Posted by "Kruger, Scott" <sc...@paypal.com.INVALID>.
Whoops, forgot to CC mailing list.

Ah, the metrics code _does_ allow you to use a name mapping if you specify one in the call to ParquetUtil.fileMetrics, which is what we did. If you don’t, though, the mapping property from the table (if present) doesn’t appear to be used automatically.

To be clear, for semi-complicated reasons that probably don’t bear going into (unless you really want to know), we aren’t writing iceberg data directly (i.e. using DataFrameWriter.format(“iceberg”)), but rather writing plain parquet data and then adding it to the iceberg table post hoc. So I don’t think there’s a problem with the regular iceberg write path via spark.


From: Ryan Blue <rb...@netflix.com>
Reply-To: "rblue@netflix.com" <rb...@netflix.com>
Date: Tuesday, November 3, 2020 at 3:00 PM
To: "Kruger, Scott" <sc...@paypal.com>
Cc: "dev@iceberg.apache.org" <de...@iceberg.apache.org>
Subject: Re: Migrating plain parquet tables to iceberg

This message is from an external sender.
I thought that we had already updated the metrics code to use a name mapping. Sorry I was mistaken. Could you post a PR with your fix?

Glad it's working!

On Tue, Nov 3, 2020 at 11:51 AM Kruger, Scott <sc...@paypal.com>> wrote:
Awesome, this is working for us, although we had to modify our code to also use the NameMapping when grabbing parquet file metrics. Thanks!

From: Ryan Blue <rb...@netflix.com.INVALID>
Reply-To: "dev@iceberg.apache.org<ma...@iceberg.apache.org>" <de...@iceberg.apache.org>>, "rblue@netflix.com<ma...@netflix.com>" <rb...@netflix.com>>
Date: Friday, October 30, 2020 at 5:55 PM
To: "sckruger@paypal.com.invalid" <sc...@paypal.com.invalid>
Cc: "dev@iceberg.apache.org<ma...@iceberg.apache.org>" <de...@iceberg.apache.org>>
Subject: Re: Migrating plain parquet tables to iceberg

This message is from an external sender.

For existing tables that use name-based column resolution, you can add a name-to-id mapping that is applied when reading files with no field IDs. There is a utility to generate the name mapping from an existing schema (using the current names) and then you just need to store that in a table property.

NameMapping mapping = MappingUtil.create(table.schema());

table.updateProperties()

    .set("schema.name-mapping.default", NameMappingParser.toJson(mapping))

    .commit()

I think there is also an issue to add a name mapping by default when importing data.

On Fri, Oct 30, 2020 at 3:46 PM Kruger, Scott <sc...@paypal.com.invalid> wrote:
I’m looking to migrate a partitioned parquet table to use iceberg. The issue I’ve run into is that the column order for the data varies wildly, which isn’t a problem for us normally (we just set mergeSchemas=true when reading), but presents a problem with iceberg because the iceberg.schema field isn’t set in the parquet footer. Is there any way to migrate this data over without rewriting the entire dataset?


--
Ryan Blue
Software Engineer
Netflix


--
Ryan Blue
Software Engineer
Netflix

Re: Migrating plain parquet tables to iceberg

Posted by Ryan Blue <rb...@netflix.com.INVALID>.
I thought that we had already updated the metrics code to use a name
mapping. Sorry I was mistaken. Could you post a PR with your fix?

Glad it's working!

On Tue, Nov 3, 2020 at 11:51 AM Kruger, Scott <sc...@paypal.com> wrote:

> Awesome, this is working for us, although we had to modify our code to
> also use the NameMapping when grabbing parquet file metrics. Thanks!
>
>
>
> *From: *Ryan Blue <rb...@netflix.com.INVALID>
> *Reply-To: *"dev@iceberg.apache.org" <de...@iceberg.apache.org>, "
> rblue@netflix.com" <rb...@netflix.com>
> *Date: *Friday, October 30, 2020 at 5:55 PM
> *To: *"sckruger@paypal.com.invalid" <sc...@paypal.com.invalid>
> *Cc: *"dev@iceberg.apache.org" <de...@iceberg.apache.org>
> *Subject: *Re: Migrating plain parquet tables to iceberg
>
>
>
> This message is from an external sender.
>
> For existing tables that use name-based column resolution, you can add a
> name-to-id mapping that is applied when reading files with no field IDs.
> There is a utility to generate the name mapping from an existing schema
> (using the current names) and then you just need to store that in a table
> property.
>
> NameMapping mapping = MappingUtil.create(table.schema());
>
> table.updateProperties()
>
>     .set("schema.name-mapping.default", NameMappingParser.toJson(mapping))
>
>     .commit()
>
> I think there is also an issue to add a name mapping by default when
> importing data.
>
>
>
> On Fri, Oct 30, 2020 at 3:46 PM Kruger, Scott <sc...@paypal.com.invalid>
> wrote:
>
> I’m looking to migrate a partitioned parquet table to use iceberg. The
> issue I’ve run into is that the column order for the data varies wildly,
> which isn’t a problem for us normally (we just set mergeSchemas=true when
> reading), but presents a problem with iceberg because the iceberg.schema
> field isn’t set in the parquet footer. Is there any way to migrate this
> data over without rewriting the entire dataset?
>
>
>
>
> --
>
> Ryan Blue
>
> Software Engineer
>
> Netflix
>


-- 
Ryan Blue
Software Engineer
Netflix