You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@iceberg.apache.org by 李响 <wa...@gmail.com> on 2020/11/05 16:07:05 UTC

About importing Hive tables and name mapping

Dear community:

I am using SparkTableUtil to import an existing Hive table to an Iceberg
table.
The ORC files of Hive table is an old version of ORC, so I set a name
mapping (like: id 1 mapped to _col0 and id 2 mapped to _col1...) to the
Iceberg table by using "schema.name-mapping.default" so that the matrics of
ORC files could be built correctly during the import process.

After that, I plan to write new data into the Iceberg table (using the ORC
version 1.6.5 in the iceberg package), how could I deal with that name
mapping used for importing ? Should I remove that? Does that name mapping
do any harm when reading/writing from/to the new ORC file?

I am not sure if we need a per-data file name mapping setting here in
additional to the default name mapping for the whole table level?



-- 

                                               李响 Xiang Li


邮件 e-mail      ：waterlx@gmail.com

Re: About importing Hive tables and name mapping

Posted by 李响 <wa...@gmail.com>.

Edger, Ryan,

Got that. Thanks very much for your reply!

On Fri, Nov 6, 2020 at 12:36 AM Ryan Blue <rb...@netflix.com.invalid> wrote:

> Edgar is correct. Name mapping is used if a data file has no field ids.
> When you import data with a name mapping, you should leave it configured on
> the table so that you can read the data files that you imported.
>
> There's no need for a different mapping because we assume that the files
> you add to the table all use a consistent naming scheme. You can add more
> than one alias to a mapping if you need to handle a rename, but most of the
> time names don't change and are consistent across files if you have been
> reading the files as a table already using name-based column resolution.
>
> On Thu, Nov 5, 2020 at 8:21 AM Edgar Rodriguez
> <ed...@airbnb.com.invalid> wrote:
>
>> Hi Xiang,
>>
>> On Thu, Nov 5, 2020 at 11:07 AM 李响 <wa...@gmail.com> wrote:
>>
>>> Dear community:
>>>
>>> I am using SparkTableUtil to import an existing Hive table to an Iceberg
>>> table.
>>> The ORC files of Hive table is an old version of ORC, so I set a name
>>> mapping (like: id 1 mapped to _col0 and id 2 mapped to _col1...) to the
>>> Iceberg table by using "schema.name-mapping.default" so that the matrics of
>>> ORC files could be built correctly during the import process.
>>>
>>> After that, I plan to write new data into the Iceberg table (using the
>>> ORC version 1.6.5 in the iceberg package), how could I deal with that name
>>> mapping used for importing ? Should I remove that? Does that name mapping
>>> do any harm when reading/writing from/to the new ORC file?
>>>
>>
>> If I understand correctly the name-mapping would only apply if there were
>> no Iceberg IDs found in the ORC file as type attributes, which is the case
>> for the imported data. All new data you write with Iceberg/ORC will have
>> the Iceberg field-id stored as a type attribute, so when reading those new
>> files the name-mapping should have no effect since the read path will
>> detect the Iceberg field-ids.
>>
>> Cheers,
>> --
>> Edgar R
>>
>
>
> --
> Ryan Blue
> Software Engineer
> Netflix
>


-- 

                                               李响 Xiang Li

手机 cellphone ：+86-136-8113-8972
邮件 e-mail      ：waterlx@gmail.com

Re: About importing Hive tables and name mapping

Posted by Ryan Blue <rb...@netflix.com.INVALID>.

Edgar is correct. Name mapping is used if a data file has no field ids.
When you import data with a name mapping, you should leave it configured on
the table so that you can read the data files that you imported.

There's no need for a different mapping because we assume that the files
you add to the table all use a consistent naming scheme. You can add more
than one alias to a mapping if you need to handle a rename, but most of the
time names don't change and are consistent across files if you have been
reading the files as a table already using name-based column resolution.

On Thu, Nov 5, 2020 at 8:21 AM Edgar Rodriguez
<ed...@airbnb.com.invalid> wrote:

> Hi Xiang,
>
> On Thu, Nov 5, 2020 at 11:07 AM 李响 <wa...@gmail.com> wrote:
>
>> Dear community:
>>
>> I am using SparkTableUtil to import an existing Hive table to an Iceberg
>> table.
>> The ORC files of Hive table is an old version of ORC, so I set a name
>> mapping (like: id 1 mapped to _col0 and id 2 mapped to _col1...) to the
>> Iceberg table by using "schema.name-mapping.default" so that the matrics of
>> ORC files could be built correctly during the import process.
>>
>> After that, I plan to write new data into the Iceberg table (using the
>> ORC version 1.6.5 in the iceberg package), how could I deal with that name
>> mapping used for importing ? Should I remove that? Does that name mapping
>> do any harm when reading/writing from/to the new ORC file?
>>
>
> If I understand correctly the name-mapping would only apply if there were
> no Iceberg IDs found in the ORC file as type attributes, which is the case
> for the imported data. All new data you write with Iceberg/ORC will have
> the Iceberg field-id stored as a type attribute, so when reading those new
> files the name-mapping should have no effect since the read path will
> detect the Iceberg field-ids.
>
> Cheers,
> --
> Edgar R
>

-- 
Ryan Blue
Software Engineer
Netflix

Re: About importing Hive tables and name mapping

Posted by Edgar Rodriguez <ed...@airbnb.com.INVALID>.

Hi Xiang,

On Thu, Nov 5, 2020 at 11:07 AM 李响 <wa...@gmail.com> wrote:

> Dear community:
>
> I am using SparkTableUtil to import an existing Hive table to an Iceberg
> table.
> The ORC files of Hive table is an old version of ORC, so I set a name
> mapping (like: id 1 mapped to _col0 and id 2 mapped to _col1...) to the
> Iceberg table by using "schema.name-mapping.default" so that the matrics of
> ORC files could be built correctly during the import process.
>
> After that, I plan to write new data into the Iceberg table (using the ORC
> version 1.6.5 in the iceberg package), how could I deal with that name
> mapping used for importing ? Should I remove that? Does that name mapping
> do any harm when reading/writing from/to the new ORC file?
>

If I understand correctly the name-mapping would only apply if there were
no Iceberg IDs found in the ORC file as type attributes, which is the case
for the imported data. All new data you write with Iceberg/ORC will have
the Iceberg field-id stored as a type attribute, so when reading those new
files the name-mapping should have no effect since the read path will
detect the Iceberg field-ids.

Cheers,
-- 
Edgar R