You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@iceberg.apache.org by Huadong Liu <hu...@gmail.com> on 2021/07/01 06:42:20 UTC

Re: migrating Hadoop tables to tables with hive catalog

FYI, I was able to do the migration by casting ManifestFile
to GenericManifestFile, resetting sequence number and snapshot id and
adding them to AppendFiles.

On Mon, Jun 28, 2021 at 3:49 PM Huadong Liu <hu...@gmail.com> wrote:

> Hi,
>
> I am trying to migrate an Iceberg Hadoop table to a table using the hive
> catalog. Luckily the table is appended only, so there are no delete files.
> It is not clear which APIs were used in a previous post
> <https://lists.apache.org/thread.html/r39f2c773bc06889cb19d7de3729d868fccbafbafcfab1922332a4dc6%40%3Cdev.iceberg.apache.org%3E>
> .
>
> The list of ManifestFiles in the current snapshot can be obtained with the
> Snapshot allManifests
> <https://iceberg.apache.org/javadoc/0.11.1/org/apache/iceberg/Snapshot.html#allManifests-->
> API. However, they cannot be added to the new table's AppendFiles
> <https://iceberg.apache.org/javadoc/0.11.1/org/apache/iceberg/AppendFiles.html> for
> committing because the snapshot id needs to be blank
> <https://github.com/apache/iceberg/blob/master/core/src/main/java/org/apache/iceberg/MergeAppend.java#L55>
> .
>
> Alternatively, the table snapshots
> <https://iceberg.apache.org/javadoc/0.11.1/org/apache/iceberg/Table.html#snapshots--> API
> can be used to get all snapshots of the table. From there, data files for
> each snapshot can be obtained with addedFiles
> <https://iceberg.apache.org/javadoc/0.11.1/org/apache/iceberg/Snapshot.html#addedFiles-->
> API and then added to AppendFiles of the new table with hive catalog.
>
> I am not sure the latter is correct for the migration. Any input is
> appreciated.
>
> --
> Huadong
>

Re: migrating Hadoop tables to tables with hive catalog

Posted by Huadong Liu <hu...@gmail.com>.
Thank you all. That saves rewriting all the manifest files, which is a lot.
I did the following and it seems to be working fine.

1. Create an iceberg table using the hive catalog with the table schema,
partition spec etc.
2. Copy the hadoop latest vddddd.metadata.json to the hive table metadata
json
3. Change table-uuid back to the uuid from the original hive table metadata
json.


On Thu, Jul 1, 2021 at 7:00 AM Ryan Murray <ry...@gmail.com> wrote:

> I had a short proposal here[1] suggesting the same as Russell. I
> think this is probably a more broadly useful operation but I don't really
> know the best place for it to live. Im happy to finish the proposal if
> there are some opinions on where in iceberg it is appropriate to add such
> functionality.
>
> Best,
> Ryan
>
> [1] https://github.com/apache/iceberg/issues/2288
>
> On Thu, Jul 1, 2021 at 3:34 PM Russell Spitzer <ru...@gmail.com>
> wrote:
>
>> I think you could probably also do this by just creating a Hive table and
>> then changing the location to point to the most recent hadoop metadata.json
>> file.
>>
>> On Jul 1, 2021, at 1:42 AM, Huadong Liu <hu...@gmail.com> wrote:
>>
>> FYI, I was able to do the migration by casting ManifestFile
>> to GenericManifestFile, resetting sequence number and snapshot id and
>> adding them to AppendFiles.
>>
>> On Mon, Jun 28, 2021 at 3:49 PM Huadong Liu <hu...@gmail.com> wrote:
>>
>>> Hi,
>>>
>>> I am trying to migrate an Iceberg Hadoop table to a table using the hive
>>> catalog. Luckily the table is appended only, so there are no delete files.
>>> It is not clear which APIs were used in a previous post
>>> <https://lists.apache.org/thread.html/r39f2c773bc06889cb19d7de3729d868fccbafbafcfab1922332a4dc6%40%3Cdev.iceberg.apache.org%3E>
>>> .
>>>
>>> The list of ManifestFiles in the current snapshot can be obtained with
>>> the Snapshot allManifests
>>> <https://iceberg.apache.org/javadoc/0.11.1/org/apache/iceberg/Snapshot.html#allManifests-->
>>> API. However, they cannot be added to the new table's AppendFiles
>>> <https://iceberg.apache.org/javadoc/0.11.1/org/apache/iceberg/AppendFiles.html> for
>>> committing because the snapshot id needs to be blank
>>> <https://github.com/apache/iceberg/blob/master/core/src/main/java/org/apache/iceberg/MergeAppend.java#L55>
>>> .
>>>
>>> Alternatively, the table snapshots
>>> <https://iceberg.apache.org/javadoc/0.11.1/org/apache/iceberg/Table.html#snapshots--> API
>>> can be used to get all snapshots of the table. From there, data files for
>>> each snapshot can be obtained with addedFiles
>>> <https://iceberg.apache.org/javadoc/0.11.1/org/apache/iceberg/Snapshot.html#addedFiles-->
>>> API and then added to AppendFiles of the new table with hive catalog.
>>>
>>> I am not sure the latter is correct for the migration. Any input is
>>> appreciated.
>>>
>>> --
>>> Huadong
>>>
>>
>>

Re: migrating Hadoop tables to tables with hive catalog

Posted by Ryan Murray <ry...@gmail.com>.
I had a short proposal here[1] suggesting the same as Russell. I think this
is probably a more broadly useful operation but I don't really know the
best place for it to live. Im happy to finish the proposal if there are
some opinions on where in iceberg it is appropriate to add such
functionality.

Best,
Ryan

[1] https://github.com/apache/iceberg/issues/2288

On Thu, Jul 1, 2021 at 3:34 PM Russell Spitzer <ru...@gmail.com>
wrote:

> I think you could probably also do this by just creating a Hive table and
> then changing the location to point to the most recent hadoop metadata.json
> file.
>
> On Jul 1, 2021, at 1:42 AM, Huadong Liu <hu...@gmail.com> wrote:
>
> FYI, I was able to do the migration by casting ManifestFile
> to GenericManifestFile, resetting sequence number and snapshot id and
> adding them to AppendFiles.
>
> On Mon, Jun 28, 2021 at 3:49 PM Huadong Liu <hu...@gmail.com> wrote:
>
>> Hi,
>>
>> I am trying to migrate an Iceberg Hadoop table to a table using the hive
>> catalog. Luckily the table is appended only, so there are no delete files.
>> It is not clear which APIs were used in a previous post
>> <https://lists.apache.org/thread.html/r39f2c773bc06889cb19d7de3729d868fccbafbafcfab1922332a4dc6%40%3Cdev.iceberg.apache.org%3E>
>> .
>>
>> The list of ManifestFiles in the current snapshot can be obtained with
>> the Snapshot allManifests
>> <https://iceberg.apache.org/javadoc/0.11.1/org/apache/iceberg/Snapshot.html#allManifests-->
>> API. However, they cannot be added to the new table's AppendFiles
>> <https://iceberg.apache.org/javadoc/0.11.1/org/apache/iceberg/AppendFiles.html> for
>> committing because the snapshot id needs to be blank
>> <https://github.com/apache/iceberg/blob/master/core/src/main/java/org/apache/iceberg/MergeAppend.java#L55>
>> .
>>
>> Alternatively, the table snapshots
>> <https://iceberg.apache.org/javadoc/0.11.1/org/apache/iceberg/Table.html#snapshots--> API
>> can be used to get all snapshots of the table. From there, data files for
>> each snapshot can be obtained with addedFiles
>> <https://iceberg.apache.org/javadoc/0.11.1/org/apache/iceberg/Snapshot.html#addedFiles-->
>> API and then added to AppendFiles of the new table with hive catalog.
>>
>> I am not sure the latter is correct for the migration. Any input is
>> appreciated.
>>
>> --
>> Huadong
>>
>
>

Re: migrating Hadoop tables to tables with hive catalog

Posted by Russell Spitzer <ru...@gmail.com>.
I think you could probably also do this by just creating a Hive table and then changing the location to point to the most recent hadoop metadata.json file.

> On Jul 1, 2021, at 1:42 AM, Huadong Liu <hu...@gmail.com> wrote:
> 
> FYI, I was able to do the migration by casting ManifestFile to GenericManifestFile, resetting sequence number and snapshot id and adding them to AppendFiles.
> 
> On Mon, Jun 28, 2021 at 3:49 PM Huadong Liu <huadongliu@gmail.com <ma...@gmail.com>> wrote:
> Hi,
> 
> I am trying to migrate an Iceberg Hadoop table to a table using the hive catalog. Luckily the table is appended only, so there are no delete files. It is not clear which APIs were used in a previous post <https://lists.apache.org/thread.html/r39f2c773bc06889cb19d7de3729d868fccbafbafcfab1922332a4dc6%40%3Cdev.iceberg.apache.org%3E>.
> 
> The list of ManifestFiles in the current snapshot can be obtained with the Snapshot allManifests <https://iceberg.apache.org/javadoc/0.11.1/org/apache/iceberg/Snapshot.html#allManifests--> API. However, they cannot be added to the new table's AppendFiles <https://iceberg.apache.org/javadoc/0.11.1/org/apache/iceberg/AppendFiles.html> for committing because the snapshot id needs to be blank <https://github.com/apache/iceberg/blob/master/core/src/main/java/org/apache/iceberg/MergeAppend.java#L55>.
> 
> Alternatively, the table snapshots <https://iceberg.apache.org/javadoc/0.11.1/org/apache/iceberg/Table.html#snapshots--> API can be used to get all snapshots of the table. From there, data files for each snapshot can be obtained with addedFiles <https://iceberg.apache.org/javadoc/0.11.1/org/apache/iceberg/Snapshot.html#addedFiles--> API and then added to AppendFiles of the new table with hive catalog.
> 
> I am not sure the latter is correct for the migration. Any input is appreciated.
> 
> --
> Huadong