You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@iceberg.apache.org by Junjie Chen <ch...@gmail.com> on 2019/12/23 08:57:02 UTC

How to add Datafiles from an existing iceberg table?

Hi community

I tried to add data files from an existing iceberg table to a target
iceberg table with following code (unit test):

    Iterator<DataFile> datafiles =
sourceTable.currentSnapshot().addedFiles().iterator();

    while (datafiles.hasNext()) {
      targetTable.newAppend().appendFile(datafiles.next()).commit();
    }

it throws exception below (this can be reproduced in unit test as
well, I tried in testRewrites, it throws NPE):

    org.apache.avro.file.DataFileWriter$AppendWriteException:
java.lang.ClassCastException: java.util.Collections$UnmodifiableMap
cannot be cast to java.lang.Long
    at org.apache.avro.file.DataFileWriter.append(DataFileWriter.java:308)
    at org.apache.iceberg.avro.AvroFileAppender.add(AvroFileAppender.java:52)
    at org.apache.iceberg.ManifestWriter.addEntry(ManifestWriter.java:133)
    at org.apache.iceberg.ManifestWriter.add(ManifestWriter.java:147)
    at org.apache.iceberg.ManifestWriter.add(ManifestWriter.java:36)
    at org.apache.iceberg.io.FileAppender.addAll(FileAppender.java:32)
    at org.apache.iceberg.io.FileAppender.addAll(FileAppender.java:37)
    ...

After debugging I found that the GenericDataFile read from existing
table has a defined fromProjectionPos array (0->0, ...4->4, 5->9,
6->10, 7->11, 8->12...), while the GenericAvroWriter is initialized
without such projection so that when writing the object it throws
CastException/NPE.

My question is how to solve this? Or do we have other methods to add
data files from an existing table?

Thanks

Re: How to add Datafiles from an existing iceberg table?

Posted by Ryan Blue <rb...@netflix.com.INVALID>.
Hi Junjie,

The problem is that your writer doesn't have the same schema as the records
you're passing to it, because addedFiles doesn't project all columns.
Writers assume that the write schema and record schema matches, and will
throw an exception like this if they don't.

The projection in addedFiles also came up on the PR to add a cherry-pick
operation because that uses addedFiles and appends them. The fix in that PR
is to always project the entire schema when returning added files. You
could make that change in a different PR to fix this as well.

On Mon, Dec 23, 2019 at 12:57 AM Junjie Chen <ch...@gmail.com>
wrote:

> Hi community
>
> I tried to add data files from an existing iceberg table to a target
> iceberg table with following code (unit test):
>
>     Iterator<DataFile> datafiles =
> sourceTable.currentSnapshot().addedFiles().iterator();
>
>     while (datafiles.hasNext()) {
>       targetTable.newAppend().appendFile(datafiles.next()).commit();
>     }
>
> it throws exception below (this can be reproduced in unit test as
> well, I tried in testRewrites, it throws NPE):
>
>     org.apache.avro.file.DataFileWriter$AppendWriteException:
> java.lang.ClassCastException: java.util.Collections$UnmodifiableMap
> cannot be cast to java.lang.Long
>     at org.apache.avro.file.DataFileWriter.append(DataFileWriter.java:308)
>     at
> org.apache.iceberg.avro.AvroFileAppender.add(AvroFileAppender.java:52)
>     at org.apache.iceberg.ManifestWriter.addEntry(ManifestWriter.java:133)
>     at org.apache.iceberg.ManifestWriter.add(ManifestWriter.java:147)
>     at org.apache.iceberg.ManifestWriter.add(ManifestWriter.java:36)
>     at org.apache.iceberg.io.FileAppender.addAll(FileAppender.java:32)
>     at org.apache.iceberg.io.FileAppender.addAll(FileAppender.java:37)
>     ...
>
> After debugging I found that the GenericDataFile read from existing
> table has a defined fromProjectionPos array (0->0, ...4->4, 5->9,
> 6->10, 7->11, 8->12...), while the GenericAvroWriter is initialized
> without such projection so that when writing the object it throws
> CastException/NPE.
>
> My question is how to solve this? Or do we have other methods to add
> data files from an existing table?
>
> Thanks
>


-- 
Ryan Blue
Software Engineer
Netflix