You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@iceberg.apache.org by kkishore iiith <kk...@gmail.com> on 2021/02/10 01:35:54 UTC

Followup from iceberg newbie questions

Hello,

This is followup from
https://lists.apache.org/thread.html/rd15bf1db711b1a31f39d4b98776f29753b544fa3a496111d3460e11e%40%3Cdev.iceberg.apache.org%3E

*If a file system does not support atomic renames, then you should use a
metastore to track tables. You can use Hive, Nessie, or Glue. We also are
working on a JDBC catalog.*

1. What would go wrong if I write directly to gcs from spark via iceberg?
Do we end up having data in gcs but would be missing the iceberg metadata
for these files ? Or would it just lose some snapshots during multiple
parallel transactions?

*Iceberg's API can tell you what files were added or removed in any given
snapshot. You can also use time travel to query the table at a given
snapshot and use SQL to find the row-level changes. We don't currently
support reading just the changes in a snapshot because there may be deletes
as well as inserts.*

2. I would like to further clarify whether iceberg supports incremental
query like https://hudi.apache.org/docs/querying_data.html#spark-incr-query.
https://medium.com/adobetech/iceberg-at-adobe-88cf1950e866 was talking
about incremental reads to query data between snapshots. But I am confused
with above response and
http://mail-archives.apache.org/mod_mbox/iceberg-dev/201907.mbox/%3CA237BB81-F4DA-45D9-9827-36203624F9D4@tencent.com%3E
where you talked that the incremental query is not supported natively. If
the latter way
<http://mail-archives.apache.org/mod_mbox/iceberg-dev/201907.mbox/%3CA237BB81-F4DA-45D9-9827-36203624F9D4@tencent.com%3E>
is the only way to derive incremental data, does iceberg use predicate
pushdown to get the incremental data based on file-delta as iceberg's
metadata contain file info for both snapshots.

Thanks,
Kishor.

Re: Followup from iceberg newbie questions

Posted by Ryan Blue <rb...@netflix.com.INVALID>.
I'm not sure what you're asking in question 1, but the generally correct
way to write to an Iceberg table is to use the integration. Writing to
pre-existing paths sounds like an unreliable idea to me.

For question 2, my answer is also to use the existing integration. If you
write your own files and then try to append them to an Iceberg table, you
will probably leave out metrics and not have the correct Iceberg schema in
the file. You can use the file-level API, but you should still write your
data files using an Iceberg writer and not some other non-Iceberg path. It
is possible to do that, but you need to know a lot about how Iceberg works
and the trade-offs. So I recommend writing with iceberg.

On Wed, Feb 10, 2021 at 9:46 AM Professional <kk...@gmail.com>
wrote:

> Ryan,
>
> I agree with you about that part, could you please clarify (1) and (2)
>
> Sent from my iPhone
>
> On Feb 10, 2021, at 8:55 AM, Ryan Blue <rb...@netflix.com> wrote:
>
> 
> You should always write to Iceberg using the supplied integration. That
> way your data has metrics to make your queries faster and schemas are
> written correctly.
>
> On Tue, Feb 9, 2021 at 6:24 PM kkishore iiith <kk...@gmail.com>
> wrote:
>
>> Ryan,
>>
>> It would be nice to include that in iceberg website as the feature seems
>> like a common ask.
>>
>> Our spark job needs to return the gcs filenames as the downstream service
>> would load these gcs files into bigquery.
>>
>> So, we have two options here, could you please clarify for both
>> (1) We write to iceberg via hive metastore as gcs doesn't support atomic
>> renames, but could we be able to write to iceberg via hive using pre-known
>> filenames as I see all examples using hive table path.
>>
>> (2) After we write to gcs, we can use the iceberg table API to add the
>> files that we wrote in a transaction like follows. But does the comment
>> related to atomicity in the previous response also applicable via this
>> route i..e., losing snapshots across multiple parallel commits?
>>
>> appendFiles.appendFile(DataFiles.builder(partitionSpec)
>>         .withPath(..)
>>         .withFileSizeInBytes(..)
>>         .withRecordCount(..)
>>         .withFormat(..)
>>         .build());
>>
>> appendFiles.commit()
>>
>>
>> On Tue, Feb 9, 2021 at 6:05 PM Ryan Blue <rb...@netflix.com.invalid>
>> wrote:
>>
>>> Sorry, I was mistaken about this. We have exposed the incremental read
>>> functionality using DataFrame options
>>> <https://github.com/apache/iceberg/blob/d8cc2a29364e57df95c4e50f4079bacd35e4a047/spark3/src/main/java/org/apache/iceberg/spark/source/SparkBatchQueryScan.java#L67-L68>
>>> .
>>>
>>> You should be able to use it like this:
>>>
>>> val df = spark.read.format("iceberg").option("start-snapshot-id", lastSnapshotId).load("db.table")
>>>
>>> I hope that helps!
>>>
>>> rb
>>>
>>> On Tue, Feb 9, 2021 at 5:57 PM Ryan Blue <rb...@netflix.com> wrote:
>>>
>>>> Replies inline.
>>>>
>>>> On Tue, Feb 9, 2021 at 5:36 PM kkishore iiith <kk...@gmail.com>
>>>> wrote:
>>>>
>>>>> *If a file system does not support atomic renames, then you should use
>>>>> a metastore to track tables. You can use Hive, Nessie, or Glue. We also are
>>>>> working on a JDBC catalog.*
>>>>>
>>>>> 1. What would go wrong if I write directly to gcs from spark via
>>>>> iceberg? Do we end up having data in gcs but would be missing the iceberg
>>>>> metadata for these files ? Or would it just lose some snapshots during
>>>>> multiple parallel transactions?
>>>>>
>>>>
>>>> If you choose not to use a metastore and use a "Hadoop" table instead,
>>>> then there isn't a guarantee that concurrent writers won't clobber each
>>>> other. You'll probably lose some commits when two writers commit at the
>>>> same time with the same base version.
>>>>
>>>>
>>>>> *Iceberg's API can tell you what files were added or removed in any
>>>>> given snapshot. You can also use time travel to query the table at a given
>>>>> snapshot and use SQL to find the row-level changes. We don't currently
>>>>> support reading just the changes in a snapshot because there may be deletes
>>>>> as well as inserts.*
>>>>>
>>>>> 2. I would like to further clarify whether iceberg supports
>>>>> incremental query like
>>>>> https://hudi.apache.org/docs/querying_data.html#spark-incr-query.
>>>>> https://medium.com/adobetech/iceberg-at-adobe-88cf1950e866 was
>>>>> talking about incremental reads to query data between snapshots. But I am
>>>>> confused with above response and
>>>>> http://mail-archives.apache.org/mod_mbox/iceberg-dev/201907.mbox/%3CA237BB81-F4DA-45D9-9827-36203624F9D4@tencent.com%3E
>>>>> where you talked that the incremental query is not supported natively. If
>>>>> the latter way
>>>>> <http://mail-archives.apache.org/mod_mbox/iceberg-dev/201907.mbox/%3CA237BB81-F4DA-45D9-9827-36203624F9D4@tencent.com%3E>
>>>>> is the only way to derive incremental data, does iceberg use predicate
>>>>> pushdown to get the incremental data based on file-delta as iceberg's
>>>>> metadata contain file info for both snapshots.
>>>>>
>>>>
>>>> Iceberg can plan incremental scans to read data that was added since
>>>> some snapshot. This isn't exposed through Spark yet, but could be. I've
>>>> considered adding support for git-like `..` expressions: `SELECT * FROM
>>>> db.table.1234..5678`.
>>>>
>>>> One problem with this approach is that it is limited when it encounters
>>>> something other than an append. For example, Iceberg supports atomic
>>>> overwrites to rewrite data in a table. When the latest snapshot is an
>>>> overwrite, it isn't clear exactly what an incremental read should produce.
>>>> We're open to ideas here, like producing "delete" records as well as
>>>> "insert" records with an extra column for the operation. But this is
>>>> something we'd need to consider.
>>>>
>>>> I don't think Hudi has this problem because it only supports insert and
>>>> upsert, if I remember correctly.
>>>>
>>>> --
>>>> Ryan Blue
>>>> Software Engineer
>>>> Netflix
>>>>
>>>
>>>
>>> --
>>> Ryan Blue
>>> Software Engineer
>>> Netflix
>>>
>>
>
> --
> Ryan Blue
> Software Engineer
> Netflix
>
>

-- 
Ryan Blue
Software Engineer
Netflix

Re: Followup from iceberg newbie questions

Posted by Professional <kk...@gmail.com>.
Ryan,

I agree with you about that part, could you please clarify (1) and (2)

Sent from my iPhone

> On Feb 10, 2021, at 8:55 AM, Ryan Blue <rb...@netflix.com> wrote:
> 
> 
> You should always write to Iceberg using the supplied integration. That way your data has metrics to make your queries faster and schemas are written correctly.
> 
>> On Tue, Feb 9, 2021 at 6:24 PM kkishore iiith <kk...@gmail.com> wrote:
>> Ryan,
>> 
>> It would be nice to include that in iceberg website as the feature seems like a common ask.
>> 
>> Our spark job needs to return the gcs filenames as the downstream service would load these gcs files into bigquery.
>> 
>> So, we have two options here, could you please clarify for both
>> (1) We write to iceberg via hive metastore as gcs doesn't support atomic renames, but could we be able to write to iceberg via hive using pre-known filenames as I see all examples using hive table path.
>> 
>> (2) After we write to gcs, we can use the iceberg table API to add the files that we wrote in a transaction like follows. But does the comment related to atomicity in the previous response also applicable via this route i..e., losing snapshots across multiple parallel commits?
>> 
>> appendFiles.appendFile(DataFiles.builder(partitionSpec)
>>         .withPath(..)
>>         .withFileSizeInBytes(..)
>>         .withRecordCount(..)
>>         .withFormat(..)
>>         .build());
>> appendFiles.commit()
>> 
>>> On Tue, Feb 9, 2021 at 6:05 PM Ryan Blue <rb...@netflix.com.invalid> wrote:
>>> Sorry, I was mistaken about this. We have exposed the incremental read functionality using DataFrame options.
>>> 
>>> You should be able to use it like this:
>>> 
>>> val df = spark.read.format("iceberg").option("start-snapshot-id", lastSnapshotId).load("db.table")
>>> I hope that helps!
>>> 
>>> rb
>>> 
>>> 
>>>> On Tue, Feb 9, 2021 at 5:57 PM Ryan Blue <rb...@netflix.com> wrote:
>>>> Replies inline.
>>>> 
>>>>> On Tue, Feb 9, 2021 at 5:36 PM kkishore iiith <kk...@gmail.com> wrote:
>>>>> If a file system does not support atomic renames, then you should use a metastore to track tables. You can use Hive, Nessie, or Glue. We also are working on a JDBC catalog.
>>>>> 
>>>>> 1. What would go wrong if I write directly to gcs from spark via iceberg? Do we end up having data in gcs but would be missing the iceberg metadata for these files ? Or would it just lose some snapshots during multiple parallel transactions?
>>>> 
>>>> If you choose not to use a metastore and use a "Hadoop" table instead, then there isn't a guarantee that concurrent writers won't clobber each other. You'll probably lose some commits when two writers commit at the same time with the same base version.
>>>>  
>>>>> Iceberg's API can tell you what files were added or removed in any given snapshot. You can also use time travel to query the table at a given snapshot and use SQL to find the row-level changes. We don't currently support reading just the changes in a snapshot because there may be deletes as well as inserts.
>>>>> 
>>>>> 2. I would like to further clarify whether iceberg supports incremental query like https://hudi.apache.org/docs/querying_data.html#spark-incr-query.  https://medium.com/adobetech/iceberg-at-adobe-88cf1950e866 was talking about incremental reads to query data between snapshots. But I am confused with above response and http://mail-archives.apache.org/mod_mbox/iceberg-dev/201907.mbox/%3CA237BB81-F4DA-45D9-9827-36203624F9D4@tencent.com%3E where you talked that the incremental query is not supported natively. If the latter way is the only way to derive incremental data, does iceberg use predicate pushdown to get the incremental data based on file-delta as iceberg's metadata contain file info for both snapshots.
>>>> 
>>>> 
>>>> Iceberg can plan incremental scans to read data that was added since some snapshot. This isn't exposed through Spark yet, but could be. I've considered adding support for git-like `..` expressions: `SELECT * FROM db.table.1234..5678`.
>>>> 
>>>> One problem with this approach is that it is limited when it encounters something other than an append. For example, Iceberg supports atomic overwrites to rewrite data in a table. When the latest snapshot is an overwrite, it isn't clear exactly what an incremental read should produce. We're open to ideas here, like producing "delete" records as well as "insert" records with an extra column for the operation. But this is something we'd need to consider.
>>>> 
>>>> I don't think Hudi has this problem because it only supports insert and upsert, if I remember correctly.
>>>> 
>>>> -- 
>>>> Ryan Blue
>>>> Software Engineer
>>>> Netflix
>>> 
>>> 
>>> -- 
>>> Ryan Blue
>>> Software Engineer
>>> Netflix
> 
> 
> -- 
> Ryan Blue
> Software Engineer
> Netflix

Re: Followup from iceberg newbie questions

Posted by Ryan Blue <rb...@netflix.com.INVALID>.
You should always write to Iceberg using the supplied integration. That way
your data has metrics to make your queries faster and schemas are written
correctly.

On Tue, Feb 9, 2021 at 6:24 PM kkishore iiith <kk...@gmail.com>
wrote:

> Ryan,
>
> It would be nice to include that in iceberg website as the feature seems
> like a common ask.
>
> Our spark job needs to return the gcs filenames as the downstream service
> would load these gcs files into bigquery.
>
> So, we have two options here, could you please clarify for both
> (1) We write to iceberg via hive metastore as gcs doesn't support atomic
> renames, but could we be able to write to iceberg via hive using pre-known
> filenames as I see all examples using hive table path.
>
> (2) After we write to gcs, we can use the iceberg table API to add the
> files that we wrote in a transaction like follows. But does the comment
> related to atomicity in the previous response also applicable via this
> route i..e., losing snapshots across multiple parallel commits?
>
> appendFiles.appendFile(DataFiles.builder(partitionSpec)
>         .withPath(..)
>         .withFileSizeInBytes(..)
>         .withRecordCount(..)
>         .withFormat(..)
>         .build());
>
> appendFiles.commit()
>
>
> On Tue, Feb 9, 2021 at 6:05 PM Ryan Blue <rb...@netflix.com.invalid>
> wrote:
>
>> Sorry, I was mistaken about this. We have exposed the incremental read
>> functionality using DataFrame options
>> <https://github.com/apache/iceberg/blob/d8cc2a29364e57df95c4e50f4079bacd35e4a047/spark3/src/main/java/org/apache/iceberg/spark/source/SparkBatchQueryScan.java#L67-L68>
>> .
>>
>> You should be able to use it like this:
>>
>> val df = spark.read.format("iceberg").option("start-snapshot-id", lastSnapshotId).load("db.table")
>>
>> I hope that helps!
>>
>> rb
>>
>> On Tue, Feb 9, 2021 at 5:57 PM Ryan Blue <rb...@netflix.com> wrote:
>>
>>> Replies inline.
>>>
>>> On Tue, Feb 9, 2021 at 5:36 PM kkishore iiith <kk...@gmail.com>
>>> wrote:
>>>
>>>> *If a file system does not support atomic renames, then you should use
>>>> a metastore to track tables. You can use Hive, Nessie, or Glue. We also are
>>>> working on a JDBC catalog.*
>>>>
>>>> 1. What would go wrong if I write directly to gcs from spark via
>>>> iceberg? Do we end up having data in gcs but would be missing the iceberg
>>>> metadata for these files ? Or would it just lose some snapshots during
>>>> multiple parallel transactions?
>>>>
>>>
>>> If you choose not to use a metastore and use a "Hadoop" table instead,
>>> then there isn't a guarantee that concurrent writers won't clobber each
>>> other. You'll probably lose some commits when two writers commit at the
>>> same time with the same base version.
>>>
>>>
>>>> *Iceberg's API can tell you what files were added or removed in any
>>>> given snapshot. You can also use time travel to query the table at a given
>>>> snapshot and use SQL to find the row-level changes. We don't currently
>>>> support reading just the changes in a snapshot because there may be deletes
>>>> as well as inserts.*
>>>>
>>>> 2. I would like to further clarify whether iceberg supports incremental
>>>> query like
>>>> https://hudi.apache.org/docs/querying_data.html#spark-incr-query.
>>>> https://medium.com/adobetech/iceberg-at-adobe-88cf1950e866 was talking
>>>> about incremental reads to query data between snapshots. But I am confused
>>>> with above response and
>>>> http://mail-archives.apache.org/mod_mbox/iceberg-dev/201907.mbox/%3CA237BB81-F4DA-45D9-9827-36203624F9D4@tencent.com%3E
>>>> where you talked that the incremental query is not supported natively. If
>>>> the latter way
>>>> <http://mail-archives.apache.org/mod_mbox/iceberg-dev/201907.mbox/%3CA237BB81-F4DA-45D9-9827-36203624F9D4@tencent.com%3E>
>>>> is the only way to derive incremental data, does iceberg use predicate
>>>> pushdown to get the incremental data based on file-delta as iceberg's
>>>> metadata contain file info for both snapshots.
>>>>
>>>
>>> Iceberg can plan incremental scans to read data that was added since
>>> some snapshot. This isn't exposed through Spark yet, but could be. I've
>>> considered adding support for git-like `..` expressions: `SELECT * FROM
>>> db.table.1234..5678`.
>>>
>>> One problem with this approach is that it is limited when it encounters
>>> something other than an append. For example, Iceberg supports atomic
>>> overwrites to rewrite data in a table. When the latest snapshot is an
>>> overwrite, it isn't clear exactly what an incremental read should produce.
>>> We're open to ideas here, like producing "delete" records as well as
>>> "insert" records with an extra column for the operation. But this is
>>> something we'd need to consider.
>>>
>>> I don't think Hudi has this problem because it only supports insert and
>>> upsert, if I remember correctly.
>>>
>>> --
>>> Ryan Blue
>>> Software Engineer
>>> Netflix
>>>
>>
>>
>> --
>> Ryan Blue
>> Software Engineer
>> Netflix
>>
>

-- 
Ryan Blue
Software Engineer
Netflix

Re: Followup from iceberg newbie questions

Posted by kkishore iiith <kk...@gmail.com>.
Ryan,

It would be nice to include that in iceberg website as the feature seems
like a common ask.

Our spark job needs to return the gcs filenames as the downstream service
would load these gcs files into bigquery.

So, we have two options here, could you please clarify for both
(1) We write to iceberg via hive metastore as gcs doesn't support atomic
renames, but could we be able to write to iceberg via hive using pre-known
filenames as I see all examples using hive table path.

(2) After we write to gcs, we can use the iceberg table API to add the
files that we wrote in a transaction like follows. But does the comment
related to atomicity in the previous response also applicable via this
route i..e., losing snapshots across multiple parallel commits?

appendFiles.appendFile(DataFiles.builder(partitionSpec)
        .withPath(..)
        .withFileSizeInBytes(..)
        .withRecordCount(..)
        .withFormat(..)
        .build());

appendFiles.commit()


On Tue, Feb 9, 2021 at 6:05 PM Ryan Blue <rb...@netflix.com.invalid> wrote:

> Sorry, I was mistaken about this. We have exposed the incremental read
> functionality using DataFrame options
> <https://github.com/apache/iceberg/blob/d8cc2a29364e57df95c4e50f4079bacd35e4a047/spark3/src/main/java/org/apache/iceberg/spark/source/SparkBatchQueryScan.java#L67-L68>
> .
>
> You should be able to use it like this:
>
> val df = spark.read.format("iceberg").option("start-snapshot-id", lastSnapshotId).load("db.table")
>
> I hope that helps!
>
> rb
>
> On Tue, Feb 9, 2021 at 5:57 PM Ryan Blue <rb...@netflix.com> wrote:
>
>> Replies inline.
>>
>> On Tue, Feb 9, 2021 at 5:36 PM kkishore iiith <kk...@gmail.com>
>> wrote:
>>
>>> *If a file system does not support atomic renames, then you should use a
>>> metastore to track tables. You can use Hive, Nessie, or Glue. We also are
>>> working on a JDBC catalog.*
>>>
>>> 1. What would go wrong if I write directly to gcs from spark via
>>> iceberg? Do we end up having data in gcs but would be missing the iceberg
>>> metadata for these files ? Or would it just lose some snapshots during
>>> multiple parallel transactions?
>>>
>>
>> If you choose not to use a metastore and use a "Hadoop" table instead,
>> then there isn't a guarantee that concurrent writers won't clobber each
>> other. You'll probably lose some commits when two writers commit at the
>> same time with the same base version.
>>
>>
>>> *Iceberg's API can tell you what files were added or removed in any
>>> given snapshot. You can also use time travel to query the table at a given
>>> snapshot and use SQL to find the row-level changes. We don't currently
>>> support reading just the changes in a snapshot because there may be deletes
>>> as well as inserts.*
>>>
>>> 2. I would like to further clarify whether iceberg supports incremental
>>> query like
>>> https://hudi.apache.org/docs/querying_data.html#spark-incr-query.
>>> https://medium.com/adobetech/iceberg-at-adobe-88cf1950e866 was talking
>>> about incremental reads to query data between snapshots. But I am confused
>>> with above response and
>>> http://mail-archives.apache.org/mod_mbox/iceberg-dev/201907.mbox/%3CA237BB81-F4DA-45D9-9827-36203624F9D4@tencent.com%3E
>>> where you talked that the incremental query is not supported natively. If
>>> the latter way
>>> <http://mail-archives.apache.org/mod_mbox/iceberg-dev/201907.mbox/%3CA237BB81-F4DA-45D9-9827-36203624F9D4@tencent.com%3E>
>>> is the only way to derive incremental data, does iceberg use predicate
>>> pushdown to get the incremental data based on file-delta as iceberg's
>>> metadata contain file info for both snapshots.
>>>
>>
>> Iceberg can plan incremental scans to read data that was added since some
>> snapshot. This isn't exposed through Spark yet, but could be. I've
>> considered adding support for git-like `..` expressions: `SELECT * FROM
>> db.table.1234..5678`.
>>
>> One problem with this approach is that it is limited when it encounters
>> something other than an append. For example, Iceberg supports atomic
>> overwrites to rewrite data in a table. When the latest snapshot is an
>> overwrite, it isn't clear exactly what an incremental read should produce.
>> We're open to ideas here, like producing "delete" records as well as
>> "insert" records with an extra column for the operation. But this is
>> something we'd need to consider.
>>
>> I don't think Hudi has this problem because it only supports insert and
>> upsert, if I remember correctly.
>>
>> --
>> Ryan Blue
>> Software Engineer
>> Netflix
>>
>
>
> --
> Ryan Blue
> Software Engineer
> Netflix
>

Re: Followup from iceberg newbie questions

Posted by Ryan Blue <rb...@netflix.com.INVALID>.
Sorry, I was mistaken about this. We have exposed the incremental read
functionality using DataFrame options
<https://github.com/apache/iceberg/blob/d8cc2a29364e57df95c4e50f4079bacd35e4a047/spark3/src/main/java/org/apache/iceberg/spark/source/SparkBatchQueryScan.java#L67-L68>
.

You should be able to use it like this:

val df = spark.read.format("iceberg").option("start-snapshot-id",
lastSnapshotId).load("db.table")

I hope that helps!

rb

On Tue, Feb 9, 2021 at 5:57 PM Ryan Blue <rb...@netflix.com> wrote:

> Replies inline.
>
> On Tue, Feb 9, 2021 at 5:36 PM kkishore iiith <kk...@gmail.com>
> wrote:
>
>> *If a file system does not support atomic renames, then you should use a
>> metastore to track tables. You can use Hive, Nessie, or Glue. We also are
>> working on a JDBC catalog.*
>>
>> 1. What would go wrong if I write directly to gcs from spark via iceberg?
>> Do we end up having data in gcs but would be missing the iceberg metadata
>> for these files ? Or would it just lose some snapshots during multiple
>> parallel transactions?
>>
>
> If you choose not to use a metastore and use a "Hadoop" table instead,
> then there isn't a guarantee that concurrent writers won't clobber each
> other. You'll probably lose some commits when two writers commit at the
> same time with the same base version.
>
>
>> *Iceberg's API can tell you what files were added or removed in any given
>> snapshot. You can also use time travel to query the table at a given
>> snapshot and use SQL to find the row-level changes. We don't currently
>> support reading just the changes in a snapshot because there may be deletes
>> as well as inserts.*
>>
>> 2. I would like to further clarify whether iceberg supports incremental
>> query like
>> https://hudi.apache.org/docs/querying_data.html#spark-incr-query.
>> https://medium.com/adobetech/iceberg-at-adobe-88cf1950e866 was talking
>> about incremental reads to query data between snapshots. But I am confused
>> with above response and
>> http://mail-archives.apache.org/mod_mbox/iceberg-dev/201907.mbox/%3CA237BB81-F4DA-45D9-9827-36203624F9D4@tencent.com%3E
>> where you talked that the incremental query is not supported natively. If
>> the latter way
>> <http://mail-archives.apache.org/mod_mbox/iceberg-dev/201907.mbox/%3CA237BB81-F4DA-45D9-9827-36203624F9D4@tencent.com%3E>
>> is the only way to derive incremental data, does iceberg use predicate
>> pushdown to get the incremental data based on file-delta as iceberg's
>> metadata contain file info for both snapshots.
>>
>
> Iceberg can plan incremental scans to read data that was added since some
> snapshot. This isn't exposed through Spark yet, but could be. I've
> considered adding support for git-like `..` expressions: `SELECT * FROM
> db.table.1234..5678`.
>
> One problem with this approach is that it is limited when it encounters
> something other than an append. For example, Iceberg supports atomic
> overwrites to rewrite data in a table. When the latest snapshot is an
> overwrite, it isn't clear exactly what an incremental read should produce.
> We're open to ideas here, like producing "delete" records as well as
> "insert" records with an extra column for the operation. But this is
> something we'd need to consider.
>
> I don't think Hudi has this problem because it only supports insert and
> upsert, if I remember correctly.
>
> --
> Ryan Blue
> Software Engineer
> Netflix
>


-- 
Ryan Blue
Software Engineer
Netflix

Re: Followup from iceberg newbie questions

Posted by Ryan Blue <rb...@netflix.com.INVALID>.
Replies inline.

On Tue, Feb 9, 2021 at 5:36 PM kkishore iiith <kk...@gmail.com>
wrote:

> *If a file system does not support atomic renames, then you should use a
> metastore to track tables. You can use Hive, Nessie, or Glue. We also are
> working on a JDBC catalog.*
>
> 1. What would go wrong if I write directly to gcs from spark via iceberg?
> Do we end up having data in gcs but would be missing the iceberg metadata
> for these files ? Or would it just lose some snapshots during multiple
> parallel transactions?
>

If you choose not to use a metastore and use a "Hadoop" table instead, then
there isn't a guarantee that concurrent writers won't clobber each other.
You'll probably lose some commits when two writers commit at the same time
with the same base version.


> *Iceberg's API can tell you what files were added or removed in any given
> snapshot. You can also use time travel to query the table at a given
> snapshot and use SQL to find the row-level changes. We don't currently
> support reading just the changes in a snapshot because there may be deletes
> as well as inserts.*
>
> 2. I would like to further clarify whether iceberg supports incremental
> query like
> https://hudi.apache.org/docs/querying_data.html#spark-incr-query.
> https://medium.com/adobetech/iceberg-at-adobe-88cf1950e866 was talking
> about incremental reads to query data between snapshots. But I am confused
> with above response and
> http://mail-archives.apache.org/mod_mbox/iceberg-dev/201907.mbox/%3CA237BB81-F4DA-45D9-9827-36203624F9D4@tencent.com%3E
> where you talked that the incremental query is not supported natively. If
> the latter way
> <http://mail-archives.apache.org/mod_mbox/iceberg-dev/201907.mbox/%3CA237BB81-F4DA-45D9-9827-36203624F9D4@tencent.com%3E>
> is the only way to derive incremental data, does iceberg use predicate
> pushdown to get the incremental data based on file-delta as iceberg's
> metadata contain file info for both snapshots.
>

Iceberg can plan incremental scans to read data that was added since some
snapshot. This isn't exposed through Spark yet, but could be. I've
considered adding support for git-like `..` expressions: `SELECT * FROM
db.table.1234..5678`.

One problem with this approach is that it is limited when it encounters
something other than an append. For example, Iceberg supports atomic
overwrites to rewrite data in a table. When the latest snapshot is an
overwrite, it isn't clear exactly what an incremental read should produce.
We're open to ideas here, like producing "delete" records as well as
"insert" records with an extra column for the operation. But this is
something we'd need to consider.

I don't think Hudi has this problem because it only supports insert and
upsert, if I remember correctly.

-- 
Ryan Blue
Software Engineer
Netflix