You are viewing a plain text version of this content. The canonical link for it is here.

Posted to issues@iceberg.apache.org by GitBox <gi...@apache.org> on 2021/12/15 12:07:10 UTC

[GitHub] [iceberg] hililiwei opened a new pull request #3749: Spark: We should probably say why we cannot process the snapshot in SparkMicroBatchStream

hililiwei opened a new pull request #3749:
URL: https://github.com/apache/iceberg/pull/3749


   https://github.com/apache/iceberg/issues/3699


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] rdblue commented on a change in pull request #3749: Spark: We should probably say why we cannot process the snapshot in SparkMicroBatchStream

Posted by GitBox <gi...@apache.org>.

rdblue commented on a change in pull request #3749:
URL: https://github.com/apache/iceberg/pull/3749#discussion_r773426925



##########
File path: site/docs/spark-structured-streaming.md
##########
@@ -26,6 +26,25 @@ As of Spark 3.0, DataFrame reads and writes are supported.
 |--------------------------------------------------|----------|------------|------------------------------------------------|
 | [DataFrame write](#writing-with-streaming-query) | ✔        | ✔          |                                                |
 
+## Streaming Reads
+
+Iceberg supports processing incremental data in spark structured streaming  jobs which starts from a historical timestamp:
+
+```scala
+val spark:SparkSession = ...
+val tableIdentifier: String = ...
+
+val df = spark.readStream
+    .format("iceberg")
+    .option(SparkReadOptions.STREAM_FROM_TIMESTAMP, Long.toString(streamStartTimestamp))
+    .load(tableIdentifier)
+```
+
+The `tableIdentifier`  can be any valid table identifier or table path. Refer [TableIdentifier](https://iceberg.apache.org/javadoc/0.12.1/org/apache/iceberg/catalog/TableIdentifier.html)

Review comment:
       I don't think that linking to `TableIdentifer` is helpful. That's a class used internally, but users are going to supply a string here. It would be better to remove this paragraph and replace `tableIdentifier` with an example string, like `load("db.table")`. 




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] hililiwei commented on a change in pull request #3749: Spark: We should probably say why we cannot process the snapshot in SparkMicroBatchStream

Posted by GitBox <gi...@apache.org>.

hililiwei commented on a change in pull request #3749:
URL: https://github.com/apache/iceberg/pull/3749#discussion_r770167693



##########
File path: site/docs/spark-structured-streaming.md
##########
@@ -26,6 +26,28 @@ As of Spark 3.0, DataFrame reads and writes are supported.
 |--------------------------------------------------|----------|------------|------------------------------------------------|
 | [DataFrame write](#writing-with-streaming-query) | ✔        | ✔          |                                                |
 
+## Streaming Reads
+
+Iceberg supports processing incremental data in spark structured streaming  jobs which starts from a historical timestamp:
+
+```scala
+val spark:SparkSession = ...
+val tableIdentifier: String = ...
+
+val df = spark.readStream
+    .format("iceberg")
+    .option(SparkReadOptions.STREAM_FROM_TIMESTAMP, Long.toString(streamStartTimestamp))
+    .load(tableIdentifier)
+```
+
+The `tableIdentifier` can be:
+
+* The fully-qualified path to a HDFS table, like `hdfs://nn:8020/path/to/table`
+* A table name if the table is tracked by a catalog, like `database.table_name`

Review comment:
       I can't find a reference anywhere in the document yet, how about I link it to the Java Doc(https://iceberg.apache.org/javadoc/0.12.1/org/apache/iceberg/catalog/TableIdentifier.html)?
   
   Maybe I add its definition to the document in this issuse？ or we create a separate issuse trace for this?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] rdblue commented on a change in pull request #3749: Spark: We should probably say why we cannot process the snapshot in SparkMicroBatchStream

Posted by GitBox <gi...@apache.org>.

rdblue commented on a change in pull request #3749:
URL: https://github.com/apache/iceberg/pull/3749#discussion_r773430186



##########
File path: site/docs/spark-structured-streaming.md
##########
@@ -26,6 +26,25 @@ As of Spark 3.0, DataFrame reads and writes are supported.
 |--------------------------------------------------|----------|------------|------------------------------------------------|
 | [DataFrame write](#writing-with-streaming-query) | ✔        | ✔          |                                                |
 
+## Streaming Reads
+
+Iceberg supports processing incremental data in spark structured streaming  jobs which starts from a historical timestamp:
+
+```scala
+val spark:SparkSession = ...
+val tableIdentifier: String = ...
+
+val df = spark.readStream
+    .format("iceberg")
+    .option(SparkReadOptions.STREAM_FROM_TIMESTAMP, Long.toString(streamStartTimestamp))
+    .load(tableIdentifier)
+```
+
+The `tableIdentifier`  can be any valid table identifier or table path. Refer [TableIdentifier](https://iceberg.apache.org/javadoc/0.12.1/org/apache/iceberg/catalog/TableIdentifier.html)
+
+!!! Note
+    Iceberg only supports read data from snapshot whose type of Data Operations is APPEND\REPLACE\DELETE. In particular if some of your snapshots are of DELETE type, you need to add 'streaming-skip-delete-snapshots' option to skip it, otherwise the task will fail.

Review comment:
       There are a few issues with this paragraph:
   * Typo: "only supports reading"
   * Typo: "from snapshots"
   * Change "snapshots whose type of Data Operations ..." to "append snapshots" because it is much shorter
   * As a separate sentence, add that delete and overwrite cannot be processed and will cause an exception
   * Then in the last sentence add that deletes can be ignored: "To ignore delete snapshots, add `streaming-skip-delete-snapshots=true`"
   
   Keep in mind that people reading the documentation probably don't know Iceberg internals. Referring to "Data Operations" is not very clear to most readers.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] hililiwei commented on a change in pull request #3749: Spark: We should probably say why we cannot process the snapshot in SparkMicroBatchStream

Posted by GitBox <gi...@apache.org>.

hililiwei commented on a change in pull request #3749:
URL: https://github.com/apache/iceberg/pull/3749#discussion_r776579234



##########
File path: spark/v3.2/spark/src/main/java/org/apache/iceberg/spark/source/SparkMicroBatchStream.java
##########
@@ -200,10 +201,13 @@ public void stop() {
   private boolean shouldProcess(Snapshot snapshot) {
     String op = snapshot.operation();
     Preconditions.checkState(!op.equals(DataOperations.DELETE) || skipDelete,
-        "Cannot process delete snapshot: %s", snapshot.snapshotId());
+        "Cannot process delete snapshot: %s, to ignore snapshots of type delete, set the config %s to true.",
+        snapshot.snapshotId(), SparkReadOptions.STREAMING_SKIP_DELETE_SNAPSHOTS);
     Preconditions.checkState(
         op.equals(DataOperations.DELETE) || op.equals(DataOperations.APPEND) || op.equals(DataOperations.REPLACE),
-        "Cannot process %s snapshot: %s", op.toLowerCase(Locale.ROOT), snapshot.snapshotId());
+        "Cannot process snapshot: %s, Structured Streaming does not currently support snapshots of type %s",

Review comment:
       Rolled back.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] rdblue commented on a change in pull request #3749: Spark: We should probably say why we cannot process the snapshot in SparkMicroBatchStream

Posted by GitBox <gi...@apache.org>.

rdblue commented on a change in pull request #3749:
URL: https://github.com/apache/iceberg/pull/3749#discussion_r776877285



##########
File path: spark/v3.2/spark/src/main/java/org/apache/iceberg/spark/source/SparkMicroBatchStream.java
##########
@@ -200,7 +201,8 @@ public void stop() {
   private boolean shouldProcess(Snapshot snapshot) {
     String op = snapshot.operation();
     Preconditions.checkState(!op.equals(DataOperations.DELETE) || skipDelete,
-        "Cannot process delete snapshot: %s", snapshot.snapshotId());
+        "Cannot process delete snapshot: %s, to ignore deletes, set %s=true.",
+        snapshot.snapshotId(), SparkReadOptions.STREAMING_SKIP_DELETE_SNAPSHOTS);

Review comment:
       Looks good now.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] hililiwei commented on a change in pull request #3749: Spark: We should probably say why we cannot process the snapshot in SparkMicroBatchStream

Posted by GitBox <gi...@apache.org>.

hililiwei commented on a change in pull request #3749:
URL: https://github.com/apache/iceberg/pull/3749#discussion_r776579033



##########
File path: site/docs/spark-structured-streaming.md
##########
@@ -26,6 +26,25 @@ As of Spark 3.0, DataFrame reads and writes are supported.
 |--------------------------------------------------|----------|------------|------------------------------------------------|
 | [DataFrame write](#writing-with-streaming-query) | ✔        | ✔          |                                                |
 
+## Streaming Reads
+
+Iceberg supports processing incremental data in spark structured streaming  jobs which starts from a historical timestamp:

Review comment:
       Sorry for my carelessness.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] rdblue commented on a change in pull request #3749: Spark: We should probably say why we cannot process the snapshot in SparkMicroBatchStream

Posted by GitBox <gi...@apache.org>.

rdblue commented on a change in pull request #3749:
URL: https://github.com/apache/iceberg/pull/3749#discussion_r776876806



##########
File path: site/docs/spark-structured-streaming.md
##########
@@ -26,6 +26,25 @@ As of Spark 3.0, DataFrame reads and writes are supported.
 |--------------------------------------------------|----------|------------|------------------------------------------------|
 | [DataFrame write](#writing-with-streaming-query) | ✔        | ✔          |                                                |
 
+## Streaming Reads
+
+Iceberg supports processing incremental data in spark structured streaming  jobs which starts from a historical timestamp:

Review comment:
       Not a problem! This is why we review.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] rdblue commented on a change in pull request #3749: Spark: We should probably say why we cannot process the snapshot in SparkMicroBatchStream

Posted by GitBox <gi...@apache.org>.

rdblue commented on a change in pull request #3749:
URL: https://github.com/apache/iceberg/pull/3749#discussion_r776876885



##########
File path: site/docs/spark-structured-streaming.md
##########
@@ -26,6 +26,25 @@ As of Spark 3.0, DataFrame reads and writes are supported.
 |--------------------------------------------------|----------|------------|------------------------------------------------|
 | [DataFrame write](#writing-with-streaming-query) | ✔        | ✔          |                                                |
 
+## Streaming Reads
+
+Iceberg supports processing incremental data in spark structured streaming  jobs which starts from a historical timestamp:
+
+```scala
+val spark:SparkSession = ...
+val tableIdentifier: String = ...
+
+val df = spark.readStream
+    .format("iceberg")
+    .option(SparkReadOptions.STREAM_FROM_TIMESTAMP, Long.toString(streamStartTimestamp))
+    .load(tableIdentifier)
+```
+
+The `tableIdentifier`  can be any valid table identifier or table path. Refer [TableIdentifier](https://iceberg.apache.org/javadoc/0.12.1/org/apache/iceberg/catalog/TableIdentifier.html)

Review comment:
       It may be useful in some of the API documentation.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] rdblue commented on a change in pull request #3749: Spark: We should probably say why we cannot process the snapshot in SparkMicroBatchStream

Posted by GitBox <gi...@apache.org>.

rdblue commented on a change in pull request #3749:
URL: https://github.com/apache/iceberg/pull/3749#discussion_r776877006



##########
File path: site/docs/spark-structured-streaming.md
##########
@@ -24,7 +24,21 @@ As of Spark 3.0, DataFrame reads and writes are supported.
 
 | Feature support                                  | Spark 3.0| Spark 2.4  | Notes                                          |
 |--------------------------------------------------|----------|------------|------------------------------------------------|
-| [DataFrame write](#writing-with-streaming-query) | ✔        | ✔          |                                                |
+| [DataFrame write](#writing-with-streaming-query) | ✔        | ✔         |                                                |

Review comment:
       What changed on this line? Can we roll that back?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] rdblue merged pull request #3749: Spark: We should probably say why we cannot process the snapshot in SparkMicroBatchStream

Posted by GitBox <gi...@apache.org>.

rdblue merged pull request #3749:
URL: https://github.com/apache/iceberg/pull/3749


   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] hililiwei commented on a change in pull request #3749: Spark: We should probably say why we cannot process the snapshot in SparkMicroBatchStream

Posted by GitBox <gi...@apache.org>.

hililiwei commented on a change in pull request #3749:
URL: https://github.com/apache/iceberg/pull/3749#discussion_r775315915



##########
File path: site/docs/spark-structured-streaming.md
##########
@@ -26,6 +26,25 @@ As of Spark 3.0, DataFrame reads and writes are supported.
 |--------------------------------------------------|----------|------------|------------------------------------------------|
 | [DataFrame write](#writing-with-streaming-query) | ✔        | ✔          |                                                |
 
+## Streaming Reads
+
+Iceberg supports processing incremental data in spark structured streaming  jobs which starts from a historical timestamp:

Review comment:
       I don't quite understand.  Delete the blank line?  This blank line seems to be needed.  




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] rdblue commented on a change in pull request #3749: Spark: We should probably say why we cannot process the snapshot in SparkMicroBatchStream

Posted by GitBox <gi...@apache.org>.

rdblue commented on a change in pull request #3749:
URL: https://github.com/apache/iceberg/pull/3749#discussion_r773426040



##########
File path: site/docs/spark-structured-streaming.md
##########
@@ -26,6 +26,25 @@ As of Spark 3.0, DataFrame reads and writes are supported.
 |--------------------------------------------------|----------|------------|------------------------------------------------|
 | [DataFrame write](#writing-with-streaming-query) | ✔        | ✔          |                                                |
 
+## Streaming Reads
+
+Iceberg supports processing incremental data in spark structured streaming  jobs which starts from a historical timestamp:

Review comment:
       Typo: double space.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] hililiwei commented on a change in pull request #3749: Spark: We should probably say why we cannot process the snapshot in SparkMicroBatchStream

Posted by GitBox <gi...@apache.org>.

hililiwei commented on a change in pull request #3749:
URL: https://github.com/apache/iceberg/pull/3749#discussion_r776579193



##########
File path: site/docs/spark-structured-streaming.md
##########
@@ -24,7 +24,21 @@ As of Spark 3.0, DataFrame reads and writes are supported.
 
 | Feature support                                  | Spark 3.0| Spark 2.4  | Notes                                          |
 |--------------------------------------------------|----------|------------|------------------------------------------------|
-| [DataFrame write](#writing-with-streaming-query) | ✔        | ✔          |                                                |
+| [DataFrame write](#writing-with-streaming-query) | ✔        | ✔          |                                               |
+
+## Streaming Reads
+
+Iceberg supports processing incremental data in spark structured streaming  jobs which starts from a historical timestamp:
+
+```scala
+val df = spark.readStream
+    .format("iceberg")
+    .option(SparkReadOptions.STREAM_FROM_TIMESTAMP, Long.toString(streamStartTimestamp))
+    .load("database.table_name")
+```
+
+!!! Note
+    Iceberg only supports reading data from APPEND snapshots. DELETE\OVERWRITE snapshots cannot be processed and will cause an exception. To ignore delete snapshots, add `streaming-skip-delete-snapshots=true` to option.

Review comment:
       done, PTAL.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] hililiwei commented on a change in pull request #3749: Spark: We should probably say why we cannot process the snapshot in SparkMicroBatchStream

Posted by GitBox <gi...@apache.org>.

hililiwei commented on a change in pull request #3749:
URL: https://github.com/apache/iceberg/pull/3749#discussion_r776945899



##########
File path: spark/v3.2/spark/src/main/java/org/apache/iceberg/spark/source/SparkMicroBatchStream.java
##########
@@ -200,7 +201,8 @@ public void stop() {
   private boolean shouldProcess(Snapshot snapshot) {
     String op = snapshot.operation();
     Preconditions.checkState(!op.equals(DataOperations.DELETE) || skipDelete,
-        "Cannot process delete snapshot: %s", snapshot.snapshotId());
+        "Cannot process delete snapshot: %s, to ignore deletes, set %s=true.",
+        snapshot.snapshotId(), SparkReadOptions.STREAMING_SKIP_DELETE_SNAPSHOTS);

Review comment:
       thanks :) 




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] kbendick commented on a change in pull request #3749: Spark: We should probably say why we cannot process the snapshot in SparkMicroBatchStream

Posted by GitBox <gi...@apache.org>.

kbendick commented on a change in pull request #3749:
URL: https://github.com/apache/iceberg/pull/3749#discussion_r769837962



##########
File path: site/docs/spark-structured-streaming.md
##########
@@ -26,6 +26,28 @@ As of Spark 3.0, DataFrame reads and writes are supported.
 |--------------------------------------------------|----------|------------|------------------------------------------------|
 | [DataFrame write](#writing-with-streaming-query) | ✔        | ✔          |                                                |
 
+## Streaming Reads
+
+Iceberg supports processing incremental data in spark structured streaming  jobs which starts from a historical timestamp:
+
+```scala
+val spark:SparkSession = ...
+val tableIdentifier: String = ...
+
+val df = spark.readStream
+    .format("iceberg")
+    .option(SparkReadOptions.STREAM_FROM_TIMESTAMP, Long.toString(streamStartTimestamp))
+    .load(tableIdentifier)
+```
+
+The `tableIdentifier` can be:
+
+* The fully-qualified path to a HDFS table, like `hdfs://nn:8020/path/to/table`
+* A table name if the table is tracked by a catalog, like `database.table_name`
+
+!!! Note
+    Iceberg only supports read data from snapshot whose type of Data Operations is APPEND\REPLACE\DELETE. In particular if some of your snapshots are of DELETE type, you need to add 'streaming-skip-delete-snapshots' option to skip it, otherwise the task will fail.

Review comment:
       Nit: In the rich diff, this note isn't coming up formatted. Have you verified using `mkdocs` that this formats like the other parts that use `!!!`?
   
   Also, we might want to just format this as any other config box vs using the `!!! Note` statement.

##########
File path: site/docs/spark-structured-streaming.md
##########
@@ -26,6 +26,28 @@ As of Spark 3.0, DataFrame reads and writes are supported.
 |--------------------------------------------------|----------|------------|------------------------------------------------|
 | [DataFrame write](#writing-with-streaming-query) | ✔        | ✔          |                                                |
 
+## Streaming Reads
+
+Iceberg supports processing incremental data in spark structured streaming  jobs which starts from a historical timestamp:
+
+```scala
+val spark:SparkSession = ...
+val tableIdentifier: String = ...
+
+val df = spark.readStream
+    .format("iceberg")
+    .option(SparkReadOptions.STREAM_FROM_TIMESTAMP, Long.toString(streamStartTimestamp))
+    .load(tableIdentifier)
+```
+
+The `tableIdentifier` can be:
+
+* The fully-qualified path to a HDFS table, like `hdfs://nn:8020/path/to/table`
+* A table name if the table is tracked by a catalog, like `database.table_name`

Review comment:
       Question / Comment: It might be better to just say that the table identifier can be any valid table identifier or table path and link to any existing docs we have on that., instead of repeating the definition here or hiding the definition within Spark streaming reads section (if we don't have it defined somewhere else).
   
   Is there a place we can link to already?

##########
File path: spark/v3.2/spark/src/main/java/org/apache/iceberg/spark/source/SparkMicroBatchStream.java
##########
@@ -204,7 +204,9 @@ private boolean shouldProcess(Snapshot snapshot) {
         "Cannot process delete snapshot: %s", snapshot.snapshotId());
     Preconditions.checkState(
         op.equals(DataOperations.DELETE) || op.equals(DataOperations.APPEND) || op.equals(DataOperations.REPLACE),
-        "Cannot process %s snapshot: %s", op.toLowerCase(Locale.ROOT), snapshot.snapshotId());
+        "Cannot process snapshot: %s, Structured Streaming does not support snapshots of type %s",

Review comment:
       Nit: Can we say `.... does not currently support snapshots of type %s`? In the future, we will support reading more of them, like we do in Flink.
   
   Also, can we mention the config `streaming-skip-delete-snapshots` in the Preconditions check? That way, if users get this exception, they know the option to get passed it if they'd like.
   
   Maybe like 
   ```java
       Preconditions.checkState(
           op.equals(DataOperations.DELETE) || op.equals(DataOperations.APPEND) || op.equals(DataOperations.REPLACE),
           "Cannot process snapshot: %s. Structured Streaming does not support snapshots of type %s. To ignore snapshots of type delete, set the config %s to true.",
          snapshot.snapshotId(),
          op.toLowerCase(Locale.ROOT),
          SparkReadOptions.STREAMING_SKIP_DELETE_SNAPSHOTS);
   ````




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] hililiwei commented on a change in pull request #3749: Spark: We should probably say why we cannot process the snapshot in SparkMicroBatchStream

Posted by GitBox <gi...@apache.org>.

hililiwei commented on a change in pull request #3749:
URL: https://github.com/apache/iceberg/pull/3749#discussion_r770165074



##########
File path: site/docs/spark-structured-streaming.md
##########
@@ -26,6 +26,28 @@ As of Spark 3.0, DataFrame reads and writes are supported.
 |--------------------------------------------------|----------|------------|------------------------------------------------|
 | [DataFrame write](#writing-with-streaming-query) | ✔        | ✔          |                                                |
 
+## Streaming Reads
+
+Iceberg supports processing incremental data in spark structured streaming  jobs which starts from a historical timestamp:
+
+```scala
+val spark:SparkSession = ...
+val tableIdentifier: String = ...
+
+val df = spark.readStream
+    .format("iceberg")
+    .option(SparkReadOptions.STREAM_FROM_TIMESTAMP, Long.toString(streamStartTimestamp))
+    .load(tableIdentifier)
+```
+
+The `tableIdentifier` can be:
+
+* The fully-qualified path to a HDFS table, like `hdfs://nn:8020/path/to/table`
+* A table name if the table is tracked by a catalog, like `database.table_name`
+
+!!! Note
+    Iceberg only supports read data from snapshot whose type of Data Operations is APPEND\REPLACE\DELETE. In particular if some of your snapshots are of DELETE type, you need to add 'streaming-skip-delete-snapshots' option to skip it, otherwise the task will fail.

Review comment:
       I just want to show this:
   
   https://iceberg.apache.org/#maintenance/
   
   

##########
File path: site/docs/spark-structured-streaming.md
##########
@@ -26,6 +26,28 @@ As of Spark 3.0, DataFrame reads and writes are supported.
 |--------------------------------------------------|----------|------------|------------------------------------------------|
 | [DataFrame write](#writing-with-streaming-query) | ✔        | ✔          |                                                |
 
+## Streaming Reads
+
+Iceberg supports processing incremental data in spark structured streaming  jobs which starts from a historical timestamp:
+
+```scala
+val spark:SparkSession = ...
+val tableIdentifier: String = ...
+
+val df = spark.readStream
+    .format("iceberg")
+    .option(SparkReadOptions.STREAM_FROM_TIMESTAMP, Long.toString(streamStartTimestamp))
+    .load(tableIdentifier)
+```
+
+The `tableIdentifier` can be:
+
+* The fully-qualified path to a HDFS table, like `hdfs://nn:8020/path/to/table`
+* A table name if the table is tracked by a catalog, like `database.table_name`
+
+!!! Note
+    Iceberg only supports read data from snapshot whose type of Data Operations is APPEND\REPLACE\DELETE. In particular if some of your snapshots are of DELETE type, you need to add 'streaming-skip-delete-snapshots' option to skip it, otherwise the task will fail.

Review comment:
       I just want to show like this:
   
   https://iceberg.apache.org/#maintenance/
   
   




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] hililiwei commented on a change in pull request #3749: Spark: We should probably say why we cannot process the snapshot in SparkMicroBatchStream

Posted by GitBox <gi...@apache.org>.

hililiwei commented on a change in pull request #3749:
URL: https://github.com/apache/iceberg/pull/3749#discussion_r773548703



##########
File path: site/docs/spark-structured-streaming.md
##########
@@ -26,6 +26,25 @@ As of Spark 3.0, DataFrame reads and writes are supported.
 |--------------------------------------------------|----------|------------|------------------------------------------------|
 | [DataFrame write](#writing-with-streaming-query) | ✔        | ✔          |                                                |
 
+## Streaming Reads
+
+Iceberg supports processing incremental data in spark structured streaming  jobs which starts from a historical timestamp:
+
+```scala
+val spark:SparkSession = ...
+val tableIdentifier: String = ...
+
+val df = spark.readStream
+    .format("iceberg")
+    .option(SparkReadOptions.STREAM_FROM_TIMESTAMP, Long.toString(streamStartTimestamp))
+    .load(tableIdentifier)
+```
+
+The `tableIdentifier`  can be any valid table identifier or table path. Refer [TableIdentifier](https://iceberg.apache.org/javadoc/0.12.1/org/apache/iceberg/catalog/TableIdentifier.html)

Review comment:
       done. In addition, should we explain TableIdentifer etc in https://iceberg.apache.org/#terms/ section or elsewhere? . This may be useful to developers.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] hililiwei commented on a change in pull request #3749: Spark: We should probably say why we cannot process the snapshot in SparkMicroBatchStream

Posted by GitBox <gi...@apache.org>.

hililiwei commented on a change in pull request #3749:
URL: https://github.com/apache/iceberg/pull/3749#discussion_r776944095



##########
File path: site/docs/spark-structured-streaming.md
##########
@@ -24,7 +24,21 @@ As of Spark 3.0, DataFrame reads and writes are supported.
 
 | Feature support                                  | Spark 3.0| Spark 2.4  | Notes                                          |
 |--------------------------------------------------|----------|------------|------------------------------------------------|
-| [DataFrame write](#writing-with-streaming-query) | ✔        | ✔          |                                                |
+| [DataFrame write](#writing-with-streaming-query) | ✔        | ✔         |                                                |
+
+## Streaming Reads
+
+Iceberg supports processing incremental data in spark structured streaming jobs which starts from a historical timestamp:
+
+```scala
+val df = spark.readStream
+    .format("iceberg")
+    .option(SparkReadOptions.STREAM_FROM_TIMESTAMP, Long.toString(streamStartTimestamp))
+    .load("database.table_name")
+```
+
+!!! Note
+    Iceberg only supports reading data from append snapshots. Overwrite snapshots cannot be processed and will cause an exception, similarly, delete snapshots will cause an exception by default, but deletes may be ignored by setting `streaming-skip-delete-snapshots=true`.

Review comment:
       done




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] rdblue commented on a change in pull request #3749: Spark: We should probably say why we cannot process the snapshot in SparkMicroBatchStream

Posted by GitBox <gi...@apache.org>.

rdblue commented on a change in pull request #3749:
URL: https://github.com/apache/iceberg/pull/3749#discussion_r776492592



##########
File path: site/docs/spark-structured-streaming.md
##########
@@ -24,7 +24,21 @@ As of Spark 3.0, DataFrame reads and writes are supported.
 
 | Feature support                                  | Spark 3.0| Spark 2.4  | Notes                                          |
 |--------------------------------------------------|----------|------------|------------------------------------------------|
-| [DataFrame write](#writing-with-streaming-query) | ✔        | ✔          |                                                |
+| [DataFrame write](#writing-with-streaming-query) | ✔        | ✔          |                                               |
+
+## Streaming Reads
+
+Iceberg supports processing incremental data in spark structured streaming  jobs which starts from a historical timestamp:
+
+```scala
+val df = spark.readStream
+    .format("iceberg")
+    .option(SparkReadOptions.STREAM_FROM_TIMESTAMP, Long.toString(streamStartTimestamp))
+    .load("database.table_name")
+```
+
+!!! Note
+    Iceberg only supports reading data from APPEND snapshots. DELETE\OVERWRITE snapshots cannot be processed and will cause an exception. To ignore delete snapshots, add `streaming-skip-delete-snapshots=true` to option.

Review comment:
       * There is no need to capitalize append, delete, and overwrite
   * "DELETE\OVERWRITE" should be "Delete or overwrite" -- try to stick to plain language rather than shortcuts
   * The last two sentences conflict with one another because the first says that a delete will cause an exception. Instead, say that overwrite snapshots will cause an exception and address delete handling in a separate sentence: "By default, delete snapshots will cause an exception, but deletes may be ignored by setting ..."




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] rdblue commented on a change in pull request #3749: Spark: We should probably say why we cannot process the snapshot in SparkMicroBatchStream

Posted by GitBox <gi...@apache.org>.

rdblue commented on a change in pull request #3749:
URL: https://github.com/apache/iceberg/pull/3749#discussion_r776491439



##########
File path: site/docs/spark-structured-streaming.md
##########
@@ -26,6 +26,25 @@ As of Spark 3.0, DataFrame reads and writes are supported.
 |--------------------------------------------------|----------|------------|------------------------------------------------|
 | [DataFrame write](#writing-with-streaming-query) | ✔        | ✔          |                                                |
 
+## Streaming Reads
+
+Iceberg supports processing incremental data in spark structured streaming  jobs which starts from a historical timestamp:

Review comment:
       There are two spaces between "streaming" and "jobs"




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] rdblue commented on a change in pull request #3749: Spark: We should probably say why we cannot process the snapshot in SparkMicroBatchStream

Posted by GitBox <gi...@apache.org>.

rdblue commented on a change in pull request #3749:
URL: https://github.com/apache/iceberg/pull/3749#discussion_r776877214



##########
File path: site/docs/spark-structured-streaming.md
##########
@@ -24,7 +24,21 @@ As of Spark 3.0, DataFrame reads and writes are supported.
 
 | Feature support                                  | Spark 3.0| Spark 2.4  | Notes                                          |
 |--------------------------------------------------|----------|------------|------------------------------------------------|
-| [DataFrame write](#writing-with-streaming-query) | ✔        | ✔          |                                                |
+| [DataFrame write](#writing-with-streaming-query) | ✔        | ✔         |                                                |
+
+## Streaming Reads
+
+Iceberg supports processing incremental data in spark structured streaming jobs which starts from a historical timestamp:
+
+```scala
+val df = spark.readStream
+    .format("iceberg")
+    .option(SparkReadOptions.STREAM_FROM_TIMESTAMP, Long.toString(streamStartTimestamp))
+    .load("database.table_name")
+```
+
+!!! Note
+    Iceberg only supports reading data from append snapshots. Overwrite snapshots cannot be processed and will cause an exception, similarly, delete snapshots will cause an exception by default, but deletes may be ignored by setting `streaming-skip-delete-snapshots=true`.

Review comment:
       The "overwrite" sentence should end after "cause an exception" and "similarly" should start a new sentence. There's a clean break there.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] hililiwei commented on a change in pull request #3749: Spark: We should probably say why we cannot process the snapshot in SparkMicroBatchStream

Posted by GitBox <gi...@apache.org>.

hililiwei commented on a change in pull request #3749:
URL: https://github.com/apache/iceberg/pull/3749#discussion_r776579251



##########
File path: spark/v3.2/spark/src/main/java/org/apache/iceberg/spark/source/SparkMicroBatchStream.java
##########
@@ -200,10 +201,13 @@ public void stop() {
   private boolean shouldProcess(Snapshot snapshot) {
     String op = snapshot.operation();
     Preconditions.checkState(!op.equals(DataOperations.DELETE) || skipDelete,
-        "Cannot process delete snapshot: %s", snapshot.snapshotId());
+        "Cannot process delete snapshot: %s, to ignore snapshots of type delete, set the config %s to true.",

Review comment:
       done




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] rdblue commented on a change in pull request #3749: Spark: We should probably say why we cannot process the snapshot in SparkMicroBatchStream

Posted by GitBox <gi...@apache.org>.

rdblue commented on a change in pull request #3749:
URL: https://github.com/apache/iceberg/pull/3749#discussion_r773427132



##########
File path: site/docs/spark-structured-streaming.md
##########
@@ -26,6 +26,25 @@ As of Spark 3.0, DataFrame reads and writes are supported.
 |--------------------------------------------------|----------|------------|------------------------------------------------|
 | [DataFrame write](#writing-with-streaming-query) | ✔        | ✔          |                                                |
 
+## Streaming Reads
+
+Iceberg supports processing incremental data in spark structured streaming  jobs which starts from a historical timestamp:
+
+```scala
+val spark:SparkSession = ...
+val tableIdentifier: String = ...

Review comment:
       No need for these lines. I think it is clear that `spark` is a SparkSession.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] hililiwei commented on a change in pull request #3749: Spark: We should probably say why we cannot process the snapshot in SparkMicroBatchStream

Posted by GitBox <gi...@apache.org>.

hililiwei commented on a change in pull request #3749:
URL: https://github.com/apache/iceberg/pull/3749#discussion_r770172059



##########
File path: spark/v3.2/spark/src/main/java/org/apache/iceberg/spark/source/SparkMicroBatchStream.java
##########
@@ -204,7 +204,9 @@ private boolean shouldProcess(Snapshot snapshot) {
         "Cannot process delete snapshot: %s", snapshot.snapshotId());
     Preconditions.checkState(
         op.equals(DataOperations.DELETE) || op.equals(DataOperations.APPEND) || op.equals(DataOperations.REPLACE),
-        "Cannot process %s snapshot: %s", op.toLowerCase(Locale.ROOT), snapshot.snapshotId());
+        "Cannot process snapshot: %s, Structured Streaming does not support snapshots of type %s",

Review comment:
       I put the hint to skip delete here:
   https://github.com/apache/iceberg/blob/6b4cd711cdb8aa913fd271e0e759d9a90ed7a842/spark/v3.2/spark/src/main/java/org/apache/iceberg/spark/source/SparkMicroBatchStream.java#L204-L206
   
   The 'currently' keyword is added.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] kbendick commented on a change in pull request #3749: Spark: We should probably say why we cannot process the snapshot in SparkMicroBatchStream

Posted by GitBox <gi...@apache.org>.

kbendick commented on a change in pull request #3749:
URL: https://github.com/apache/iceberg/pull/3749#discussion_r770165470



##########
File path: site/docs/spark-structured-streaming.md
##########
@@ -26,6 +26,28 @@ As of Spark 3.0, DataFrame reads and writes are supported.
 |--------------------------------------------------|----------|------------|------------------------------------------------|
 | [DataFrame write](#writing-with-streaming-query) | ✔        | ✔          |                                                |
 
+## Streaming Reads
+
+Iceberg supports processing incremental data in spark structured streaming  jobs which starts from a historical timestamp:
+
+```scala
+val spark:SparkSession = ...
+val tableIdentifier: String = ...
+
+val df = spark.readStream
+    .format("iceberg")
+    .option(SparkReadOptions.STREAM_FROM_TIMESTAMP, Long.toString(streamStartTimestamp))
+    .load(tableIdentifier)
+```
+
+The `tableIdentifier` can be:
+
+* The fully-qualified path to a HDFS table, like `hdfs://nn:8020/path/to/table`
+* A table name if the table is tracked by a catalog, like `database.table_name`
+
+!!! Note
+    Iceberg only supports read data from snapshot whose type of Data Operations is APPEND\REPLACE\DELETE. In particular if some of your snapshots are of DELETE type, you need to add 'streaming-skip-delete-snapshots' option to skip it, otherwise the task will fail.

Review comment:
       Ahh ok. Yeah I would agree that makes sense here. 👍




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] hililiwei commented on a change in pull request #3749: Spark: We should probably say why we cannot process the snapshot in SparkMicroBatchStream

Posted by GitBox <gi...@apache.org>.

hililiwei commented on a change in pull request #3749:
URL: https://github.com/apache/iceberg/pull/3749#discussion_r776944036



##########
File path: site/docs/spark-structured-streaming.md
##########
@@ -24,7 +24,21 @@ As of Spark 3.0, DataFrame reads and writes are supported.
 
 | Feature support                                  | Spark 3.0| Spark 2.4  | Notes                                          |
 |--------------------------------------------------|----------|------------|------------------------------------------------|
-| [DataFrame write](#writing-with-streaming-query) | ✔        | ✔          |                                                |
+| [DataFrame write](#writing-with-streaming-query) | ✔        | ✔         |                                                |

Review comment:
       Just for alignment. Rolled back.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] rdblue commented on a change in pull request #3749: Spark: We should probably say why we cannot process the snapshot in SparkMicroBatchStream

Posted by GitBox <gi...@apache.org>.

rdblue commented on a change in pull request #3749:
URL: https://github.com/apache/iceberg/pull/3749#discussion_r776492905



##########
File path: spark/v3.2/spark/src/main/java/org/apache/iceberg/spark/source/SparkMicroBatchStream.java
##########
@@ -200,10 +201,13 @@ public void stop() {
   private boolean shouldProcess(Snapshot snapshot) {
     String op = snapshot.operation();
     Preconditions.checkState(!op.equals(DataOperations.DELETE) || skipDelete,
-        "Cannot process delete snapshot: %s", snapshot.snapshotId());
+        "Cannot process delete snapshot: %s, to ignore snapshots of type delete, set the config %s to true.",
+        snapshot.snapshotId(), SparkReadOptions.STREAMING_SKIP_DELETE_SNAPSHOTS);
     Preconditions.checkState(
         op.equals(DataOperations.DELETE) || op.equals(DataOperations.APPEND) || op.equals(DataOperations.REPLACE),
-        "Cannot process %s snapshot: %s", op.toLowerCase(Locale.ROOT), snapshot.snapshotId());
+        "Cannot process snapshot: %s, Structured Streaming does not currently support snapshots of type %s",

Review comment:
       No need to change this.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] rdblue commented on a change in pull request #3749: Spark: We should probably say why we cannot process the snapshot in SparkMicroBatchStream

Posted by GitBox <gi...@apache.org>.

rdblue commented on a change in pull request #3749:
URL: https://github.com/apache/iceberg/pull/3749#discussion_r776493031



##########
File path: spark/v3.2/spark/src/main/java/org/apache/iceberg/spark/source/SparkMicroBatchStream.java
##########
@@ -200,10 +201,13 @@ public void stop() {
   private boolean shouldProcess(Snapshot snapshot) {
     String op = snapshot.operation();
     Preconditions.checkState(!op.equals(DataOperations.DELETE) || skipDelete,
-        "Cannot process delete snapshot: %s", snapshot.snapshotId());
+        "Cannot process delete snapshot: %s, to ignore snapshots of type delete, set the config %s to true.",

Review comment:
       I think this should be `"to ignore deletes, set %s=true"`.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org