You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@iceberg.apache.org by GitBox <gi...@apache.org> on 2021/12/23 05:51:40 UTC

[GitHub] [iceberg] ajantha-bhat opened a new pull request #3796: Docs: update spark doc about incremental scan

ajantha-bhat opened a new pull request #3796:
URL: https://github.com/apache/iceberg/pull/3796


   Some users in the slack are exploring incremental read in spark and we don't have document for the same. Hence this PR.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] rdblue commented on a change in pull request #3796: Docs: update spark doc about incremental scan

Posted by GitBox <gi...@apache.org>.
rdblue commented on a change in pull request #3796:
URL: https://github.com/apache/iceberg/pull/3796#discussion_r777625761



##########
File path: site/docs/spark-queries.md
##########
@@ -104,6 +104,28 @@ spark.read
 
 Time travel is not yet supported by Spark's SQL syntax.
 
+### Incremental read
+
+To read incremental data between the snapshots, Configure below Spark read options:
+
+* `start-snapshot-id` Start snapshot ID used in incremental scans (exclusive)
+* `end-snapshot-id` End snapshot ID used in incremental scans (inclusive)

Review comment:
       This is optional. Omitting it will default to the current snapshot.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] ajantha-bhat commented on pull request #3796: Docs: update spark doc about incremental scan

Posted by GitBox <gi...@apache.org>.
ajantha-bhat commented on pull request #3796:
URL: https://github.com/apache/iceberg/pull/3796#issuecomment-1003241521


   cc: @rdblue 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] ajantha-bhat commented on pull request #3796: Docs: update spark doc about incremental scan

Posted by GitBox <gi...@apache.org>.
ajantha-bhat commented on pull request #3796:
URL: https://github.com/apache/iceberg/pull/3796#issuecomment-1004528846


   @rdblue : Thanks for the review. I have handled the comments. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] rdblue commented on a change in pull request #3796: Docs: update spark doc about incremental scan

Posted by GitBox <gi...@apache.org>.
rdblue commented on a change in pull request #3796:
URL: https://github.com/apache/iceberg/pull/3796#discussion_r777624777



##########
File path: site/docs/spark-queries.md
##########
@@ -104,6 +104,28 @@ spark.read
 
 Time travel is not yet supported by Spark's SQL syntax.
 
+### Incremental read
+
+To read incremental data between the snapshots, Configure below Spark read options:
+
+* `start-snapshot-id` Start snapshot ID used in incremental scans (exclusive)
+* `end-snapshot-id` End snapshot ID used in incremental scans (inclusive)
+
+```scala
+// get the data added after start-snapshot-id (10963874102873L) till end-snapshot-id (63874143573109L)
+spark.read()
+  .format("iceberg")
+  .option("start-snapshot-id", "10963874102873")
+  .option("end-snapshot-id", "63874143573109")
+  .load("path/to/table")
+```
+
+!!! Note
+Currently gets only the data from `append` operation. Cannot support `replace`, `overwrite`, `delete` operations yet.
+Works with both V1 and V2 format-version.
+
+Incremental read is not yet supported by Spark's SQL syntax.

Review comment:
       Let's remove "yet" because it is unclear whether it will be supported.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] rdblue commented on a change in pull request #3796: Docs: update spark doc about incremental scan

Posted by GitBox <gi...@apache.org>.
rdblue commented on a change in pull request #3796:
URL: https://github.com/apache/iceberg/pull/3796#discussion_r777624931



##########
File path: site/docs/spark-queries.md
##########
@@ -104,6 +104,28 @@ spark.read
 
 Time travel is not yet supported by Spark's SQL syntax.
 
+### Incremental read
+
+To read incremental data between the snapshots, Configure below Spark read options:
+
+* `start-snapshot-id` Start snapshot ID used in incremental scans (exclusive)
+* `end-snapshot-id` End snapshot ID used in incremental scans (inclusive)
+
+```scala
+// get the data added after start-snapshot-id (10963874102873L) till end-snapshot-id (63874143573109L)
+spark.read()
+  .format("iceberg")
+  .option("start-snapshot-id", "10963874102873")
+  .option("end-snapshot-id", "63874143573109")
+  .load("path/to/table")
+```
+
+!!! Note
+Currently gets only the data from `append` operation. Cannot support `replace`, `overwrite`, `delete` operations yet.

Review comment:
       If you want this to be in a note box, it needs to be indented with 4 spaces.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] rdblue commented on pull request #3796: Docs: update spark doc about incremental scan

Posted by GitBox <gi...@apache.org>.
rdblue commented on pull request #3796:
URL: https://github.com/apache/iceberg/pull/3796#issuecomment-1005042064


   Thanks, @ajantha-bhat!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] rdblue commented on a change in pull request #3796: Docs: update spark doc about incremental scan

Posted by GitBox <gi...@apache.org>.
rdblue commented on a change in pull request #3796:
URL: https://github.com/apache/iceberg/pull/3796#discussion_r777625163



##########
File path: site/docs/spark-queries.md
##########
@@ -104,6 +104,28 @@ spark.read
 
 Time travel is not yet supported by Spark's SQL syntax.
 
+### Incremental read
+
+To read incremental data between the snapshots, Configure below Spark read options:
+
+* `start-snapshot-id` Start snapshot ID used in incremental scans (exclusive)
+* `end-snapshot-id` End snapshot ID used in incremental scans (inclusive)
+
+```scala
+// get the data added after start-snapshot-id (10963874102873L) till end-snapshot-id (63874143573109L)
+spark.read()
+  .format("iceberg")
+  .option("start-snapshot-id", "10963874102873")
+  .option("end-snapshot-id", "63874143573109")
+  .load("path/to/table")
+```
+
+!!! Note
+Currently gets only the data from `append` operation. Cannot support `replace`, `overwrite`, `delete` operations yet.
+Works with both V1 and V2 format-version.

Review comment:
       Is this part of the note or a separate paragraph? Also, could you expand this to be a complete sentence?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] rdblue commented on a change in pull request #3796: Docs: update spark doc about incremental scan

Posted by GitBox <gi...@apache.org>.
rdblue commented on a change in pull request #3796:
URL: https://github.com/apache/iceberg/pull/3796#discussion_r777625300



##########
File path: site/docs/spark-queries.md
##########
@@ -104,6 +104,28 @@ spark.read
 
 Time travel is not yet supported by Spark's SQL syntax.
 
+### Incremental read
+
+To read incremental data between the snapshots, Configure below Spark read options:
+
+* `start-snapshot-id` Start snapshot ID used in incremental scans (exclusive)
+* `end-snapshot-id` End snapshot ID used in incremental scans (inclusive)
+
+```scala
+// get the data added after start-snapshot-id (10963874102873L) till end-snapshot-id (63874143573109L)

Review comment:
       Typo: "till" should be "until"




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] rdblue commented on a change in pull request #3796: Docs: update spark doc about incremental scan

Posted by GitBox <gi...@apache.org>.
rdblue commented on a change in pull request #3796:
URL: https://github.com/apache/iceberg/pull/3796#discussion_r777625530



##########
File path: site/docs/spark-queries.md
##########
@@ -104,6 +104,28 @@ spark.read
 
 Time travel is not yet supported by Spark's SQL syntax.
 
+### Incremental read
+
+To read incremental data between the snapshots, Configure below Spark read options:

Review comment:
       How about "To read appended data incrementally, use:"




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] rdblue merged pull request #3796: Docs: update spark doc about incremental scan

Posted by GitBox <gi...@apache.org>.
rdblue merged pull request #3796:
URL: https://github.com/apache/iceberg/pull/3796


   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org