You are viewing a plain text version of this content. The canonical link for it is here.

Posted to issues@iceberg.apache.org by GitBox <gi...@apache.org> on 2022/08/19 10:32:08 UTC

[GitHub] [iceberg] hililiwei opened a new pull request, #5585: Spark: Write on a snapshot branch ref

hililiwei opened a new pull request, #5585:
URL: https://github.com/apache/iceberg/pull/5585

   ## What is the purpose of the change
   
   Spark can write data to Branch in the following ways:
   ```
   data.write
       .format("iceberg")
       .mode("append")
       .option("branch","branchName")
       .save("db.table")
   ```
   
   spark sql
   ```
   INSERT INTO prod.db.table.to_branch_{BranchName}
   
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] hililiwei commented on pull request #5585: Spark: Write on a snapshot branch ref

Posted by GitBox <gi...@apache.org>.

hililiwei commented on PR #5585:
URL: https://github.com/apache/iceberg/pull/5585#issuecomment-1221202956

   > @hililiwei, I don't think that we will support that syntax in SparkSQL. The option for `DataFrameWriter` looks fine, although we generally prefer to use `DataFrameWriterV2`, because there are so many problems with the original one.
   
   @rdblue Thanks for your feedback. Is there a more recommended way to write data to the branch in Spark SQL?
   
   Branch\Tag is a very useful feature that we urgently need. Our colleagues use SparkSQL more than Dataframe API. We have tried the `table.to_branch_{branchName}` internally for branch write, could you please briefly show me what the main problems are?
   
   Thanks 
   Liwei
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] hililiwei commented on pull request #5585: Spark: Write on a snapshot branch ref

Posted by GitBox <gi...@apache.org>.

hililiwei commented on PR #5585:
URL: https://github.com/apache/iceberg/pull/5585#issuecomment-1221714082

   > @hililiwei, we want to introduce standard syntax for this to avoid divergence across engines:
   > 
   > ```sql
   > -- Read from a branch
   > SELECT * FROM table BRANCH branch_name
   > -- Insert into a branch
   > INSERT INTO table BRANCH branch_name SELECT ...
   > -- Merge into a branch
   > MERGE INTO table BRANCH branch_name USING ...
   > -- Read from a tag
   > SELECT * FROM table AT TAG tag_name
   > ```
   > 
   > It's very unlikely that other engines will support the multipart identifier syntax. Right now, I think we should focus on getting dataframe operations working and then move on to SQL after that.
   
   Thank you for your answer. This syntax appears to need to wait for engine support. Or rewrite the engine's syntax file in the iceberg, as hudi did in spark 3.2 to support time travel. Otherwise, we don't seem to be able to use it in an earlier version of the engine. Of course, we can customize the engine's code and use it internally, but that's not realistic for a lot of tech teams, who tend to prefer out-of-the-box.
   
   I don't know if that's the case with other teams, but we're more conservative in upgrading engine versions, for example, we're still using Spark 3.1 on a large scale. (That's why I raised a lot of PRs for reverse port of code to Spark3.1) , and use it for a long foreseeable period of time because we made a lot of internal changes to it and the upgrade wasn't that easy. This could cause us to be unavailable for a short period of time even if the engine supports it.
   
   But no matter what, thanks again for your answer, I will remove it from this PR and keep only the option part.
   
   Thank you very much.
   Liwei


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] zinking commented on pull request #5585: Spark: Write on a snapshot branch ref

Posted by GitBox <gi...@apache.org>.

zinking commented on PR #5585:
URL: https://github.com/apache/iceberg/pull/5585#issuecomment-1282279633

   @hililiwei I also agree we should create divergence on sql syntaxes. why bother if we have better choices?
   
   in this case what about using table hints to achieve what you want ?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] rdblue commented on pull request #5585: Spark: Write on a snapshot branch ref

Posted by GitBox <gi...@apache.org>.

rdblue commented on PR #5585:
URL: https://github.com/apache/iceberg/pull/5585#issuecomment-1220824869

   @hililiwei, I don't think that we will support that syntax in SparkSQL. The option for `DataFrameWriter` looks fine, although we generally prefer to use `DataFrameWriterV2`, because there are so many problems with the original one.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] rdblue commented on pull request #5585: Spark: Write on a snapshot branch ref

Posted by GitBox <gi...@apache.org>.

rdblue commented on PR #5585:
URL: https://github.com/apache/iceberg/pull/5585#issuecomment-1221415290

   @hililiwei, we want to introduce standard syntax for this to avoid divergence across engines:
   
   ```sql
   -- Read from a branch
   SELECT * FROM table BRANCH branch_name
   -- Insert into a branch
   INSERT INTO table BRANCH branch_name SELECT ...
   -- Merge into a branch
   MERGE INTO table BRANCH branch_name USING ...
   -- Read from a tag
   SELECT * FROM table AT TAG tag_name
   ```
   
   It's very unlikely that other engines will support the multipart identifier syntax. Right now, I think we should focus on getting dataframe operations working and then move on to SQL after that.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org