You are viewing a plain text version of this content. The canonical link for it is here.

Posted to issues@iceberg.apache.org by "puchengy (via GitHub)" <gi...@apache.org> on 2023/04/25 16:31:29 UTC

[GitHub] [iceberg] puchengy opened a new pull request, #7430: Allow sparksql to override target split size with session property

puchengy opened a new pull request, #7430:
URL: https://github.com/apache/iceberg/pull/7430

   The test is mostly copied from TestSparkWriteConf class.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] singhpk234 commented on pull request #7430: Allow sparksql to override target split size with session property

Posted by "singhpk234 (via GitHub)" <gi...@apache.org>.

singhpk234 commented on PR #7430:
URL: https://github.com/apache/iceberg/pull/7430#issuecomment-1526918436

   > We first find matching files and then plan splits so the split size can be dynamic, we just need a good way to estimate it correctly.
   
   +1 on @aokolnychyi's suggestion on dynamic split size generation, this would help in solving the problem from both sides small table with a large split size, and large table with a small split size, since iceberg is responsible for the scan planning it's best if it's takes the call itself on how to make these splits. 
   
   plz. ref  [ticket](https://github.com/apache/iceberg/issues/7071) similar discussion above 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] puchengy commented on pull request #7430: Allow sparksql to override target split size with session property

Posted by "puchengy (via GitHub)" <gi...@apache.org>.

puchengy commented on PR #7430:
URL: https://github.com/apache/iceberg/pull/7430#issuecomment-1528036670

   @aokolnychyi Thanks for your understanding.
   
   > Split planning is a bit different. If we support one config, will we have to support others?
   Unfortunately, that is the case. But we don't have to proactively pull in other configs if no one need.
   
   > Is there a way to set this value correctly during the migration or is the split size different for different workloads?
   Yes, there is a way, but to intelligently automate this will need more work (which is why I am trying to explore this possibility). Also higher driver memory means more resource usage, this will lead to additional layer of complexity for user education (why it is ok to bump the memory and why the cost will not be high etc). 
   
   I haven't seen a case where split size is different for different workloads, but I am not surprised if there is since in our platform, customers are allowed to set any configs they would like. 
   
   > I assume there is no shuffle in that read-write job so that AQE cannot coalesce/split tasks during writes?
   Yes, you are correct.
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] puchengy commented on pull request #7430: Allow sparksql to override target split size with session property

Posted by "puchengy (via GitHub)" <gi...@apache.org>.

puchengy commented on PR #7430:
URL: https://github.com/apache/iceberg/pull/7430#issuecomment-1526936388

Hi @aokolnychyi,

> I am not going to oppose a SQL config but I don't think we should rely on an internal SQL property for built-in file sources.
Trying understand your stance here. Do you mean you are fine with the current change but against making it correlate to "spark.sql.files.maxPartitionBytes"? If so, I am fine with that.

> Can we identify exact scenarios when the default split size performs poorly and check if we can solve the underlying problem?
I can share two scenarios, they doesn't really lead to poor performance, but it made our platform team's life harder ("harder" means making migration work more challenging).

(1) as mentioned above, when SparkSQL used to consume a Hive table with a large "spark.sql.files.maxPartitionBytes" value (for example, 1GB), changing the underlying table to Iceberg (default to 128MB split size) will immediately increase the split count by 8x (in theory), this will lead to driver memory consumption increased and cause job driver OOM.

(2) we have a strict SLA we customer, this usually mean when we perform a change to a SparkSQL job, hopefully we make sure the output are the same (number of files and size of each files). In the case of Iceberg migration, when source table is changed from Hive to Iceberg, due to the split count changes, it will directly increase the SparkSQL job output files by 8x (in theory). While we can further make a case that the increase is OK, but this is making the surface of work larger thus slower down the innovation.

--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] puchengy commented on pull request #7430: Allow sparksql to override target split size with session property

Posted by "puchengy (via GitHub)" <gi...@apache.org>.

puchengy commented on PR #7430:
URL: https://github.com/apache/iceberg/pull/7430#issuecomment-1539359779

   @aokolnychyi @RussellSpitzer @szehon-ho @singhpk234 @rdblue Hi there, a gentle ping on this diff. Given that we have some discussions and a general agreement on this topic already, I am wondering what we need to do proceed and merge the change? Thanks


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] aokolnychyi commented on pull request #7430: Allow sparksql to override target split size with session property

Posted by "aokolnychyi (via GitHub)" <gi...@apache.org>.

aokolnychyi commented on PR #7430:
URL: https://github.com/apache/iceberg/pull/7430#issuecomment-1528008416

   > Do you mean you are fine with the current change but against making it correlate to "spark.sql.files.maxPartitionBytes"?
   
   Yeah, I don't mind adding an Iceberg SQL property if it benefits you and other folks also support it but I would like to think through other alternatives to make sure we are not overlooking a better approach. I don't think it is a good idea to support any properties for built-in sources, though. Split planning is a bit different. If we support one config, will we have to support others?
   
   > this will lead to driver memory consumption increased and cause job driver OOM. To fix that, we will have to increase driver memory manually for that job.
   
   Is there a way to set this value correctly during the migration or is the split size different for different workloads?
   
   > To fix that, we have to add explicit coalesce to match the behavior.
   
   I assume there is no shuffle in that read-write job so that AQE cannot coalesce/split tasks during writes?
   
   Let's hear what others think. I am OK to add this property to unblock you but it would be great to explore the automatic split configuration. I created #7465 for that.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] aokolnychyi commented on a diff in pull request #7430: Allow sparksql to override target split size with session property

Posted by "aokolnychyi (via GitHub)" <gi...@apache.org>.

aokolnychyi commented on code in PR #7430:
URL: https://github.com/apache/iceberg/pull/7430#discussion_r1196718483


##########
spark/v3.4/spark/src/main/java/org/apache/iceberg/spark/SparkSQLProperties.java:
##########
@@ -63,4 +63,7 @@ private SparkSQLProperties() {}
   // Controls the WAP branch used for write-audit-publish workflow.
   // When set, new snapshots will be committed to this branch.
   public static final String WAP_BRANCH = "spark.wap.branch";
+
+  // Controls read split size for the individual table
+  public static final String TEMPLATED_SPLIT_SIZE = "spark.sql.iceberg.%s.read-split-size";

Review Comment:
   Also, will we be able to handle quoted identifiers?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] puchengy commented on pull request #7430: Allow sparksql to override target split size with session property

Posted by "puchengy (via GitHub)" <gi...@apache.org>.

puchengy commented on PR #7430:
URL: https://github.com/apache/iceberg/pull/7430#issuecomment-1548346603

   @rdblue @aokolnychyi @singhpk234 @szehon-ho Hi, I updated the PR as what Ryan suggested. The current implementation allows `spark.sql.iceberg.db.tbl.read-split-size` to be set for each individual table. Can some of you take a look, thanks!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] rdblue commented on pull request #7430: Allow sparksql to override target split size with session property

Posted by "rdblue (via GitHub)" <gi...@apache.org>.

rdblue commented on PR #7430:
URL: https://github.com/apache/iceberg/pull/7430#issuecomment-1544386218

   I thought this was already supported, but I don't see it. The way we did this at Netflix was to add a table-specific property to SQL, like `spark.sql.iceberg.db.table.split-size=...`.
   
   This is something that should be set per table because it is data-specific, which is why we don't have a global option for it.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] puchengy commented on pull request #7430: Allow sparksql to override target split size with session property

Posted by "puchengy (via GitHub)" <gi...@apache.org>.

puchengy commented on PR #7430:
URL: https://github.com/apache/iceberg/pull/7430#issuecomment-1526199803

   Hi @aokolnychyi I think there is legit value for this.
   
   We are migrating hundreds of Hive tables to Iceberg. Ensuring the SparkSQL consumers of these tables don't fail is our top priorities. So the SparkSQL job used to read Hive table with some "spark.sql.files.maxPartitionBytes" values will fail if the Iceberg table split size is at huge difference causing more splits to be generated causing job failures.
   
   It is even more complicated if different downstream jobs have different "spark.sql.files.maxPartitionBytes" values (I am not sure if this really happens, but in theory it could).
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] puchengy commented on pull request #7430: Allow sparksql to override target split size with session property

Posted by "puchengy (via GitHub)" <gi...@apache.org>.

puchengy commented on PR #7430:
URL: https://github.com/apache/iceberg/pull/7430#issuecomment-1555353655

   @aokolnychyi Thanks for sharing this. I would to love to go for the generic way to override a table property via SQL approach and I would love to offer the implementation, If you and @rdblue @RussellSpitzer @danielcweeks thinks this is something that community can adopt. Otherwise, we can continue the discussion.
   
   If we are fine with the approach, then my next question is I don't know a good implementation to this, so if you know, please let me know. Otherwise I can investigate and propose one.
   
   Thanks all for participating this.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] szehon-ho commented on pull request #7430: Allow sparksql to override target split size with session property

Posted by "szehon-ho (via GitHub)" <gi...@apache.org>.

szehon-ho commented on PR #7430:
URL: https://github.com/apache/iceberg/pull/7430#issuecomment-1528384337

   I think one use case for different split size, is the more advanced GDPR use cases (row level delete).  
   
   If you have completely optimized the job to do a broadcast join (no shuffle), you can control the number of written files  by split size.  This may be different than doing a regular read where performance is the main concern.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] aokolnychyi commented on a diff in pull request #7430: Allow sparksql to override target split size with session property

Posted by "aokolnychyi (via GitHub)" <gi...@apache.org>.

aokolnychyi commented on code in PR #7430:
URL: https://github.com/apache/iceberg/pull/7430#discussion_r1195524292


##########
spark/v3.4/spark/src/main/java/org/apache/iceberg/spark/SparkSQLProperties.java:
##########
@@ -63,4 +63,7 @@ private SparkSQLProperties() {}
   // Controls the WAP branch used for write-audit-publish workflow.
   // When set, new snapshots will be committed to this branch.
   public static final String WAP_BRANCH = "spark.wap.branch";
+
+  // Controls read split size for the individual table
+  public static final String TEMPLATED_SPLIT_SIZE = "spark.sql.iceberg.%s.read-split-size";

Review Comment:
   I wonder whether we should offer a generic templated version to override table properties rather than adding a SQL  property for each table property. Let us discuss this during the community sync tomorrow.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] aokolnychyi commented on pull request #7430: Allow sparksql to override target split size with session property

Posted by "aokolnychyi (via GitHub)" <gi...@apache.org>.

aokolnychyi commented on PR #7430:
URL: https://github.com/apache/iceberg/pull/7430#issuecomment-1526779646

   Can we identify exact scenarios when the default split size performs poorly and check if we can solve the underlying problem? For instance, if the scheduler is FIFO, can we use the default cluster parallelism and the size of the data to be processed to come up with an optimal split size? We first find matching files and then plan splits so the split size can be dynamic, we just need a good way to estimate it correctly.
   
   I am not going to oppose a SQL config but I don't think we should rely on an internal SQL property for built-in file sources.
   
   Thoughts, @puchengy @RussellSpitzer @szehon-ho @singhpk234 @rdblue?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] puchengy commented on a diff in pull request #7430: Allow sparksql to override target split size with session property

Posted by "puchengy (via GitHub)" <gi...@apache.org>.

puchengy commented on code in PR #7430:
URL: https://github.com/apache/iceberg/pull/7430#discussion_r1199220899


##########
spark/v3.4/spark/src/main/java/org/apache/iceberg/spark/SparkSQLProperties.java:
##########
@@ -63,4 +63,7 @@ private SparkSQLProperties() {}
   // Controls the WAP branch used for write-audit-publish workflow.
   // When set, new snapshots will be committed to this branch.
   public static final String WAP_BRANCH = "spark.wap.branch";
+
+  // Controls read split size for the individual table
+  public static final String TEMPLATED_SPLIT_SIZE = "spark.sql.iceberg.%s.read-split-size";

Review Comment:
   @aokolnychyi Hi, may I know what's the outcome of the community sync discussion?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] aokolnychyi commented on pull request #7430: Allow sparksql to override target split size with session property

Posted by "aokolnychyi (via GitHub)" <gi...@apache.org>.

aokolnychyi commented on PR #7430:
URL: https://github.com/apache/iceberg/pull/7430#issuecomment-1555339375

   @puchengy, we have discussed this during the sync (it is unfortunate we missed you). The consensus was to invest into an algorithm to pick the split size automatically. We also converged that just offering a single SQL property to override the split is probably not a good idea as we plan to pick the split size automatically.
   
   That said, we should discuss whether offering a generic way to override a table property via SQL would make sense. In my view, this is the most reasonable short-term solution for this problem. The only concern I have is how to handle special characters.
   
   @rdblue @RussellSpitzer @danielcweeks, could you share your thoughts? 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] aokolnychyi commented on pull request #7430: Allow sparksql to override target split size with session property

Posted by "aokolnychyi (via GitHub)" <gi...@apache.org>.

aokolnychyi commented on PR #7430:
URL: https://github.com/apache/iceberg/pull/7430#issuecomment-1555339630

   I am going to drop this PR from the 1.3.0 as it, probably, needs a bit of discussion.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] puchengy commented on pull request #7430: Allow sparksql to override target split size with session property

Posted by "puchengy (via GitHub)" <gi...@apache.org>.

puchengy commented on PR #7430:
URL: https://github.com/apache/iceberg/pull/7430#issuecomment-1544410670

   @rdblue Hi Ryan, thanks for sharing this. I can implement this if it is not supported.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] aokolnychyi commented on a diff in pull request #7430: Allow sparksql to override target split size with session property

Posted by "aokolnychyi (via GitHub)" <gi...@apache.org>.

aokolnychyi commented on code in PR #7430:
URL: https://github.com/apache/iceberg/pull/7430#discussion_r1195523383


##########
spark/v3.4/spark/src/main/java/org/apache/iceberg/spark/SparkReadConf.java:
##########
@@ -271,4 +280,8 @@ public boolean aggregatePushDownEnabled() {
         .defaultValue(SparkSQLProperties.AGGREGATE_PUSH_DOWN_ENABLED_DEFAULT)
         .parse();
   }
+
+  private static String tableNameWithoutCatalog(String tableName) {

Review Comment:
   What if I set a value for the same table name but in another catalog?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] aokolnychyi commented on pull request #7430: Allow sparksql to override target split size with session property

Posted by "aokolnychyi (via GitHub)" <gi...@apache.org>.

aokolnychyi commented on PR #7430:
URL: https://github.com/apache/iceberg/pull/7430#issuecomment-1526129392

   Why would we want to support this property at the session level? The split size is specific to data volume, which varies from table to table. Spark built-in tables rely solely on the session config but it is not the case for Iceberg. We do have SQL properties to override what's set in the table but I am not sure this is a good use case.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org