You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@iceberg.apache.org by GitBox <gi...@apache.org> on 2021/09/02 09:58:04 UTC

[GitHub] [iceberg] fcvr1010 opened a new issue #3062: Report I/O metrics to Spark

fcvr1010 opened a new issue #3062:
URL: https://github.com/apache/iceberg/issues/3062


   Apologies if there's already an issue tracking this, I searched for both "metrics" and "statistics" and I think I could not find a relevant one.
   
   I noticed that I/O statistics (e.g., output size in bytes) are missing in the Spark UI when using Iceberg tables. My setup:
   - Iceberg 0.11
   - Running on AWS with GlueCatalog and DynamoDB for locking, as per the Iceberg-AWS integration example.
   - Spark 3.0
   - Parquet data read/written with Spark2.4-style APIs, e.g., `df.write.saveAsTable(table_name, mode=mode)`
   
   Is this expected? What would be needed in order to get such statistics? Thanks!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] kbendick commented on issue #3062: Report I/O metrics to Spark

Posted by GitBox <gi...@apache.org>.
kbendick commented on issue #3062:
URL: https://github.com/apache/iceberg/issues/3062#issuecomment-983250944


   Thank you for the report @fcvr1010 ans @kishendas!
   
   This is not unexpected behavior, depending on which FileIO implementation you're using. I believe S3FileIO doesnt emit the metrics but if one uses HDFS it does.
   
   I am a bit surprised it doesn't come up on databricks but that's only based off of a hunch (not concrete experience). If I could talk to you a bit more @kishendas that would be appreciated!
   
   There is a push for more standardization around metrics coming soon (though I'm not sure when exactly). Specifically to deal with issues like this.
   
   I'm not exactly sure how easily this specific issue can be addressed as I haven't looked into the details much. Could be a simple fix and might not be.
   
   Perhaps others will have more insight than I can provide.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] prodeezy commented on issue #3062: Report I/O metrics to Spark

Posted by GitBox <gi...@apache.org>.
prodeezy commented on issue #3062:
URL: https://github.com/apache/iceberg/issues/3062#issuecomment-991083894


   After some more investigation we'v tracked this down to Spark DSv2 API issues. Here are the relevant Spark jiras  [SPARK-37578](https://issues.apache.org/jira/browse/SPARK-37578) and [SPARK-37585 ](https://issues.apache.org/jira/browse/SPARK-37585)


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] danielcweeks commented on issue #3062: Report I/O metrics to Spark

Posted by GitBox <gi...@apache.org>.
danielcweeks commented on issue #3062:
URL: https://github.com/apache/iceberg/issues/3062#issuecomment-983306313


   It's been quite a while since I looked at this (prior to Spark 3), but at the time, spark relied entirely on Hadoop FileSystem metrics for tracking purposes.  I believe we created a shim that pulls IO metrics from the S3FileIO and reports them via the Hadoop FileSystem in order to expose this information.
   
   I think it is possible to create such a shim in the Iceberg Spark project, but we need to be careful not to leak the Hadoop packages (this would mean creating a metric callback interface in the S3FileIO) so as not to introduce a Hadoop dependency.
   
   That may provide a workaround until the upstream spark metrics framework is sorted out.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] kishendas commented on issue #3062: Report I/O metrics to Spark

Posted by GitBox <gi...@apache.org>.
kishendas commented on issue #3062:
URL: https://github.com/apache/iceberg/issues/3062#issuecomment-981921693


   We have also observed this same behavior with Iceberg 0.10 version on Databricks environment, where Input/Output metrics ( bytesRead and bytesWritten) are not showing up. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] kishendas commented on issue #3062: Report I/O metrics to Spark

Posted by GitBox <gi...@apache.org>.
kishendas commented on issue #3062:
URL: https://github.com/apache/iceberg/issues/3062#issuecomment-983403586


   @kbendick and @danielcweeks Thank you for your responses. Just fyi, we use Azure Databricks, so, it's Azure Data Lake Storage Gen2 and not S3FileIO. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] fcvr1010 commented on issue #3062: Report I/O metrics to Spark

Posted by GitBox <gi...@apache.org>.
fcvr1010 commented on issue #3062:
URL: https://github.com/apache/iceberg/issues/3062#issuecomment-1002627516


   Thank you @danielcweeks, @prodeezy, @kbendick for the feedback (and sorry for my very late reply but we had to pause our investigation into Iceberg for a bit).
   
   If I understand correctly, [SPARK-37578](https://issues.apache.org/jira/browse/SPARK-37578) is about output metrics and should be fixed in Spark 3.3. I noticed that input metrics were missing too, not sure if this would be covered by [SPARK-37585](https://issues.apache.org/jira/browse/SPARK-37585) which seems to be about a corner case. Were you able to obtain input metrics when using S3?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org