You are viewing a plain text version of this content. The canonical link for it is here.

Posted to commits@hudi.apache.org by GitBox <gi...@apache.org> on 2022/07/21 00:04:58 UTC

[GitHub] [hudi] rahil-c opened a new pull request, #6154: Disable EmrFS file metadata caching and EMR Spark's data prefetcher f…

rahil-c opened a new pull request, #6154:
URL: https://github.com/apache/hudi/pull/6154

   …eature
   
   ## *Tips*
   - *Thank you very much for contributing to Apache Hudi.*
   - *Please review https://hudi.apache.org/contribute/how-to-contribute before opening a pull request.*
   
   ## What is the purpose of the pull request
   
   *(For example: This pull request adds quick-start document.)*
   
   ## Brief change log
   
   *(for example:)*
     - *Modify AnnotationLocation checkstyle rule in checkstyle.xml*
   
   ## Verify this pull request
   
   *(Please pick either of the following options)*
   
   This pull request is a trivial rework / code cleanup without any test coverage.
   
   *(or)*
   
   This pull request is already covered by existing tests, such as *(please describe tests)*.
   
   (or)
   
   This change added tests and can be verified as follows:
   
   *(example:)*
   
     - *Added integration tests for end-to-end.*
     - *Added HoodieClientWriteTest to verify the change.*
     - *Manually verified the change by running a job locally.*
   
   ## Committer checklist
   
    - [ ] Has a corresponding JIRA in PR title & commit
    
    - [ ] Commit message is descriptive of the change
    
    - [ ] CI is green
   
    - [ ] Necessary doc changes done or have another open PR
          
    - [ ] For large changes, please consider breaking it into sub-tasks under an umbrella JIRA.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hudi] hudi-bot commented on pull request #6154: Disable EmrFS file metadata caching and EMR Spark's data prefetcher f…

Posted by GitBox <gi...@apache.org>.

hudi-bot commented on PR #6154:
URL: https://github.com/apache/hudi/pull/6154#issuecomment-1190920270

   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "473be87aa5d71939c2e8a367851b0e3b96744bc0",
       "status" : "PENDING",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10114",
       "triggerID" : "473be87aa5d71939c2e8a367851b0e3b96744bc0",
       "triggerType" : "PUSH"
     } ]
   }-->
   ## CI report:
   
   * 473be87aa5d71939c2e8a367851b0e3b96744bc0 Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10114) 
   
   <details>
   <summary>Bot commands</summary>
     @hudi-bot supports the following commands:
   
    - `@hudi-bot run azure` re-run the last Azure build
   </details>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hudi] rahil-c commented on pull request #6154: Disable EmrFS file metadata caching and EMR Spark's data prefetcher f…

Posted by GitBox <gi...@apache.org>.

rahil-c commented on PR #6154:
URL: https://github.com/apache/hudi/pull/6154#issuecomment-1191700735

   Azure ci IT seems to be flaky https://dev.azure.com/apache-hudi-ci-org/apache-hudi-ci/_build/results?buildId=10114&view=results 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hudi] vinothchandar commented on a diff in pull request #6154: [HUDI-4434] Disable EmrFS file metadata caching and EMR Spark's data prefetcher feature

Posted by GitBox <gi...@apache.org>.

vinothchandar commented on code in PR #6154:
URL: https://github.com/apache/hudi/pull/6154#discussion_r928788187


##########
hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/DefaultSource.scala:
##########
@@ -56,6 +56,9 @@ class DefaultSource extends RelationProvider
       // Enable "passPartitionByAsOptions" to support "write.partitionBy(...)"
       spark.conf.set("spark.sql.legacy.sources.write.passPartitionByAsOptions", "true")
     }
+    // Revisit EMR Spark and EMRFS incompatibilities, for now disable
+    spark.conf.set("spark.sql.dataPrefetch.enabled", "false")
+    spark.sparkContext.hadoopConfiguration.set("fs.s3.metadata.cache.expiration.seconds", "0")

Review Comment:
   Instead of hard-coding to false, shall we also allow for override using --conf . I wonder if it's better todo this in the EMR Huds configs, rather than code?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hudi] hudi-bot commented on pull request #6154: Disable EmrFS file metadata caching and EMR Spark's data prefetcher f…

Posted by GitBox <gi...@apache.org>.

hudi-bot commented on PR #6154:
URL: https://github.com/apache/hudi/pull/6154#issuecomment-1191055392

   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "473be87aa5d71939c2e8a367851b0e3b96744bc0",
       "status" : "FAILURE",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10114",
       "triggerID" : "473be87aa5d71939c2e8a367851b0e3b96744bc0",
       "triggerType" : "PUSH"
     } ]
   }-->
   ## CI report:
   
   * 473be87aa5d71939c2e8a367851b0e3b96744bc0 Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10114) 
   
   <details>
   <summary>Bot commands</summary>
     @hudi-bot supports the following commands:
   
    - `@hudi-bot run azure` re-run the last Azure build
   </details>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hudi] rahil-c commented on pull request #6154: Disable EmrFS file metadata caching and EMR Spark's data prefetcher f…

Posted by GitBox <gi...@apache.org>.

rahil-c commented on PR #6154:
URL: https://github.com/apache/hudi/pull/6154#issuecomment-1191075870

   cc @umehrot2 @zhedoubushishi 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hudi] umehrot2 commented on a diff in pull request #6154: [HUDI-4434] Disable EmrFS file metadata caching and EMR Spark's data prefetcher feature

Posted by GitBox <gi...@apache.org>.

umehrot2 commented on code in PR #6154:
URL: https://github.com/apache/hudi/pull/6154#discussion_r927913811


##########
hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/DefaultSource.scala:
##########
@@ -56,6 +56,9 @@ class DefaultSource extends RelationProvider
       // Enable "passPartitionByAsOptions" to support "write.partitionBy(...)"
       spark.conf.set("spark.sql.legacy.sources.write.passPartitionByAsOptions", "true")
     }
+    // Revisit EMR Spark and EMRFS incompatibilities, for now disable
+    spark.conf.set("spark.sql.dataPrefetch.enabled", "false")
+    spark.sparkContext.hadoopConfiguration.set("fs.s3.metadata.cache.expiration.seconds", "0")

Review Comment:
   @codope Yes the property is specific to `EmrFS` and its not present in open source implementations. This is done as a temporary fix to make sure Hudi 0.12 can work out of the box on EMR environment. This property should act as No Op for other environments.
   
   We are also working internally to have a fix for this on EmrFS side instead, and should get rid of this in next release. It was tested using EMR's internal integration tests. With the current hudi master it was breaking several of our tests, and with this fix we get past them.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hudi] rahil-c commented on pull request #6154: Disable EmrFS file metadata caching and EMR Spark's data prefetcher f…

Posted by GitBox <gi...@apache.org>.

rahil-c commented on PR #6154:
URL: https://github.com/apache/hudi/pull/6154#issuecomment-1190889612

   cc @umehrot2 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hudi] hudi-bot commented on pull request #6154: Disable EmrFS file metadata caching and EMR Spark's data prefetcher f…

Posted by GitBox <gi...@apache.org>.

hudi-bot commented on PR #6154:
URL: https://github.com/apache/hudi/pull/6154#issuecomment-1190893130

   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "473be87aa5d71939c2e8a367851b0e3b96744bc0",
       "status" : "UNKNOWN",
       "url" : "TBD",
       "triggerID" : "473be87aa5d71939c2e8a367851b0e3b96744bc0",
       "triggerType" : "PUSH"
     } ]
   }-->
   ## CI report:
   
   * 473be87aa5d71939c2e8a367851b0e3b96744bc0 UNKNOWN
   
   <details>
   <summary>Bot commands</summary>
     @hudi-bot supports the following commands:
   
    - `@hudi-bot run azure` re-run the last Azure build
   </details>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hudi] zhedoubushishi merged pull request #6154: [HUDI-4434] Disable EmrFS file metadata caching and EMR Spark's data prefetcher feature

Posted by GitBox <gi...@apache.org>.

zhedoubushishi merged PR #6154:
URL: https://github.com/apache/hudi/pull/6154


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hudi] umehrot2 commented on pull request #6154: [HUDI-4434] Disable EmrFS file metadata caching and EMR Spark's data prefetcher feature

Posted by GitBox <gi...@apache.org>.

umehrot2 commented on PR #6154:
URL: https://github.com/apache/hudi/pull/6154#issuecomment-1194484290

   Jira for reverting this by Hudi 0.13 => https://issues.apache.org/jira/browse/HUDI-4470


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hudi] codope commented on a diff in pull request #6154: [HUDI-4434] Disable EmrFS file metadata caching and EMR Spark's data prefetcher feature

Posted by GitBox <gi...@apache.org>.

codope commented on code in PR #6154:
URL: https://github.com/apache/hudi/pull/6154#discussion_r927021771


##########
hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/DefaultSource.scala:
##########
@@ -56,6 +56,9 @@ class DefaultSource extends RelationProvider
       // Enable "passPartitionByAsOptions" to support "write.partitionBy(...)"
       spark.conf.set("spark.sql.legacy.sources.write.passPartitionByAsOptions", "true")
     }
+    // Revisit EMR Spark and EMRFS incompatibilities, for now disable
+    spark.conf.set("spark.sql.dataPrefetch.enabled", "false")
+    spark.sparkContext.hadoopConfiguration.set("fs.s3.metadata.cache.expiration.seconds", "0")

Review Comment:
   Shouldn't this be guarded by a Hudi config as it is an issue on EmrFs?
   Also, I did not find any documentation for `fs.s3.metadata.cache.expiration.seconds` config in https://hadoop.apache.org/docs/stable/hadoop-aws/tools/hadoop-aws/s3guard.html?
   Have we tested this change?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hudi] umehrot2 commented on a diff in pull request #6154: [HUDI-4434] Disable EmrFS file metadata caching and EMR Spark's data prefetcher feature

Posted by GitBox <gi...@apache.org>.

umehrot2 commented on code in PR #6154:
URL: https://github.com/apache/hudi/pull/6154#discussion_r929195103


##########
hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/DefaultSource.scala:
##########
@@ -56,6 +56,9 @@ class DefaultSource extends RelationProvider
       // Enable "passPartitionByAsOptions" to support "write.partitionBy(...)"
       spark.conf.set("spark.sql.legacy.sources.write.passPartitionByAsOptions", "true")
     }
+    // Revisit EMR Spark and EMRFS incompatibilities, for now disable
+    spark.conf.set("spark.sql.dataPrefetch.enabled", "false")
+    spark.sparkContext.hadoopConfiguration.set("fs.s3.metadata.cache.expiration.seconds", "0")

Review Comment:
   Well the only reason we did this is because we want to reduce the noise, for customers having to pass additional configurations just to make things work on EMR. We cannot store in EMR Hudi configs, because as of now the global Hudi confs that we support only work for Hudi related configurations. We cannot pass spark/hadoop configs in them.
   
   If you guys have concerns about this, we can revert it and instead have it in the documentation that customers should explicitly pass these when running open source bundle on EMR. Its just that it is not a good experience.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hudi] codope commented on a diff in pull request #6154: [HUDI-4434] Disable EmrFS file metadata caching and EMR Spark's data prefetcher feature

Posted by GitBox <gi...@apache.org>.

codope commented on code in PR #6154:
URL: https://github.com/apache/hudi/pull/6154#discussion_r929488952


##########
hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/DefaultSource.scala:
##########
@@ -56,6 +56,9 @@ class DefaultSource extends RelationProvider
       // Enable "passPartitionByAsOptions" to support "write.partitionBy(...)"
       spark.conf.set("spark.sql.legacy.sources.write.passPartitionByAsOptions", "true")
     }
+    // Revisit EMR Spark and EMRFS incompatibilities, for now disable
+    spark.conf.set("spark.sql.dataPrefetch.enabled", "false")
+    spark.sparkContext.hadoopConfiguration.set("fs.s3.metadata.cache.expiration.seconds", "0")

Review Comment:
   Let's prioritize user experience and keep it in this release for now. @umehrot2 already created a tracking issue to revert it in 0.13.0 HUDI-4470



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hudi] codope commented on a diff in pull request #6154: [HUDI-4434] Disable EmrFS file metadata caching and EMR Spark's data prefetcher feature

Posted by GitBox <gi...@apache.org>.

codope commented on code in PR #6154:
URL: https://github.com/apache/hudi/pull/6154#discussion_r929487909


##########
hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/DefaultSource.scala:
##########
@@ -56,6 +56,9 @@ class DefaultSource extends RelationProvider
       // Enable "passPartitionByAsOptions" to support "write.partitionBy(...)"
       spark.conf.set("spark.sql.legacy.sources.write.passPartitionByAsOptions", "true")
     }
+    // Revisit EMR Spark and EMRFS incompatibilities, for now disable
+    spark.conf.set("spark.sql.dataPrefetch.enabled", "false")
+    spark.sparkContext.hadoopConfiguration.set("fs.s3.metadata.cache.expiration.seconds", "0")

Review Comment:
   Let's prioritize user experience and keep it in this release for now. Tracking the removal of these configs after proper fix in HUDI-4473



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org