You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@hudi.apache.org by GitBox <gi...@apache.org> on 2020/06/23 09:54:18 UTC

[GitHub] [hudi] christoph-wmt opened a new issue #1758: [SUPPORT]

christoph-wmt opened a new issue #1758:
URL: https://github.com/apache/hudi/issues/1758


   **Describe the problem you faced**
   
   We are using Spark to write Hudi tables to ADLSv2 and GCS.  For Append tables, the more partitions are added the more time is taken to complete batches.  Actual work of writing data stays the same but the driver uses an over proportional amount of increasing time building an InMemoryFileIndex.  This becomes a very obvious problem already after 2 days worth of hourly partitions i.e. 48 directories and very little actual data. It takes a few seconds to do the work and several minutes to build the InMemoryFileIndex recursively - see stacktrace below.
   
   
   A clear and concise description of the problem.
   
   **To Reproduce**
   
   Steps to reproduce the behavior:
   
   1. Create spark job to write to partitioned append table on ADLSv2 or GS
   2. have the job ingesting data for several hours
   3. observe time it takes on the driver while executors are idl'ing
   4. take thread dumps to observe below stack
   
   **Expected behavior**
   
   My expectation is that there is no incremental additional cost with the amount of partitions of my target table.
   The job only writes to a constant low number of recent partitions. i would not expect my write operation having to build an index of the entire table partitions.
   
   **Environment Description**
   
   * Hudi version : 0.5.3
   
   * Spark version : 2.4-1.0.5
   
   * Hive version : 
   
   * Hadoop version : 2.7
   
   * Storage (HDFS/S3/GCS..) : ADLSv3 and GCS
   
   * Running on Docker? (yes/no) : yes and no, both
   
   
   **Additional context**
   
   Add any other context about the problem here.
   
   **Stacktrace**
   
   org.apache.spark.sql.execution.datasources.InMemoryFileIndex$.bulkListLeafFiles(InMemoryFileIndex.scala:263)
   org.apache.spark.sql.execution.datasources.InMemoryFileIndex.listLeafFiles(InMemoryFileIndex.scala:130)
   org.apache.spark.sql.execution.datasources.InMemoryFileIndex.refresh0(InMemoryFileIndex.scala:94)
   org.apache.spark.sql.execution.datasources.InMemoryFileIndex.<init>(InMemoryFileIndex.scala:70)
   org.apache.spark.sql.execution.datasources.DataSource.org$apache$spark$sql$execution$datasources$DataSource$$createInMemoryFileIndex(DataSource.scala:585)
   org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:415)
   org.apache.hudi.DefaultSource.createRelation(DefaultSource.scala:81)
   
   
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] christoph-wmt commented on issue #1758: [SUPPORT] building InMemoryFileIndex slow with increase target table partitions

Posted by GitBox <gi...@apache.org>.
christoph-wmt commented on issue #1758:
URL: https://github.com/apache/hudi/issues/1758#issuecomment-648045079


   sorry, i realize this might be a duplicate of https://github.com/apache/hudi/issues/1552
   I'll build off master and give it a shot.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] vinothchandar commented on issue #1758: [SUPPORT] building InMemoryFileIndex slow with increase target table partitions

Posted by GitBox <gi...@apache.org>.
vinothchandar commented on issue #1758:
URL: https://github.com/apache/hudi/issues/1758#issuecomment-648872791


   @christoph-wmt Good to know.. My guess is EMR picked up 0.5.0 even though you put in 0.5.3?
   
   we are all working on 0.6.0 release where all of this is going to be much better.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] christoph-wmt commented on issue #1758: [SUPPORT] building InMemoryFileIndex slow with increase target table partitions

Posted by GitBox <gi...@apache.org>.
christoph-wmt commented on issue #1758:
URL: https://github.com/apache/hudi/issues/1758#issuecomment-648698394


   oh really, my assumption was it would only make it in 6.0.  It turns out we are on 5.0 and adding above fix into our build resolved the issue.  We still have significant time spend on the driver on every batch which now is in the region of 10s of seconds and no longer minutes and we'll dig deeper on that. The immediate issue is resolved though - thank you!


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] vinothchandar commented on issue #1758: [SUPPORT] building InMemoryFileIndex slow with increase target table partitions

Posted by GitBox <gi...@apache.org>.
vinothchandar commented on issue #1758:
URL: https://github.com/apache/hudi/issues/1758#issuecomment-648509194


   0.5.3 has the fix already IIUC.. So please let me know how things go on master.. 


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] christoph-wmt closed issue #1758: [SUPPORT] building InMemoryFileIndex slow with increase target table partitions

Posted by GitBox <gi...@apache.org>.
christoph-wmt closed issue #1758:
URL: https://github.com/apache/hudi/issues/1758


   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org