You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@hudi.apache.org by "sivabalan narayanan (Jira)" <ji...@apache.org> on 2022/01/02 12:50:00 UTC

[jira] [Commented] (HUDI-3066) Very slow file listing after enabling metadata for existing tables in 0.10.0 release

    [ https://issues.apache.org/jira/browse/HUDI-3066?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17467642#comment-17467642 ] 

sivabalan narayanan commented on HUDI-3066:
-------------------------------------------

[~h7kanna] : I tried to reproduce myself. If I am not wrong, you are using glob pattern to load the hudi table. In 0.8.0 we introduced efficient way to load hudi table w/o the glob path. With and without glob path makes a lot of difference locally in my case. Can you try w/o glob pattern. 

 

Instead of 
{code:java}
val hudiDf = spark.read.format("hudi").load("hudi_base_path/*/*/*/*")
{code}
can you try the below: 
{code:java}
val hudiDf = spark.read.format("hudi").load("hudi_base_path/") {code}
I assume, to use metadata, you would do something like this 
{code:java}
val hudiDf = spark.read.option("hoodie.metadata.enable","true").format("hudi").load("hudi_base_path/")  {code}
To give you a comparison, without glob path, my table loading took 28 s, while with glob path, it took 6 mins. (metadata enabled in both) 

And w/o glob path, w/ and w/o metadata gave me similar number in some cases metadata performed better. Something to keep in mind regarding metadata table. Metadata table is known to perform better in larger datasets, where S3 throttling might kick in and hence using metadata based file listing will definitely be better. But for smaller datasets, we haven't done detailed benchmarking to see where it might perform better hands down or are there scenarios where metadata may not give better results. 

 

> Very slow file listing after enabling metadata for existing tables in 0.10.0 release
> ------------------------------------------------------------------------------------
>
>                 Key: HUDI-3066
>                 URL: https://issues.apache.org/jira/browse/HUDI-3066
>             Project: Apache Hudi
>          Issue Type: Bug
>    Affects Versions: 0.10.0
>         Environment: EMR 6.4.0
> Hudi version : 0.10.0
>            Reporter: Harsha Teja Kanna
>            Assignee: sivabalan narayanan
>            Priority: Blocker
>              Labels: performance, pull-request-available
>             Fix For: 0.11.0
>
>         Attachments: Screen Shot 2021-12-18 at 6.16.29 PM.png, Screen Shot 2021-12-20 at 10.05.50 PM.png, Screen Shot 2021-12-20 at 10.17.44 PM.png, Screen Shot 2021-12-21 at 10.22.54 PM.png, Screen Shot 2021-12-21 at 10.24.12 PM.png, metadata_files.txt, metadata_files_compacted.txt, metadata_timeline.txt, metadata_timeline_archived.txt, metadata_timeline_compacted.txt, stderr_part1.txt, stderr_part2.txt, timeline.txt, writer_log.txt
>
>
> After 'metadata table' is enabled, File listing takes long time.
> If metadata is enabled on Reader side(as shown below), it is taking even more time per file listing task
> {code:java}
> import org.apache.hudi.DataSourceReadOptions
> import org.apache.hudi.common.config.HoodieMetadataConfig
> val hadoopConf = spark.conf
> hadoopConf.set(HoodieMetadataConfig.ENABLE.key(), "true")
> val basePath = "s3a://datalake-hudi"
> val sessions = spark
> .read
> .format("org.apache.hudi")
> .option(DataSourceReadOptions.QUERY_TYPE.key(), DataSourceReadOptions.QUERY_TYPE_SNAPSHOT_OPT_VAL)
> .option(DataSourceReadOptions.READ_PATHS.key(), s"${basePath}/sessions_by_entrydate/entrydate=2021/*/*/*")
> .load()
> sessions.createOrReplaceTempView("sessions") {code}
> Existing tables (COW) have inline clustering on and have many replace commits.
> Logs seem to suggest the delay is in view.AbstractTableFileSystemView 
> resetFileGroupsReplaced function or metadata.HoodieBackedTableMetadata
> Also many log messages in AbstractHoodieLogRecordReader
>  
> 2021-12-18 23:17:54,056 INFO view.AbstractTableFileSystemView: Took 4118 ms to read  136 instants, 9731 replaced file groups
> 2021-12-18 23:37:46,086 INFO log.AbstractHoodieLogRecordReader: Number of remaining logblocks to merge 1
> 2021-12-18 23:37:46,090 INFO log.AbstractHoodieLogRecordReader: Reading a data block from file s3a://datalake-hudi/sessions/.hoodie/metadata/files/.files-0000_20211216144130775001.log.76_0-20-515 at instant 20211217035105329
> 2021-12-18 23:37:46,090 INFO log.AbstractHoodieLogRecordReader: Number of remaining logblocks to merge 1
> 2021-12-18 23:37:46,094 INFO log.HoodieLogFormatReader: Moving to the next reader for logfile HoodieLogFile\{pathStr='s3a://datalake-hudi/sessions/.hoodie/metadata/files/.files-0000_20211216144130775001.log.121_0-57-663', fileLen=0}
> 2021-12-18 23:37:46,095 INFO log.AbstractHoodieLogRecordReader: Scanning log file HoodieLogFile\{pathStr='s3a://datalake-hudi/sessions/.hoodie/metadata/files/.files-0000_20211216144130775001.log.20_0-35-613', fileLen=0}
> 2021-12-18 23:37:46,095 INFO s3a.S3AInputStream: Switching to Random IO seek policy
> 2021-12-18 23:37:46,096 INFO log.AbstractHoodieLogRecordReader: Reading a data block from file s3a://datalake-hudi/sessions/.hoodie/metadata/files/.files-0000_20211216144130775001.log.62_0-34-377 at instant 20211217022049877
> 2021-12-18 23:37:46,096 INFO log.AbstractHoodieLogRecordReader: Number of remaining logblocks to merge 1
> 2021-12-18 23:37:46,105 INFO log.HoodieLogFormatReader: Moving to the next reader for logfile HoodieLogFile\{pathStr='s3a://datalake-hudi/sessions/.hoodie/metadata/files/.files-0000_20211216144130775001.log.86_0-20-362', fileLen=0}
> 2021-12-18 23:37:46,109 INFO log.AbstractHoodieLogRecordReader: Scanning log file HoodieLogFile\{pathStr='s3a://datalake-hudi/sessions/.hoodie/metadata/files/.files-0000_20211216144130775001.log.121_0-57-663', fileLen=0}
> 2021-12-18 23:37:46,109 INFO s3a.S3AInputStream: Switching to Random IO seek policy
> 2021-12-18 23:37:46,110 INFO log.HoodieLogFormatReader: Moving to the next reader for logfile HoodieLogFile\{pathStr='s3a://datalake-hudi/sessions/.hoodie/metadata/files/.files-0000_20211216144130775001.log.77_0-35-590', fileLen=0}
> 2021-12-18 23:37:46,112 INFO log.AbstractHoodieLogRecordReader: Reading a data block from file s3a://datalake-hudi/sessions/.hoodie/metadata/files/.files-0000_20211216144130775001.log.20_0-35-613 at instant 20211216183448389
> 2021-12-18 23:37:46,112 INFO log.AbstractHoodieLogRecordReader: Number of remaining logblocks to merge 1
> 2021-12-18 23:37:46,118 INFO log.HoodieLogFormatReader: Moving to the next reader for logfile HoodieLogFile\{pathStr='s3a://datalake-hudi/sessions/.hoodie/metadata/files/.files-0000_20211216144130775001.log.63_0-56-519', fileLen=0}
> 2021-12-18 23:37:46,122 INFO log.AbstractHoodieLogRecordReader: Scanning log file HoodieLogFile\{pathStr='s3a://datalake-hudi/sessions/.hoodie/metadata/files/.files-0000_20211216144130775001.log.86_0-20-362', fileLen=0}
> 2021-12-18 23:37:46,122 INFO s3a.S3AInputStream: Switching to Random IO seek policy
> 2021-12-18 23:37:46,123 INFO log.AbstractHoodieLogRecordReader: Reading a data block from file s3a://datalake-hudi/sessions/.hoodie/metadata/files/.files-0000_20211216144130775001.log.121_0-57-663 at instant 20211217090337935
> 2021-12-18 23:37:46,123 INFO log.AbstractHoodieLogRecordReader: Number of remaining logblocks to merge 1
> 2021-12-18 23:37:46,127 INFO log.AbstractHoodieLogRecordReader: Scanning log file HoodieLogFile\{pathStr='s3a://datalake-hudi/sessions/.hoodie/metadata/files/.files-0000_20211216144130775001.log.77_0-35-590', fileLen=0}
> 2021-12-18 23:37:46,127 INFO s3a.S3AInputStream: Switching to Random IO seek policy
> 2021-12-18 23:37:46,141 INFO log.HoodieLogFormatReader: Moving to the next reader for logfile HoodieLogFile\{pathStr='s3a://datalake-hudi/sessions/.hoodie/metadata/files/.files-0000_20211216144130775001.log.21_0-56-795', fileLen=0}
> 2021-12-18 23:37:46,143 INFO log.AbstractHoodieLogRecordReader: Scanning log file HoodieLogFile\{pathStr='s3a://datalake-hudi/sessions/.hoodie/metadata/files/.files-0000_20211216144130775001.log.63_0-56-519', fileLen=0}
> 2021-12-18 23:37:46,143 INFO s3a.S3AInputStream: Switching to Random IO seek policy
> 2021-12-18 23:37:46,144 INFO log.AbstractHoodieLogRecordReader: Reading a data block from file s3a://datalake-hudi/sessions/.hoodie/metadata/files/.files-0000_20211216144130775001.log.86_0-20-362 at instant 20211217045206861
> 2021-12-18 23:37:46,144 INFO log.AbstractHoodieLogRecordReader: Number of remaining logblocks to merge 1
> 2021-12-18 23:37:46,145 INFO log.AbstractHoodieLogRecordReader: Reading a data block from file s3a://datalake-hudi/sessions/.hoodie/metadata/files/.files-0000_20211216144130775001.log.77_0-35-590 at instant 20211217035058289
> 2021-12-18 23:37:46,145 INFO log.AbstractHoodieLogRecordReader: Number of remaining logblocks to merge 1
> 2021-12-18 23:37:46,162 INFO log.AbstractHoodieLogRecordReader: Scanning log file HoodieLogFile\{pathStr='s3a://datalake-hudi/sessions/.hoodie/metadata/files/.files-0000_20211216144130775001.log.21_0-56-795', fileLen=0}
> 2021-12-18 23:37:46,162 INFO log.AbstractHoodieLogRecordReader: Reading a data block from file s3a://datalake-hudi/sessions/.hoodie/metadata/files/.files-0000_20211216144130775001.log.63_0-56-519 at instant 20211217022306173
> 2021-12-18 23:37:46,162 INFO s3a.S3AInputStream: Switching to Random IO seek policy
> 2021-12-18 23:37:46,162 INFO log.AbstractHoodieLogRecordReader: Number of remaining logblocks to merge 1
> 2021-12-18 23:37:46,170 INFO log.HoodieLogFormatReader: Moving to the next reader for logfile HoodieLogFile\{pathStr='s3a://datalake-hudi/sessions/.hoodie/metadata/files/.files-0000_20211216144130775001.log.87_0-35-433', fileLen=0}
> 2021-12-18 23:37:46,171 INFO log.HoodieLogFormatReader: Moving to the next reader for logfile HoodieLogFile\{pathStr='s3a://datalake-hudi/sessions/.hoodie/metadata/files/.files-0000_20211216144130775001.log.78_0-56-705', fileLen=0}
> 2021-12-18 23:37:46,178 INFO log.AbstractHoodieLogRecordReader: Reading a data block from file s3a://datalake-hudi/sessions/.hoodie/metadata/files/.files-0000_20211216144130775001.log.21_0-56-795 at instant 20211216184106961
> 2021-12-18 23:37:46,178 INFO log.AbstractHoodieLogRecordReader: Number of remaining logblocks to merge
> 2021-12-18 23:38:35,380 INFO log.HoodieMergedLogRecordScanner: Number of log files scanned => 437
> 2021-12-18 23:38:35,380 INFO log.HoodieMergedLogRecordScanner: MaxMemoryInBytes allowed for compaction => 1073741824
> 2021-12-18 23:38:35,380 INFO log.HoodieMergedLogRecordScanner: Number of entries in MemoryBasedMap in ExternalSpillableMap => 165
> 2021-12-18 23:38:35,380 INFO log.HoodieMergedLogRecordScanner: Total size in bytes of MemoryBasedMap in ExternalSpillableMap => 259380
> 2021-12-18 23:38:35,380 INFO log.HoodieMergedLogRecordScanner: Number of entries in BitCaskDiskMap in ExternalSpillableMap => 0
> 2021-12-18 23:38:35,380 INFO log.HoodieMergedLogRecordScanner: Size of file spilled to disk => 0
> 2021-12-18 23:38:35,380 INFO metadata.HoodieBackedTableMetadata: Opened 437 metadata log files (dataset instant=20211218233649435, metadata instant=20211218233649435) in 22935 ms
> 2021-12-18 23:38:37,193 INFO metadata.HoodieBackedTableMetadata: Opened 437 metadata log files (dataset instant=20211218233649435, metadata instant=20211218233649435) in 22802 ms
> Sample file listing tasks
> !Screen Shot 2021-12-18 at 6.16.29 PM.png!
>  



--
This message was sent by Atlassian Jira
(v8.20.1#820001)