You are viewing a plain text version of this content. The canonical link for it is here.

Posted to commits@hudi.apache.org by GitBox <gi...@apache.org> on 2020/07/06 09:39:39 UTC

[GitHub] [hudi] zherenyu831 opened a new issue #1798: Question reading partition path with less level is more faster than what document mentioned

zherenyu831 opened a new issue #1798:
URL: https://github.com/apache/hudi/issues/1798


   Document 
   ```
   val hudiIncQueryDF = spark
        .read()
        .format("org.apache.hudi")
        .option(DataSourceReadOptions.QUERY_TYPE_OPT_KEY(), DataSourceReadOptions.QUERY_TYPE_SNAPSHOT_OPT_VAL())
        .load(tablePath + "/*")
   ```
   
   we have path like data/YYYY/MM/DD and when try as document mentioned
   ```
   spark.read.format("org.apache.hudi").load("s3://test/data/*/*/*/*")
   // 4000+ files cost 60s
   scala> res8.count
   res9: Long = 313589086
   ```
   
   but when we test with 
   ```
   spark.read.format("org.apache.hudi").load("s3://test/data/*/*/*")
   // 600+ files cost 10s
   scala> res10.count
   res11: Long = 313589086
   ```
   result is the same, but with `s3://test/data/*/*/*` we could have much more fast speed.
   and basically the the more file count the path included, the much more huge difference the time cost will be....
   
   Is there any concern with using the path with lower level of parquet file? 


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hudi] zherenyu831 commented on issue #1798: Question reading partition path with less level is more faster than what document mentioned

Posted by GitBox <gi...@apache.org>.

zherenyu831 commented on issue #1798:
URL: https://github.com/apache/hudi/issues/1798#issuecomment-657193426


   @umehrot2 @vinothchandar 
   Sorry for lately reply.
   Here is my snapshot of spark ui.
   
   First query I used, files processed by resolveRelation was 950
   ```
   spark.read.format("org.apache.hudi").load("s3://daas-hudi-test/paylite_payment_read/orders_v6/data/*/*/*").count()
   ```
   
   and second I used below query, and files processed by resolveRelation was 4750
   ```
   spark.read.format("org.apache.hudi").load("s3://daas-hudi-test/paylite_payment_read/orders_v6/data/*/*/*/*").count()
   ```
   
   both of them are return same result to me...
   
   <img width="1665" alt="スクリーンショット 2020-07-12 17 51 22" src="https://user-images.githubusercontent.com/52404525/87242515-61857300-c468-11ea-9e23-a874afed66b8.png">
   <img width="1666" alt="スクリーンショット 2020-07-12 17 51 28" src="https://user-images.githubusercontent.com/52404525/87242522-6a764480-c468-11ea-89f1-f865875783fe.png">
   
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hudi] umehrot2 commented on issue #1798: Question reading partition path with less level is more faster than what document mentioned

Posted by GitBox <gi...@apache.org>.

umehrot2 commented on issue #1798:
URL: https://github.com/apache/hudi/issues/1798#issuecomment-656320283


   @zherenyu831 yes I am also confused by the difference in number of files in the two experiments you have provided. Are both these queries on the same dataset and have same number of files underneath ?
   
   Regardless, the listing happens internally through Spark's `parquet` data source. The only difference is Hudi passes `HoodieROTablePathFilter` to spark's implementation to list only the latest files. At this point I don't understand why that would cause difference in these two queries which you have mentioned, but we would be happy to look into it.
   
   Can you provide a snapshot of your Spark history server showing the difference in time in Spark's listing for these two queries on the same table ?
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hudi] bvaradar closed issue #1798: Question reading partition path with less level is more faster than what document mentioned

Posted by GitBox <gi...@apache.org>.

bvaradar closed issue #1798:
URL: https://github.com/apache/hudi/issues/1798


   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hudi] vinothchandar commented on issue #1798: Question reading partition path with less level is more faster than what document mentioned

Posted by GitBox <gi...@apache.org>.

vinothchandar commented on issue #1798:
URL: https://github.com/apache/hudi/issues/1798#issuecomment-657313523


   @zherenyu831 this seems like an issue with the contents of `.aux` as well listed additionally.. than anything to do with the actual reading of data.. cc @bvaradar to confirm if we made any fixes around this recently.. 


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hudi] bvaradar commented on issue #1798: Question reading partition path with less level is more faster than what document mentioned

Posted by GitBox <gi...@apache.org>.

bvaradar commented on issue #1798:
URL: https://github.com/apache/hudi/issues/1798#issuecomment-668661892


   https://issues.apache.org/jira/browse/HUDI-1144 to address optimizaion in HoodieROPathFilter


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hudi] vinothchandar commented on issue #1798: Question reading partition path with less level is more faster than what document mentioned

Posted by GitBox <gi...@apache.org>.

vinothchandar commented on issue #1798:
URL: https://github.com/apache/hudi/issues/1798#issuecomment-655880193


   @zherenyu831 one thing I don’t understand from your original description is wat you mean by 4000+ files vs 600+ files. If it’s the same result then how can the files be different , when your are just loading the entire table.. 
   
   I suspect if the based filtering is happening during one and not during another. Your query is on the hudi commit_time which will be the same regardless.. 
   
   Can you confirm that you can do df.count() with both paths and the result is the same?


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hudi] vinothchandar commented on issue #1798: Question reading partition path with less level is more faster than what document mentioned

Posted by GitBox <gi...@apache.org>.

vinothchandar commented on issue #1798:
URL: https://github.com/apache/hudi/issues/1798#issuecomment-655878475


   @umehrot2  any ideas?


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hudi] zherenyu831 edited a comment on issue #1798: Question reading partition path with less level is more faster than what document mentioned

Posted by GitBox <gi...@apache.org>.

zherenyu831 edited a comment on issue #1798:
URL: https://github.com/apache/hudi/issues/1798#issuecomment-657193426


   @umehrot2 @vinothchandar 
   Thank you guys. and sorry for lately reply.
   
   Here is my snapshot of spark ui.
   
   First query I used, files processed by resolveRelation was 950, cost 31 seconds
   ```
   spark.read.format("org.apache.hudi").load("s3://daas-hudi-test/paylite_payment_read/orders_v6/data/*/*/*").count()
   ```
   
   and second I used below query, and files processed by resolveRelation was 4750, cost 2.5 mins
   ``` 
   spark.read.format("org.apache.hudi").load("s3://daas-hudi-test/paylite_payment_read/orders_v6/data/*/*/*/*").count()
   ```
   
   since we are using spark stream to write data into the table, so the file size will be changed a little when second query run.
   
   <img width="1665" alt="スクリーンショット 2020-07-12 17 51 22" src="https://user-images.githubusercontent.com/52404525/87242515-61857300-c468-11ea-9e23-a874afed66b8.png">
   <img width="1666" alt="スクリーンショット 2020-07-12 17 51 28" src="https://user-images.githubusercontent.com/52404525/87242522-6a764480-c468-11ea-89f1-f865875783fe.png">
   
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hudi] zherenyu831 commented on issue #1798: Question reading partition path with less level is more faster than what document mentioned

Posted by GitBox <gi...@apache.org>.

zherenyu831 commented on issue #1798:
URL: https://github.com/apache/hudi/issues/1798#issuecomment-654575639


   @bhasudha 
   It is very simple query 
   ```
   val df = spark.read.format("org.apache.hudi").load("s3://test/data/*/*/*")
   val updatedDf  = df.filter("_hoodie_commit_time between '${_hoodieCommitTimeStart}' and '${_hoodieCommitTimeEnd}'")
   ```


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hudi] zherenyu831 edited a comment on issue #1798: Question reading partition path with less level is more faster than what document mentioned

Posted by GitBox <gi...@apache.org>.

zherenyu831 edited a comment on issue #1798:
URL: https://github.com/apache/hudi/issues/1798#issuecomment-657193426


   @umehrot2 @vinothchandar 
   Thank you guys. and sorry for lately reply.
   
   Here is my snapshot of spark ui.
   
   First query I used, files processed by resolveRelation was 950
   ```
   spark.read.format("org.apache.hudi").load("s3://daas-hudi-test/paylite_payment_read/orders_v6/data/*/*/*").count()
   ```
   
   and second I used below query, and files processed by resolveRelation was 4750
   ```
   spark.read.format("org.apache.hudi").load("s3://daas-hudi-test/paylite_payment_read/orders_v6/data/*/*/*/*").count()
   ```
   
   since we are using spark stream to write data into the table, so the file size will be changed a little when second query run.
   
   <img width="1665" alt="スクリーンショット 2020-07-12 17 51 22" src="https://user-images.githubusercontent.com/52404525/87242515-61857300-c468-11ea-9e23-a874afed66b8.png">
   <img width="1666" alt="スクリーンショット 2020-07-12 17 51 28" src="https://user-images.githubusercontent.com/52404525/87242522-6a764480-c468-11ea-89f1-f865875783fe.png">
   
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hudi] bhasudha commented on issue #1798: Question reading partition path with less level is more faster than what document mentioned

Posted by GitBox <gi...@apache.org>.

bhasudha commented on issue #1798:
URL: https://github.com/apache/hudi/issues/1798#issuecomment-654251026


   > Document
   > 
   > ```
   > val hudiIncQueryDF = spark
   >      .read()
   >      .format("org.apache.hudi")
   >      .option(DataSourceReadOptions.QUERY_TYPE_OPT_KEY(), DataSourceReadOptions.QUERY_TYPE_SNAPSHOT_OPT_VAL())
   >      .load(tablePath + "/*") //The number of wildcard asterisks here must be one greater than the number of partition
   > ```
   > 
   > we have path like data/YYYY/MM/DD and when try as document mentioned
   > 
   > ```
   > spark.read.format("org.apache.hudi").load("s3://test/data/*/*/*/*")
   > // 4000+ files cost 60s
   > scala> res8.count
   > res9: Long = 313589086
   > ```
   > 
   > but when we test with
   > 
   > ```
   > spark.read.format("org.apache.hudi").load("s3://test/data/*/*/*")
   > // 600+ files cost 10s
   > scala> res10.count
   > res11: Long = 313589086
   > ```
   > 
   > result is the same, but with `s3://test/data/*/*/*` we could have much more fast speed.
   > and basically the the more file count the path included, the much more huge difference the time cost will be....
   > 
   > Is there any concern with using the path with less level of parquet file?
   
   @zherenyu831 Thanks for reaching out. Do you mind sharing what was your query on the table?  


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hudi] zherenyu831 edited a comment on issue #1798: Question reading partition path with less level is more faster than what document mentioned

Posted by GitBox <gi...@apache.org>.

zherenyu831 edited a comment on issue #1798:
URL: https://github.com/apache/hudi/issues/1798#issuecomment-654575639


   @bhasudha 
   It is a very simple query for testing
   ```
   //val df = spark.read.format("org.apache.hudi").load("s3://test/data/*/*/*/*")
   val df = spark.read.format("org.apache.hudi").load("s3://test/data/*/*/*")
   val updatedDf  = df.filter("_hoodie_commit_time between '${_hoodieCommitTimeStart}' and '${_hoodieCommitTimeEnd}'")
   ```
   and we found it cost a lot of time on resolving relation of parquet files.
   so we do the test as I mentioned 


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hudi] umehrot2 commented on issue #1798: Question reading partition path with less level is more faster than what document mentioned

Posted by GitBox <gi...@apache.org>.

umehrot2 commented on issue #1798:
URL: https://github.com/apache/hudi/issues/1798#issuecomment-658385297


   Like @bvaradar mentioned, in the first query the glob pattern matches with 950 folders which are then parallely listed across the cluster using spark context. In the second query the glob patter matches 4750 files because of the extra * and now spark has to parallely list 4750 paths using spark context. This most likely seems to be the cause of this performance difference. Added to this I think the time taken by **HoodieROTablePathFilter** which is applied per file might somehow be amplifying it more.
   
   Can you run a similar test queries on a simple parquet table (non-hudi table) and observe the performance difference in listing. I think you may see slightly similar behavior.
   
   ```
   spark.read.parquet(globPath)
   ```


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hudi] zherenyu831 edited a comment on issue #1798: Question reading partition path with less level is more faster than what document mentioned

Posted by GitBox <gi...@apache.org>.

zherenyu831 edited a comment on issue #1798:
URL: https://github.com/apache/hudi/issues/1798#issuecomment-654575639


   @bhasudha 
   It is very simple query 
   ```
   //val df = spark.read.format("org.apache.hudi").load("s3://test/data/*/*/*/*")
   val df = spark.read.format("org.apache.hudi").load("s3://test/data/*/*/*")
   val updatedDf  = df.filter("_hoodie_commit_time between '${_hoodieCommitTimeStart}' and '${_hoodieCommitTimeEnd}'")
   ```
   and we found it cost a lot of time on resolve relation of parquet files.
   so we do the test as I mentioned 


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hudi] bvaradar commented on issue #1798: Question reading partition path with less level is more faster than what document mentioned

Posted by GitBox <gi...@apache.org>.

bvaradar commented on issue #1798:
URL: https://github.com/apache/hudi/issues/1798#issuecomment-658244220


   The related code (HoodieROTablePathFilter) does not seem to have any relevant recent changes. 
   
   @zherenyu831 From the control flow, since Spark deciphers the glob-path, it is first performing the listing of all matching entities and this is where I think  it is slower try to list files under .aux. One option to try (for experimentation) is to skip ".hoodie" folder in glob pattern and see if it is faster. 


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hudi] zherenyu831 edited a comment on issue #1798: Question reading partition path with less level is more faster than what document mentioned

Posted by GitBox <gi...@apache.org>.

zherenyu831 edited a comment on issue #1798:
URL: https://github.com/apache/hudi/issues/1798#issuecomment-654575639


   @bhasudha 
   It is a very simple query for testing
   ```
   //val df = spark.read.format("org.apache.hudi").load("s3://test/data/*/*/*/*")
   val df = spark.read.format("org.apache.hudi").load("s3://test/data/*/*/*")
   val updatedDf  = df.filter("_hoodie_commit_time between '${_hoodieCommitTimeStart}' and '${_hoodieCommitTimeEnd}'")
   ```
   and we found it cost a lot of time on resolve relation of parquet files.
   so we do the test as I mentioned 


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hudi] zherenyu831 edited a comment on issue #1798: Question reading partition path with less level is more faster than what document mentioned

Posted by GitBox <gi...@apache.org>.

zherenyu831 edited a comment on issue #1798:
URL: https://github.com/apache/hudi/issues/1798#issuecomment-654575639


   @bhasudha 
   It is a very simple query 
   ```
   //val df = spark.read.format("org.apache.hudi").load("s3://test/data/*/*/*/*")
   val df = spark.read.format("org.apache.hudi").load("s3://test/data/*/*/*")
   val updatedDf  = df.filter("_hoodie_commit_time between '${_hoodieCommitTimeStart}' and '${_hoodieCommitTimeEnd}'")
   ```
   and we found it cost a lot of time on resolve relation of parquet files.
   so we do the test as I mentioned 


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hudi] zherenyu831 edited a comment on issue #1798: Question reading partition path with less level is more faster than what document mentioned

Posted by GitBox <gi...@apache.org>.

zherenyu831 edited a comment on issue #1798:
URL: https://github.com/apache/hudi/issues/1798#issuecomment-657193426


   @umehrot2 @vinothchandar 
   Sorry for lately reply.
   Here is my snapshot of spark ui.
   
   First query I used, files processed by resolveRelation was 950
   ```
   spark.read.format("org.apache.hudi").load("s3://daas-hudi-test/paylite_payment_read/orders_v6/data/*/*/*").count()
   ```
   
   and second I used below query, and files processed by resolveRelation was 4750
   ```
   spark.read.format("org.apache.hudi").load("s3://daas-hudi-test/paylite_payment_read/orders_v6/data/*/*/*/*").count()
   ```
   
   since we are using spark stream to write data into the table, so the file size will be changed a little when second query run.
   
   <img width="1665" alt="スクリーンショット 2020-07-12 17 51 22" src="https://user-images.githubusercontent.com/52404525/87242515-61857300-c468-11ea-9e23-a874afed66b8.png">
   <img width="1666" alt="スクリーンショット 2020-07-12 17 51 28" src="https://user-images.githubusercontent.com/52404525/87242522-6a764480-c468-11ea-89f1-f865875783fe.png">
   
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org