You are viewing a plain text version of this content. The canonical link for it is here.

Posted to commits@hudi.apache.org by GitBox <gi...@apache.org> on 2020/08/18 22:39:39 UTC

[GitHub] [hudi] rubenssoto opened a new issue #1981: [SUPPORT] Huge performance Difference Between Hudi and Regular Parquet in Athena

rubenssoto opened a new issue #1981:
URL: https://github.com/apache/hudi/issues/1981


   Hi, How are you?
   
   I have two tables in my datalake, a bigger one with 300GB in regular parquet, I can execute a simple count in Athena on this table takes 8 seconds
   select count(1) from table
   
   I have another table, smaller one(Hudi Dataset), 47GB the same simple count takes 1 minute and 37 seconds in Athena. Both tables are partitioned by date, the first table has a lot of small files and the second has one file per partition, the bigger file has 600MB.
   
   I really don't understand why performance is so different in athena between this tables.
   
   The table one was created by Glue Crawler, the second one by Apache Hudi, I saw some differences on Glue Catalog:
   
   This is screenshots from table crawled by Glue, you could saw some tips like row count.
   <img width="1324" alt="Captura de Tela 2020-08-18 às 19 35 40" src="https://user-images.githubusercontent.com/36298331/90572360-3807e780-e18a-11ea-9ae4-47f51fa28eb0.png">
   <img width="1439" alt="Captura de Tela 2020-08-18 às 19 36 12" src="https://user-images.githubusercontent.com/36298331/90572366-3b02d800-e18a-11ea-84d6-55621cff6f9b.png">
   
   This is screenshots from Hudi table, there aren't the same tips
   <img width="1412" alt="Captura de Tela 2020-08-18 às 19 37 57" src="https://user-images.githubusercontent.com/36298331/90572450-6dacd080-e18a-11ea-9800-001871ee7d4f.png">
   <img width="1371" alt="Captura de Tela 2020-08-18 às 19 37 34" src="https://user-images.githubusercontent.com/36298331/90572455-71405780-e18a-11ea-9a89-c3e0afc206fa.png">
   
   
   Is this could be the reason of performance difference? And how to solve?
   
   Thank you so much!


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hudi] rubenssoto commented on issue #1981: [SUPPORT] Huge performance Difference Between Hudi and Regular Parquet in Athena

Posted by GitBox <gi...@apache.org>.

rubenssoto commented on issue #1981:
URL: https://github.com/apache/hudi/issues/1981#issuecomment-678824833


   @umehrot2 @vinothchandar 
   Path Filter improvements, could be achieved updating some Hudi Lib in presto? Because emr presto is 0.232, and these improvements were made in 0.233.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hudi] tooptoop4 commented on issue #1981: [SUPPORT] Huge performance Difference Between Hudi and Regular Parquet in Athena

Posted by GitBox <gi...@apache.org>.

tooptoop4 commented on issue #1981:
URL: https://github.com/apache/hudi/issues/1981#issuecomment-677200382


   I use open source presto on ec2 and find native parquet table much faster than hoodie table


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hudi] rubenssoto commented on issue #1981: [SUPPORT] Huge performance Difference Between Hudi and Regular Parquet in Athena

Posted by GitBox <gi...@apache.org>.

rubenssoto commented on issue #1981:
URL: https://github.com/apache/hudi/issues/1981#issuecomment-677672123


   I made more tests, but now with the same table, only difference is partition strategy
   I use Athena.
   
   Table01 with regular parquet
   
   query:
   select city,origin, count(1) from 
   parquet_demand_coverage where created_date_brt >= '2020-01-01'
   group by city,origin
   order by count(1) desc
   limit 20
   
   Time to Execute: 6.19 seconds
   Table Size: 35.7gb
   Number of Partitions: 693
   Number of Files: 916
   Partition by: day
   Data Scanned by Athena: 512mb
   
   
   Table02 with Hudi
   
   
   query:
   select city,origin, count(1) from 
   demand_coverage where created_year_month_brt >= '2020-01-01'
   group by city,origin
   order by count(1) desc
   limit 20
   
   Time to Execute: 18.77 seconds
   Table size: 59gb (The bigger size is because Hudi keep commit files, but the original size is almost the same)
   Number Of partitions: 24
   Number Of files: 124
   Data Scanned by Athena: 480mb
   
   Its a big performance difference


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hudi] nsivabalan commented on issue #1981: [SUPPORT] Huge performance Difference Between Hudi and Regular Parquet in Athena

Posted by GitBox <gi...@apache.org>.

nsivabalan commented on issue #1981:
URL: https://github.com/apache/hudi/issues/1981#issuecomment-767206596


   @vinothchandar @umehrot2 : can either of you respond here wrt metadata support(rfc-15) in Athena. when can we possibly expect. 


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hudi] rubenssoto commented on issue #1981: [SUPPORT] Huge performance Difference Between Hudi and Regular Parquet in Athena

Posted by GitBox <gi...@apache.org>.

rubenssoto commented on issue #1981:
URL: https://github.com/apache/hudi/issues/1981#issuecomment-677647272


   Yeah, I could try.
   
   I made some tests, the smaller table was partitioned by day, so now I partitioned by year-month, so now I have greater files...my simple count improve a lot before was taking 1 minute and 30 seconds, now 17 seconds, but count on bigger table takes only 7 seconds.
   
   I could try on EMR but I catch this error
   
   Query 20200820_125020_00004_h9eb5 failed: Not valid Parquet file: s3://datalake/raw/courier_api/demand_coverage/created_year_month_brt=2020-06-01/b89ad14e-8cf2-446b-934a-b27107e88e20-0_26-8-4880_20200819200116.parquet expected magic number: [80, 65, 82, 49] got: [51, -66, -112, 88] 


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hudi] rubenssoto commented on issue #1981: [SUPPORT] Huge performance Difference Between Hudi and Regular Parquet in Athena

Posted by GitBox <gi...@apache.org>.

rubenssoto commented on issue #1981:
URL: https://github.com/apache/hudi/issues/1981#issuecomment-678008043


   I think spark sql to query is not an option for us, because we use redash, so redash doesn't connect to spark and my users are not tech experts.
   
   I think the only viable option is to use emr with presto, but I think is not only athena problem is presto in general.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hudi] tooptoop4 commented on issue #1981: [SUPPORT] Huge performance Difference Between Hudi and Regular Parquet in Athena

Posted by GitBox <gi...@apache.org>.

tooptoop4 commented on issue #1981:
URL: https://github.com/apache/hudi/issues/1981#issuecomment-678066269


   > I understand that recently we made changes in Presto to use `Path Filter` instead. 
   @umehrot2 was that fix made on prestosql too or just prestodb? I heard new EMR 6 in September will use prestosql instead of prestodb


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hudi] rubenssoto commented on issue #1981: [SUPPORT] Huge performance Difference Between Hudi and Regular Parquet in Athena

Posted by GitBox <gi...@apache.org>.

rubenssoto commented on issue #1981:
URL: https://github.com/apache/hudi/issues/1981#issuecomment-680284122


   @bvaradar is this problem was solved in 0.6 because I read that rfc 15 is in experimental.
   And Athena already support? 


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hudi] vinothchandar commented on issue #1981: [SUPPORT] Huge performance Difference Between Hudi and Regular Parquet in Athena

Posted by GitBox <gi...@apache.org>.

vinothchandar commented on issue #1981:
URL: https://github.com/apache/hudi/issues/1981#issuecomment-675757583


   cc @bschell @umehrot2 


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hudi] nsivabalan commented on issue #1981: [SUPPORT] Huge performance Difference Between Hudi and Regular Parquet in Athena

Posted by GitBox <gi...@apache.org>.

nsivabalan commented on issue #1981:
URL: https://github.com/apache/hudi/issues/1981#issuecomment-767206596


   @vinothchandar @umehrot2 : can either of you respond here wrt metadata support(rfc-15) in Athena. when can we possibly expect. 


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hudi] n3nash commented on issue #1981: [SUPPORT] Huge performance Difference Between Hudi and Regular Parquet in Athena

Posted by GitBox <gi...@apache.org>.

n3nash commented on issue #1981:
URL: https://github.com/apache/hudi/issues/1981#issuecomment-824570179


   @umehrot2 Do you know when 0.7 will support metadata table in Athena ? 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hudi] rubenssoto commented on issue #1981: [SUPPORT] Huge performance Difference Between Hudi and Regular Parquet in Athena

Posted by GitBox <gi...@apache.org>.

rubenssoto commented on issue #1981:
URL: https://github.com/apache/hudi/issues/1981#issuecomment-678582396


   Q. Athena uses wich version of presto? Athena uses PrestoDb or PrestoSql?
   Ans:
   --- Athena is based on Presto (PrestoDB) 0.172 


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hudi] umehrot2 commented on issue #1981: [SUPPORT] Huge performance Difference Between Hudi and Regular Parquet in Athena

Posted by GitBox <gi...@apache.org>.

umehrot2 commented on issue #1981:
URL: https://github.com/apache/hudi/issues/1981#issuecomment-676855516


   @vinothchandar @rubenssoto I am thinking this could just be the difference between presto's performance over regular parquet where it completely uses its native parquet readers, vs presto's performance for Hudi where it needs to atleast use splits/listing logic from Hoodie's Input Format. Is it possible for you to try the queries on an EMR cluster and observe the difference in performance through presto ?
   
   cc @bhasudha as well
   
   @rubenssoto have you tried cutting ticket to AWS support regarding this ? They should help atleast rule out if its something specifically to do with Athena or just performance bottleneck with Hudi.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hudi] umehrot2 edited a comment on issue #1981: [SUPPORT] Huge performance Difference Between Hudi and Regular Parquet in Athena

Posted by GitBox <gi...@apache.org>.

umehrot2 edited a comment on issue #1981:
URL: https://github.com/apache/hudi/issues/1981#issuecomment-693142002


   @rubenssoto No this is not solved in 0.6.0. RFC 15 is still under development. As @bvaradar had mentioned it is being targeted in a 1 - 2 months timeframe.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hudi] umehrot2 commented on issue #1981: [SUPPORT] Huge performance Difference Between Hudi and Regular Parquet in Athena

Posted by GitBox <gi...@apache.org>.

umehrot2 commented on issue #1981:
URL: https://github.com/apache/hudi/issues/1981#issuecomment-677974115


   @vinothchandar `native parquet readers` are used only in `COW` use-case, but even then splits are fetched through `InputFormat` which also in the process does `listing`. For `MOR` use-case and going forward with `bootstrap`, the `native readers` will not be used and reading will happen through `record reader`.
   
   My hunch is that this is related to https://github.com/apache/hudi/issues/1829


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hudi] vinothchandar closed issue #1981: [SUPPORT] Huge performance Difference Between Hudi and Regular Parquet in Athena

Posted by GitBox <gi...@apache.org>.

vinothchandar closed issue #1981:
URL: https://github.com/apache/hudi/issues/1981


   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hudi] n3nash commented on issue #1981: [SUPPORT] Huge performance Difference Between Hudi and Regular Parquet in Athena

Posted by GitBox <gi...@apache.org>.

n3nash commented on issue #1981:
URL: https://github.com/apache/hudi/issues/1981#issuecomment-853641371


   @umehrot2 gentle reminder on this one


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hudi] rubenssoto commented on issue #1981: [SUPPORT] Huge performance Difference Between Hudi and Regular Parquet in Athena

Posted by GitBox <gi...@apache.org>.

rubenssoto commented on issue #1981:
URL: https://github.com/apache/hudi/issues/1981#issuecomment-680284508


   @umehrot2 


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hudi] rubenssoto commented on issue #1981: [SUPPORT] Huge performance Difference Between Hudi and Regular Parquet in Athena

Posted by GitBox <gi...@apache.org>.

rubenssoto commented on issue #1981:
URL: https://github.com/apache/hudi/issues/1981#issuecomment-678249155


   Its strange @tooptoop4 , becausa aws support hudi officialy, I think.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hudi] rubenssoto commented on issue #1981: [SUPPORT] Huge performance Difference Between Hudi and Regular Parquet in Athena

Posted by GitBox <gi...@apache.org>.

rubenssoto commented on issue #1981:
URL: https://github.com/apache/hudi/issues/1981#issuecomment-678031526


   Do you don't see a solution for this in a near future? 


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hudi] rubenssoto commented on issue #1981: [SUPPORT] Huge performance Difference Between Hudi and Regular Parquet in Athena

Posted by GitBox <gi...@apache.org>.

rubenssoto commented on issue #1981:
URL: https://github.com/apache/hudi/issues/1981#issuecomment-678550642


   Hi Guys Aws Support answer me, its the same topic that we debate here.
   
   Hello,
   
   Thank you for your patience. I have heard back from the Service team, and here's why such behavior has been observed when querying Apache Hudi tables:
   
   When running 'SELECT COUNT(1)' queries on Hudi tables using HoodieParquetInputFormat, Athena has to bypass it's own implementation of S3 file listing. Thus Hudi tables can be much less efficient in a query where the bottleneck is the speed at which files are listed. The Apache Hudi community is already aware of there being a performance impact caused by their S3 listing logic[1], as also has been rightly suggested on the thread you created.
   
   Further, 'SELECT COUNT(1)' queries over either format are nearly instantaneous to process on the Query Engine and measure how quickly the S3 listing completes. If you instead compare performance on more complex queries (that require meaningful work on both sides), you should see a less pronounced difference in the results.
   
   I hope this information helps. Feel free to reach out to me with any additional queries you may have on this topic. I will be glad to assist you!
   
   References:
   [1]. S3 slow file listing (Hudi) - https://github.com/apache/hudi/issues/1829 


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hudi] rubenssoto commented on issue #1981: [SUPPORT] Huge performance Difference Between Hudi and Regular Parquet in Athena

Posted by GitBox <gi...@apache.org>.

rubenssoto commented on issue #1981:
URL: https://github.com/apache/hudi/issues/1981#issuecomment-677988455


   Hi guys thank you so much for helping me.
   
   I'm really want to use Hudi in my production environment and I migrated almost all my datasets to hudi, but until now I've been migrated only smaller ones, in the last few days I started to migrate the bigger ones, so I realized that with more partitions, query is slower.
   
   Could help me what path that I have to follow?
   
   My plan is to use Hudi in All my datalake to deduplicate data and control file size, and after all my datasets migrated to Hudi I will allow my users to query by athena.
   
   Thank you


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hudi] umehrot2 commented on issue #1981: [SUPPORT] Huge performance Difference Between Hudi and Regular Parquet in Athena

Posted by GitBox <gi...@apache.org>.

umehrot2 commented on issue #1981:
URL: https://github.com/apache/hudi/issues/1981#issuecomment-693142002


   @rubenssoto No this is not solved in 0.6.0. RFC 15 is still under development. As @bvaradar it is being targeted in a 1 - 2 months timeframe.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hudi] rubenssoto commented on issue #1981: [SUPPORT] Huge performance Difference Between Hudi and Regular Parquet in Athena

Posted by GitBox <gi...@apache.org>.

rubenssoto commented on issue #1981:
URL: https://github.com/apache/hudi/issues/1981#issuecomment-761832251


   Hey Guys, this issue will be solved on 0.7 version?
   
   Another question, to the improvement, take effect our aws athena folks should make changes on athena?
   
   Thank you
   
   @umehrot2 


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hudi] vinothchandar commented on issue #1981: [SUPPORT] Huge performance Difference Between Hudi and Regular Parquet in Athena

Posted by GitBox <gi...@apache.org>.

vinothchandar commented on issue #1981:
URL: https://github.com/apache/hudi/issues/1981#issuecomment-855191697


   Closing this issue. 0.7 is at-least in EMR now. But ultimately, this issue boils down to Athena and we have little idea what exact prestoDB (or trino) they are running. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hudi] bvaradar edited a comment on issue #1981: [SUPPORT] Huge performance Difference Between Hudi and Regular Parquet in Athena

Posted by GitBox <gi...@apache.org>.

bvaradar edited a comment on issue #1981:
URL: https://github.com/apache/hudi/issues/1981#issuecomment-678134623


   Do you don't see a solution for this in a near future?
   
   @rubenssoto : We have been working to avoid listing with the consolidated metadata (https://cwiki.apache.org/confluence/display/HUDI/RFC+-+15%3A+HUDI+File+Listing+and+Query+Planning+Improvements). This is the comprehensive way to fix  the listing issues in S3.  We are aiming to have this feature in the next major release in 1 or 2 months timeframe. 
   
   cc @prashantwason @n3nash 
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hudi] vinothchandar commented on issue #1981: [SUPPORT] Huge performance Difference Between Hudi and Regular Parquet in Athena

Posted by GitBox <gi...@apache.org>.

vinothchandar commented on issue #1981:
URL: https://github.com/apache/hudi/issues/1981#issuecomment-677813495


   Athena question could boil down to what version of presto its running internally. Really for aws folks to answer. 
   
   But on open source Presto, I want to clear up few things. 
   - Hudi tables, do use Presto's native parquet readers.
   - The timeline version filtering for COW/MOR RO queries adds no overhead. This has been verified at Uber many times. 
   
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hudi] umehrot2 commented on issue #1981: [SUPPORT] Huge performance Difference Between Hudi and Regular Parquet in Athena

Posted by GitBox <gi...@apache.org>.

umehrot2 commented on issue #1981:
URL: https://github.com/apache/hudi/issues/1981#issuecomment-677977073


   I understand that recently we made changes in Presto to use `Path Filter` instead. Athena is on an older version and does not have the `Path Filter` patch in Presto. So I am not sure what kind of a difference that will have here.
   
   But even besides that for all other use-cases like `MOR` and `Bootstrap` that will not follow path filter approach we still will have to solve #1829 


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hudi] bvaradar commented on issue #1981: [SUPPORT] Huge performance Difference Between Hudi and Regular Parquet in Athena

Posted by GitBox <gi...@apache.org>.

bvaradar commented on issue #1981:
URL: https://github.com/apache/hudi/issues/1981#issuecomment-678253508


   @rubenssoto : I am not aware of this 


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hudi] rubenssoto commented on issue #1981: [SUPPORT] Huge performance Difference Between Hudi and Regular Parquet in Athena

Posted by GitBox <gi...@apache.org>.

rubenssoto commented on issue #1981:
URL: https://github.com/apache/hudi/issues/1981#issuecomment-677971682


   @vinothchandar I opened a ticket to aws. But my perception is when you have more partition takes much more time.
   The same dataset with 600 partitions count takes more than one minute, and with 20 partitions takes 15 seconds, but with regular parquet I don't have this problem.
   
   I hope solve this problem :)


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hudi] umehrot2 commented on issue #1981: [SUPPORT] Huge performance Difference Between Hudi and Regular Parquet in Athena

Posted by GitBox <gi...@apache.org>.

umehrot2 commented on issue #1981:
URL: https://github.com/apache/hudi/issues/1981#issuecomment-678000764


   @rubenssoto until this is fixed would you been okay querying through `spark-sql` instead ?
   
   Since you are using COW, you can make your spark-sql queries use spark's listing mechanism and just pass the Hoodie path filter to it. I think this is going to give you better query performance. Here is how you should start `spark-sql`:
   
   ```
   spark-sql --conf "spark.serializer=org.apache.spark.serializer.KryoSerializer" --conf "spark.hadoop.mapreduce.input.pathFilter.class=org.apache.hudi.hadoop.HoodieROTablePathFilter" --jars /usr/lib/hudi/hudi-spark-bundle.jar,/usr/lib/spark/external/lib/spark-avro.jar
   ```


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hudi] tooptoop4 edited a comment on issue #1981: [SUPPORT] Huge performance Difference Between Hudi and Regular Parquet in Athena

Posted by GitBox <gi...@apache.org>.

tooptoop4 edited a comment on issue #1981:
URL: https://github.com/apache/hudi/issues/1981#issuecomment-678066269


   > I understand that recently we made changes in Presto to use `Path Filter` instead. 
   
   @umehrot2 was that fix made on prestosql too or just prestodb? I heard new EMR 6 in September will use prestosql instead of prestodb


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hudi] rubenssoto commented on issue #1981: [SUPPORT] Huge performance Difference Between Hudi and Regular Parquet in Athena

Posted by GitBox <gi...@apache.org>.

rubenssoto commented on issue #1981:
URL: https://github.com/apache/hudi/issues/1981#issuecomment-678244175


   @bvaradar Its a good timeframe. Do you know if what @tooptoop4 said is true? It could be a problem.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hudi] bvaradar commented on issue #1981: [SUPPORT] Huge performance Difference Between Hudi and Regular Parquet in Athena

Posted by GitBox <gi...@apache.org>.

bvaradar commented on issue #1981:
URL: https://github.com/apache/hudi/issues/1981#issuecomment-678134623


   Do you don't see a solution for this in a near future?
   
   @rubenssoto : We do. We have been working to avoid listing with the consolidated metadata (https://cwiki.apache.org/confluence/display/HUDI/RFC+-+15%3A+HUDI+File+Listing+and+Query+Planning+Improvements). This is the comprehensive way to fix  the listing issues in S3.  We are aiming to have this feature in the next major release in 1 or 2 months timeframe. 
   
   cc @prashantwason @n3nash 
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hudi] umehrot2 commented on issue #1981: [SUPPORT] Huge performance Difference Between Hudi and Regular Parquet in Athena

Posted by GitBox <gi...@apache.org>.

umehrot2 commented on issue #1981:
URL: https://github.com/apache/hudi/issues/1981#issuecomment-679408763


   @rubenssoto yes currently EMR presto is on 0.232, but in upcoming releases you will see later versions of presto where you will be able to use this patch.
   
   If you want to manually give it a shot on current emr version..you can try to build presto 0.233 and replace presto-hive jar I believe on all nodes of the cluster and restart presto-server on all nodes.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hudi] n3nash commented on issue #1981: [SUPPORT] Huge performance Difference Between Hudi and Regular Parquet in Athena

Posted by GitBox <gi...@apache.org>.

n3nash commented on issue #1981:
URL: https://github.com/apache/hudi/issues/1981#issuecomment-853641371


   @umehrot2 gentle reminder on this one


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org