You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@hudi.apache.org by GitBox <gi...@apache.org> on 2022/06/22 10:09:33 UTC

[GitHub] [hudi] developerwxl opened a new issue, #5938: Why Hudi publish data size much more than the input file size when publish to hive

developerwxl opened a new issue, #5938:
URL: https://github.com/apache/hudi/issues/5938

   I use the insert operation to publish data to the hive. 
   I insert a 541 partitions table.
   And the input data size is 153GB .But there are  2.6TB in the UpsertPartitioner job.
   Here is the code
   <img width="1138" alt="image" src="https://user-images.githubusercontent.com/4186869/175003745-65c0767a-eb05-45e2-9314-5fd3b8e36ddf.png">
   
   
   Here is the spark ui
   <img width="1635" alt="image" src="https://user-images.githubusercontent.com/4186869/175002927-0c7221d8-2072-4dd1-b246-bd9b851d88a3.png">
   
   <img width="1638" alt="image" src="https://user-images.githubusercontent.com/4186869/175003033-3ca82130-78da-4955-a99d-9cc9f9efe283.png">
   
   <img width="1787" alt="image" src="https://user-images.githubusercontent.com/4186869/175003079-348f558c-7c64-40e0-aa58-4423ff1551ba.png">
   
   
   
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] nsivabalan commented on issue #5938: Why Hudi publish data size much more than the input file size when publish to hive

Posted by GitBox <gi...@apache.org>.
nsivabalan commented on issue #5938:
URL: https://github.com/apache/hudi/issues/5938#issuecomment-1164535154

   Hudi does small file handling and so it has to. 
   you can read more on this here: 
   https://hudi.apache.org/blog/2021/03/01/hudi-file-sizing
   https://hudi.apache.org/learn/faq/#how-do-i-to-avoid-creating-tons-of-small-files
   
   If you wish to avoid this small file handling, you can set [this](
   https://hudi.apache.org/docs/configurations/#hoodieparquetsmallfilelimit) config to 0. 
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] nsivabalan commented on issue #5938: Why Hudi publish data size much more than the input file size when publish to hive

Posted by GitBox <gi...@apache.org>.
nsivabalan commented on issue #5938:
URL: https://github.com/apache/hudi/issues/5938#issuecomment-1211356103

   @developerwxl : any update on this. if the issue is resolved, feel free to close out the issue. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] nsivabalan commented on issue #5938: Why Hudi publish data size much more than the input file size when publish to hive

Posted by GitBox <gi...@apache.org>.
nsivabalan commented on issue #5938:
URL: https://github.com/apache/hudi/issues/5938#issuecomment-1163897410

   "Getting small files from partitions" stage refers to reading existing data from hudi to fetch list of small file groups. So, this could refer to your hudi table size and not your incoming data size. does your existing hudi table size is 2.6 TB? 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] nsivabalan commented on issue #5938: Why Hudi publish data size much more than the input file size when publish to hive

Posted by GitBox <gi...@apache.org>.
nsivabalan commented on issue #5938:
URL: https://github.com/apache/hudi/issues/5938#issuecomment-1229357390

   @developerwxl : any updates please.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] xushiyan commented on issue #5938: Why Hudi publish data size much more than the input file size when publish to hive

Posted by GitBox <gi...@apache.org>.
xushiyan commented on issue #5938:
URL: https://github.com/apache/hudi/issues/5938#issuecomment-1296325168

   > "Getting small files from partitions" stage refers to reading existing data from hudi to fetch list of small file groups. So, this could refer to your hudi table size and not your incoming data size. does your existing hudi table size is 2.6 TB?
   
   @nsivabalan has clarified for the original question. closing this due to inactivity.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] xushiyan closed issue #5938: Why Hudi publish data size much more than the input file size when publish to hive

Posted by GitBox <gi...@apache.org>.
xushiyan closed issue #5938: Why Hudi publish data size much more than the input file size when publish to hive
URL: https://github.com/apache/hudi/issues/5938


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] developerwxl commented on issue #5938: Why Hudi publish data size much more than the input file size when publish to hive

Posted by GitBox <gi...@apache.org>.
developerwxl commented on issue #5938:
URL: https://github.com/apache/hudi/issues/5938#issuecomment-1164136304

   So you mean even i use the "insert" operation, Hudi still read the existing hudi table data? Could we let hudi publish data action ignore read the existed hudi table data? I just want to add a new partition and add the new files to the new partition.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] developerwxl commented on issue #5938: Why Hudi publish data size much more than the input file size when publish to hive

Posted by GitBox <gi...@apache.org>.
developerwxl commented on issue #5938:
URL: https://github.com/apache/hudi/issues/5938#issuecomment-1166775992

   I'm going to try it, Thanks for your comment. I will comment on the issue once I fix it.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org