You are viewing a plain text version of this content. The canonical link for it is here.

Posted to commits@hudi.apache.org by GitBox <gi...@apache.org> on 2022/09/23 15:15:44 UTC

[GitHub] [hudi] ranjani1993 opened a new issue, #6775: HUDI taking longer time for update

ranjani1993 opened a new issue, #6775:
URL: https://github.com/apache/hudi/issues/6775

   Hi Team,
   
   We are trying to implement HUDI for one of workflows in our project.
   
   The problem we are facing is we don't get only updated/changed records from source. We get the entire (unchanged + updated + new records) from source.
   
   Example:
   
   Source table has 1 billion records per partition
   Our target HUDI table has 1 billion records per partition
   
   Out of those 1 billion records in the source few records got updated. We don't know what are all the records got updated.
   
   So when we perform HUDI upsert operation on these 1 billion records in target against 1 billion records in source - HUDI is taking longer time than the regular overwrite operation (regular overwrite - in which we overwrite the entire partition in target table)
   
   We tried to apply optimisation by changing the index type to SIMPLE & other parallelism configs/ Spark configs. But we could not achieve the expected result.
   
   Just wanted to check with you, whether HUDI would be suitable for our usecase.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hudi] yihua commented on issue #6775: [Support] Is HUDI suitable for a usecase without incremental data ?

Posted by GitBox <gi...@apache.org>.

yihua commented on issue #6775:
URL: https://github.com/apache/hudi/issues/6775#issuecomment-1258120078

   @ranjani1993 Hudi also provides INSERT_OVERWRITE write action which is the same as overwrite operation in partitioned parquet tables.  Hud can still be suitable for your use case.
   
   If you do want to use UPSERT operation, since 0.11.0, SIMPLE index is used as the default.  You may want to try BLOOM if your record key field is ordered in some way.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hudi] ranjani1993 commented on issue #6775: [Support] Is HUDI suitable for a usecase with no incremental data from source?

Posted by GitBox <gi...@apache.org>.

ranjani1993 commented on issue #6775:
URL: https://github.com/apache/hudi/issues/6775#issuecomment-1261713740

   In this scenario we would just want to blindly overwrite the target partition without indexing overheads


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hudi] ranjani1993 closed issue #6775: [Support] Is HUDI suitable for a usecase with no incremental data from source?

Posted by GitBox <gi...@apache.org>.

ranjani1993 closed issue #6775: [Support] Is HUDI suitable for a usecase with no incremental data from source?
URL: https://github.com/apache/hudi/issues/6775


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hudi] ranjani1993 commented on issue #6775: [Support] Is HUDI suitable for a usecase with no incremental data from source?

Posted by GitBox <gi...@apache.org>.

ranjani1993 commented on issue #6775:
URL: https://github.com/apache/hudi/issues/6775#issuecomment-1259635055

   @yihua 
   @yihua 
   Our record key is not ordered. We have used "SIMPLE" index instead of "BLOOM". SIMPLE was giving better performance than BLOOM in our usecase.
   
   **run time statistics:**
   SIMPLE INDEX - 45 mins to update single partition
   BLOOM index - more than 2 hours to update single partition
   HUDI Bulk insert - 15 mins to load single partition
   Regular insert overwrite - 15 mins to load single partition
   
   Can HUDI "upsert" operation provide better performance than HUDI Bulk insert/Regular insert overwrite in this scenario?
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org