You are viewing a plain text version of this content. The canonical link for it is here.

Posted to commits@hudi.apache.org by GitBox <gi...@apache.org> on 2022/09/06 10:30:46 UTC

[GitHub] [hudi] gunjdesai opened a new issue, #6610: [QUESTION] Faster approach for backfilling older data without stopping realtime pipelines

gunjdesai opened a new issue, #6610:
URL: https://github.com/apache/hudi/issues/6610

   **Environment Description**
   
   * Hudi version : `0.11.0`
   
   * Spark version : `3.2.0`
   
   * Hive Metastore version : `3.1.0`
   
   * Storage (HDFS/S3/GCS..) : `Minio`
   
   * Running on Docker? (yes/no) : `yes`
   
   Hi Folks,
   We are using Hudi via Spark to push data in Trino. We started the pipeline recently and data accuracy is as expected. 
   For older data since we want to perform backfills, we are pushing older data in the same topic as that of the realtime data.
   This works well for us till a point, but after a while, the data in the topic becomes so huge, that the **_Tagging task takes more than 12 hours with only running on one thread at a time, before eventually failing_**
   
   Can you’ll suggest a better to approach backfill while not needing to shutdown real-time pipelines ?
   
   Any guidance is deeply appreciated.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hudi] gunjdesai commented on issue #6610: [QUESTION] Faster approach for backfilling older data without stopping realtime pipelines

Posted by GitBox <gi...@apache.org>.

gunjdesai commented on issue #6610:
URL: https://github.com/apache/hudi/issues/6610#issuecomment-1237974275

   Sharing images of task status for reference
   <img width="1791" alt="Screenshot 2022-09-06 at 4 05 04 PM" src="https://user-images.githubusercontent.com/7438622/188614309-da5900dc-b2a2-4332-aeb8-a44204633c0d.png">
   <img width="1792" alt="Screenshot 2022-09-06 at 4 05 34 PM" src="https://user-images.githubusercontent.com/7438622/188614338-767d1f5e-47a6-4da3-9006-fd70d6738cf8.png">
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hudi] gunjdesai closed issue #6610: [QUESTION] Faster approach for backfilling older data without stopping realtime pipelines

Posted by GitBox <gi...@apache.org>.

gunjdesai closed issue #6610: [QUESTION] Faster approach for backfilling older data without stopping realtime pipelines
URL: https://github.com/apache/hudi/issues/6610


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hudi] gunjdesai commented on issue #6610: [QUESTION] Faster approach for backfilling older data without stopping realtime pipelines

Posted by GitBox <gi...@apache.org>.

gunjdesai commented on issue #6610:
URL: https://github.com/apache/hudi/issues/6610#issuecomment-1247777136

   @xushiyan yes this is a spark structured streaming job. So we are running the job on K8S Spot instances, there are cases where we face driver eviction, hence we can't use multi-writer approach as it can mess with the locks.
   Yes the job does scale based on backfill traffic going up.
   
   Actually the idea was not to stop the real-time pipeline when doing backfill, but i think our setup would not allow us to do that. 
   
   On further reading, I was thinking about stopping the real-time pipeline, doing a **_bulk_insert_** for the table and then starting the real-time pipeline again in **_upsert_** mode
   Would you say this could be a good approach ?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hudi] xushiyan commented on issue #6610: [QUESTION] Faster approach for backfilling older data without stopping realtime pipelines

Posted by GitBox <gi...@apache.org>.

xushiyan commented on issue #6610:
URL: https://github.com/apache/hudi/issues/6610#issuecomment-1247472941

   is this a spark streaming job you're running ? does it scale accordingly when backfill traffic spiked up? the OOM also hints that you may need tune spark configs properly, like spark memory and spark memory.storage.fraction to give more execution memory.
   Looks like order of records does not matter here since you pump them into the same topic. Why not start a batch job just for backfill? that's how people usually run backfill jobs.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hudi] gunjdesai commented on issue #6610: [QUESTION] Faster approach for backfilling older data without stopping realtime pipelines

Posted by GitBox <gi...@apache.org>.

gunjdesai commented on issue #6610:
URL: https://github.com/apache/hudi/issues/6610#issuecomment-1238051026

   On further inspection, I was able to find out that the query is constantly picking up more rows and failing when the size of rows cross the memory available causing an OOM. 
   
   <img width="679" alt="Screenshot 2022-09-06 at 5 24 57 PM" src="https://user-images.githubusercontent.com/7438622/188628948-7ff0132b-4c8d-4e42-92a1-5e3d8ad69747.png">
   
   At the 10 min interval, the number of output rows are at 1.6M
   <img width="1792" alt="Screenshot 2022-09-06 at 5 25 27 PM" src="https://user-images.githubusercontent.com/7438622/188629007-7488ba83-1e98-4126-ba52-38290d2bfac6.png">
   
   At the 12 min interval, the number of output rows are at 2.1M
   <img width="528" alt="Screenshot 2022-09-06 at 5 28 22 PM" src="https://user-images.githubusercontent.com/7438622/188629212-9035e64a-01d2-4cd3-8544-caadf815cd16.png">
   
   The processed output keeps on increasing even above 23M rows


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hudi] gunjdesai commented on issue #6610: [QUESTION] Faster approach for backfilling older data without stopping realtime pipelines

Posted by GitBox <gi...@apache.org>.

gunjdesai commented on issue #6610:
URL: https://github.com/apache/hudi/issues/6610#issuecomment-1278758071

   This issue was resolved by setting `maxOffsetsPerTrigger` property in Kafka, without the setting of that property, the batch size was constantly growing until we were hitting an OOM.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org