You are viewing a plain text version of this content. The canonical link for it is here.

Posted to commits@hudi.apache.org by "pushpavanthar (via GitHub)" <gi...@apache.org> on 2023/01/27 10:19:20 UTC

[GitHub] [hudi] pushpavanthar commented on issue #7757: [SUPPORT] missing records when HoodieDeltaStreamer run in continuous mode

pushpavanthar commented on issue #7757:
URL: https://github.com/apache/hudi/issues/7757#issuecomment-1406298898

Thanks for looking into this issue @codope. Below is the brief explanation to points you mentioned and hope it throws more light on the setup we have.
1. The data in raw table is written by s3-sink connector which roll files every 15 mins and partitioned by date derived from kafka metadata timestamp. I'm checking for count of unique primary keys per **created_at hour** (at least number of create records should match) for the last 3 days by excluding the current hour (to avoid inconsistencies in the current hour due to difference in nature of pipeline). Still I have a buffer of scanning 7+ days to account outliers to compare data of last 3 days.
2. For this pipeline we have provided sufficient resources and are constantly monitoring for lag. haven't noticed anything strange wrt application and cluster. Similar to your observation, I'm suspecting on `hoodie.deltastreamer.source.kafka.enable.commit.offset: true` config, which lets consumer groups in kafka to manage offsets. There might be a situation where consumer offsets are committed to kafka and some failure in the cycle might have triggered rollback. Next cycle of `deltasync` would pick from next set of offsets, hence miss entire batch of old records.
I'll try running few pipelines by disabling this config.
3. For now i've replayed the events to correct the inconsistencies since it impacts our reports. Have seen similar issues in the past on other tables. Will do this analysis when I come across this issue again.
4. The count of unique records in entire table is lower in Hudi table compared to raw table. Hence to dig deep, I ran query for hourly comparison.

for the notes on configurations
1. Verified that all records are having unique `id`s. The hourly distinct count of unique `id`s on raw tables are matching with the source db but doesn't match with hudi table.
2. We are doing some transformation to drop `__op` and `__source_ts_ms` and explicitly set `_hoodie_is_deleted` with false make sure we retain deleted records also.
3. Will try out disabling dynamic allocation.

--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org