You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@hudi.apache.org by "kazdy (via GitHub)" <gi...@apache.org> on 2023/01/28 09:51:37 UTC

[GitHub] [hudi] kazdy opened a new issue, #7778: [SUPPORT] NPE in spark structured streaming reading hudi table from checkpoint that no longer exists

kazdy opened a new issue, #7778:
URL: https://github.com/apache/hudi/issues/7778

   **Describe the problem you faced**
   
   I have a hudi table that is read by spark structured streaming job with checkpoint enabled and saved to S3.
   When table A is updated and commits are cleaned and the saved checkpoint no longer can be read since it no longer exists, Hudi throws NPE.
   
   **To Reproduce**
   
   Steps to reproduce the behavior:
   
   1. Create hudi table 
   2. Insert data to the table
   3. Consume the table using spark structured streaming and use checkpoint
   4. Insert more data o the hudi table (create a few commits)
   5. Clean commits (leave last one)
   6. Start streaming read again using previously saved checkpoint
   7. Streaming read fails with NPE
   
   **Expected behavior**
   
   Following Kafka structured streaming source it would be good to have "fail on data loss" config for spark streaming jobs.
   if failOnDataLoss is true -> throw an error warning about potential data loss
   else -> start reading from the earliest available instant.
   
   **Environment Description**
   
   * Hudi version : 0.12.1 amzn
   
   * Spark version : 3.3.1 amzn
   
   * Hive version :
   
   * Hadoop version :
   
   * Storage (HDFS/S3/GCS..) : S3
   
   * Running on Docker? (yes/no) : emr serverless
   
   
   **Additional context**
   
   Add any other context about the problem here.
   
   **Stacktrace**
   
   ```Add the stacktrace of the error.```
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] nsivabalan closed issue #7778: [SUPPORT] NPE in spark structured streaming reading hudi table from checkpoint that no longer exists

Posted by "nsivabalan (via GitHub)" <gi...@apache.org>.
nsivabalan closed issue #7778: [SUPPORT] NPE in spark structured streaming reading hudi table from checkpoint that no longer exists
URL: https://github.com/apache/hudi/issues/7778


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] kazdy commented on issue #7778: [SUPPORT] NPE in spark structured streaming reading hudi table from checkpoint that no longer exists

Posted by "kazdy (via GitHub)" <gi...@apache.org>.
kazdy commented on issue #7778:
URL: https://github.com/apache/hudi/issues/7778#issuecomment-1412143207

   @codope I think this PR addresses a different issue. I reset the identifier manually anyways before running into this issue.
   
   Here the issue is that start offset provided from spark checkpoint no longer exists in hudi table/timeline and incremental query can not read from that instant (?), regardless of these identifiers. This is the streaming source side, and the PR only touches the streaming sink.
   
   The scenario is
   one streaming job consumes data from kafka and saves to hudi table raw
   second streaming job consumes hudi table raw and writes to hudi table processed
   
   first job is running for a while and second is stopped, so commits are cleaned from table raw
   after a while I start second job and it fails with NPE
   
   if I remove spark checkpoint and reset batchId in hudi commit file, then it starts consuming table from earliest available instant
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] nsivabalan commented on issue #7778: [SUPPORT] NPE in spark structured streaming reading hudi table from checkpoint that no longer exists

Posted by "nsivabalan (via GitHub)" <gi...@apache.org>.
nsivabalan commented on issue #7778:
URL: https://github.com/apache/hudi/issues/7778#issuecomment-1413050154

   thanks @kazdy. will leave it open. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] codope closed issue #7778: [SUPPORT] NPE in spark structured streaming reading hudi table from checkpoint that no longer exists

Posted by "codope (via GitHub)" <gi...@apache.org>.
codope closed issue #7778: [SUPPORT] NPE in spark structured streaming reading hudi table from checkpoint that no longer exists
URL: https://github.com/apache/hudi/issues/7778


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] kazdy commented on issue #7778: [SUPPORT] NPE in spark structured streaming reading hudi table from checkpoint that no longer exists

Posted by "kazdy (via GitHub)" <gi...@apache.org>.
kazdy commented on issue #7778:
URL: https://github.com/apache/hudi/issues/7778#issuecomment-1415716090

   @nsivabalan
   I added stack trace I got during execution. 
   I will be happy to take a look at it and try to come up with a fix.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] nsivabalan commented on issue #7778: [SUPPORT] NPE in spark structured streaming reading hudi table from checkpoint that no longer exists

Posted by "nsivabalan (via GitHub)" <gi...@apache.org>.
nsivabalan commented on issue #7778:
URL: https://github.com/apache/hudi/issues/7778#issuecomment-1418452730

   sure @kazdy . that would be awesome. but I am curious of how you plan to fix the issue though. In streaming read, user might want to get all incremental changes. from what I see, this is nothing but an incremental query on a hudi table. w/ incremental query, we do have fallback mechanism via `hoodie.datasource.read.incr.fallback.fulltablescan.enable`. 
   
   But in streaming read, the amount of data read might spike up(if we plan to do the same) and the user may not have provisioned higher resources for the job. 
   
   I am thinking, if we should add something like `auto.offset.reset` we have in kafka. If you know if we have something similar in streaming read from spark itself, we can leverage the same or add a new config in hoodie. 
   
   So, users can configure what they want to do in such cases:
   1. whether they wish to resume reading from earliest valid commit from hudi. 
      // impl might be involved. since we need to dedect the commit which hasn't been cleaned by the cleaner yet. 
   3. Or do snapshot query w/ latest table state. 
   4. Fail the streaming read. 
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] codope commented on issue #7778: [SUPPORT] NPE in spark structured streaming reading hudi table from checkpoint that no longer exists

Posted by "codope (via GitHub)" <gi...@apache.org>.
codope commented on issue #7778:
URL: https://github.com/apache/hudi/issues/7778#issuecomment-1411963836

   This should be fixed by #7783 which will be released in 0.13.0.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] nsivabalan commented on issue #7778: [SUPPORT] NPE in spark structured streaming reading hudi table from checkpoint that no longer exists

Posted by "nsivabalan (via GitHub)" <gi...@apache.org>.
nsivabalan commented on issue #7778:
URL: https://github.com/apache/hudi/issues/7778#issuecomment-1419932843

   feel free to loop me in once you have the patch up. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] nsivabalan commented on issue #7778: [SUPPORT] NPE in spark structured streaming reading hudi table from checkpoint that no longer exists

Posted by "nsivabalan (via GitHub)" <gi...@apache.org>.
nsivabalan commented on issue #7778:
URL: https://github.com/apache/hudi/issues/7778#issuecomment-1419932442

   awesome! 
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] nsivabalan commented on issue #7778: [SUPPORT] NPE in spark structured streaming reading hudi table from checkpoint that no longer exists

Posted by "nsivabalan (via GitHub)" <gi...@apache.org>.
nsivabalan commented on issue #7778:
URL: https://github.com/apache/hudi/issues/7778#issuecomment-1418454827

   feel free to close the issue and put up a patch w/ the fix. we can continue the discussion over there.
   appreciate your help, thanks!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] nsivabalan commented on issue #7778: [SUPPORT] NPE in spark structured streaming reading hudi table from checkpoint that no longer exists

Posted by "nsivabalan (via GitHub)" <gi...@apache.org>.
nsivabalan commented on issue #7778:
URL: https://github.com/apache/hudi/issues/7778#issuecomment-1418454235

   have created a jira here https://issues.apache.org/jira/browse/HUDI-5707
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] kazdy commented on issue #7778: [SUPPORT] NPE in spark structured streaming reading hudi table from checkpoint that no longer exists

Posted by "kazdy (via GitHub)" <gi...@apache.org>.
kazdy commented on issue #7778:
URL: https://github.com/apache/hudi/issues/7778#issuecomment-1418741003

   Hi @nsivabalan thanks for your hints, I'll take these into consideration.
   
   Regarding this one:
   > But in streaming read, the amount of data read might spike up(if we plan to do the same) and the user may not have provisioned higher resources for the job.
   
   I think we can support rate limiting in streaming reads, Spark has an interface for it so I can implement it. This will be something similar to what we already have in deltastreamer (limiting on no of instants per microbatch).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org