You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@hudi.apache.org by Daniel Kaźmirski <d....@gmail.com> on 2022/04/25 23:39:39 UTC

Spark structured streaming and Spark SQL improvements

Hi,

I would like to propose a few additions to Spark structured streaming in
Hudi and spark sql improvements. These would make my life easier as a Hudi
user, so this is from user perspective, not sure how about the
implementation side :)

Spark Structured Streaming:
1. As a user, I would like to be able to specify starting instant position
in for reading Hudi table streaming query, this is not possible in
structured streaming right now, it starts streaming data from the earliest
available instant or from instant saved in checkpoint.

2. In Hudi 0.11 it's possible to fallback to full table scan in absence of
commits afaik, this is used in delta streamer. I would like to have the
same functionality in structured streaming query.

3. I would like to be able to limit input rate when reading stream from
Hudi table. I'm thinking about adding maxInstantsPerTrigger/
maxBytesPerTrigger. Eg I would like to have 100 instants per trigger in my
micro batch.

Spark SQL:
Since 0.11 we get very flexible schema evolution. Therefore can we as users
automatically evolve schema on MERGE INTO operations?
I guess this should only be supported when we use update set * and insert *
in merge operation.
In case of missing columns, reconcile schema functionality can be used.


Best Regards,
Daniel Kaźmirski

Re: Spark structured streaming and Spark SQL improvements

Posted by Vinoth Chandar <vi...@apache.org>.
Thanks the thoughtful note, Daniel!

All of 1-3 looks good to me. Yann/Raymond or other spark usuals here, any
thoughts on adding these for 0.12?

0.12 we want to get schema evolution to GA. That's also a very useful
suggestion. Tao (author for Schema evolution), any thoughts?

On Mon, Apr 25, 2022 at 4:39 PM Daniel Kaźmirski <d....@gmail.com>
wrote:

> Hi,
>
> I would like to propose a few additions to Spark structured streaming in
> Hudi and spark sql improvements. These would make my life easier as a Hudi
> user, so this is from user perspective, not sure how about the
> implementation side :)
>
> Spark Structured Streaming:
> 1. As a user, I would like to be able to specify starting instant position
> in for reading Hudi table streaming query, this is not possible in
> structured streaming right now, it starts streaming data from the earliest
> available instant or from instant saved in checkpoint.
>
> 2. In Hudi 0.11 it's possible to fallback to full table scan in absence of
> commits afaik, this is used in delta streamer. I would like to have the
> same functionality in structured streaming query.
>
> 3. I would like to be able to limit input rate when reading stream from
> Hudi table. I'm thinking about adding maxInstantsPerTrigger/
> maxBytesPerTrigger. Eg I would like to have 100 instants per trigger in my
> micro batch.
>
> Spark SQL:
> Since 0.11 we get very flexible schema evolution. Therefore can we as users
> automatically evolve schema on MERGE INTO operations?
> I guess this should only be supported when we use update set * and insert *
> in merge operation.
> In case of missing columns, reconcile schema functionality can be used.
>
>
> Best Regards,
> Daniel Kaźmirski
>