You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@flink.apache.org by Илья Соин <il...@gmail.com> on 2023/04/21 09:59:27 UTC

State bootstrapping for Flink SQL / Table API jobs

Hi Flink community,

We have a quite complex sql job, it unions 5 topics, deduplicates by key and does some daily aggregations. The state TTL is 40 days. We want to be able to bootstrap its state from s3 or clickhouse. We want to have a general solution to this, to use for other SQL jobs as well. 

So far I haven’t found a working solution to this. I’d like to discuss what’s the best approach to take here and possibly contribute in to Flink.

I think a good solution would be to bring HybridSource to Table / SQL API. 

Another thought was to take the SQL, replace unbounded sources with bounded ones, and run the job. Then take a savepoint in the end and use it to bootstrap the streaming job. The problems I see here:
- we have no control over operator uuids and the final table plan, it’s possible the plan of the batch job will be slightly different than of the streaming job.


-- 
Sincerely,
Ilya Soin

Re: State bootstrapping for Flink SQL / Table API jobs

Posted by Flavio Pompermaier <po...@okkam.it>.

This feature would be an awesome addition! I'm looking forward to it

On Mon, Apr 24, 2023 at 3:59 PM Илья Соин <il...@gmail.com> wrote:

> Thank you, Shammon FY
>
> --
> *Sincerely,*
> *Ilya Soin*
>
> On 24 Apr 2023, at 15:19, Shammon FY <zj...@gmail.com> wrote:
>
> 
> Thanks Илья, there's already a FLIP [1] and discussion thread [2] about
> hybrid source. You can follow the progress and welcome to participate in
> the discussion.
>
> [1]
> https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=235836225
> [2] https://lists.apache.org/thread/nbf3skopy3trtj37jcovmt6ktcgst4w8
>
> Best,
> Shammon FY
>
>
> On Mon, Apr 24, 2023 at 3:30 PM Илья Соин <il...@gmail.com> wrote:
>
>> Hi Shammon FY,
>>
>> I haven’t tried it because AFIK it’s only available in the DataStream
>> API, while our job is in SQL. I’m thinking to write a custom
>> HybridDynamicTableSource which will use HybridSource under the hood. This
>> should allow to bootstrap any SQL / Table API job. Maybe it’s something
>> worth adding to the Flink distribution?
>>
>> --
>> *Sincerely,*
>>
>> *Ilya Soin*
>>
>> On 24 Apr 2023, at 03:37, Shammon FY <zj...@gmail.com> wrote:
>>
>> 
>> Hi Илья
>>
>> I think HybridSource may be a good way. Have you tried it before? Or have
>> you encountered any problems?
>>
>> Best,
>> Shammon FY
>>
>> On Fri, Apr 21, 2023 at 5:59 PM Илья Соин <il...@gmail.com> wrote:
>>
>>> Hi Flink community,
>>>
>>> We have a quite complex sql job, it unions 5 topics, deduplicates by key
>>> and does some daily aggregations. The state TTL is 40 days. We want to be
>>> able to bootstrap its state from s3 or clickhouse. We want to have a
>>> general solution to this, to use for other SQL jobs as well.
>>>
>>> So far I haven’t found a working solution to this. I’d like to discuss
>>> what’s the best approach to take here and possibly contribute in to Flink.
>>>
>>> I think a good solution would be to bring HybridSource to Table / SQL
>>> API.
>>>
>>> Another thought was to take the SQL, replace unbounded sources with
>>> bounded ones, and run the job. Then take a savepoint in the end and use it
>>> to bootstrap the streaming job. The problems I see here:
>>> - we have no control over operator uuids and the final table plan, it’s
>>> possible the plan of the batch job will be slightly different than of the
>>> streaming job.
>>>
>>>
>>> --
>>> *Sincerely,*
>>> *Ilya Soin*
>>>
>>

Re: State bootstrapping for Flink SQL / Table API jobs

Posted by Илья Соин <il...@gmail.com>.

Thank you, Shammon FY  
  

\--  

 **Sincerely,**

 **Ilya Soin**

  

> On 24 Apr 2023, at 15:19, Shammon FY <zj...@gmail.com> wrote:  
>  
>

> 
>
> Thanks Илья, there's already a FLIP [1] and discussion thread [2] about
> hybrid source. You can follow the progress and welcome to participate in the
> discussion.
>
>  
>
>
> [1]
> <https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=235836225>
>
> [2] <https://lists.apache.org/thread/nbf3skopy3trtj37jcovmt6ktcgst4w8>
>
>  
>
>
> Best,
>
> Shammon FY
>
>  
>
>
>  
>
>
> On Mon, Apr 24, 2023 at 3:30 PM Илья Соин
> <[ilya.soin.95@gmail.com](mailto:ilya.soin.95@gmail.com)> wrote:  
>
>

>> Hi Shammon FY,

>>

>>  
>
>>

>> I haven’t tried it because AFIK it’s only available in the DataStream API,
while our job is in SQL. I’m thinking to write a custom
HybridDynamicTableSource which will use HybridSource under the hood. This
should allow to bootstrap any SQL / Table API job. Maybe it’s something worth
adding to the Flink distribution?  
>  
>
>>

>> \--  
>
>>

>>  **Sincerely,**

>>

>>  **  
> **
>>

>>  **Ilya Soin**

>>

>>  
>
>>

>>> On 24 Apr 2023, at 03:37, Shammon FY
<[zjureel@gmail.com](mailto:zjureel@gmail.com)> wrote:  
>  
>
>>

>>> 

>>>

>>> Hi Илья

>>>

>>>  
>
>>>

>>> I think HybridSource may be a good way. Have you tried it before? Or have
you encountered any problems?

>>>

>>>  
>
>>>

>>> Best,

>>>

>>> Shammon FY

>>>

>>>  
>
>>>

>>> On Fri, Apr 21, 2023 at 5:59 PM Илья Соин
<[ilya.soin.95@gmail.com](mailto:ilya.soin.95@gmail.com)> wrote:  
>
>>>

>>>> Hi Flink community,

>>>>

>>>>  
>
>>>>

>>>> We have a quite complex sql job, it unions 5 topics, deduplicates by key
and does some daily aggregations. The state TTL is 40 days. We want to be able
to bootstrap its state from s3 or clickhouse. We want to have a general
solution to this, to use for other SQL jobs as well.

>>>>

>>>>  
>
>>>>

>>>> So far I haven’t found a working solution to this. I’d like to discuss
what’s the best approach to take here and possibly contribute in to Flink.

>>>>

>>>>  
>
>>>>

>>>> I think a good solution would be to bring HybridSource to Table / SQL
API.

>>>>

>>>>  
>
>>>>

>>>> Another thought was to take the SQL, replace unbounded sources with
bounded ones, and run the job. Then take a savepoint in the end and use it to
bootstrap the streaming job. The problems I see here:

>>>>

>>>> \- we have no control over operator uuids and the final table plan, it’s
possible the plan of the batch job will be slightly different than of the
streaming job.

>>>>

>>>>  
>  
>
>>>>

>>>> \--  
>
>>>>

>>>>  **Sincerely,**

>>>>

>>>>  **Ilya Soin**

Re: State bootstrapping for Flink SQL / Table API jobs

Posted by Shammon FY <zj...@gmail.com>.

Thanks Илья, there's already a FLIP [1] and discussion thread [2] about
hybrid source. You can follow the progress and welcome to participate in
the discussion.

[1]
https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=235836225
[2] https://lists.apache.org/thread/nbf3skopy3trtj37jcovmt6ktcgst4w8

Best,
Shammon FY


On Mon, Apr 24, 2023 at 3:30 PM Илья Соин <il...@gmail.com> wrote:

> Hi Shammon FY,
>
> I haven’t tried it because AFIK it’s only available in the DataStream API,
> while our job is in SQL. I’m thinking to write a custom
> HybridDynamicTableSource which will use HybridSource under the hood. This
> should allow to bootstrap any SQL / Table API job. Maybe it’s something
> worth adding to the Flink distribution?
>
> --
> *Sincerely,*
>
> *Ilya Soin*
>
> On 24 Apr 2023, at 03:37, Shammon FY <zj...@gmail.com> wrote:
>
> 
> Hi Илья
>
> I think HybridSource may be a good way. Have you tried it before? Or have
> you encountered any problems?
>
> Best,
> Shammon FY
>
> On Fri, Apr 21, 2023 at 5:59 PM Илья Соин <il...@gmail.com> wrote:
>
>> Hi Flink community,
>>
>> We have a quite complex sql job, it unions 5 topics, deduplicates by key
>> and does some daily aggregations. The state TTL is 40 days. We want to be
>> able to bootstrap its state from s3 or clickhouse. We want to have a
>> general solution to this, to use for other SQL jobs as well.
>>
>> So far I haven’t found a working solution to this. I’d like to discuss
>> what’s the best approach to take here and possibly contribute in to Flink.
>>
>> I think a good solution would be to bring HybridSource to Table / SQL
>> API.
>>
>> Another thought was to take the SQL, replace unbounded sources with
>> bounded ones, and run the job. Then take a savepoint in the end and use it
>> to bootstrap the streaming job. The problems I see here:
>> - we have no control over operator uuids and the final table plan, it’s
>> possible the plan of the batch job will be slightly different than of the
>> streaming job.
>>
>>
>> --
>> *Sincerely,*
>> *Ilya Soin*
>>
>

Re: State bootstrapping for Flink SQL / Table API jobs

Posted by Илья Соин <il...@gmail.com>.

Hi Shammon FY,

  

I haven’t tried it because AFIK it’s only available in the DataStream API,
while our job is in SQL. I’m thinking to write a custom
HybridDynamicTableSource which will use HybridSource under the hood. This
should allow to bootstrap any SQL / Table API job. Maybe it’s something worth
adding to the Flink distribution?  
  

\--  

 **Sincerely,**

 **  
**

 **Ilya Soin**

  

> On 24 Apr 2023, at 03:37, Shammon FY <zj...@gmail.com> wrote:  
>  
>

> 
>
> Hi Илья
>
>  
>
>
> I think HybridSource may be a good way. Have you tried it before? Or have
> you encountered any problems?
>
>  
>
>
> Best,
>
> Shammon FY
>
>  
>
>
> On Fri, Apr 21, 2023 at 5:59 PM Илья Соин
> <[ilya.soin.95@gmail.com](mailto:ilya.soin.95@gmail.com)> wrote:  
>
>

>> Hi Flink community,

>>

>>  
>
>>

>> We have a quite complex sql job, it unions 5 topics, deduplicates by key
and does some daily aggregations. The state TTL is 40 days. We want to be able
to bootstrap its state from s3 or clickhouse. We want to have a general
solution to this, to use for other SQL jobs as well.

>>

>>  
>
>>

>> So far I haven’t found a working solution to this. I’d like to discuss
what’s the best approach to take here and possibly contribute in to Flink.

>>

>>  
>
>>

>> I think a good solution would be to bring HybridSource to Table / SQL API.

>>

>>  
>
>>

>> Another thought was to take the SQL, replace unbounded sources with bounded
ones, and run the job. Then take a savepoint in the end and use it to
bootstrap the streaming job. The problems I see here:

>>

>> \- we have no control over operator uuids and the final table plan, it’s
possible the plan of the batch job will be slightly different than of the
streaming job.

>>

>>  
>  
>
>>

>> \--  
>
>>

>>  **Sincerely,**

>>

>>  **Ilya Soin**

Re: State bootstrapping for Flink SQL / Table API jobs

Posted by Shammon FY <zj...@gmail.com>.

Hi Илья

I think HybridSource may be a good way. Have you tried it before? Or have
you encountered any problems?

Best,
Shammon FY

On Fri, Apr 21, 2023 at 5:59 PM Илья Соин <il...@gmail.com> wrote:

> Hi Flink community,
>
> We have a quite complex sql job, it unions 5 topics, deduplicates by key
> and does some daily aggregations. The state TTL is 40 days. We want to be
> able to bootstrap its state from s3 or clickhouse. We want to have a
> general solution to this, to use for other SQL jobs as well.
>
> So far I haven’t found a working solution to this. I’d like to discuss
> what’s the best approach to take here and possibly contribute in to Flink.
>
> I think a good solution would be to bring HybridSource to Table / SQL API.
>
> Another thought was to take the SQL, replace unbounded sources with
> bounded ones, and run the job. Then take a savepoint in the end and use it
> to bootstrap the streaming job. The problems I see here:
> - we have no control over operator uuids and the final table plan, it’s
> possible the plan of the batch job will be slightly different than of the
> streaming job.
>
>
> --
> *Sincerely,*
> *Ilya Soin*
>