You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@flink.apache.org by Satyam Shekhar <sa...@gmail.com> on 2020/05/29 22:18:57 UTC

Sorting Bounded Streams

Hello,

I am using Flink as the streaming execution engine for building a
low-latency alerting application. The use case also requires ad-hoc
querying on batch data, which I also plan to serve using Flink to avoid the
complexity of maintaining two separate engines.

My current understanding is that Order By operator in Blink planner (on
DataStream) requires time attribute as the primary sort column. This is
quite limiting for ad-hoc querying. It seems I can use the DataSet API to
obtain a globally sorted output on an arbitrary column but that will force
me to use the older Flink planner.

Specifically, I am looking for guidance from the community on the following
questions -

   1. Is it possible to obtain a globally sorted output on DataStreams on
   an arbitrary sort column?
   2. What are the tradeoffs in using DataSet vs DataStream in performance,
   long term support, etc?
   3. Is there any other way to address this issue?

Regards,
Satyam

Re: Sorting Bounded Streams

Posted by Benchao Li <li...@gmail.com>.
Hi Satyam,

You are correct. Blink planner is built on top of DataStream, both for
batch and streaming.
Hence you cannot transform Table into DataSet if you are using blink
planner.

AFAIK, the community is working on the unification of batch and streaming.
And the the unification
will be Table API/SQL on top of BoundedStream. Then the DataSet/DataStream
API will be unified
as BoundedStream API. Hence the DataSet API is not the recommended approach
for the long term.


Satyam Shekhar <sa...@gmail.com> 于2020年5月30日周六 下午3:34写道:

> Thanks for your reply, Benchao Li.
>
> While I can use the Blink planner in batch mode, I'd still have to work
> with DataSet. Based on my limited reading it appears to me that DataStream
> is being extended to support both batch and steaming use-cases with the
> `isBounded` method in the StreamTableSource interface. Is that correct?
>
> Is working with DataSet the recommended approach for the long term? Are
> there any performance implications for that decision?
>
> Regards,
> Satyam
>
>
> On Fri, May 29, 2020 at 9:01 PM Benchao Li <li...@gmail.com> wrote:
>
>> Hi Satyam,
>>
>> Are you using blink planner in streaming mode? AFAIK, blink planner in
>> batch mode can sort on arbitrary columns.
>>
>> Satyam Shekhar <sa...@gmail.com> 于2020年5月30日周六 上午6:19写道:
>>
>>> Hello,
>>>
>>> I am using Flink as the streaming execution engine for building a
>>> low-latency alerting application. The use case also requires ad-hoc
>>> querying on batch data, which I also plan to serve using Flink to avoid the
>>> complexity of maintaining two separate engines.
>>>
>>> My current understanding is that Order By operator in Blink planner (on
>>> DataStream) requires time attribute as the primary sort column. This is
>>> quite limiting for ad-hoc querying. It seems I can use the DataSet API to
>>> obtain a globally sorted output on an arbitrary column but that will force
>>> me to use the older Flink planner.
>>>
>>> Specifically, I am looking for guidance from the community on the
>>> following questions -
>>>
>>>    1. Is it possible to obtain a globally sorted output on DataStreams
>>>    on an arbitrary sort column?
>>>    2. What are the tradeoffs in using DataSet vs DataStream in
>>>    performance, long term support, etc?
>>>    3. Is there any other way to address this issue?
>>>
>>> Regards,
>>> Satyam
>>>
>>
>>
>> --
>>
>> Best,
>> Benchao Li
>>
>

-- 

Best,
Benchao Li

Re: Sorting Bounded Streams

Posted by Satyam Shekhar <sa...@gmail.com>.
Thanks for your reply, Benchao Li.

While I can use the Blink planner in batch mode, I'd still have to work
with DataSet. Based on my limited reading it appears to me that DataStream
is being extended to support both batch and steaming use-cases with the
`isBounded` method in the StreamTableSource interface. Is that correct?

Is working with DataSet the recommended approach for the long term? Are
there any performance implications for that decision?

Regards,
Satyam


On Fri, May 29, 2020 at 9:01 PM Benchao Li <li...@gmail.com> wrote:

> Hi Satyam,
>
> Are you using blink planner in streaming mode? AFAIK, blink planner in
> batch mode can sort on arbitrary columns.
>
> Satyam Shekhar <sa...@gmail.com> 于2020年5月30日周六 上午6:19写道:
>
>> Hello,
>>
>> I am using Flink as the streaming execution engine for building a
>> low-latency alerting application. The use case also requires ad-hoc
>> querying on batch data, which I also plan to serve using Flink to avoid the
>> complexity of maintaining two separate engines.
>>
>> My current understanding is that Order By operator in Blink planner (on
>> DataStream) requires time attribute as the primary sort column. This is
>> quite limiting for ad-hoc querying. It seems I can use the DataSet API to
>> obtain a globally sorted output on an arbitrary column but that will force
>> me to use the older Flink planner.
>>
>> Specifically, I am looking for guidance from the community on the
>> following questions -
>>
>>    1. Is it possible to obtain a globally sorted output on DataStreams
>>    on an arbitrary sort column?
>>    2. What are the tradeoffs in using DataSet vs DataStream in
>>    performance, long term support, etc?
>>    3. Is there any other way to address this issue?
>>
>> Regards,
>> Satyam
>>
>
>
> --
>
> Best,
> Benchao Li
>

Re: Sorting Bounded Streams

Posted by Benchao Li <li...@gmail.com>.
Hi Satyam,

Are you using blink planner in streaming mode? AFAIK, blink planner in
batch mode can sort on arbitrary columns.

Satyam Shekhar <sa...@gmail.com> 于2020年5月30日周六 上午6:19写道:

> Hello,
>
> I am using Flink as the streaming execution engine for building a
> low-latency alerting application. The use case also requires ad-hoc
> querying on batch data, which I also plan to serve using Flink to avoid the
> complexity of maintaining two separate engines.
>
> My current understanding is that Order By operator in Blink planner (on
> DataStream) requires time attribute as the primary sort column. This is
> quite limiting for ad-hoc querying. It seems I can use the DataSet API to
> obtain a globally sorted output on an arbitrary column but that will force
> me to use the older Flink planner.
>
> Specifically, I am looking for guidance from the community on the
> following questions -
>
>    1. Is it possible to obtain a globally sorted output on DataStreams on
>    an arbitrary sort column?
>    2. What are the tradeoffs in using DataSet vs DataStream in
>    performance, long term support, etc?
>    3. Is there any other way to address this issue?
>
> Regards,
> Satyam
>


-- 

Best,
Benchao Li