You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@flink.apache.org by kant kodali <ka...@gmail.com> on 2020/02/18 07:10:10 UTC

Can Connected Components run on a streaming dataset using iterate delta?

Hi All,

I am wondering if connected components
<https://ci.apache.org/projects/flink/flink-docs-stable/dev/batch/examples.html#connected-components>
can run on a streaming data? or say incremental batch?

I see that with delta iteration not all vertices need to participate at
every iteration which is great but in my case the graph is evolving over
time other words new edges are getting added over time. If so, does the
algorithm needs to run on the entire graph or can it simply run on the new
batch of edges?

Finally, What is the best way to compute connected components on Graphs
evolving over time? Should I use streaming or batch or any custom
incremental approach? Also, the documentation take maxIterations as an
input. How can I come up with a good number for max iterations? and once I
come up with a good number for max Iterations is the algorithm guaranteed
to converge?


Thanks,
Kant

Re: Can Connected Components run on a streaming dataset using iterate delta?

Posted by Arvid Heise <ar...@ververica.com>.
Hi Kant,

there has not been high demand yet and it's always a matter of priority for
the scarce manpower.
I'd probably get inspired by gelly and implement it on DataStream in your
stead.

On Sat, Feb 22, 2020 at 11:23 AM kant kodali <ka...@gmail.com> wrote:

> Hi,
>
> Thanks for that but Looks like it is already available
> https://github.com/vasia/gelly-streaming in streaming but I wonder why
> this is not part of Flink? there are no releases either.
>
> Thanks!
>
> On Tue, Feb 18, 2020 at 9:13 AM Yun Gao <yu...@aliyun.com> wrote:
>
>>        Hi Kant,
>>
>>               As far as I know, I think the current example connected
>> components implementation based on DataSet API could not be extended to
>> streaming data or incremental batch directly.
>>
>>               From the algorithm's perspective, if the graph only add
>> edge and never remove edge, I think the connected components should be able
>> to be updated incrementally when the graph changes: When some edges are
>> added, a new search should be started from the sources of the added edges
>> to propagate its component ID. This will trigger a new pass of update of
>> the following vertices, and the updates continues until no vertices'
>> component ID get updated. However, if there are also edge removes, I think
>> the incremental computation should not be easily achieved.
>>
>>               To implement the above logic on Flink, I think currently
>> there should be two possible methods:
>>                     1) Use DataSet API and DataSet iteration, maintains
>> the graph structure and the latest computation result in a storage, and
>> whenever there are enough changes to the graph, submits a new DataSet job
>> to recompute the result. The job should load the edges, the latest
>> component id and whether it is the source of the newly added edges for each
>> graph vertex, and then start the above incremental computation logic.
>>                     2) Flink also provide DataStream iteration API[1]
>> that enables iterating on the unbounded data. In this case the graph
>> modification should be modeled as a datastream, and some operators inside
>> the iteration should maintain the graph structure and current component id.
>> whenever there are enough changes, it starts a new pass of computation.
>>
>>     Best,
>>      Yun
>>
>>         [1] Flink DataStream iteration,
>> https://ci.apache.org/projects/flink/flink-docs-release-1.10/dev/datastream_api.html#iterations
>>
>> ------------------------------------------------------------------
>> From:kant kodali <ka...@gmail.com>
>> Send Time:2020 Feb. 18 (Tue.) 15:11
>> To:user <us...@flink.apache.org>
>> Subject:Can Connected Components run on a streaming dataset using iterate
>> delta?
>>
>> Hi All,
>>
>> I am wondering if connected components
>> <https://ci.apache.org/projects/flink/flink-docs-stable/dev/batch/examples.html#connected-components>
>> can run on a streaming data? or say incremental batch?
>>
>> I see that with delta iteration not all vertices need to participate at
>> every iteration which is great but in my case the graph is evolving over
>> time other words new edges are getting added over time. If so, does the
>> algorithm needs to run on the entire graph or can it simply run on the new
>> batch of edges?
>>
>> Finally, What is the best way to compute connected components on Graphs
>> evolving over time? Should I use streaming or batch or any custom
>> incremental approach? Also, the documentation take maxIterations as an
>> input. How can I come up with a good number for max iterations? and once I
>> come up with a good number for max Iterations is the algorithm guaranteed
>> to converge?
>>
>>
>> Thanks,
>> Kant
>>
>>
>>

Re: Can Connected Components run on a streaming dataset using iterate delta?

Posted by kant kodali <ka...@gmail.com>.
Hi,

Thanks for that but Looks like it is already available
https://github.com/vasia/gelly-streaming in streaming but I wonder why this
is not part of Flink? there are no releases either.

Thanks!

On Tue, Feb 18, 2020 at 9:13 AM Yun Gao <yu...@aliyun.com> wrote:

>        Hi Kant,
>
>               As far as I know, I think the current example connected
> components implementation based on DataSet API could not be extended to
> streaming data or incremental batch directly.
>
>               From the algorithm's perspective, if the graph only add edge
> and never remove edge, I think the connected components should be able to
> be updated incrementally when the graph changes: When some edges are added,
> a new search should be started from the sources of the added edges to
> propagate its component ID. This will trigger a new pass of update of the
> following vertices, and the updates continues until no vertices' component
> ID get updated. However, if there are also edge removes, I think the
> incremental computation should not be easily achieved.
>
>               To implement the above logic on Flink, I think currently
> there should be two possible methods:
>                     1) Use DataSet API and DataSet iteration, maintains
> the graph structure and the latest computation result in a storage, and
> whenever there are enough changes to the graph, submits a new DataSet job
> to recompute the result. The job should load the edges, the latest
> component id and whether it is the source of the newly added edges for each
> graph vertex, and then start the above incremental computation logic.
>                     2) Flink also provide DataStream iteration API[1] that
> enables iterating on the unbounded data. In this case the graph
> modification should be modeled as a datastream, and some operators inside
> the iteration should maintain the graph structure and current component id.
> whenever there are enough changes, it starts a new pass of computation.
>
>     Best,
>      Yun
>
>         [1] Flink DataStream iteration,
> https://ci.apache.org/projects/flink/flink-docs-release-1.10/dev/datastream_api.html#iterations
>
> ------------------------------------------------------------------
> From:kant kodali <ka...@gmail.com>
> Send Time:2020 Feb. 18 (Tue.) 15:11
> To:user <us...@flink.apache.org>
> Subject:Can Connected Components run on a streaming dataset using iterate
> delta?
>
> Hi All,
>
> I am wondering if connected components
> <https://ci.apache.org/projects/flink/flink-docs-stable/dev/batch/examples.html#connected-components>
> can run on a streaming data? or say incremental batch?
>
> I see that with delta iteration not all vertices need to participate at
> every iteration which is great but in my case the graph is evolving over
> time other words new edges are getting added over time. If so, does the
> algorithm needs to run on the entire graph or can it simply run on the new
> batch of edges?
>
> Finally, What is the best way to compute connected components on Graphs
> evolving over time? Should I use streaming or batch or any custom
> incremental approach? Also, the documentation take maxIterations as an
> input. How can I come up with a good number for max iterations? and once I
> come up with a good number for max Iterations is the algorithm guaranteed
> to converge?
>
>
> Thanks,
> Kant
>
>
>

Re: Can Connected Components run on a streaming dataset using iterate delta?

Posted by Yun Gao <yu...@aliyun.com>.
       Hi Kant, 

              As far as I know, I think the current example connected components implementation based on DataSet API could not be extended to streaming data or incremental batch directly. 
              From the algorithm's perspective, if the graph only add edge and never remove edge, I think the connected components should be able to be updated incrementally when the graph changes: When some edges are added, a new search should be started from the sources of the added edges to propagate its component ID. This will trigger a new pass of update of the following vertices, and the updates continues until no vertices' component ID get updated. However, if there are also edge removes, I think the incremental computation should not be easily achieved. 
              To implement the above logic on Flink, I think currently there should be two possible methods: 
                    1) Use DataSet API and DataSet iteration, maintains the graph structure and the latest computation result in a storage, and whenever there are enough changes to the graph, submits a new DataSet job to recompute the result. The job should load the edges, the latest component id and whether it is the source of the newly added edges for each graph vertex, and then start the above incremental computation logic. 
                    2) Flink also provide DataStream iteration API[1] that enables iterating on the unbounded data. In this case the graph modification should be modeled as a datastream, and some operators inside the iteration should maintain the graph structure and current component id. whenever there are enough changes, it starts a new pass of computation.

    Best,
     Yun

        [1] Flink DataStream iteration, https://ci.apache.org/projects/flink/flink-docs-release-1.10/dev/datastream_api.html#iterations


------------------------------------------------------------------
From:kant kodali <ka...@gmail.com>
Send Time:2020 Feb. 18 (Tue.) 15:11
To:user <us...@flink.apache.org>
Subject:Can Connected Components run on a streaming dataset using iterate delta?

Hi All,

I am wondering if connected components can run on a streaming data? or say incremental batch?

I see that with delta iteration not all vertices need to participate at every iteration which is great but in my case the graph is evolving over time other words new edges are getting added over time. If so, does the algorithm needs to run on the entire graph or can it simply run on the new batch of edges?

Finally, What is the best way to compute connected components on Graphs evolving over time? Should I use streaming or batch or any custom incremental approach? Also, the documentation take maxIterations as an input. How can I come up with a good number for max iterations? and once I come up with a good number for max Iterations is the algorithm guaranteed to converge?


Thanks,
Kant