You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@spark.apache.org by Tarek Auel <ta...@gmail.com> on 2015/06/01 17:54:17 UTC

GraphX: New graph operator

Hello,

Someone proposed in a Jira issue to implement new graph operations. Sean
Owen recommended to check first with the mailing list, if this is
interesting or not.

So I would like to know, if it is interesting for GraphX to implement the
operators like:
http://en.wikipedia.org/wiki/Graph_operations and/or
http://techieme.in/complex-graph-operations/

If yes, should they be integrated into GraphImpl (like mask, subgraph etc.)
or as external library? My feeling is that they are similar to mask.
Because of consistency they should be part of the graph implementation
itself.

What do you guys think? I really would like to bring GraphX forward and
help to implement some of these.

Looking forward to hear your opinions
Tarek

Re: GraphX: New graph operator

Posted by Reynold Xin <rx...@databricks.com>.
I'd think id is the unique identifier by default.


On Wed, Jun 3, 2015 at 12:13 AM, Tarek Auel <ta...@gmail.com> wrote:

> Hi,
>
> The graph is already there (GraphX) and has the two RDDs you described. My
> question tries to get an idea, if the community thinks that it's a benefit
> and would be a plus or not. If yes, I would like to contribute it to GraphX
> (either as part of GraphOpts or as external library).
>
> An interesting question is for me, if these operators take the attributes
> into account (and ignore the ids) or if the id has to match too. I believe
> that ignoring the id and focus on the attributes is harder but more
> powerful. If I think of a graph with a node for the country US for instance
> and I want to merge a second graph, which contains a U.S. node too, I would
> expect that these two nodes will be merged (ignoring the id). Is this
> thought valid, or does it more sense to merge based on the id?
>
> Thanks for your inputs !
>
> Tarek
> On Tue 2 Jun 2015 at 11:58 pm Reynold Xin <rx...@databricks.com> wrote:
>
>> Hi Tarek,
>>
>> I took a quick look at the materials you shared. It actually seems to me
>> it'd be super easy to express a graph as two DataFrames: one for edges
>> (srcid, dstid, and other edge attributes) and one for vertices (vid, and
>> other vertex attributes).
>>
>> Then
>>
>> intersection is just
>>
>> edges1.intersect(edges2)
>>
>>
>> "join" is just
>>
>> edges1.union(edges2).distinct
>>
>>
>>
>>
>> On Tue, Jun 2, 2015 at 12:12 AM, Tarek Auel <ta...@gmail.com> wrote:
>>
>>> Okay thanks for your feedback.
>>>
>>> What is the expected behavior of union? Like Union and/or union all of
>>> SQL? Union all would be more or less trivial if we just concatenate the
>>> vertices and edges (vertex Id conflicts have to be resolved). Should union
>>> look for duplicates on the actual attribute (VD) or just the vertex Id? If
>>> it compares the attribute it might be necessary to change the id of some
>>> vertices in order to resolve conflicts.
>>>
>>> Already a big thanks for your inputs !
>>>
>>> On Mon 1 Jun 2015 at 11:55 pm Ankur Dave <an...@gmail.com> wrote:
>>>
>>>> I think it would be good to have more basic operators like union or
>>>> difference, as long as they have an efficient distributed implementation
>>>> and are plausibly useful.
>>>>
>>>> If they can be written in terms of the existing GraphX API, it would be
>>>> best to put them into GraphOps to keep the core GraphX implementation
>>>> small. The `mask` operation should actually be in GraphOps -- it's only in
>>>> GraphImpl for historical reasons. On the other hand, `subgraph` needs to be
>>>> in GraphImpl for performance: it accesses EdgeRDDImpl#filter(epred, vpred),
>>>> which can't be a public EdgeRDD method because its semantics rely on an
>>>> implementation detail (vertex replication).
>>>>
>>>> Ankur <http://www.ankurdave.com/>
>>>>
>>>> On Mon, Jun 1, 2015 at 8:54 AM, Tarek Auel <ta...@gmail.com>
>>>> wrote:
>>>>
>>>>> Hello,
>>>>>
>>>>> Someone proposed in a Jira issue to implement new graph operations.
>>>>> Sean Owen recommended to check first with the mailing list, if this is
>>>>> interesting or not.
>>>>>
>>>>> So I would like to know, if it is interesting for GraphX to implement
>>>>> the operators like:
>>>>> http://en.wikipedia.org/wiki/Graph_operations and/or
>>>>> http://techieme.in/complex-graph-operations/
>>>>>
>>>>> If yes, should they be integrated into GraphImpl (like mask, subgraph
>>>>> etc.) or as external library? My feeling is that they are similar to mask.
>>>>> Because of consistency they should be part of the graph implementation
>>>>> itself.
>>>>>
>>>>> What do you guys think? I really would like to bring GraphX forward
>>>>> and help to implement some of these.
>>>>>
>>>>> Looking forward to hear your opinions
>>>>> Tarek
>>>>>
>>>>>
>>>>
>>

Re: GraphX: New graph operator

Posted by Tarek Auel <ta...@gmail.com>.
Hi,

The graph is already there (GraphX) and has the two RDDs you described. My
question tries to get an idea, if the community thinks that it's a benefit
and would be a plus or not. If yes, I would like to contribute it to GraphX
(either as part of GraphOpts or as external library).

An interesting question is for me, if these operators take the attributes
into account (and ignore the ids) or if the id has to match too. I believe
that ignoring the id and focus on the attributes is harder but more
powerful. If I think of a graph with a node for the country US for instance
and I want to merge a second graph, which contains a U.S. node too, I would
expect that these two nodes will be merged (ignoring the id). Is this
thought valid, or does it more sense to merge based on the id?

Thanks for your inputs !

Tarek
On Tue 2 Jun 2015 at 11:58 pm Reynold Xin <rx...@databricks.com> wrote:

> Hi Tarek,
>
> I took a quick look at the materials you shared. It actually seems to me
> it'd be super easy to express a graph as two DataFrames: one for edges
> (srcid, dstid, and other edge attributes) and one for vertices (vid, and
> other vertex attributes).
>
> Then
>
> intersection is just
>
> edges1.intersect(edges2)
>
>
> "join" is just
>
> edges1.union(edges2).distinct
>
>
>
>
> On Tue, Jun 2, 2015 at 12:12 AM, Tarek Auel <ta...@gmail.com> wrote:
>
>> Okay thanks for your feedback.
>>
>> What is the expected behavior of union? Like Union and/or union all of
>> SQL? Union all would be more or less trivial if we just concatenate the
>> vertices and edges (vertex Id conflicts have to be resolved). Should union
>> look for duplicates on the actual attribute (VD) or just the vertex Id? If
>> it compares the attribute it might be necessary to change the id of some
>> vertices in order to resolve conflicts.
>>
>> Already a big thanks for your inputs !
>>
>> On Mon 1 Jun 2015 at 11:55 pm Ankur Dave <an...@gmail.com> wrote:
>>
>>> I think it would be good to have more basic operators like union or
>>> difference, as long as they have an efficient distributed implementation
>>> and are plausibly useful.
>>>
>>> If they can be written in terms of the existing GraphX API, it would be
>>> best to put them into GraphOps to keep the core GraphX implementation
>>> small. The `mask` operation should actually be in GraphOps -- it's only in
>>> GraphImpl for historical reasons. On the other hand, `subgraph` needs to be
>>> in GraphImpl for performance: it accesses EdgeRDDImpl#filter(epred, vpred),
>>> which can't be a public EdgeRDD method because its semantics rely on an
>>> implementation detail (vertex replication).
>>>
>>> Ankur <http://www.ankurdave.com/>
>>>
>>> On Mon, Jun 1, 2015 at 8:54 AM, Tarek Auel <ta...@gmail.com> wrote:
>>>
>>>> Hello,
>>>>
>>>> Someone proposed in a Jira issue to implement new graph operations.
>>>> Sean Owen recommended to check first with the mailing list, if this is
>>>> interesting or not.
>>>>
>>>> So I would like to know, if it is interesting for GraphX to implement
>>>> the operators like:
>>>> http://en.wikipedia.org/wiki/Graph_operations and/or
>>>> http://techieme.in/complex-graph-operations/
>>>>
>>>> If yes, should they be integrated into GraphImpl (like mask, subgraph
>>>> etc.) or as external library? My feeling is that they are similar to mask.
>>>> Because of consistency they should be part of the graph implementation
>>>> itself.
>>>>
>>>> What do you guys think? I really would like to bring GraphX forward and
>>>> help to implement some of these.
>>>>
>>>> Looking forward to hear your opinions
>>>> Tarek
>>>>
>>>>
>>>
>

Re: GraphX: New graph operator

Posted by "jinkui.sjk" <ji...@alibaba-inc.com>.
i have a pull request about this issue. https://github.com/apache/spark/pull/6685 <https://github.com/apache/spark/pull/6685>
the union operation of two graph is useful in practice. And it’s necessary to provide operation on the Graph level.  

> On 3 Jun, 2015, at 2:58 pm, Reynold Xin <rx...@databricks.com> wrote:
> 
> Hi Tarek,
> 
> I took a quick look at the materials you shared. It actually seems to me it'd be super easy to express a graph as two DataFrames: one for edges (srcid, dstid, and other edge attributes) and one for vertices (vid, and other vertex attributes).
> 
> Then 
> 
> intersection is just
> 
> edges1.intersect(edges2)
> 
> 
> "join" is just
> 
> edges1.union(edges2).distinct
> 
> 
> 
> 
> On Tue, Jun 2, 2015 at 12:12 AM, Tarek Auel <tarek.auel@gmail.com <ma...@gmail.com>> wrote:
> Okay thanks for your feedback. 
> 
> What is the expected behavior of union? Like Union and/or union all of SQL? Union all would be more or less trivial if we just concatenate the vertices and edges (vertex Id conflicts have to be resolved). Should union look for duplicates on the actual attribute (VD) or just the vertex Id? If it compares the attribute it might be necessary to change the id of some vertices in order to resolve conflicts. 
> 
> Already a big thanks for your inputs !
> 
> On Mon 1 Jun 2015 at 11:55 pm Ankur Dave <ankurdave@gmail.com <ma...@gmail.com>> wrote:
> I think it would be good to have more basic operators like union or difference, as long as they have an efficient distributed implementation and are plausibly useful.
> 
> If they can be written in terms of the existing GraphX API, it would be best to put them into GraphOps to keep the core GraphX implementation small. The `mask` operation should actually be in GraphOps -- it's only in GraphImpl for historical reasons. On the other hand, `subgraph` needs to be in GraphImpl for performance: it accesses EdgeRDDImpl#filter(epred, vpred), which can't be a public EdgeRDD method because its semantics rely on an implementation detail (vertex replication).
> 
> Ankur <http://www.ankurdave.com/>
> 
> On Mon, Jun 1, 2015 at 8:54 AM, Tarek Auel <tarek.auel@gmail.com <ma...@gmail.com>> wrote:
> Hello,
> 
> Someone proposed in a Jira issue to implement new graph operations. Sean Owen recommended to check first with the mailing list, if this is interesting or not.
> 
> So I would like to know, if it is interesting for GraphX to implement the operators like:
> http://en.wikipedia.org/wiki/Graph_operations <http://en.wikipedia.org/wiki/Graph_operations> and/or
> http://techieme.in/complex-graph-operations/ <http://techieme.in/complex-graph-operations/> 
> 
> If yes, should they be integrated into GraphImpl (like mask, subgraph etc.) or as external library? My feeling is that they are similar to mask. Because of consistency they should be part of the graph implementation itself.
> 
> What do you guys think? I really would like to bring GraphX forward and help to implement some of these.
> 
> Looking forward to hear your opinions
> Tarek
> 
> 
> 


Re: GraphX: New graph operator

Posted by Reynold Xin <rx...@databricks.com>.
Hi Tarek,

I took a quick look at the materials you shared. It actually seems to me
it'd be super easy to express a graph as two DataFrames: one for edges
(srcid, dstid, and other edge attributes) and one for vertices (vid, and
other vertex attributes).

Then

intersection is just

edges1.intersect(edges2)


"join" is just

edges1.union(edges2).distinct




On Tue, Jun 2, 2015 at 12:12 AM, Tarek Auel <ta...@gmail.com> wrote:

> Okay thanks for your feedback.
>
> What is the expected behavior of union? Like Union and/or union all of
> SQL? Union all would be more or less trivial if we just concatenate the
> vertices and edges (vertex Id conflicts have to be resolved). Should union
> look for duplicates on the actual attribute (VD) or just the vertex Id? If
> it compares the attribute it might be necessary to change the id of some
> vertices in order to resolve conflicts.
>
> Already a big thanks for your inputs !
>
> On Mon 1 Jun 2015 at 11:55 pm Ankur Dave <an...@gmail.com> wrote:
>
>> I think it would be good to have more basic operators like union or
>> difference, as long as they have an efficient distributed implementation
>> and are plausibly useful.
>>
>> If they can be written in terms of the existing GraphX API, it would be
>> best to put them into GraphOps to keep the core GraphX implementation
>> small. The `mask` operation should actually be in GraphOps -- it's only in
>> GraphImpl for historical reasons. On the other hand, `subgraph` needs to be
>> in GraphImpl for performance: it accesses EdgeRDDImpl#filter(epred, vpred),
>> which can't be a public EdgeRDD method because its semantics rely on an
>> implementation detail (vertex replication).
>>
>> Ankur <http://www.ankurdave.com/>
>>
>> On Mon, Jun 1, 2015 at 8:54 AM, Tarek Auel <ta...@gmail.com> wrote:
>>
>>> Hello,
>>>
>>> Someone proposed in a Jira issue to implement new graph operations. Sean
>>> Owen recommended to check first with the mailing list, if this is
>>> interesting or not.
>>>
>>> So I would like to know, if it is interesting for GraphX to implement
>>> the operators like:
>>> http://en.wikipedia.org/wiki/Graph_operations and/or
>>> http://techieme.in/complex-graph-operations/
>>>
>>> If yes, should they be integrated into GraphImpl (like mask, subgraph
>>> etc.) or as external library? My feeling is that they are similar to mask.
>>> Because of consistency they should be part of the graph implementation
>>> itself.
>>>
>>> What do you guys think? I really would like to bring GraphX forward and
>>> help to implement some of these.
>>>
>>> Looking forward to hear your opinions
>>> Tarek
>>>
>>>
>>

Re: GraphX: New graph operator

Posted by Tarek Auel <ta...@gmail.com>.
Okay thanks for your feedback.

What is the expected behavior of union? Like Union and/or union all of SQL?
Union all would be more or less trivial if we just concatenate the vertices
and edges (vertex Id conflicts have to be resolved). Should union look for
duplicates on the actual attribute (VD) or just the vertex Id? If it
compares the attribute it might be necessary to change the id of some
vertices in order to resolve conflicts.

Already a big thanks for your inputs !
On Mon 1 Jun 2015 at 11:55 pm Ankur Dave <an...@gmail.com> wrote:

> I think it would be good to have more basic operators like union or
> difference, as long as they have an efficient distributed implementation
> and are plausibly useful.
>
> If they can be written in terms of the existing GraphX API, it would be
> best to put them into GraphOps to keep the core GraphX implementation
> small. The `mask` operation should actually be in GraphOps -- it's only in
> GraphImpl for historical reasons. On the other hand, `subgraph` needs to be
> in GraphImpl for performance: it accesses EdgeRDDImpl#filter(epred, vpred),
> which can't be a public EdgeRDD method because its semantics rely on an
> implementation detail (vertex replication).
>
> Ankur <http://www.ankurdave.com/>
>
> On Mon, Jun 1, 2015 at 8:54 AM, Tarek Auel <ta...@gmail.com> wrote:
>
>> Hello,
>>
>> Someone proposed in a Jira issue to implement new graph operations. Sean
>> Owen recommended to check first with the mailing list, if this is
>> interesting or not.
>>
>> So I would like to know, if it is interesting for GraphX to implement the
>> operators like:
>> http://en.wikipedia.org/wiki/Graph_operations and/or
>> http://techieme.in/complex-graph-operations/
>>
>> If yes, should they be integrated into GraphImpl (like mask, subgraph
>> etc.) or as external library? My feeling is that they are similar to mask.
>> Because of consistency they should be part of the graph implementation
>> itself.
>>
>> What do you guys think? I really would like to bring GraphX forward and
>> help to implement some of these.
>>
>> Looking forward to hear your opinions
>> Tarek
>>
>>
>

Re: GraphX: New graph operator

Posted by Ankur Dave <an...@gmail.com>.
I think it would be good to have more basic operators like union or
difference, as long as they have an efficient distributed implementation
and are plausibly useful.

If they can be written in terms of the existing GraphX API, it would be
best to put them into GraphOps to keep the core GraphX implementation
small. The `mask` operation should actually be in GraphOps -- it's only in
GraphImpl for historical reasons. On the other hand, `subgraph` needs to be
in GraphImpl for performance: it accesses EdgeRDDImpl#filter(epred, vpred),
which can't be a public EdgeRDD method because its semantics rely on an
implementation detail (vertex replication).

Ankur <http://www.ankurdave.com/>

On Mon, Jun 1, 2015 at 8:54 AM, Tarek Auel <ta...@gmail.com> wrote:

> Hello,
>
> Someone proposed in a Jira issue to implement new graph operations. Sean
> Owen recommended to check first with the mailing list, if this is
> interesting or not.
>
> So I would like to know, if it is interesting for GraphX to implement the
> operators like:
> http://en.wikipedia.org/wiki/Graph_operations and/or
> http://techieme.in/complex-graph-operations/
>
> If yes, should they be integrated into GraphImpl (like mask, subgraph
> etc.) or as external library? My feeling is that they are similar to mask.
> Because of consistency they should be part of the graph implementation
> itself.
>
> What do you guys think? I really would like to bring GraphX forward and
> help to implement some of these.
>
> Looking forward to hear your opinions
> Tarek
>
>