You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@spark.apache.org by Mats Rydberg <ma...@neo4j.org> on 2019/10/04 08:32:19 UTC

SparkGraph review process

Hello dear Spark community

We are the developers behind the SparkGraph SPIP, which is a project
created out of our work on openCypher Morpheus (
https://github.com/opencypher/morpheus). During this year we have
collaborated with mainly Xiangrui Meng of Databricks to define and develop
a new SparkGraph module based on our experience from working on Morpheus.
Morpheus - formerly known as "Cypher for Apache Spark" - has been in
development for over 3 years and matured in its API and implementation.

The SPIP work has been on hold for a period of time now, as priorities at
Databricks have changed which has occupied Xiangrui's time (as well as
other happenings). As you may know, the latest API PR (
https://github.com/apache/spark/pull/24851) is blocking us from moving
forward with the implementation.

In an attempt to not lose track of this project we now reach out to you to
ask whether there are any Spark committers in the community who would be
prepared to commit to helping us review and merge our code contributions to
Apache Spark? We are not asking for lots of direct development support, as
we believe we have the implementation more or less completed already since
early this year. There is a proof-of-concept PR (
https://github.com/apache/spark/pull/24297) which contains the
functionality.

If you could offer such aid it would be greatly appreciated. None of us are
Spark committers, which is hindering our ability to deliver this project in
time for Spark 3.0.

Sincerely
the Neo4j Graph Analytics team
Mats, Martin, Max, Sören, Jonatan

Re: SparkGraph review process

Posted by kant kodali <ka...@gmail.com>.
Hi Sean,

In that case, Can we have Graphframes as part of spark release? or separate
release is also fine. Currently, I don't see any releases w.r.t Graphframes.

Thanks


On Fri, Feb 14, 2020 at 9:06 AM Sean Owen <sr...@gmail.com> wrote:

> This will not be Spark 3.0, no.
>
> On Fri, Feb 14, 2020 at 1:12 AM kant kodali <ka...@gmail.com> wrote:
> >
> > any update on this? Is spark graph going to make it into Spark or no?
> >
> > On Mon, Oct 14, 2019 at 12:26 PM Holden Karau <ho...@pigscanfly.ca>
> wrote:
> >>
> >> Maybe let’s ask the folks from Lightbend who helped with the previous
> scala upgrade for their thoughts?
> >>
> >> On Mon, Oct 14, 2019 at 8:24 PM Xiao Li <ga...@gmail.com> wrote:
> >>>>
> >>>> 1. On the technical side, my main concern is the runtime dependency
> on org.opencypher:okapi-shade. okapi depends on several Scala libraries. We
> came out with the solution to shade a few Scala libraries to avoid
> pollution. However, I'm not super confident that the approach is
> sustainable for two reasons: a) there exists no proper shading libraries
> for Scala, 2) We will have to wait for upgrades from those Scala libraries
> before we can upgrade Spark to use a newer Scala version. So it would be
> great if some Scala experts can help review the current implementation and
> help assess the risk.
> >>>
> >>>
> >>> This concern is valid. I think we should start the vote to ensure the
> whole community is aware of the risk and take the responsibility to
> maintain this in the long term.
> >>>
> >>> Cheers,
> >>>
> >>> Xiao
> >>>
> >>>
> >>> Xiangrui Meng <me...@gmail.com> 于2019年10月4日周五 下午12:27写道:
> >>>>
> >>>> Hi all,
> >>>>
> >>>> I want to clarify my role first to avoid misunderstanding. I'm an
> individual contributor here. My work on the graph SPIP as well as other
> Spark features I contributed to are not associated with my employer. It
> became quite challenging for me to keep track of the graph SPIP work due to
> less available time at home.
> >>>>
> >>>> On retrospective, we should have involved more Spark devs and
> committers early on so there is no single point of failure, i.e., me.
> Hopefully it is not too late to fix. I summarize my thoughts here to help
> onboard other reviewers:
> >>>>
> >>>> 1. On the technical side, my main concern is the runtime dependency
> on org.opencypher:okapi-shade. okapi depends on several Scala libraries. We
> came out with the solution to shade a few Scala libraries to avoid
> pollution. However, I'm not super confident that the approach is
> sustainable for two reasons: a) there exists no proper shading libraries
> for Scala, 2) We will have to wait for upgrades from those Scala libraries
> before we can upgrade Spark to use a newer Scala version. So it would be
> great if some Scala experts can help review the current implementation and
> help assess the risk.
> >>>>
> >>>> 2. Overloading helper methods. MLlib used to have several overloaded
> helper methods for each algorithm, which later became a major maintenance
> burden. Builders and setters/getters are more maintainable. I will comment
> again on the PR.
> >>>>
> >>>> 3. The proposed API partitions graph into sub-graphs, as described in
> the property graph model. It is unclear to me how it would affect query
> performance because it requires SQL optimizer to correctly recognize data
> from the same source and make execution efficient.
> >>>>
> >>>> 4. The feature, although originally targeted for Spark 3.0, should
> not be a Spark 3.0 release blocker because it doesn't require breaking
> changes. If we miss the code freeze deadline, we can introduce a build flag
> to exclude the module from the official release/distribution, and then make
> it default once the module is ready.
> >>>>
> >>>> 5. If unfortunately we still don't see sufficient committer reviews,
> I think the best option would be submitting the work to Apache Incubator
> instead to unblock the work. But maybe it is too earlier to discuss this
> option.
> >>>>
> >>>> It would be great if other committers can offer help on the review!
> Really appreciated!
> >>>>
> >>>> Best,
> >>>> Xiangrui
> >>>>
> >>>> On Fri, Oct 4, 2019 at 1:32 AM Mats Rydberg <ma...@neo4j.org.invalid>
> wrote:
> >>>>>
> >>>>> Hello dear Spark community
> >>>>>
> >>>>> We are the developers behind the SparkGraph SPIP, which is a project
> created out of our work on openCypher Morpheus (
> https://github.com/opencypher/morpheus). During this year we have
> collaborated with mainly Xiangrui Meng of Databricks to define and develop
> a new SparkGraph module based on our experience from working on Morpheus.
> Morpheus - formerly known as "Cypher for Apache Spark" - has been in
> development for over 3 years and matured in its API and implementation.
> >>>>>
> >>>>> The SPIP work has been on hold for a period of time now, as
> priorities at Databricks have changed which has occupied Xiangrui's time
> (as well as other happenings). As you may know, the latest API PR (
> https://github.com/apache/spark/pull/24851) is blocking us from moving
> forward with the implementation.
> >>>>>
> >>>>> In an attempt to not lose track of this project we now reach out to
> you to ask whether there are any Spark committers in the community who
> would be prepared to commit to helping us review and merge our code
> contributions to Apache Spark? We are not asking for lots of direct
> development support, as we believe we have the implementation more or less
> completed already since early this year. There is a proof-of-concept PR (
> https://github.com/apache/spark/pull/24297) which contains the
> functionality.
> >>>>>
> >>>>> If you could offer such aid it would be greatly appreciated. None of
> us are Spark committers, which is hindering our ability to deliver this
> project in time for Spark 3.0.
> >>>>>
> >>>>> Sincerely
> >>>>> the Neo4j Graph Analytics team
> >>>>> Mats, Martin, Max, Sören, Jonatan
> >>>>>
> >> --
> >> Twitter: https://twitter.com/holdenkarau
> >> Books (Learning Spark, High Performance Spark, etc.):
> https://amzn.to/2MaRAG9
> >> YouTube Live Streams: https://www.youtube.com/user/holdenkarau
>

Re: SparkGraph review process

Posted by Sean Owen <sr...@gmail.com>.
This will not be Spark 3.0, no.

On Fri, Feb 14, 2020 at 1:12 AM kant kodali <ka...@gmail.com> wrote:
>
> any update on this? Is spark graph going to make it into Spark or no?
>
> On Mon, Oct 14, 2019 at 12:26 PM Holden Karau <ho...@pigscanfly.ca> wrote:
>>
>> Maybe let’s ask the folks from Lightbend who helped with the previous scala upgrade for their thoughts?
>>
>> On Mon, Oct 14, 2019 at 8:24 PM Xiao Li <ga...@gmail.com> wrote:
>>>>
>>>> 1. On the technical side, my main concern is the runtime dependency on org.opencypher:okapi-shade. okapi depends on several Scala libraries. We came out with the solution to shade a few Scala libraries to avoid pollution. However, I'm not super confident that the approach is sustainable for two reasons: a) there exists no proper shading libraries for Scala, 2) We will have to wait for upgrades from those Scala libraries before we can upgrade Spark to use a newer Scala version. So it would be great if some Scala experts can help review the current implementation and help assess the risk.
>>>
>>>
>>> This concern is valid. I think we should start the vote to ensure the whole community is aware of the risk and take the responsibility to maintain this in the long term.
>>>
>>> Cheers,
>>>
>>> Xiao
>>>
>>>
>>> Xiangrui Meng <me...@gmail.com> 于2019年10月4日周五 下午12:27写道:
>>>>
>>>> Hi all,
>>>>
>>>> I want to clarify my role first to avoid misunderstanding. I'm an individual contributor here. My work on the graph SPIP as well as other Spark features I contributed to are not associated with my employer. It became quite challenging for me to keep track of the graph SPIP work due to less available time at home.
>>>>
>>>> On retrospective, we should have involved more Spark devs and committers early on so there is no single point of failure, i.e., me. Hopefully it is not too late to fix. I summarize my thoughts here to help onboard other reviewers:
>>>>
>>>> 1. On the technical side, my main concern is the runtime dependency on org.opencypher:okapi-shade. okapi depends on several Scala libraries. We came out with the solution to shade a few Scala libraries to avoid pollution. However, I'm not super confident that the approach is sustainable for two reasons: a) there exists no proper shading libraries for Scala, 2) We will have to wait for upgrades from those Scala libraries before we can upgrade Spark to use a newer Scala version. So it would be great if some Scala experts can help review the current implementation and help assess the risk.
>>>>
>>>> 2. Overloading helper methods. MLlib used to have several overloaded helper methods for each algorithm, which later became a major maintenance burden. Builders and setters/getters are more maintainable. I will comment again on the PR.
>>>>
>>>> 3. The proposed API partitions graph into sub-graphs, as described in the property graph model. It is unclear to me how it would affect query performance because it requires SQL optimizer to correctly recognize data from the same source and make execution efficient.
>>>>
>>>> 4. The feature, although originally targeted for Spark 3.0, should not be a Spark 3.0 release blocker because it doesn't require breaking changes. If we miss the code freeze deadline, we can introduce a build flag to exclude the module from the official release/distribution, and then make it default once the module is ready.
>>>>
>>>> 5. If unfortunately we still don't see sufficient committer reviews, I think the best option would be submitting the work to Apache Incubator instead to unblock the work. But maybe it is too earlier to discuss this option.
>>>>
>>>> It would be great if other committers can offer help on the review! Really appreciated!
>>>>
>>>> Best,
>>>> Xiangrui
>>>>
>>>> On Fri, Oct 4, 2019 at 1:32 AM Mats Rydberg <ma...@neo4j.org.invalid> wrote:
>>>>>
>>>>> Hello dear Spark community
>>>>>
>>>>> We are the developers behind the SparkGraph SPIP, which is a project created out of our work on openCypher Morpheus (https://github.com/opencypher/morpheus). During this year we have collaborated with mainly Xiangrui Meng of Databricks to define and develop a new SparkGraph module based on our experience from working on Morpheus. Morpheus - formerly known as "Cypher for Apache Spark" - has been in development for over 3 years and matured in its API and implementation.
>>>>>
>>>>> The SPIP work has been on hold for a period of time now, as priorities at Databricks have changed which has occupied Xiangrui's time (as well as other happenings). As you may know, the latest API PR (https://github.com/apache/spark/pull/24851) is blocking us from moving forward with the implementation.
>>>>>
>>>>> In an attempt to not lose track of this project we now reach out to you to ask whether there are any Spark committers in the community who would be prepared to commit to helping us review and merge our code contributions to Apache Spark? We are not asking for lots of direct development support, as we believe we have the implementation more or less completed already since early this year. There is a proof-of-concept PR (https://github.com/apache/spark/pull/24297) which contains the functionality.
>>>>>
>>>>> If you could offer such aid it would be greatly appreciated. None of us are Spark committers, which is hindering our ability to deliver this project in time for Spark 3.0.
>>>>>
>>>>> Sincerely
>>>>> the Neo4j Graph Analytics team
>>>>> Mats, Martin, Max, Sören, Jonatan
>>>>>
>> --
>> Twitter: https://twitter.com/holdenkarau
>> Books (Learning Spark, High Performance Spark, etc.): https://amzn.to/2MaRAG9
>> YouTube Live Streams: https://www.youtube.com/user/holdenkarau

---------------------------------------------------------------------
To unsubscribe e-mail: dev-unsubscribe@spark.apache.org


Re: SparkGraph review process

Posted by kant kodali <ka...@gmail.com>.
any update on this? Is spark graph going to make it into Spark or no?

On Mon, Oct 14, 2019 at 12:26 PM Holden Karau <ho...@pigscanfly.ca> wrote:

> Maybe let’s ask the folks from Lightbend who helped with the previous
> scala upgrade for their thoughts?
>
> On Mon, Oct 14, 2019 at 8:24 PM Xiao Li <ga...@gmail.com> wrote:
>
>> 1. On the technical side, my main concern is the runtime dependency on
>>> org.opencypher:okapi-shade. okapi depends on several Scala libraries. We
>>> came out with the solution to shade a few Scala libraries to avoid
>>> pollution. However, I'm not super confident that the approach is
>>> sustainable for two reasons: a) there exists no proper shading libraries
>>> for Scala, 2) We will have to wait for upgrades from those Scala libraries
>>> before we can upgrade Spark to use a newer Scala version. So it would be
>>> great if some Scala experts can help review the current implementation and
>>> help assess the risk.
>>
>>
>> This concern is valid. I think we should start the vote to ensure the
>> whole community is aware of the risk and take the responsibility to
>> maintain this in the long term.
>>
>> Cheers,
>>
>> Xiao
>>
>>
>> Xiangrui Meng <me...@gmail.com> 于2019年10月4日周五 下午12:27写道:
>>
>>> Hi all,
>>>
>>> I want to clarify my role first to avoid misunderstanding. I'm an
>>> individual contributor here. My work on the graph SPIP as well as other
>>> Spark features I contributed to are not associated with my employer. It
>>> became quite challenging for me to keep track of the graph SPIP work due to
>>> less available time at home.
>>>
>>> On retrospective, we should have involved more Spark devs and committers
>>> early on so there is no single point of failure, i.e., me. Hopefully it is
>>> not too late to fix. I summarize my thoughts here to help onboard other
>>> reviewers:
>>>
>>> 1. On the technical side, my main concern is the runtime dependency on
>>> org.opencypher:okapi-shade. okapi depends on several Scala libraries. We
>>> came out with the solution to shade a few Scala libraries to avoid
>>> pollution. However, I'm not super confident that the approach is
>>> sustainable for two reasons: a) there exists no proper shading libraries
>>> for Scala, 2) We will have to wait for upgrades from those Scala libraries
>>> before we can upgrade Spark to use a newer Scala version. So it would be
>>> great if some Scala experts can help review the current implementation and
>>> help assess the risk.
>>>
>>> 2. Overloading helper methods. MLlib used to have several overloaded
>>> helper methods for each algorithm, which later became a major maintenance
>>> burden. Builders and setters/getters are more maintainable. I will comment
>>> again on the PR.
>>>
>>> 3. The proposed API partitions graph into sub-graphs, as described in
>>> the property graph model. It is unclear to me how it would affect query
>>> performance because it requires SQL optimizer to correctly recognize data
>>> from the same source and make execution efficient.
>>>
>>> 4. The feature, although originally targeted for Spark 3.0, should not
>>> be a Spark 3.0 release blocker because it doesn't require breaking changes.
>>> If we miss the code freeze deadline, we can introduce a build flag to
>>> exclude the module from the official release/distribution, and then make it
>>> default once the module is ready.
>>>
>>> 5. If unfortunately we still don't see sufficient committer reviews, I
>>> think the best option would be submitting the work to Apache Incubator
>>> instead to unblock the work. But maybe it is too earlier to discuss this
>>> option.
>>>
>>> It would be great if other committers can offer help on the review!
>>> Really appreciated!
>>>
>>> Best,
>>> Xiangrui
>>>
>>> On Fri, Oct 4, 2019 at 1:32 AM Mats Rydberg <ma...@neo4j.org.invalid>
>>> wrote:
>>>
>>>> Hello dear Spark community
>>>>
>>>> We are the developers behind the SparkGraph SPIP, which is a project
>>>> created out of our work on openCypher Morpheus (
>>>> https://github.com/opencypher/morpheus). During this year we have
>>>> collaborated with mainly Xiangrui Meng of Databricks to define and develop
>>>> a new SparkGraph module based on our experience from working on Morpheus.
>>>> Morpheus - formerly known as "Cypher for Apache Spark" - has been in
>>>> development for over 3 years and matured in its API and implementation.
>>>>
>>>> The SPIP work has been on hold for a period of time now, as priorities
>>>> at Databricks have changed which has occupied Xiangrui's time (as well as
>>>> other happenings). As you may know, the latest API PR (
>>>> https://github.com/apache/spark/pull/24851) is blocking us from moving
>>>> forward with the implementation.
>>>>
>>>> In an attempt to not lose track of this project we now reach out to you
>>>> to ask whether there are any Spark committers in the community who would be
>>>> prepared to commit to helping us review and merge our code contributions to
>>>> Apache Spark? We are not asking for lots of direct development support, as
>>>> we believe we have the implementation more or less completed already since
>>>> early this year. There is a proof-of-concept PR (
>>>> https://github.com/apache/spark/pull/24297) which contains the
>>>> functionality.
>>>>
>>>> If you could offer such aid it would be greatly appreciated. None of us
>>>> are Spark committers, which is hindering our ability to deliver this
>>>> project in time for Spark 3.0.
>>>>
>>>> Sincerely
>>>> the Neo4j Graph Analytics team
>>>> Mats, Martin, Max, Sören, Jonatan
>>>>
>>>> --
> Twitter: https://twitter.com/holdenkarau
> Books (Learning Spark, High Performance Spark, etc.):
> https://amzn.to/2MaRAG9  <https://amzn.to/2MaRAG9>
> YouTube Live Streams: https://www.youtube.com/user/holdenkarau
>

Re: SparkGraph review process

Posted by Holden Karau <ho...@pigscanfly.ca>.
Maybe let’s ask the folks from Lightbend who helped with the previous scala
upgrade for their thoughts?

On Mon, Oct 14, 2019 at 8:24 PM Xiao Li <ga...@gmail.com> wrote:

> 1. On the technical side, my main concern is the runtime dependency on
>> org.opencypher:okapi-shade. okapi depends on several Scala libraries. We
>> came out with the solution to shade a few Scala libraries to avoid
>> pollution. However, I'm not super confident that the approach is
>> sustainable for two reasons: a) there exists no proper shading libraries
>> for Scala, 2) We will have to wait for upgrades from those Scala libraries
>> before we can upgrade Spark to use a newer Scala version. So it would be
>> great if some Scala experts can help review the current implementation and
>> help assess the risk.
>
>
> This concern is valid. I think we should start the vote to ensure the
> whole community is aware of the risk and take the responsibility to
> maintain this in the long term.
>
> Cheers,
>
> Xiao
>
>
> Xiangrui Meng <me...@gmail.com> 于2019年10月4日周五 下午12:27写道:
>
>> Hi all,
>>
>> I want to clarify my role first to avoid misunderstanding. I'm an
>> individual contributor here. My work on the graph SPIP as well as other
>> Spark features I contributed to are not associated with my employer. It
>> became quite challenging for me to keep track of the graph SPIP work due to
>> less available time at home.
>>
>> On retrospective, we should have involved more Spark devs and committers
>> early on so there is no single point of failure, i.e., me. Hopefully it is
>> not too late to fix. I summarize my thoughts here to help onboard other
>> reviewers:
>>
>> 1. On the technical side, my main concern is the runtime dependency on
>> org.opencypher:okapi-shade. okapi depends on several Scala libraries. We
>> came out with the solution to shade a few Scala libraries to avoid
>> pollution. However, I'm not super confident that the approach is
>> sustainable for two reasons: a) there exists no proper shading libraries
>> for Scala, 2) We will have to wait for upgrades from those Scala libraries
>> before we can upgrade Spark to use a newer Scala version. So it would be
>> great if some Scala experts can help review the current implementation and
>> help assess the risk.
>>
>> 2. Overloading helper methods. MLlib used to have several overloaded
>> helper methods for each algorithm, which later became a major maintenance
>> burden. Builders and setters/getters are more maintainable. I will comment
>> again on the PR.
>>
>> 3. The proposed API partitions graph into sub-graphs, as described in the
>> property graph model. It is unclear to me how it would affect query
>> performance because it requires SQL optimizer to correctly recognize data
>> from the same source and make execution efficient.
>>
>> 4. The feature, although originally targeted for Spark 3.0, should not be
>> a Spark 3.0 release blocker because it doesn't require breaking changes. If
>> we miss the code freeze deadline, we can introduce a build flag to exclude
>> the module from the official release/distribution, and then make it default
>> once the module is ready.
>>
>> 5. If unfortunately we still don't see sufficient committer reviews, I
>> think the best option would be submitting the work to Apache Incubator
>> instead to unblock the work. But maybe it is too earlier to discuss this
>> option.
>>
>> It would be great if other committers can offer help on the review!
>> Really appreciated!
>>
>> Best,
>> Xiangrui
>>
>> On Fri, Oct 4, 2019 at 1:32 AM Mats Rydberg <ma...@neo4j.org.invalid>
>> wrote:
>>
>>> Hello dear Spark community
>>>
>>> We are the developers behind the SparkGraph SPIP, which is a project
>>> created out of our work on openCypher Morpheus (
>>> https://github.com/opencypher/morpheus). During this year we have
>>> collaborated with mainly Xiangrui Meng of Databricks to define and develop
>>> a new SparkGraph module based on our experience from working on Morpheus.
>>> Morpheus - formerly known as "Cypher for Apache Spark" - has been in
>>> development for over 3 years and matured in its API and implementation.
>>>
>>> The SPIP work has been on hold for a period of time now, as priorities
>>> at Databricks have changed which has occupied Xiangrui's time (as well as
>>> other happenings). As you may know, the latest API PR (
>>> https://github.com/apache/spark/pull/24851) is blocking us from moving
>>> forward with the implementation.
>>>
>>> In an attempt to not lose track of this project we now reach out to you
>>> to ask whether there are any Spark committers in the community who would be
>>> prepared to commit to helping us review and merge our code contributions to
>>> Apache Spark? We are not asking for lots of direct development support, as
>>> we believe we have the implementation more or less completed already since
>>> early this year. There is a proof-of-concept PR (
>>> https://github.com/apache/spark/pull/24297) which contains the
>>> functionality.
>>>
>>> If you could offer such aid it would be greatly appreciated. None of us
>>> are Spark committers, which is hindering our ability to deliver this
>>> project in time for Spark 3.0.
>>>
>>> Sincerely
>>> the Neo4j Graph Analytics team
>>> Mats, Martin, Max, Sören, Jonatan
>>>
>>> --
Twitter: https://twitter.com/holdenkarau
Books (Learning Spark, High Performance Spark, etc.):
https://amzn.to/2MaRAG9  <https://amzn.to/2MaRAG9>
YouTube Live Streams: https://www.youtube.com/user/holdenkarau

Re: SparkGraph review process

Posted by Xiao Li <ga...@gmail.com>.
>
> 1. On the technical side, my main concern is the runtime dependency on
> org.opencypher:okapi-shade. okapi depends on several Scala libraries. We
> came out with the solution to shade a few Scala libraries to avoid
> pollution. However, I'm not super confident that the approach is
> sustainable for two reasons: a) there exists no proper shading libraries
> for Scala, 2) We will have to wait for upgrades from those Scala libraries
> before we can upgrade Spark to use a newer Scala version. So it would be
> great if some Scala experts can help review the current implementation and
> help assess the risk.


This concern is valid. I think we should start the vote to ensure the whole
community is aware of the risk and take the responsibility to maintain this
in the long term.

Cheers,

Xiao


Xiangrui Meng <me...@gmail.com> 于2019年10月4日周五 下午12:27写道:

> Hi all,
>
> I want to clarify my role first to avoid misunderstanding. I'm an
> individual contributor here. My work on the graph SPIP as well as other
> Spark features I contributed to are not associated with my employer. It
> became quite challenging for me to keep track of the graph SPIP work due to
> less available time at home.
>
> On retrospective, we should have involved more Spark devs and committers
> early on so there is no single point of failure, i.e., me. Hopefully it is
> not too late to fix. I summarize my thoughts here to help onboard other
> reviewers:
>
> 1. On the technical side, my main concern is the runtime dependency on
> org.opencypher:okapi-shade. okapi depends on several Scala libraries. We
> came out with the solution to shade a few Scala libraries to avoid
> pollution. However, I'm not super confident that the approach is
> sustainable for two reasons: a) there exists no proper shading libraries
> for Scala, 2) We will have to wait for upgrades from those Scala libraries
> before we can upgrade Spark to use a newer Scala version. So it would be
> great if some Scala experts can help review the current implementation and
> help assess the risk.
>
> 2. Overloading helper methods. MLlib used to have several overloaded
> helper methods for each algorithm, which later became a major maintenance
> burden. Builders and setters/getters are more maintainable. I will comment
> again on the PR.
>
> 3. The proposed API partitions graph into sub-graphs, as described in the
> property graph model. It is unclear to me how it would affect query
> performance because it requires SQL optimizer to correctly recognize data
> from the same source and make execution efficient.
>
> 4. The feature, although originally targeted for Spark 3.0, should not be
> a Spark 3.0 release blocker because it doesn't require breaking changes. If
> we miss the code freeze deadline, we can introduce a build flag to exclude
> the module from the official release/distribution, and then make it default
> once the module is ready.
>
> 5. If unfortunately we still don't see sufficient committer reviews, I
> think the best option would be submitting the work to Apache Incubator
> instead to unblock the work. But maybe it is too earlier to discuss this
> option.
>
> It would be great if other committers can offer help on the review! Really
> appreciated!
>
> Best,
> Xiangrui
>
> On Fri, Oct 4, 2019 at 1:32 AM Mats Rydberg <ma...@neo4j.org.invalid>
> wrote:
>
>> Hello dear Spark community
>>
>> We are the developers behind the SparkGraph SPIP, which is a project
>> created out of our work on openCypher Morpheus (
>> https://github.com/opencypher/morpheus). During this year we have
>> collaborated with mainly Xiangrui Meng of Databricks to define and develop
>> a new SparkGraph module based on our experience from working on Morpheus.
>> Morpheus - formerly known as "Cypher for Apache Spark" - has been in
>> development for over 3 years and matured in its API and implementation.
>>
>> The SPIP work has been on hold for a period of time now, as priorities at
>> Databricks have changed which has occupied Xiangrui's time (as well as
>> other happenings). As you may know, the latest API PR (
>> https://github.com/apache/spark/pull/24851) is blocking us from moving
>> forward with the implementation.
>>
>> In an attempt to not lose track of this project we now reach out to you
>> to ask whether there are any Spark committers in the community who would be
>> prepared to commit to helping us review and merge our code contributions to
>> Apache Spark? We are not asking for lots of direct development support, as
>> we believe we have the implementation more or less completed already since
>> early this year. There is a proof-of-concept PR (
>> https://github.com/apache/spark/pull/24297) which contains the
>> functionality.
>>
>> If you could offer such aid it would be greatly appreciated. None of us
>> are Spark committers, which is hindering our ability to deliver this
>> project in time for Spark 3.0.
>>
>> Sincerely
>> the Neo4j Graph Analytics team
>> Mats, Martin, Max, Sören, Jonatan
>>
>>

Re: SparkGraph review process

Posted by Xiangrui Meng <me...@gmail.com>.
Hi all,

I want to clarify my role first to avoid misunderstanding. I'm an
individual contributor here. My work on the graph SPIP as well as other
Spark features I contributed to are not associated with my employer. It
became quite challenging for me to keep track of the graph SPIP work due to
less available time at home.

On retrospective, we should have involved more Spark devs and committers
early on so there is no single point of failure, i.e., me. Hopefully it is
not too late to fix. I summarize my thoughts here to help onboard other
reviewers:

1. On the technical side, my main concern is the runtime dependency on
org.opencypher:okapi-shade. okapi depends on several Scala libraries. We
came out with the solution to shade a few Scala libraries to avoid
pollution. However, I'm not super confident that the approach is
sustainable for two reasons: a) there exists no proper shading libraries
for Scala, 2) We will have to wait for upgrades from those Scala libraries
before we can upgrade Spark to use a newer Scala version. So it would be
great if some Scala experts can help review the current implementation and
help assess the risk.

2. Overloading helper methods. MLlib used to have several overloaded helper
methods for each algorithm, which later became a major maintenance burden.
Builders and setters/getters are more maintainable. I will comment again on
the PR.

3. The proposed API partitions graph into sub-graphs, as described in the
property graph model. It is unclear to me how it would affect query
performance because it requires SQL optimizer to correctly recognize data
from the same source and make execution efficient.

4. The feature, although originally targeted for Spark 3.0, should not be a
Spark 3.0 release blocker because it doesn't require breaking changes. If
we miss the code freeze deadline, we can introduce a build flag to exclude
the module from the official release/distribution, and then make it default
once the module is ready.

5. If unfortunately we still don't see sufficient committer reviews, I
think the best option would be submitting the work to Apache Incubator
instead to unblock the work. But maybe it is too earlier to discuss this
option.

It would be great if other committers can offer help on the review! Really
appreciated!

Best,
Xiangrui

On Fri, Oct 4, 2019 at 1:32 AM Mats Rydberg <ma...@neo4j.org.invalid> wrote:

> Hello dear Spark community
>
> We are the developers behind the SparkGraph SPIP, which is a project
> created out of our work on openCypher Morpheus (
> https://github.com/opencypher/morpheus). During this year we have
> collaborated with mainly Xiangrui Meng of Databricks to define and develop
> a new SparkGraph module based on our experience from working on Morpheus.
> Morpheus - formerly known as "Cypher for Apache Spark" - has been in
> development for over 3 years and matured in its API and implementation.
>
> The SPIP work has been on hold for a period of time now, as priorities at
> Databricks have changed which has occupied Xiangrui's time (as well as
> other happenings). As you may know, the latest API PR (
> https://github.com/apache/spark/pull/24851) is blocking us from moving
> forward with the implementation.
>
> In an attempt to not lose track of this project we now reach out to you to
> ask whether there are any Spark committers in the community who would be
> prepared to commit to helping us review and merge our code contributions to
> Apache Spark? We are not asking for lots of direct development support, as
> we believe we have the implementation more or less completed already since
> early this year. There is a proof-of-concept PR (
> https://github.com/apache/spark/pull/24297) which contains the
> functionality.
>
> If you could offer such aid it would be greatly appreciated. None of us
> are Spark committers, which is hindering our ability to deliver this
> project in time for Spark 3.0.
>
> Sincerely
> the Neo4j Graph Analytics team
> Mats, Martin, Max, Sören, Jonatan
>
>