You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@beam.apache.org by Ted Yu <yu...@gmail.com> on 2017/05/03 00:50:22 UTC

Re: Beam spark 2.x runner status

Spark 2.1.1 has been released.

Consider using the new release in this work.

Thanks

On Wed, Mar 29, 2017 at 5:43 AM, Jean-Baptiste Onofré <jb...@nanthrax.net>
wrote:

> Cool for the PR merge, I will rebase my branch on it.
>
> Thanks !
> Regards
> JB
>
>
> On 03/29/2017 01:58 PM, Amit Sela wrote:
>
>> @Ted definitely makes sense.
>> @JB I'm merging https://github.com/apache/beam/pull/2354 soon so any
>> deprecated Spark API issues should be resolved.
>>
>> On Wed, Mar 29, 2017 at 2:46 PM Ted Yu <yu...@gmail.com> wrote:
>>
>> This is what I did over HBASE-16179:
>>>
>>> -        f.call((asJavaIterator(it), conn)).iterator()
>>> +        // the return type is different in spark 1.x & 2.x, we handle
>>> both
>>> cases
>>> +        f.call(asJavaIterator(it), conn) match {
>>> +          // spark 1.x
>>> +          case iterable: Iterable[R] => iterable.iterator()
>>> +          // spark 2.x
>>> +          case iterator: Iterator[R] => iterator
>>> +        }
>>>        )
>>>
>>> FYI
>>>
>>> On Wed, Mar 29, 2017 at 1:47 AM, Amit Sela <am...@gmail.com> wrote:
>>>
>>> Just tried to replace dependencies and see what happens:
>>>>
>>>> Most required changes are about the runner using deprecated Spark APIs,
>>>>
>>> and
>>>
>>>> after fixing them the only real issue is with the Java API for
>>>> Pair/FlatMapFunction that changed return value to Iterator (in 1.6 its
>>>> Iterable).
>>>>
>>>> So I'm not sure that a profile that simply sets dependency on
>>>> 1.6.3/2.1.0
>>>> is feasible.
>>>>
>>>> On Thu, Mar 23, 2017 at 10:22 AM Kobi Salant <ko...@gmail.com>
>>>> wrote:
>>>>
>>>> So, if everything is in place in Spark 2.X and we use provided
>>>>>
>>>> dependencies
>>>>
>>>>> for Spark in Beam.
>>>>> Theoretically, you can run the same code in 2.X without any need for a
>>>>> branch?
>>>>>
>>>>> 2017-03-23 9:47 GMT+02:00 Amit Sela <am...@gmail.com>:
>>>>>
>>>>> If StreamingContext is valid and we don't have to use SparkSession,
>>>>>>
>>>>> and
>>>
>>>> Accumulators are valid as well and we don't need AccumulatorsV2, I
>>>>>>
>>>>> don't
>>>>
>>>>> see a reason this shouldn't work (which means there are still tons of
>>>>>> reasons this could break, but I can't think of them off the top of my
>>>>>>
>>>>> head
>>>>>
>>>>>> right now).
>>>>>>
>>>>>> @JB simply add a profile for the Spark dependencies and run the
>>>>>>
>>>>> tests -
>>>
>>>> you'll have a very definitive answer ;-) .
>>>>>> If this passes, try on a cluster running Spark 2 as well.
>>>>>>
>>>>>> Let me know of I can assist.
>>>>>>
>>>>>> On Thu, Mar 23, 2017 at 6:55 AM Jean-Baptiste Onofré <
>>>>>>
>>>>> jb@nanthrax.net>
>>>
>>>> wrote:
>>>>>>
>>>>>> Hi guys,
>>>>>>>
>>>>>>> Ismaël summarize well what I have in mind.
>>>>>>>
>>>>>>> I'm a bit late on the PoC around that (I started a branch already).
>>>>>>> I will move forward over the week end.
>>>>>>>
>>>>>>> Regards
>>>>>>> JB
>>>>>>>
>>>>>>> On 03/22/2017 11:42 PM, Ismaël Mejía wrote:
>>>>>>>
>>>>>>>> Amit, I suppose JB is talking about the RDD based version, so no
>>>>>>>>
>>>>>>> need
>>>>
>>>>> to worry about SparkSession or different incompatible APIs.
>>>>>>>>
>>>>>>>> Remember the idea we are discussing is to have in master both the
>>>>>>>> spark 1 and spark 2 runners using the RDD based translation. At
>>>>>>>>
>>>>>>> the
>>>
>>>> same time we can have a feature branch to evolve the DataSet
>>>>>>>>
>>>>>>> based
>>>
>>>> translator (this one will replace the RDD based translator for
>>>>>>>>
>>>>>>> spark
>>>>
>>>>> 2
>>>>>
>>>>>> once it is mature).
>>>>>>>>
>>>>>>>> The advantages have been already discussed as well as the
>>>>>>>>
>>>>>>> possible
>>>
>>>> issues so I think we have to see now if JB's idea is feasible and
>>>>>>>>
>>>>>>> how
>>>>
>>>>> hard would be to live with this while the DataSet version
>>>>>>>>
>>>>>>> evolves.
>>>
>>>>
>>>>>>>> I think what we are trying to avoid is to have a long living
>>>>>>>>
>>>>>>> branch
>>>
>>>> for a spark 2 runner based on RDD  because the maintenance burden
>>>>>>>> would be even worse. We would have to fight not only with the
>>>>>>>>
>>>>>>> double
>>>>
>>>>> merge of fixes (in case the profile idea does not work), but also
>>>>>>>>
>>>>>>> with
>>>>>
>>>>>> the continue evolution of Beam and we would end up in the long
>>>>>>>>
>>>>>>> living
>>>>
>>>>> branch mess that others runners have dealt with (e.g. the Apex
>>>>>>>>
>>>>>>> runner)
>>>>>
>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>> https://lists.apache.org/thread.html/12cc086f5ffe331cc70b89322ce541
>>>
>>>> 6c3112b87efc3393e3e16032a2@%3Cdev.beam.apache.org%3E
>>>>>>
>>>>>>>
>>>>>>>> What do you think about this Amit ? Would you be ok to go with it
>>>>>>>>
>>>>>>> if
>>>>
>>>>> JB's profile idea proves to help with the msintenance issues ?
>>>>>>>>
>>>>>>>> Ismaël
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> On Wed, Mar 22, 2017 at 5:53 PM, Ted Yu <yu...@gmail.com>
>>>>>>>>
>>>>>>> wrote:
>>>>
>>>>> hbase-spark module doesn't use SparkSession. So situation there
>>>>>>>>>
>>>>>>>> is
>>>
>>>> simpler
>>>>>>>
>>>>>>>> :-)
>>>>>>>>>
>>>>>>>>> On Wed, Mar 22, 2017 at 5:35 AM, Amit Sela <
>>>>>>>>>
>>>>>>>> amitsela33@gmail.com>
>>>
>>>> wrote:
>>>>>>>
>>>>>>>>
>>>>>>>>> I'm still wondering how we'll do this - it's not just different
>>>>>>>>>> implementations of the same Class, but a completely different
>>>>>>>>>>
>>>>>>>>> concepts
>>>>>>
>>>>>>> such
>>>>>>>
>>>>>>>> as using SparkSession in Spark 2 instead of
>>>>>>>>>>
>>>>>>>>> SparkContext/StreamingContext
>>>>>>>
>>>>>>>> in Spark 1.
>>>>>>>>>>
>>>>>>>>>> On Tue, Mar 21, 2017 at 7:25 PM Ted Yu <yu...@gmail.com>
>>>>>>>>>>
>>>>>>>>> wrote:
>>>>>
>>>>>>
>>>>>>>>>> I have done some work over in HBASE-16179 where compatibility
>>>>>>>>>>>
>>>>>>>>>> modules
>>>>>>
>>>>>>> are
>>>>>>>
>>>>>>>> created to isolate changes in Spark 2.x API so that code in
>>>>>>>>>>>
>>>>>>>>>> hbase-spark
>>>>>>>
>>>>>>>> module can be reused.
>>>>>>>>>>>
>>>>>>>>>>> FYI
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>> --
>>>>>>> Jean-Baptiste Onofré
>>>>>>> jbonofre@apache.org
>>>>>>> http://blog.nanthrax.net
>>>>>>> Talend - http://www.talend.com
>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
> --
> Jean-Baptiste Onofré
> jbonofre@apache.org
> http://blog.nanthrax.net
> Talend - http://www.talend.com
>

Re: Beam spark 2.x runner status

Posted by Holden Karau <ho...@pigscanfly.ca>.

I'd love to take a look at the PR when it comes in (<3 BEAM + SPARK :)).

On Mon, Aug 21, 2017 at 11:33 AM, Jean-Baptiste Onofré <jb...@nanthrax.net>
wrote:

> Hi
>
> I did a new runner supporting spark 2.1.x. I changed code for that.
>
> I'm still in vacation this week. I will send an update when back.
>
> Regards
> JB
>
> On Aug 21, 2017, 09:01, at 09:01, Pei HE <pe...@gmail.com> wrote:
> >Any updates for upgrading to spark 2.x?
> >
> >I tried to replace the dependency and found a compile error from
> >implementing a scala trait:
> >org.apache.beam.runners.spark.io.SourceRDD.SourcePartition is not
> >abstract
> >and does not override abstract method
> >org$apache$spark$Partition$$super$equals(java.lang.Object) in
> >org.apache.spark.Partition
> >
> >(The spark side change was introduced in
> >https://github.com/apache/spark/pull/12157.)
> >
> >Does anyone have ideas about this compile error?
> >
> >
> >On Wed, May 3, 2017 at 1:32 PM, Jean-Baptiste Onofré <jb...@nanthrax.net>
> >wrote:
> >
> >> Hi Ted,
> >>
> >> My branch used Spark 2.1.0 and I just updated to 2.1.1.
> >>
> >> As discussed with Aviem, I should be able to create the pull request
> >later
> >> today.
> >>
> >> Regards
> >> JB
> >>
> >>
> >> On 05/03/2017 02:50 AM, Ted Yu wrote:
> >>
> >>> Spark 2.1.1 has been released.
> >>>
> >>> Consider using the new release in this work.
> >>>
> >>> Thanks
> >>>
> >>> On Wed, Mar 29, 2017 at 5:43 AM, Jean-Baptiste Onofré
> ><jb...@nanthrax.net>
> >>> wrote:
> >>>
> >>> Cool for the PR merge, I will rebase my branch on it.
> >>>>
> >>>> Thanks !
> >>>> Regards
> >>>> JB
> >>>>
> >>>>
> >>>> On 03/29/2017 01:58 PM, Amit Sela wrote:
> >>>>
> >>>> @Ted definitely makes sense.
> >>>>> @JB I'm merging https://github.com/apache/beam/pull/2354 soon so
> >any
> >>>>> deprecated Spark API issues should be resolved.
> >>>>>
> >>>>> On Wed, Mar 29, 2017 at 2:46 PM Ted Yu <yu...@gmail.com>
> >wrote:
> >>>>>
> >>>>> This is what I did over HBASE-16179:
> >>>>>
> >>>>>>
> >>>>>> -        f.call((asJavaIterator(it), conn)).iterator()
> >>>>>> +        // the return type is different in spark 1.x & 2.x, we
> >handle
> >>>>>> both
> >>>>>> cases
> >>>>>> +        f.call(asJavaIterator(it), conn) match {
> >>>>>> +          // spark 1.x
> >>>>>> +          case iterable: Iterable[R] => iterable.iterator()
> >>>>>> +          // spark 2.x
> >>>>>> +          case iterator: Iterator[R] => iterator
> >>>>>> +        }
> >>>>>>        )
> >>>>>>
> >>>>>> FYI
> >>>>>>
> >>>>>> On Wed, Mar 29, 2017 at 1:47 AM, Amit Sela <am...@gmail.com>
> >>>>>> wrote:
> >>>>>>
> >>>>>> Just tried to replace dependencies and see what happens:
> >>>>>>
> >>>>>>>
> >>>>>>> Most required changes are about the runner using deprecated
> >Spark
> >>>>>>> APIs,
> >>>>>>>
> >>>>>>> and
> >>>>>>
> >>>>>> after fixing them the only real issue is with the Java API for
> >>>>>>> Pair/FlatMapFunction that changed return value to Iterator (in
> >1.6 its
> >>>>>>> Iterable).
> >>>>>>>
> >>>>>>> So I'm not sure that a profile that simply sets dependency on
> >>>>>>> 1.6.3/2.1.0
> >>>>>>> is feasible.
> >>>>>>>
> >>>>>>> On Thu, Mar 23, 2017 at 10:22 AM Kobi Salant
> ><ko...@gmail.com>
> >>>>>>> wrote:
> >>>>>>>
> >>>>>>> So, if everything is in place in Spark 2.X and we use provided
> >>>>>>>
> >>>>>>>>
> >>>>>>>> dependencies
> >>>>>>>
> >>>>>>> for Spark in Beam.
> >>>>>>>> Theoretically, you can run the same code in 2.X without any
> >need for
> >>>>>>>> a
> >>>>>>>> branch?
> >>>>>>>>
> >>>>>>>> 2017-03-23 9:47 GMT+02:00 Amit Sela <am...@gmail.com>:
> >>>>>>>>
> >>>>>>>> If StreamingContext is valid and we don't have to use
> >SparkSession,
> >>>>>>>>
> >>>>>>>>>
> >>>>>>>>> and
> >>>>>>>>
> >>>>>>>
> >>>>>> Accumulators are valid as well and we don't need AccumulatorsV2,
> >I
> >>>>>>>
> >>>>>>>>
> >>>>>>>>> don't
> >>>>>>>>
> >>>>>>>
> >>>>>>> see a reason this shouldn't work (which means there are still
> >tons of
> >>>>>>>>
> >>>>>>>>> reasons this could break, but I can't think of them off the
> >top of
> >>>>>>>>> my
> >>>>>>>>>
> >>>>>>>>> head
> >>>>>>>>
> >>>>>>>> right now).
> >>>>>>>>>
> >>>>>>>>> @JB simply add a profile for the Spark dependencies and run
> >the
> >>>>>>>>>
> >>>>>>>>> tests -
> >>>>>>>>
> >>>>>>>
> >>>>>> you'll have a very definitive answer ;-) .
> >>>>>>>
> >>>>>>>> If this passes, try on a cluster running Spark 2 as well.
> >>>>>>>>>
> >>>>>>>>> Let me know of I can assist.
> >>>>>>>>>
> >>>>>>>>> On Thu, Mar 23, 2017 at 6:55 AM Jean-Baptiste Onofré <
> >>>>>>>>>
> >>>>>>>>> jb@nanthrax.net>
> >>>>>>>>
> >>>>>>>
> >>>>>> wrote:
> >>>>>>>
> >>>>>>>>
> >>>>>>>>> Hi guys,
> >>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>> Ismaël summarize well what I have in mind.
> >>>>>>>>>>
> >>>>>>>>>> I'm a bit late on the PoC around that (I started a branch
> >already).
> >>>>>>>>>> I will move forward over the week end.
> >>>>>>>>>>
> >>>>>>>>>> Regards
> >>>>>>>>>> JB
> >>>>>>>>>>
> >>>>>>>>>> On 03/22/2017 11:42 PM, Ismaël Mejía wrote:
> >>>>>>>>>>
> >>>>>>>>>> Amit, I suppose JB is talking about the RDD based version, so
> >no
> >>>>>>>>>>>
> >>>>>>>>>>> need
> >>>>>>>>>>
> >>>>>>>>>
> >>>>>>> to worry about SparkSession or different incompatible APIs.
> >>>>>>>>
> >>>>>>>>>
> >>>>>>>>>>> Remember the idea we are discussing is to have in master
> >both the
> >>>>>>>>>>> spark 1 and spark 2 runners using the RDD based translation.
> >At
> >>>>>>>>>>>
> >>>>>>>>>>> the
> >>>>>>>>>>
> >>>>>>>>>
> >>>>>> same time we can have a feature branch to evolve the DataSet
> >>>>>>>
> >>>>>>>>
> >>>>>>>>>>> based
> >>>>>>>>>>
> >>>>>>>>>
> >>>>>> translator (this one will replace the RDD based translator for
> >>>>>>>
> >>>>>>>>
> >>>>>>>>>>> spark
> >>>>>>>>>>
> >>>>>>>>>
> >>>>>>> 2
> >>>>>>>>
> >>>>>>>> once it is mature).
> >>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>>> The advantages have been already discussed as well as the
> >>>>>>>>>>>
> >>>>>>>>>>> possible
> >>>>>>>>>>
> >>>>>>>>>
> >>>>>> issues so I think we have to see now if JB's idea is feasible and
> >>>>>>>
> >>>>>>>>
> >>>>>>>>>>> how
> >>>>>>>>>>
> >>>>>>>>>
> >>>>>>> hard would be to live with this while the DataSet version
> >>>>>>>>
> >>>>>>>>>
> >>>>>>>>>>> evolves.
> >>>>>>>>>>
> >>>>>>>>>
> >>>>>>
> >>>>>>> I think what we are trying to avoid is to have a long living
> >>>>>>>>>>>
> >>>>>>>>>>> branch
> >>>>>>>>>>
> >>>>>>>>>
> >>>>>> for a spark 2 runner based on RDD  because the maintenance burden
> >>>>>>>
> >>>>>>>> would be even worse. We would have to fight not only with the
> >>>>>>>>>>>
> >>>>>>>>>>> double
> >>>>>>>>>>
> >>>>>>>>>
> >>>>>>> merge of fixes (in case the profile idea does not work), but
> >also
> >>>>>>>>
> >>>>>>>>>
> >>>>>>>>>>> with
> >>>>>>>>>>
> >>>>>>>>>
> >>>>>>>> the continue evolution of Beam and we would end up in the long
> >>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>>> living
> >>>>>>>>>>
> >>>>>>>>>
> >>>>>>> branch mess that others runners have dealt with (e.g. the Apex
> >>>>>>>>
> >>>>>>>>>
> >>>>>>>>>>> runner)
> >>>>>>>>>>
> >>>>>>>>>
> >>>>>>>>
> >>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>>> https://lists.apache.org/thread.html/12cc086f5ffe331cc70b893
> >>>>>>>>>> 22ce541
> >>>>>>>>>>
> >>>>>>>>>
> >>>>>> 6c3112b87efc3393e3e16032a2@%3Cdev.beam.apache.org%3E
> >>>>>>>
> >>>>>>>>
> >>>>>>>>>
> >>>>>>>>>> What do you think about this Amit ? Would you be ok to go
> >with it
> >>>>>>>>>>>
> >>>>>>>>>>> if
> >>>>>>>>>>
> >>>>>>>>>
> >>>>>>> JB's profile idea proves to help with the msintenance issues ?
> >>>>>>>>
> >>>>>>>>>
> >>>>>>>>>>> Ismaël
> >>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>>> On Wed, Mar 22, 2017 at 5:53 PM, Ted Yu
> ><yu...@gmail.com>
> >>>>>>>>>>>
> >>>>>>>>>>> wrote:
> >>>>>>>>>>
> >>>>>>>>>
> >>>>>>> hbase-spark module doesn't use SparkSession. So situation there
> >>>>>>>>
> >>>>>>>>>
> >>>>>>>>>>>> is
> >>>>>>>>>>>
> >>>>>>>>>>
> >>>>>> simpler
> >>>>>>>
> >>>>>>>>
> >>>>>>>>>> :-)
> >>>>>>>>>>>
> >>>>>>>>>>>>
> >>>>>>>>>>>> On Wed, Mar 22, 2017 at 5:35 AM, Amit Sela <
> >>>>>>>>>>>>
> >>>>>>>>>>>> amitsela33@gmail.com>
> >>>>>>>>>>>
> >>>>>>>>>>
> >>>>>> wrote:
> >>>>>>>
> >>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>>> I'm still wondering how we'll do this - it's not just
> >different
> >>>>>>>>>>>>
> >>>>>>>>>>>>> implementations of the same Class, but a completely
> >different
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> concepts
> >>>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>> such
> >>>>>>>>>>
> >>>>>>>>>> as using SparkSession in Spark 2 instead of
> >>>>>>>>>>>
> >>>>>>>>>>>>
> >>>>>>>>>>>>> SparkContext/StreamingContext
> >>>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>> in Spark 1.
> >>>>>>>>>>>
> >>>>>>>>>>>>
> >>>>>>>>>>>>> On Tue, Mar 21, 2017 at 7:25 PM Ted Yu
> ><yu...@gmail.com>
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> wrote:
> >>>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>
> >>>>>>>>> I have done some work over in HBASE-16179 where compatibility
> >>>>>>>>>>>>>
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> modules
> >>>>>>>>>>>>>
> >>>>>>>>>>>>
> >>>>>>>>> are
> >>>>>>>>>>
> >>>>>>>>>> created to isolate changes in Spark 2.x API so that code in
> >>>>>>>>>>>
> >>>>>>>>>>>>
> >>>>>>>>>>>>>> hbase-spark
> >>>>>>>>>>>>>
> >>>>>>>>>>>>
> >>>>>>>>>> module can be reused.
> >>>>>>>>>>>
> >>>>>>>>>>>>
> >>>>>>>>>>>>>> FYI
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>> --
> >>>>>>>>>> Jean-Baptiste Onofré
> >>>>>>>>>> jbonofre@apache.org
> >>>>>>>>>> http://blog.nanthrax.net
> >>>>>>>>>> Talend - http://www.talend.com
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>
> >>>>>>>>
> >>>>>>>
> >>>>>>
> >>>>> --
> >>>> Jean-Baptiste Onofré
> >>>> jbonofre@apache.org
> >>>> http://blog.nanthrax.net
> >>>> Talend - http://www.talend.com
> >>>>
> >>>>
> >>>
> >> --
> >> Jean-Baptiste Onofré
> >> jbonofre@apache.org
> >> http://blog.nanthrax.net
> >> Talend - http://www.talend.com
> >>
>



-- 
Cell : 425-233-8271
Twitter: https://twitter.com/holdenkarau

Re: Beam spark 2.x runner status

Posted by Jean-Baptiste Onofré <jb...@nanthrax.net>.

Hi

I did a new runner supporting spark 2.1.x. I changed code for that.

I'm still in vacation this week. I will send an update when back.

Regards
JB

On Aug 21, 2017, 09:01, at 09:01, Pei HE <pe...@gmail.com> wrote:
>Any updates for upgrading to spark 2.x?
>
>I tried to replace the dependency and found a compile error from
>implementing a scala trait:
>org.apache.beam.runners.spark.io.SourceRDD.SourcePartition is not
>abstract
>and does not override abstract method
>org$apache$spark$Partition$$super$equals(java.lang.Object) in
>org.apache.spark.Partition
>
>(The spark side change was introduced in
>https://github.com/apache/spark/pull/12157.)
>
>Does anyone have ideas about this compile error?
>
>
>On Wed, May 3, 2017 at 1:32 PM, Jean-Baptiste Onofré <jb...@nanthrax.net>
>wrote:
>
>> Hi Ted,
>>
>> My branch used Spark 2.1.0 and I just updated to 2.1.1.
>>
>> As discussed with Aviem, I should be able to create the pull request
>later
>> today.
>>
>> Regards
>> JB
>>
>>
>> On 05/03/2017 02:50 AM, Ted Yu wrote:
>>
>>> Spark 2.1.1 has been released.
>>>
>>> Consider using the new release in this work.
>>>
>>> Thanks
>>>
>>> On Wed, Mar 29, 2017 at 5:43 AM, Jean-Baptiste Onofré
><jb...@nanthrax.net>
>>> wrote:
>>>
>>> Cool for the PR merge, I will rebase my branch on it.
>>>>
>>>> Thanks !
>>>> Regards
>>>> JB
>>>>
>>>>
>>>> On 03/29/2017 01:58 PM, Amit Sela wrote:
>>>>
>>>> @Ted definitely makes sense.
>>>>> @JB I'm merging https://github.com/apache/beam/pull/2354 soon so
>any
>>>>> deprecated Spark API issues should be resolved.
>>>>>
>>>>> On Wed, Mar 29, 2017 at 2:46 PM Ted Yu <yu...@gmail.com>
>wrote:
>>>>>
>>>>> This is what I did over HBASE-16179:
>>>>>
>>>>>>
>>>>>> -        f.call((asJavaIterator(it), conn)).iterator()
>>>>>> +        // the return type is different in spark 1.x & 2.x, we
>handle
>>>>>> both
>>>>>> cases
>>>>>> +        f.call(asJavaIterator(it), conn) match {
>>>>>> +          // spark 1.x
>>>>>> +          case iterable: Iterable[R] => iterable.iterator()
>>>>>> +          // spark 2.x
>>>>>> +          case iterator: Iterator[R] => iterator
>>>>>> +        }
>>>>>>        )
>>>>>>
>>>>>> FYI
>>>>>>
>>>>>> On Wed, Mar 29, 2017 at 1:47 AM, Amit Sela <am...@gmail.com>
>>>>>> wrote:
>>>>>>
>>>>>> Just tried to replace dependencies and see what happens:
>>>>>>
>>>>>>>
>>>>>>> Most required changes are about the runner using deprecated
>Spark
>>>>>>> APIs,
>>>>>>>
>>>>>>> and
>>>>>>
>>>>>> after fixing them the only real issue is with the Java API for
>>>>>>> Pair/FlatMapFunction that changed return value to Iterator (in
>1.6 its
>>>>>>> Iterable).
>>>>>>>
>>>>>>> So I'm not sure that a profile that simply sets dependency on
>>>>>>> 1.6.3/2.1.0
>>>>>>> is feasible.
>>>>>>>
>>>>>>> On Thu, Mar 23, 2017 at 10:22 AM Kobi Salant
><ko...@gmail.com>
>>>>>>> wrote:
>>>>>>>
>>>>>>> So, if everything is in place in Spark 2.X and we use provided
>>>>>>>
>>>>>>>>
>>>>>>>> dependencies
>>>>>>>
>>>>>>> for Spark in Beam.
>>>>>>>> Theoretically, you can run the same code in 2.X without any
>need for
>>>>>>>> a
>>>>>>>> branch?
>>>>>>>>
>>>>>>>> 2017-03-23 9:47 GMT+02:00 Amit Sela <am...@gmail.com>:
>>>>>>>>
>>>>>>>> If StreamingContext is valid and we don't have to use
>SparkSession,
>>>>>>>>
>>>>>>>>>
>>>>>>>>> and
>>>>>>>>
>>>>>>>
>>>>>> Accumulators are valid as well and we don't need AccumulatorsV2,
>I
>>>>>>>
>>>>>>>>
>>>>>>>>> don't
>>>>>>>>
>>>>>>>
>>>>>>> see a reason this shouldn't work (which means there are still
>tons of
>>>>>>>>
>>>>>>>>> reasons this could break, but I can't think of them off the
>top of
>>>>>>>>> my
>>>>>>>>>
>>>>>>>>> head
>>>>>>>>
>>>>>>>> right now).
>>>>>>>>>
>>>>>>>>> @JB simply add a profile for the Spark dependencies and run
>the
>>>>>>>>>
>>>>>>>>> tests -
>>>>>>>>
>>>>>>>
>>>>>> you'll have a very definitive answer ;-) .
>>>>>>>
>>>>>>>> If this passes, try on a cluster running Spark 2 as well.
>>>>>>>>>
>>>>>>>>> Let me know of I can assist.
>>>>>>>>>
>>>>>>>>> On Thu, Mar 23, 2017 at 6:55 AM Jean-Baptiste Onofré <
>>>>>>>>>
>>>>>>>>> jb@nanthrax.net>
>>>>>>>>
>>>>>>>
>>>>>> wrote:
>>>>>>>
>>>>>>>>
>>>>>>>>> Hi guys,
>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> Ismaël summarize well what I have in mind.
>>>>>>>>>>
>>>>>>>>>> I'm a bit late on the PoC around that (I started a branch
>already).
>>>>>>>>>> I will move forward over the week end.
>>>>>>>>>>
>>>>>>>>>> Regards
>>>>>>>>>> JB
>>>>>>>>>>
>>>>>>>>>> On 03/22/2017 11:42 PM, Ismaël Mejía wrote:
>>>>>>>>>>
>>>>>>>>>> Amit, I suppose JB is talking about the RDD based version, so
>no
>>>>>>>>>>>
>>>>>>>>>>> need
>>>>>>>>>>
>>>>>>>>>
>>>>>>> to worry about SparkSession or different incompatible APIs.
>>>>>>>>
>>>>>>>>>
>>>>>>>>>>> Remember the idea we are discussing is to have in master
>both the
>>>>>>>>>>> spark 1 and spark 2 runners using the RDD based translation.
>At
>>>>>>>>>>>
>>>>>>>>>>> the
>>>>>>>>>>
>>>>>>>>>
>>>>>> same time we can have a feature branch to evolve the DataSet
>>>>>>>
>>>>>>>>
>>>>>>>>>>> based
>>>>>>>>>>
>>>>>>>>>
>>>>>> translator (this one will replace the RDD based translator for
>>>>>>>
>>>>>>>>
>>>>>>>>>>> spark
>>>>>>>>>>
>>>>>>>>>
>>>>>>> 2
>>>>>>>>
>>>>>>>> once it is mature).
>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>> The advantages have been already discussed as well as the
>>>>>>>>>>>
>>>>>>>>>>> possible
>>>>>>>>>>
>>>>>>>>>
>>>>>> issues so I think we have to see now if JB's idea is feasible and
>>>>>>>
>>>>>>>>
>>>>>>>>>>> how
>>>>>>>>>>
>>>>>>>>>
>>>>>>> hard would be to live with this while the DataSet version
>>>>>>>>
>>>>>>>>>
>>>>>>>>>>> evolves.
>>>>>>>>>>
>>>>>>>>>
>>>>>>
>>>>>>> I think what we are trying to avoid is to have a long living
>>>>>>>>>>>
>>>>>>>>>>> branch
>>>>>>>>>>
>>>>>>>>>
>>>>>> for a spark 2 runner based on RDD  because the maintenance burden
>>>>>>>
>>>>>>>> would be even worse. We would have to fight not only with the
>>>>>>>>>>>
>>>>>>>>>>> double
>>>>>>>>>>
>>>>>>>>>
>>>>>>> merge of fixes (in case the profile idea does not work), but
>also
>>>>>>>>
>>>>>>>>>
>>>>>>>>>>> with
>>>>>>>>>>
>>>>>>>>>
>>>>>>>> the continue evolution of Beam and we would end up in the long
>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>> living
>>>>>>>>>>
>>>>>>>>>
>>>>>>> branch mess that others runners have dealt with (e.g. the Apex
>>>>>>>>
>>>>>>>>>
>>>>>>>>>>> runner)
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> https://lists.apache.org/thread.html/12cc086f5ffe331cc70b893
>>>>>>>>>> 22ce541
>>>>>>>>>>
>>>>>>>>>
>>>>>> 6c3112b87efc3393e3e16032a2@%3Cdev.beam.apache.org%3E
>>>>>>>
>>>>>>>>
>>>>>>>>>
>>>>>>>>>> What do you think about this Amit ? Would you be ok to go
>with it
>>>>>>>>>>>
>>>>>>>>>>> if
>>>>>>>>>>
>>>>>>>>>
>>>>>>> JB's profile idea proves to help with the msintenance issues ?
>>>>>>>>
>>>>>>>>>
>>>>>>>>>>> Ismaël
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> On Wed, Mar 22, 2017 at 5:53 PM, Ted Yu
><yu...@gmail.com>
>>>>>>>>>>>
>>>>>>>>>>> wrote:
>>>>>>>>>>
>>>>>>>>>
>>>>>>> hbase-spark module doesn't use SparkSession. So situation there
>>>>>>>>
>>>>>>>>>
>>>>>>>>>>>> is
>>>>>>>>>>>
>>>>>>>>>>
>>>>>> simpler
>>>>>>>
>>>>>>>>
>>>>>>>>>> :-)
>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> On Wed, Mar 22, 2017 at 5:35 AM, Amit Sela <
>>>>>>>>>>>>
>>>>>>>>>>>> amitsela33@gmail.com>
>>>>>>>>>>>
>>>>>>>>>>
>>>>>> wrote:
>>>>>>>
>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>> I'm still wondering how we'll do this - it's not just
>different
>>>>>>>>>>>>
>>>>>>>>>>>>> implementations of the same Class, but a completely
>different
>>>>>>>>>>>>>
>>>>>>>>>>>>> concepts
>>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>> such
>>>>>>>>>>
>>>>>>>>>> as using SparkSession in Spark 2 instead of
>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>> SparkContext/StreamingContext
>>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>> in Spark 1.
>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>> On Tue, Mar 21, 2017 at 7:25 PM Ted Yu
><yu...@gmail.com>
>>>>>>>>>>>>>
>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>
>>>>>>>>> I have done some work over in HBASE-16179 where compatibility
>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> modules
>>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>> are
>>>>>>>>>>
>>>>>>>>>> created to isolate changes in Spark 2.x API so that code in
>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>>> hbase-spark
>>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>> module can be reused.
>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>>> FYI
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>> --
>>>>>>>>>> Jean-Baptiste Onofré
>>>>>>>>>> jbonofre@apache.org
>>>>>>>>>> http://blog.nanthrax.net
>>>>>>>>>> Talend - http://www.talend.com
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>
>>>>> --
>>>> Jean-Baptiste Onofré
>>>> jbonofre@apache.org
>>>> http://blog.nanthrax.net
>>>> Talend - http://www.talend.com
>>>>
>>>>
>>>
>> --
>> Jean-Baptiste Onofré
>> jbonofre@apache.org
>> http://blog.nanthrax.net
>> Talend - http://www.talend.com
>>

Re: Beam spark 2.x runner status

Posted by Pei HE <pe...@gmail.com>.

Any updates for upgrading to spark 2.x?

I tried to replace the dependency and found a compile error from
implementing a scala trait:
org.apache.beam.runners.spark.io.SourceRDD.SourcePartition is not abstract
and does not override abstract method
org$apache$spark$Partition$$super$equals(java.lang.Object) in
org.apache.spark.Partition

(The spark side change was introduced in
https://github.com/apache/spark/pull/12157.)

Does anyone have ideas about this compile error?


On Wed, May 3, 2017 at 1:32 PM, Jean-Baptiste Onofré <jb...@nanthrax.net>
wrote:

> Hi Ted,
>
> My branch used Spark 2.1.0 and I just updated to 2.1.1.
>
> As discussed with Aviem, I should be able to create the pull request later
> today.
>
> Regards
> JB
>
>
> On 05/03/2017 02:50 AM, Ted Yu wrote:
>
>> Spark 2.1.1 has been released.
>>
>> Consider using the new release in this work.
>>
>> Thanks
>>
>> On Wed, Mar 29, 2017 at 5:43 AM, Jean-Baptiste Onofré <jb...@nanthrax.net>
>> wrote:
>>
>> Cool for the PR merge, I will rebase my branch on it.
>>>
>>> Thanks !
>>> Regards
>>> JB
>>>
>>>
>>> On 03/29/2017 01:58 PM, Amit Sela wrote:
>>>
>>> @Ted definitely makes sense.
>>>> @JB I'm merging https://github.com/apache/beam/pull/2354 soon so any
>>>> deprecated Spark API issues should be resolved.
>>>>
>>>> On Wed, Mar 29, 2017 at 2:46 PM Ted Yu <yu...@gmail.com> wrote:
>>>>
>>>> This is what I did over HBASE-16179:
>>>>
>>>>>
>>>>> -        f.call((asJavaIterator(it), conn)).iterator()
>>>>> +        // the return type is different in spark 1.x & 2.x, we handle
>>>>> both
>>>>> cases
>>>>> +        f.call(asJavaIterator(it), conn) match {
>>>>> +          // spark 1.x
>>>>> +          case iterable: Iterable[R] => iterable.iterator()
>>>>> +          // spark 2.x
>>>>> +          case iterator: Iterator[R] => iterator
>>>>> +        }
>>>>>        )
>>>>>
>>>>> FYI
>>>>>
>>>>> On Wed, Mar 29, 2017 at 1:47 AM, Amit Sela <am...@gmail.com>
>>>>> wrote:
>>>>>
>>>>> Just tried to replace dependencies and see what happens:
>>>>>
>>>>>>
>>>>>> Most required changes are about the runner using deprecated Spark
>>>>>> APIs,
>>>>>>
>>>>>> and
>>>>>
>>>>> after fixing them the only real issue is with the Java API for
>>>>>> Pair/FlatMapFunction that changed return value to Iterator (in 1.6 its
>>>>>> Iterable).
>>>>>>
>>>>>> So I'm not sure that a profile that simply sets dependency on
>>>>>> 1.6.3/2.1.0
>>>>>> is feasible.
>>>>>>
>>>>>> On Thu, Mar 23, 2017 at 10:22 AM Kobi Salant <ko...@gmail.com>
>>>>>> wrote:
>>>>>>
>>>>>> So, if everything is in place in Spark 2.X and we use provided
>>>>>>
>>>>>>>
>>>>>>> dependencies
>>>>>>
>>>>>> for Spark in Beam.
>>>>>>> Theoretically, you can run the same code in 2.X without any need for
>>>>>>> a
>>>>>>> branch?
>>>>>>>
>>>>>>> 2017-03-23 9:47 GMT+02:00 Amit Sela <am...@gmail.com>:
>>>>>>>
>>>>>>> If StreamingContext is valid and we don't have to use SparkSession,
>>>>>>>
>>>>>>>>
>>>>>>>> and
>>>>>>>
>>>>>>
>>>>> Accumulators are valid as well and we don't need AccumulatorsV2, I
>>>>>>
>>>>>>>
>>>>>>>> don't
>>>>>>>
>>>>>>
>>>>>> see a reason this shouldn't work (which means there are still tons of
>>>>>>>
>>>>>>>> reasons this could break, but I can't think of them off the top of
>>>>>>>> my
>>>>>>>>
>>>>>>>> head
>>>>>>>
>>>>>>> right now).
>>>>>>>>
>>>>>>>> @JB simply add a profile for the Spark dependencies and run the
>>>>>>>>
>>>>>>>> tests -
>>>>>>>
>>>>>>
>>>>> you'll have a very definitive answer ;-) .
>>>>>>
>>>>>>> If this passes, try on a cluster running Spark 2 as well.
>>>>>>>>
>>>>>>>> Let me know of I can assist.
>>>>>>>>
>>>>>>>> On Thu, Mar 23, 2017 at 6:55 AM Jean-Baptiste Onofré <
>>>>>>>>
>>>>>>>> jb@nanthrax.net>
>>>>>>>
>>>>>>
>>>>> wrote:
>>>>>>
>>>>>>>
>>>>>>>> Hi guys,
>>>>>>>>
>>>>>>>>>
>>>>>>>>> Ismaël summarize well what I have in mind.
>>>>>>>>>
>>>>>>>>> I'm a bit late on the PoC around that (I started a branch already).
>>>>>>>>> I will move forward over the week end.
>>>>>>>>>
>>>>>>>>> Regards
>>>>>>>>> JB
>>>>>>>>>
>>>>>>>>> On 03/22/2017 11:42 PM, Ismaël Mejía wrote:
>>>>>>>>>
>>>>>>>>> Amit, I suppose JB is talking about the RDD based version, so no
>>>>>>>>>>
>>>>>>>>>> need
>>>>>>>>>
>>>>>>>>
>>>>>> to worry about SparkSession or different incompatible APIs.
>>>>>>>
>>>>>>>>
>>>>>>>>>> Remember the idea we are discussing is to have in master both the
>>>>>>>>>> spark 1 and spark 2 runners using the RDD based translation. At
>>>>>>>>>>
>>>>>>>>>> the
>>>>>>>>>
>>>>>>>>
>>>>> same time we can have a feature branch to evolve the DataSet
>>>>>>
>>>>>>>
>>>>>>>>>> based
>>>>>>>>>
>>>>>>>>
>>>>> translator (this one will replace the RDD based translator for
>>>>>>
>>>>>>>
>>>>>>>>>> spark
>>>>>>>>>
>>>>>>>>
>>>>>> 2
>>>>>>>
>>>>>>> once it is mature).
>>>>>>>>
>>>>>>>>>
>>>>>>>>>> The advantages have been already discussed as well as the
>>>>>>>>>>
>>>>>>>>>> possible
>>>>>>>>>
>>>>>>>>
>>>>> issues so I think we have to see now if JB's idea is feasible and
>>>>>>
>>>>>>>
>>>>>>>>>> how
>>>>>>>>>
>>>>>>>>
>>>>>> hard would be to live with this while the DataSet version
>>>>>>>
>>>>>>>>
>>>>>>>>>> evolves.
>>>>>>>>>
>>>>>>>>
>>>>>
>>>>>> I think what we are trying to avoid is to have a long living
>>>>>>>>>>
>>>>>>>>>> branch
>>>>>>>>>
>>>>>>>>
>>>>> for a spark 2 runner based on RDD  because the maintenance burden
>>>>>>
>>>>>>> would be even worse. We would have to fight not only with the
>>>>>>>>>>
>>>>>>>>>> double
>>>>>>>>>
>>>>>>>>
>>>>>> merge of fixes (in case the profile idea does not work), but also
>>>>>>>
>>>>>>>>
>>>>>>>>>> with
>>>>>>>>>
>>>>>>>>
>>>>>>> the continue evolution of Beam and we would end up in the long
>>>>>>>>
>>>>>>>>>
>>>>>>>>>> living
>>>>>>>>>
>>>>>>>>
>>>>>> branch mess that others runners have dealt with (e.g. the Apex
>>>>>>>
>>>>>>>>
>>>>>>>>>> runner)
>>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> https://lists.apache.org/thread.html/12cc086f5ffe331cc70b893
>>>>>>>>> 22ce541
>>>>>>>>>
>>>>>>>>
>>>>> 6c3112b87efc3393e3e16032a2@%3Cdev.beam.apache.org%3E
>>>>>>
>>>>>>>
>>>>>>>>
>>>>>>>>> What do you think about this Amit ? Would you be ok to go with it
>>>>>>>>>>
>>>>>>>>>> if
>>>>>>>>>
>>>>>>>>
>>>>>> JB's profile idea proves to help with the msintenance issues ?
>>>>>>>
>>>>>>>>
>>>>>>>>>> Ismaël
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> On Wed, Mar 22, 2017 at 5:53 PM, Ted Yu <yu...@gmail.com>
>>>>>>>>>>
>>>>>>>>>> wrote:
>>>>>>>>>
>>>>>>>>
>>>>>> hbase-spark module doesn't use SparkSession. So situation there
>>>>>>>
>>>>>>>>
>>>>>>>>>>> is
>>>>>>>>>>
>>>>>>>>>
>>>>> simpler
>>>>>>
>>>>>>>
>>>>>>>>> :-)
>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> On Wed, Mar 22, 2017 at 5:35 AM, Amit Sela <
>>>>>>>>>>>
>>>>>>>>>>> amitsela33@gmail.com>
>>>>>>>>>>
>>>>>>>>>
>>>>> wrote:
>>>>>>
>>>>>>>
>>>>>>>>>
>>>>>>>>>> I'm still wondering how we'll do this - it's not just different
>>>>>>>>>>>
>>>>>>>>>>>> implementations of the same Class, but a completely different
>>>>>>>>>>>>
>>>>>>>>>>>> concepts
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>> such
>>>>>>>>>
>>>>>>>>> as using SparkSession in Spark 2 instead of
>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>> SparkContext/StreamingContext
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>> in Spark 1.
>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>> On Tue, Mar 21, 2017 at 7:25 PM Ted Yu <yu...@gmail.com>
>>>>>>>>>>>>
>>>>>>>>>>>> wrote:
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>
>>>>>>>> I have done some work over in HBASE-16179 where compatibility
>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> modules
>>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>> are
>>>>>>>>>
>>>>>>>>> created to isolate changes in Spark 2.x API so that code in
>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>>> hbase-spark
>>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>> module can be reused.
>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>>> FYI
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>> --
>>>>>>>>> Jean-Baptiste Onofré
>>>>>>>>> jbonofre@apache.org
>>>>>>>>> http://blog.nanthrax.net
>>>>>>>>> Talend - http://www.talend.com
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>> --
>>> Jean-Baptiste Onofré
>>> jbonofre@apache.org
>>> http://blog.nanthrax.net
>>> Talend - http://www.talend.com
>>>
>>>
>>
> --
> Jean-Baptiste Onofré
> jbonofre@apache.org
> http://blog.nanthrax.net
> Talend - http://www.talend.com
>

Re: Beam spark 2.x runner status

Posted by Jean-Baptiste Onofré <jb...@nanthrax.net>.

Hi Ted,

My branch used Spark 2.1.0 and I just updated to 2.1.1.

As discussed with Aviem, I should be able to create the pull request later today.

Regards
JB

On 05/03/2017 02:50 AM, Ted Yu wrote:
> Spark 2.1.1 has been released.
>
> Consider using the new release in this work.
>
> Thanks
>
> On Wed, Mar 29, 2017 at 5:43 AM, Jean-Baptiste Onofré <jb...@nanthrax.net>
> wrote:
>
>> Cool for the PR merge, I will rebase my branch on it.
>>
>> Thanks !
>> Regards
>> JB
>>
>>
>> On 03/29/2017 01:58 PM, Amit Sela wrote:
>>
>>> @Ted definitely makes sense.
>>> @JB I'm merging https://github.com/apache/beam/pull/2354 soon so any
>>> deprecated Spark API issues should be resolved.
>>>
>>> On Wed, Mar 29, 2017 at 2:46 PM Ted Yu <yu...@gmail.com> wrote:
>>>
>>> This is what I did over HBASE-16179:
>>>>
>>>> -        f.call((asJavaIterator(it), conn)).iterator()
>>>> +        // the return type is different in spark 1.x & 2.x, we handle
>>>> both
>>>> cases
>>>> +        f.call(asJavaIterator(it), conn) match {
>>>> +          // spark 1.x
>>>> +          case iterable: Iterable[R] => iterable.iterator()
>>>> +          // spark 2.x
>>>> +          case iterator: Iterator[R] => iterator
>>>> +        }
>>>>        )
>>>>
>>>> FYI
>>>>
>>>> On Wed, Mar 29, 2017 at 1:47 AM, Amit Sela <am...@gmail.com> wrote:
>>>>
>>>> Just tried to replace dependencies and see what happens:
>>>>>
>>>>> Most required changes are about the runner using deprecated Spark APIs,
>>>>>
>>>> and
>>>>
>>>>> after fixing them the only real issue is with the Java API for
>>>>> Pair/FlatMapFunction that changed return value to Iterator (in 1.6 its
>>>>> Iterable).
>>>>>
>>>>> So I'm not sure that a profile that simply sets dependency on
>>>>> 1.6.3/2.1.0
>>>>> is feasible.
>>>>>
>>>>> On Thu, Mar 23, 2017 at 10:22 AM Kobi Salant <ko...@gmail.com>
>>>>> wrote:
>>>>>
>>>>> So, if everything is in place in Spark 2.X and we use provided
>>>>>>
>>>>> dependencies
>>>>>
>>>>>> for Spark in Beam.
>>>>>> Theoretically, you can run the same code in 2.X without any need for a
>>>>>> branch?
>>>>>>
>>>>>> 2017-03-23 9:47 GMT+02:00 Amit Sela <am...@gmail.com>:
>>>>>>
>>>>>> If StreamingContext is valid and we don't have to use SparkSession,
>>>>>>>
>>>>>> and
>>>>
>>>>> Accumulators are valid as well and we don't need AccumulatorsV2, I
>>>>>>>
>>>>>> don't
>>>>>
>>>>>> see a reason this shouldn't work (which means there are still tons of
>>>>>>> reasons this could break, but I can't think of them off the top of my
>>>>>>>
>>>>>> head
>>>>>>
>>>>>>> right now).
>>>>>>>
>>>>>>> @JB simply add a profile for the Spark dependencies and run the
>>>>>>>
>>>>>> tests -
>>>>
>>>>> you'll have a very definitive answer ;-) .
>>>>>>> If this passes, try on a cluster running Spark 2 as well.
>>>>>>>
>>>>>>> Let me know of I can assist.
>>>>>>>
>>>>>>> On Thu, Mar 23, 2017 at 6:55 AM Jean-Baptiste Onofré <
>>>>>>>
>>>>>> jb@nanthrax.net>
>>>>
>>>>> wrote:
>>>>>>>
>>>>>>> Hi guys,
>>>>>>>>
>>>>>>>> Ismaël summarize well what I have in mind.
>>>>>>>>
>>>>>>>> I'm a bit late on the PoC around that (I started a branch already).
>>>>>>>> I will move forward over the week end.
>>>>>>>>
>>>>>>>> Regards
>>>>>>>> JB
>>>>>>>>
>>>>>>>> On 03/22/2017 11:42 PM, Ismaël Mejía wrote:
>>>>>>>>
>>>>>>>>> Amit, I suppose JB is talking about the RDD based version, so no
>>>>>>>>>
>>>>>>>> need
>>>>>
>>>>>> to worry about SparkSession or different incompatible APIs.
>>>>>>>>>
>>>>>>>>> Remember the idea we are discussing is to have in master both the
>>>>>>>>> spark 1 and spark 2 runners using the RDD based translation. At
>>>>>>>>>
>>>>>>>> the
>>>>
>>>>> same time we can have a feature branch to evolve the DataSet
>>>>>>>>>
>>>>>>>> based
>>>>
>>>>> translator (this one will replace the RDD based translator for
>>>>>>>>>
>>>>>>>> spark
>>>>>
>>>>>> 2
>>>>>>
>>>>>>> once it is mature).
>>>>>>>>>
>>>>>>>>> The advantages have been already discussed as well as the
>>>>>>>>>
>>>>>>>> possible
>>>>
>>>>> issues so I think we have to see now if JB's idea is feasible and
>>>>>>>>>
>>>>>>>> how
>>>>>
>>>>>> hard would be to live with this while the DataSet version
>>>>>>>>>
>>>>>>>> evolves.
>>>>
>>>>>
>>>>>>>>> I think what we are trying to avoid is to have a long living
>>>>>>>>>
>>>>>>>> branch
>>>>
>>>>> for a spark 2 runner based on RDD  because the maintenance burden
>>>>>>>>> would be even worse. We would have to fight not only with the
>>>>>>>>>
>>>>>>>> double
>>>>>
>>>>>> merge of fixes (in case the profile idea does not work), but also
>>>>>>>>>
>>>>>>>> with
>>>>>>
>>>>>>> the continue evolution of Beam and we would end up in the long
>>>>>>>>>
>>>>>>>> living
>>>>>
>>>>>> branch mess that others runners have dealt with (e.g. the Apex
>>>>>>>>>
>>>>>>>> runner)
>>>>>>
>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>> https://lists.apache.org/thread.html/12cc086f5ffe331cc70b89322ce541
>>>>
>>>>> 6c3112b87efc3393e3e16032a2@%3Cdev.beam.apache.org%3E
>>>>>>>
>>>>>>>>
>>>>>>>>> What do you think about this Amit ? Would you be ok to go with it
>>>>>>>>>
>>>>>>>> if
>>>>>
>>>>>> JB's profile idea proves to help with the msintenance issues ?
>>>>>>>>>
>>>>>>>>> Ismaël
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On Wed, Mar 22, 2017 at 5:53 PM, Ted Yu <yu...@gmail.com>
>>>>>>>>>
>>>>>>>> wrote:
>>>>>
>>>>>> hbase-spark module doesn't use SparkSession. So situation there
>>>>>>>>>>
>>>>>>>>> is
>>>>
>>>>> simpler
>>>>>>>>
>>>>>>>>> :-)
>>>>>>>>>>
>>>>>>>>>> On Wed, Mar 22, 2017 at 5:35 AM, Amit Sela <
>>>>>>>>>>
>>>>>>>>> amitsela33@gmail.com>
>>>>
>>>>> wrote:
>>>>>>>>
>>>>>>>>>
>>>>>>>>>> I'm still wondering how we'll do this - it's not just different
>>>>>>>>>>> implementations of the same Class, but a completely different
>>>>>>>>>>>
>>>>>>>>>> concepts
>>>>>>>
>>>>>>>> such
>>>>>>>>
>>>>>>>>> as using SparkSession in Spark 2 instead of
>>>>>>>>>>>
>>>>>>>>>> SparkContext/StreamingContext
>>>>>>>>
>>>>>>>>> in Spark 1.
>>>>>>>>>>>
>>>>>>>>>>> On Tue, Mar 21, 2017 at 7:25 PM Ted Yu <yu...@gmail.com>
>>>>>>>>>>>
>>>>>>>>>> wrote:
>>>>>>
>>>>>>>
>>>>>>>>>>> I have done some work over in HBASE-16179 where compatibility
>>>>>>>>>>>>
>>>>>>>>>>> modules
>>>>>>>
>>>>>>>> are
>>>>>>>>
>>>>>>>>> created to isolate changes in Spark 2.x API so that code in
>>>>>>>>>>>>
>>>>>>>>>>> hbase-spark
>>>>>>>>
>>>>>>>>> module can be reused.
>>>>>>>>>>>>
>>>>>>>>>>>> FYI
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>> --
>>>>>>>> Jean-Baptiste Onofré
>>>>>>>> jbonofre@apache.org
>>>>>>>> http://blog.nanthrax.net
>>>>>>>> Talend - http://www.talend.com
>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>> --
>> Jean-Baptiste Onofré
>> jbonofre@apache.org
>> http://blog.nanthrax.net
>> Talend - http://www.talend.com
>>
>

-- 
Jean-Baptiste Onofré
jbonofre@apache.org
http://blog.nanthrax.net
Talend - http://www.talend.com