You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@spark.apache.org by madhu phatak <ph...@gmail.com> on 2015/03/16 09:31:55 UTC

MappedStream vs Transform API

Hi,
  Current implementation of map function in spark streaming looks as below.

  def map[U: ClassTag](mapFunc: T => U): DStream[U] = {

  new MappedDStream(this, context.sparkContext.clean(mapFunc))
}

It creates an instance of MappedDStream which is a subclass of DStream.

The same function can be also implemented using transform API

def map[U: ClassTag](mapFunc: T => U): DStream[U] =

this.transform(rdd => {

  rdd.map(mapFunc)
})


Both implementation looks same. If they are same, is there any advantage
having a subclass of DStream?. Why can't we just use transform API?


Regards,
Madhukara Phatak
http://datamantra.io/

Re: MappedStream vs Transform API

Posted by madhu phatak <ph...@gmail.com>.

Hi,
Regards,
Madhukara Phatak
http://datamantra.io/

On Tue, Mar 17, 2015 at 2:31 PM, Tathagata Das <td...@databricks.com> wrote:

> That's not super essential, and hence hasn't been done till now. Even in
> core Spark there are MappedRDD, etc. even though all of them can be
> implemented by MapPartitionedRDD (may be the name is wrong). So its nice to
> maintain the consistency, MappedDStream creates MappedRDDs. :)
> Though this does not eliminate the possibility that we will do it. Maybe
> in future, if we find that maintaining these different DStreams is becoming
> a maintenance burden (its isn't yet), we may collapse them to use
> transform. We did so in the python API for exactly this reason.
>

  Ok. When I was going through source code it confused me to understand
what were right extension points were. So I thought whoever go   through
the code may get into same situation.  But if it's not super essential then
ok.

>
> If you are interested in contributing to Spark Streaming, i can point you
> to a number of issues where your contributions will be more valuable.
>

   Yes please.

>
> TD
>
> On Tue, Mar 17, 2015 at 1:56 AM, madhu phatak <ph...@gmail.com>
> wrote:
>
>> Hi,
>>  Thank you for the  response.
>>
>>  Can I give a PR to use transform for all the functions like map,flatMap
>> etc so they are consistent with other API's?.
>>
>> Regards,
>> Madhukara Phatak
>> http://datamantra.io/
>>
>> On Mon, Mar 16, 2015 at 11:42 PM, Tathagata Das <td...@databricks.com>
>> wrote:
>>
>>> It's mostly for legacy reasons. First we had added all the
>>> MappedDStream, etc. and then later we realized we need to expose something
>>> that is more generic for arbitrary RDD-RDD transformations. It can be
>>> easily replaced. However, there is a slight value in having MappedDStream,
>>> for developers to learn about DStreams.
>>>
>>> TD
>>>
>>> On Mon, Mar 16, 2015 at 3:37 AM, madhu phatak <ph...@gmail.com>
>>> wrote:
>>>
>>>> Hi,
>>>>  Thanks for the response. I understand that part. But I am asking why
>>>> the internal implementation using a subclass when it can use an existing
>>>> api? Unless there is a real difference, it feels like code smell to me.
>>>>
>>>>
>>>> Regards,
>>>> Madhukara Phatak
>>>> http://datamantra.io/
>>>>
>>>> On Mon, Mar 16, 2015 at 2:14 PM, Shao, Saisai <sa...@intel.com>
>>>> wrote:
>>>>
>>>>>  I think these two ways are both OK for you to write streaming job,
>>>>> `transform` is a more general way for you to transform from one DStream to
>>>>> another if there’s no related DStream API (but have related RDD API). But
>>>>> using map maybe more straightforward and easy to understand.
>>>>>
>>>>>
>>>>>
>>>>> Thanks
>>>>>
>>>>> Jerry
>>>>>
>>>>>
>>>>>
>>>>> *From:* madhu phatak [mailto:phatak.dev@gmail.com]
>>>>> *Sent:* Monday, March 16, 2015 4:32 PM
>>>>> *To:* user@spark.apache.org
>>>>> *Subject:* MappedStream vs Transform API
>>>>>
>>>>>
>>>>>
>>>>> Hi,
>>>>>
>>>>>   Current implementation of map function in spark streaming looks as
>>>>> below.
>>>>>
>>>>>
>>>>>
>>>>>   *def *map[U: ClassTag](mapFunc: T => U): DStream[U] = {
>>>>>
>>>>>   *new *MappedDStream(*this*, context.sparkContext.clean(mapFunc))
>>>>> }
>>>>>
>>>>>  It creates an instance of MappedDStream which is a subclass of
>>>>> DStream.
>>>>>
>>>>>
>>>>>
>>>>> The same function can be also implemented using transform API
>>>>>
>>>>>
>>>>>
>>>>> *def map*[U: ClassTag](mapFunc: T => U): DStream[U] =
>>>>>
>>>>> this.transform(rdd => {
>>>>>
>>>>>   rdd.map(mapFunc)
>>>>> })
>>>>>
>>>>>
>>>>>
>>>>> Both implementation looks same. If they are same, is there any
>>>>> advantage having a subclass of DStream?. Why can't we just use transform
>>>>> API?
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> Regards,
>>>>> Madhukara Phatak
>>>>> http://datamantra.io/
>>>>>
>>>>
>>>>
>>>
>>
>

Re: MappedStream vs Transform API

Posted by madhu phatak <ph...@gmail.com>.

Hi,
 Sorry for the wrong formatting in the earlier mail.

On Tue, Mar 17, 2015 at 2:31 PM, Tathagata Das <td...@databricks.com> wrote:

> That's not super essential, and hence hasn't been done till now. Even in
> core Spark there are MappedRDD, etc. even though all of them can be
> implemented by MapPartitionedRDD (may be the name is wrong). So its nice to
> maintain the consistency, MappedDStream creates MappedRDDs. :)
> Though this does not eliminate the possibility that we will do it. Maybe
> in future, if we find that maintaining these different DStreams is becoming
> a maintenance burden (its isn't yet), we may collapse them to use
> transform. We did so in the python API for exactly this reason.
>

  Ok. When I was going through source code it confused me to understand
what were right extension points were. So I thought whoever go   through
the code may get into same situation.  But if it's not super essential then
ok.


>
> If you are interested in contributing to Spark Streaming, i can point you
> to a number of issues where your contributions will be more valuable.
>

   That will be great.


>
> TD
>
> On Tue, Mar 17, 2015 at 1:56 AM, madhu phatak <ph...@gmail.com>
> wrote:
>
>> Hi,
>>  Thank you for the  response.
>>
>>  Can I give a PR to use transform for all the functions like map,flatMap
>> etc so they are consistent with other API's?.
>>
>> Regards,
>> Madhukara Phatak
>> http://datamantra.io/
>>
>> On Mon, Mar 16, 2015 at 11:42 PM, Tathagata Das <td...@databricks.com>
>> wrote:
>>
>>> It's mostly for legacy reasons. First we had added all the
>>> MappedDStream, etc. and then later we realized we need to expose something
>>> that is more generic for arbitrary RDD-RDD transformations. It can be
>>> easily replaced. However, there is a slight value in having MappedDStream,
>>> for developers to learn about DStreams.
>>>
>>> TD
>>>
>>> On Mon, Mar 16, 2015 at 3:37 AM, madhu phatak <ph...@gmail.com>
>>> wrote:
>>>
>>>> Hi,
>>>>  Thanks for the response. I understand that part. But I am asking why
>>>> the internal implementation using a subclass when it can use an existing
>>>> api? Unless there is a real difference, it feels like code smell to me.
>>>>
>>>>
>>>> Regards,
>>>> Madhukara Phatak
>>>> http://datamantra.io/
>>>>
>>>> On Mon, Mar 16, 2015 at 2:14 PM, Shao, Saisai <sa...@intel.com>
>>>> wrote:
>>>>
>>>>>  I think these two ways are both OK for you to write streaming job,
>>>>> `transform` is a more general way for you to transform from one DStream to
>>>>> another if there’s no related DStream API (but have related RDD API). But
>>>>> using map maybe more straightforward and easy to understand.
>>>>>
>>>>>
>>>>>
>>>>> Thanks
>>>>>
>>>>> Jerry
>>>>>
>>>>>
>>>>>
>>>>> *From:* madhu phatak [mailto:phatak.dev@gmail.com]
>>>>> *Sent:* Monday, March 16, 2015 4:32 PM
>>>>> *To:* user@spark.apache.org
>>>>> *Subject:* MappedStream vs Transform API
>>>>>
>>>>>
>>>>>
>>>>> Hi,
>>>>>
>>>>>   Current implementation of map function in spark streaming looks as
>>>>> below.
>>>>>
>>>>>
>>>>>
>>>>>   *def *map[U: ClassTag](mapFunc: T => U): DStream[U] = {
>>>>>
>>>>>   *new *MappedDStream(*this*, context.sparkContext.clean(mapFunc))
>>>>> }
>>>>>
>>>>>  It creates an instance of MappedDStream which is a subclass of
>>>>> DStream.
>>>>>
>>>>>
>>>>>
>>>>> The same function can be also implemented using transform API
>>>>>
>>>>>
>>>>>
>>>>> *def map*[U: ClassTag](mapFunc: T => U): DStream[U] =
>>>>>
>>>>> this.transform(rdd => {
>>>>>
>>>>>   rdd.map(mapFunc)
>>>>> })
>>>>>
>>>>>
>>>>>
>>>>> Both implementation looks same. If they are same, is there any
>>>>> advantage having a subclass of DStream?. Why can't we just use transform
>>>>> API?
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> Regards,
>>>>> Madhukara Phatak
>>>>> http://datamantra.io/
>>>>>
>>>>
>>>>
>>>
>>
>

Regards,
Madhukara Phatak
http://datamantra.io/

Re: MappedStream vs Transform API

Posted by Tathagata Das <td...@databricks.com>.

That's not super essential, and hence hasn't been done till now. Even in
core Spark there are MappedRDD, etc. even though all of them can be
implemented by MapPartitionedRDD (may be the name is wrong). So its nice to
maintain the consistency, MappedDStream creates MappedRDDs. :)
Though this does not eliminate the possibility that we will do it. Maybe in
future, if we find that maintaining these different DStreams is becoming a
maintenance burden (its isn't yet), we may collapse them to use transform.
We did so in the python API for exactly this reason.

If you are interested in contributing to Spark Streaming, i can point you
to a number of issues where your contributions will be more valuable.

TD

On Tue, Mar 17, 2015 at 1:56 AM, madhu phatak <ph...@gmail.com> wrote:

> Hi,
>  Thank you for the  response.
>
>  Can I give a PR to use transform for all the functions like map,flatMap
> etc so they are consistent with other API's?.
>
> Regards,
> Madhukara Phatak
> http://datamantra.io/
>
> On Mon, Mar 16, 2015 at 11:42 PM, Tathagata Das <td...@databricks.com>
> wrote:
>
>> It's mostly for legacy reasons. First we had added all the MappedDStream,
>> etc. and then later we realized we need to expose something that is more
>> generic for arbitrary RDD-RDD transformations. It can be easily replaced.
>> However, there is a slight value in having MappedDStream, for developers to
>> learn about DStreams.
>>
>> TD
>>
>> On Mon, Mar 16, 2015 at 3:37 AM, madhu phatak <ph...@gmail.com>
>> wrote:
>>
>>> Hi,
>>>  Thanks for the response. I understand that part. But I am asking why
>>> the internal implementation using a subclass when it can use an existing
>>> api? Unless there is a real difference, it feels like code smell to me.
>>>
>>>
>>> Regards,
>>> Madhukara Phatak
>>> http://datamantra.io/
>>>
>>> On Mon, Mar 16, 2015 at 2:14 PM, Shao, Saisai <sa...@intel.com>
>>> wrote:
>>>
>>>>  I think these two ways are both OK for you to write streaming job,
>>>> `transform` is a more general way for you to transform from one DStream to
>>>> another if there’s no related DStream API (but have related RDD API). But
>>>> using map maybe more straightforward and easy to understand.
>>>>
>>>>
>>>>
>>>> Thanks
>>>>
>>>> Jerry
>>>>
>>>>
>>>>
>>>> *From:* madhu phatak [mailto:phatak.dev@gmail.com]
>>>> *Sent:* Monday, March 16, 2015 4:32 PM
>>>> *To:* user@spark.apache.org
>>>> *Subject:* MappedStream vs Transform API
>>>>
>>>>
>>>>
>>>> Hi,
>>>>
>>>>   Current implementation of map function in spark streaming looks as
>>>> below.
>>>>
>>>>
>>>>
>>>>   *def *map[U: ClassTag](mapFunc: T => U): DStream[U] = {
>>>>
>>>>   *new *MappedDStream(*this*, context.sparkContext.clean(mapFunc))
>>>> }
>>>>
>>>>  It creates an instance of MappedDStream which is a subclass of
>>>> DStream.
>>>>
>>>>
>>>>
>>>> The same function can be also implemented using transform API
>>>>
>>>>
>>>>
>>>> *def map*[U: ClassTag](mapFunc: T => U): DStream[U] =
>>>>
>>>> this.transform(rdd => {
>>>>
>>>>   rdd.map(mapFunc)
>>>> })
>>>>
>>>>
>>>>
>>>> Both implementation looks same. If they are same, is there any
>>>> advantage having a subclass of DStream?. Why can't we just use transform
>>>> API?
>>>>
>>>>
>>>>
>>>>
>>>>
>>>> Regards,
>>>> Madhukara Phatak
>>>> http://datamantra.io/
>>>>
>>>
>>>
>>
>

Re: MappedStream vs Transform API

Posted by madhu phatak <ph...@gmail.com>.

Hi,
 Thank you for the  response.

 Can I give a PR to use transform for all the functions like map,flatMap
etc so they are consistent with other API's?.

Regards,
Madhukara Phatak
http://datamantra.io/

On Mon, Mar 16, 2015 at 11:42 PM, Tathagata Das <td...@databricks.com> wrote:

> It's mostly for legacy reasons. First we had added all the MappedDStream,
> etc. and then later we realized we need to expose something that is more
> generic for arbitrary RDD-RDD transformations. It can be easily replaced.
> However, there is a slight value in having MappedDStream, for developers to
> learn about DStreams.
>
> TD
>
> On Mon, Mar 16, 2015 at 3:37 AM, madhu phatak <ph...@gmail.com>
> wrote:
>
>> Hi,
>>  Thanks for the response. I understand that part. But I am asking why the
>> internal implementation using a subclass when it can use an existing api?
>> Unless there is a real difference, it feels like code smell to me.
>>
>>
>> Regards,
>> Madhukara Phatak
>> http://datamantra.io/
>>
>> On Mon, Mar 16, 2015 at 2:14 PM, Shao, Saisai <sa...@intel.com>
>> wrote:
>>
>>>  I think these two ways are both OK for you to write streaming job,
>>> `transform` is a more general way for you to transform from one DStream to
>>> another if there’s no related DStream API (but have related RDD API). But
>>> using map maybe more straightforward and easy to understand.
>>>
>>>
>>>
>>> Thanks
>>>
>>> Jerry
>>>
>>>
>>>
>>> *From:* madhu phatak [mailto:phatak.dev@gmail.com]
>>> *Sent:* Monday, March 16, 2015 4:32 PM
>>> *To:* user@spark.apache.org
>>> *Subject:* MappedStream vs Transform API
>>>
>>>
>>>
>>> Hi,
>>>
>>>   Current implementation of map function in spark streaming looks as
>>> below.
>>>
>>>
>>>
>>>   *def *map[U: ClassTag](mapFunc: T => U): DStream[U] = {
>>>
>>>   *new *MappedDStream(*this*, context.sparkContext.clean(mapFunc))
>>> }
>>>
>>>  It creates an instance of MappedDStream which is a subclass of DStream.
>>>
>>>
>>>
>>> The same function can be also implemented using transform API
>>>
>>>
>>>
>>> *def map*[U: ClassTag](mapFunc: T => U): DStream[U] =
>>>
>>> this.transform(rdd => {
>>>
>>>   rdd.map(mapFunc)
>>> })
>>>
>>>
>>>
>>> Both implementation looks same. If they are same, is there any advantage
>>> having a subclass of DStream?. Why can't we just use transform API?
>>>
>>>
>>>
>>>
>>>
>>> Regards,
>>> Madhukara Phatak
>>> http://datamantra.io/
>>>
>>
>>
>

Re: MappedStream vs Transform API

Posted by Tathagata Das <td...@databricks.com>.

It's mostly for legacy reasons. First we had added all the MappedDStream,
etc. and then later we realized we need to expose something that is more
generic for arbitrary RDD-RDD transformations. It can be easily replaced.
However, there is a slight value in having MappedDStream, for developers to
learn about DStreams.

TD

On Mon, Mar 16, 2015 at 3:37 AM, madhu phatak <ph...@gmail.com> wrote:

> Hi,
>  Thanks for the response. I understand that part. But I am asking why the
> internal implementation using a subclass when it can use an existing api?
> Unless there is a real difference, it feels like code smell to me.
>
>
> Regards,
> Madhukara Phatak
> http://datamantra.io/
>
> On Mon, Mar 16, 2015 at 2:14 PM, Shao, Saisai <sa...@intel.com>
> wrote:
>
>>  I think these two ways are both OK for you to write streaming job,
>> `transform` is a more general way for you to transform from one DStream to
>> another if there’s no related DStream API (but have related RDD API). But
>> using map maybe more straightforward and easy to understand.
>>
>>
>>
>> Thanks
>>
>> Jerry
>>
>>
>>
>> *From:* madhu phatak [mailto:phatak.dev@gmail.com]
>> *Sent:* Monday, March 16, 2015 4:32 PM
>> *To:* user@spark.apache.org
>> *Subject:* MappedStream vs Transform API
>>
>>
>>
>> Hi,
>>
>>   Current implementation of map function in spark streaming looks as
>> below.
>>
>>
>>
>>   *def *map[U: ClassTag](mapFunc: T => U): DStream[U] = {
>>
>>   *new *MappedDStream(*this*, context.sparkContext.clean(mapFunc))
>> }
>>
>>  It creates an instance of MappedDStream which is a subclass of DStream.
>>
>>
>>
>> The same function can be also implemented using transform API
>>
>>
>>
>> *def map*[U: ClassTag](mapFunc: T => U): DStream[U] =
>>
>> this.transform(rdd => {
>>
>>   rdd.map(mapFunc)
>> })
>>
>>
>>
>> Both implementation looks same. If they are same, is there any advantage
>> having a subclass of DStream?. Why can't we just use transform API?
>>
>>
>>
>>
>>
>> Regards,
>> Madhukara Phatak
>> http://datamantra.io/
>>
>
>

Re: MappedStream vs Transform API

Posted by madhu phatak <ph...@gmail.com>.

Hi,
 Thanks for the response. I understand that part. But I am asking why the
internal implementation using a subclass when it can use an existing api?
Unless there is a real difference, it feels like code smell to me.


Regards,
Madhukara Phatak
http://datamantra.io/

On Mon, Mar 16, 2015 at 2:14 PM, Shao, Saisai <sa...@intel.com> wrote:

>  I think these two ways are both OK for you to write streaming job,
> `transform` is a more general way for you to transform from one DStream to
> another if there’s no related DStream API (but have related RDD API). But
> using map maybe more straightforward and easy to understand.
>
>
>
> Thanks
>
> Jerry
>
>
>
> *From:* madhu phatak [mailto:phatak.dev@gmail.com]
> *Sent:* Monday, March 16, 2015 4:32 PM
> *To:* user@spark.apache.org
> *Subject:* MappedStream vs Transform API
>
>
>
> Hi,
>
>   Current implementation of map function in spark streaming looks as below.
>
>
>
>   *def *map[U: ClassTag](mapFunc: T => U): DStream[U] = {
>
>   *new *MappedDStream(*this*, context.sparkContext.clean(mapFunc))
> }
>
>  It creates an instance of MappedDStream which is a subclass of DStream.
>
>
>
> The same function can be also implemented using transform API
>
>
>
> *def map*[U: ClassTag](mapFunc: T => U): DStream[U] =
>
> this.transform(rdd => {
>
>   rdd.map(mapFunc)
> })
>
>
>
> Both implementation looks same. If they are same, is there any advantage
> having a subclass of DStream?. Why can't we just use transform API?
>
>
>
>
>
> Regards,
> Madhukara Phatak
> http://datamantra.io/
>

RE: MappedStream vs Transform API

Posted by "Shao, Saisai" <sa...@intel.com>.

I think these two ways are both OK for you to write streaming job, `transform` is a more general way for you to transform from one DStream to another if there’s no related DStream API (but have related RDD API). But using map maybe more straightforward and easy to understand.

Thanks
Jerry

From: madhu phatak [mailto:phatak.dev@gmail.com]
Sent: Monday, March 16, 2015 4:32 PM
To: user@spark.apache.org
Subject: MappedStream vs Transform API

Hi,
  Current implementation of map function in spark streaming looks as below.

  def map[U: ClassTag](mapFunc: T => U): DStream[U] = {

  new MappedDStream(this, context.sparkContext.clean(mapFunc))
}
It creates an instance of MappedDStream which is a subclass of DStream.

The same function can be also implemented using transform API

def map[U: ClassTag](mapFunc: T => U): DStream[U] =

this.transform(rdd => {

  rdd.map(mapFunc)
})

Both implementation looks same. If they are same, is there any advantage having a subclass of DStream?. Why can't we just use transform API?

Regards,
Madhukara Phatak
http://datamantra.io/