You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@spark.apache.org by Laeeq Ahmed <la...@yahoo.com.INVALID> on 2015/03/13 11:41:46 UTC

Using rdd methods with Dstream

Hi,
I normally use dstream.transform whenever I need to use methods which are available in RDD API but not in streaming API. e.g. dstream.transform(x => x.sortByKey(true))
But there are other RDD methods which return types other than RDD. e.g. dstream.transform(x => x.top(5)) top here returns Array. In the second scenario, how can i return RDD rather than array, so that i can perform further steps on dstream.
Regards,
Laeeq

Re: Using rdd methods with Dstream

Posted by Laeeq Ahmed <la...@yahoo.com.INVALID>.
Hi,
repartition is expensive. Looking for an efficient to do this.
Regards,Laeeq
 

     On Friday, March 13, 2015 12:24 PM, Akhil Das <ak...@sigmoidanalytics.com> wrote:
   

 Like this?

dtream.repartition(1).mapPartitions(it => it.take(5))



ThanksBest Regards
On Fri, Mar 13, 2015 at 4:11 PM, Laeeq Ahmed <la...@yahoo.com.invalid> wrote:

Hi,
I normally use dstream.transform whenever I need to use methods which are available in RDD API but not in streaming API. e.g. dstream.transform(x => x.sortByKey(true))
But there are other RDD methods which return types other than RDD. e.g. dstream.transform(x => x.top(5)) top here returns Array. In the second scenario, how can i return RDD rather than array, so that i can perform further steps on dstream.
Regards,
Laeeq



   

Re: Using rdd methods with Dstream

Posted by Laeeq Ahmed <la...@yahoo.com.INVALID>.
Thanks TD, this is what I was looking for. rdd.context.makeRDD worked.
Laeeq 


     On Friday, March 13, 2015 11:08 PM, Tathagata Das <td...@databricks.com> wrote:
   

 Is the number of top K elements you want to keep small? That is, is K small? In which case, you can1.  either do it in the driver on the array  DStream.foreachRDD ( rdd => {    val topK = rdd.top(K) ;    // use top K })
2. Or, you can use the topK to create another RDD using sc.makeRDD
DStream.transform ( rdd => {    val topK = rdd.top(K) ;    rdd.context.makeRDD(topK, numPartitions)})
TD
 
On Fri, Mar 13, 2015 at 5:58 AM, Laeeq Ahmed <la...@yahoo.com.invalid> wrote:

Hi,
Earlier my code was like follwing but slow due to repartition. I want top K of each window in a stream.
val counts = keyAndValues.map(x => math.round(x._3.toDouble)).countByValueAndWindow(Seconds(4), Seconds(4))val topCounts = counts.repartition(1).map(_.swap).transform(rdd => rdd.sortByKey(false)).map(_.swap).mapPartitions(rdd => rdd.take(10))
so I thought to use dstream.transform(rdd=>rdd.top()) but this return Array rather than rdd. I have to perform further steps on topCounts dstream.
[ERROR]  found   : Array[(Long, Long)][ERROR]  required: org.apache.spark.rdd.RDD[?][ERROR]  val topCounts = counts.transform(rdd => rdd.top(10))

Regards,Laeeq 

     On Friday, March 13, 2015 1:47 PM, Sean Owen <so...@cloudera.com> wrote:
   

 Hm, aren't you able to use the SparkContext here? DStream operations
happen on the driver. So you can parallelize() the result?

take() won't work as it's not the same as top()

On Fri, Mar 13, 2015 at 11:23 AM, Akhil Das <ak...@sigmoidanalytics.com> wrote:
> Like this?
>
> dtream.repartition(1).mapPartitions(it => it.take(5))
>
>
>
> Thanks
> Best Regards
>
> On Fri, Mar 13, 2015 at 4:11 PM, Laeeq Ahmed <la...@yahoo.com.invalid>
> wrote:
>>
>> Hi,
>>
>> I normally use dstream.transform whenever I need to use methods which are
>> available in RDD API but not in streaming API. e.g. dstream.transform(x =>
>> x.sortByKey(true))
>>
>> But there are other RDD methods which return types other than RDD. e.g.
>> dstream.transform(x => x.top(5)) top here returns Array.
>>
>> In the second scenario, how can i return RDD rather than array, so that i
>> can perform further steps on dstream.
>>
>> Regards,
>> Laeeq
>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
For additional commands, e-mail: user-help@spark.apache.org



    



  

Re: Using rdd methods with Dstream

Posted by Tathagata Das <td...@databricks.com>.
Is the number of top K elements you want to keep small? That is, is K
small? In which case, you can
1.  either do it in the driver on the array
DStream.foreachRDD ( rdd => {
   val topK = rdd.top(K) ;
   // use top K
})

2. Or, you can use the topK to create another RDD using sc.makeRDD

DStream.transform ( rdd => {
   val topK = rdd.top(K) ;
   rdd.context.makeRDD(topK, numPartitions)
})

TD



On Fri, Mar 13, 2015 at 5:58 AM, Laeeq Ahmed <la...@yahoo.com.invalid>
wrote:

> Hi,
>
> Earlier my code was like follwing but slow due to repartition. I want top
> K of each window in a stream.
>
> val counts = keyAndValues.map(x =>
> math.round(x._3.toDouble)).countByValueAndWindow(Seconds(4), Seconds(4))
> val topCounts = counts.repartition(1).map(_.swap).transform(rdd =>
> rdd.sortByKey(false)).map(_.swap).mapPartitions(rdd => rdd.take(10))
>
> so I thought to use dstream.transform(rdd=>rdd.top()) but this return
> Array rather than rdd. I have to perform further steps on topCounts dstream.
>
> [ERROR]  found   : Array[(Long, Long)]
> [ERROR]  required: org.apache.spark.rdd.RDD[?]
> [ERROR] val topCounts = counts.transform(rdd => rdd.top(10))
>
>
> Regards,
> Laeeq
>
>
>   On Friday, March 13, 2015 1:47 PM, Sean Owen <so...@cloudera.com> wrote:
>
>
> Hm, aren't you able to use the SparkContext here? DStream operations
> happen on the driver. So you can parallelize() the result?
>
> take() won't work as it's not the same as top()
>
> On Fri, Mar 13, 2015 at 11:23 AM, Akhil Das <ak...@sigmoidanalytics.com>
> wrote:
> > Like this?
> >
> > dtream.repartition(1).mapPartitions(it => it.take(5))
> >
> >
> >
> > Thanks
> > Best Regards
> >
> > On Fri, Mar 13, 2015 at 4:11 PM, Laeeq Ahmed <
> laeeqspark@yahoo.com.invalid>
> > wrote:
> >>
> >> Hi,
> >>
> >> I normally use dstream.transform whenever I need to use methods which
> are
> >> available in RDD API but not in streaming API. e.g. dstream.transform(x
> =>
> >> x.sortByKey(true))
> >>
> >> But there are other RDD methods which return types other than RDD. e.g.
> >> dstream.transform(x => x.top(5)) top here returns Array.
> >>
> >> In the second scenario, how can i return RDD rather than array, so that
> i
> >> can perform further steps on dstream.
> >>
> >> Regards,
> >> Laeeq
>
> >
> >
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
> For additional commands, e-mail: user-help@spark.apache.org
>
>
>
>
>

Re: Using rdd methods with Dstream

Posted by Laeeq Ahmed <la...@yahoo.com.INVALID>.
Hi,
Earlier my code was like follwing but slow due to repartition. I want top K of each window in a stream.
val counts = keyAndValues.map(x => math.round(x._3.toDouble)).countByValueAndWindow(Seconds(4), Seconds(4))val topCounts = counts.repartition(1).map(_.swap).transform(rdd => rdd.sortByKey(false)).map(_.swap).mapPartitions(rdd => rdd.take(10))
so I thought to use dstream.transform(rdd=>rdd.top()) but this return Array rather than rdd. I have to perform further steps on topCounts dstream.
[ERROR]  found   : Array[(Long, Long)][ERROR]  required: org.apache.spark.rdd.RDD[?][ERROR]  val topCounts = counts.transform(rdd => rdd.top(10))

Regards,Laeeq 

     On Friday, March 13, 2015 1:47 PM, Sean Owen <so...@cloudera.com> wrote:
   

 Hm, aren't you able to use the SparkContext here? DStream operations
happen on the driver. So you can parallelize() the result?

take() won't work as it's not the same as top()

On Fri, Mar 13, 2015 at 11:23 AM, Akhil Das <ak...@sigmoidanalytics.com> wrote:
> Like this?
>
> dtream.repartition(1).mapPartitions(it => it.take(5))
>
>
>
> Thanks
> Best Regards
>
> On Fri, Mar 13, 2015 at 4:11 PM, Laeeq Ahmed <la...@yahoo.com.invalid>
> wrote:
>>
>> Hi,
>>
>> I normally use dstream.transform whenever I need to use methods which are
>> available in RDD API but not in streaming API. e.g. dstream.transform(x =>
>> x.sortByKey(true))
>>
>> But there are other RDD methods which return types other than RDD. e.g.
>> dstream.transform(x => x.top(5)) top here returns Array.
>>
>> In the second scenario, how can i return RDD rather than array, so that i
>> can perform further steps on dstream.
>>
>> Regards,
>> Laeeq
>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
For additional commands, e-mail: user-help@spark.apache.org



   

Re: Using rdd methods with Dstream

Posted by Sean Owen <so...@cloudera.com>.
Hm, aren't you able to use the SparkContext here? DStream operations
happen on the driver. So you can parallelize() the result?

take() won't work as it's not the same as top()

On Fri, Mar 13, 2015 at 11:23 AM, Akhil Das <ak...@sigmoidanalytics.com> wrote:
> Like this?
>
> dtream.repartition(1).mapPartitions(it => it.take(5))
>
>
>
> Thanks
> Best Regards
>
> On Fri, Mar 13, 2015 at 4:11 PM, Laeeq Ahmed <la...@yahoo.com.invalid>
> wrote:
>>
>> Hi,
>>
>> I normally use dstream.transform whenever I need to use methods which are
>> available in RDD API but not in streaming API. e.g. dstream.transform(x =>
>> x.sortByKey(true))
>>
>> But there are other RDD methods which return types other than RDD. e.g.
>> dstream.transform(x => x.top(5)) top here returns Array.
>>
>> In the second scenario, how can i return RDD rather than array, so that i
>> can perform further steps on dstream.
>>
>> Regards,
>> Laeeq
>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
For additional commands, e-mail: user-help@spark.apache.org


Re: Using rdd methods with Dstream

Posted by Akhil Das <ak...@sigmoidanalytics.com>.
Like this?

dtream.repartition(1).mapPartitions(it => it.take(5))



Thanks
Best Regards

On Fri, Mar 13, 2015 at 4:11 PM, Laeeq Ahmed <la...@yahoo.com.invalid>
wrote:

> Hi,
>
> I normally use dstream.transform whenever I need to use methods which are
> available in RDD API but not in streaming API. e.g. dstream.transform(x =>
> x.sortByKey(true))
>
> But there are other RDD methods which return types other than RDD. e.g.
> dstream.transform(x => x.top(5)) top here returns Array.
>
> In the second scenario, how can i return RDD rather than array, so that i
> can perform further steps on dstream.
>
> Regards,
> Laeeq
>