You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@spark.apache.org by guoxu1231 <gu...@gmail.com> on 2015/09/21 08:47:56 UTC
Join operation on DStreams
Hi Spark Experts,
I'm trying to use join(otherStream, [numTasks]) on DStreams, and it
requires called on two DStreams of (K, V) and (K, W) pairs,
Usually in common RDD, we could use keyBy(f) to build the (K, V) pair,
however I could not find it in DStream.
My question is:
What is the expected way to build (K, V) pair in DStream?
Thanks
Shawn
--
View this message in context: http://apache-spark-developers-list.1001551.n3.nabble.com/Join-operation-on-DStreams-tp14228.html
Sent from the Apache Spark Developers List mailing list archive at Nabble.com.
---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@spark.apache.org
For additional commands, e-mail: dev-help@spark.apache.org
Re: Join operation on DStreams
Posted by guoxu1231 <gu...@gmail.com>.
Thanks for the prompt reply.
May I ask why the keyBy(f) is not supported in DStreams? any particular
reason?
or is it possible to add it in future release since that "stream.map(record
=> (keyFunction(record), record))" looks tedious.
I checked the python source code, KeyBy looks like a "shortcut" method.
maybe people are more familiar with it.
def keyBy(self, f):
"""
Creates tuples of the elements in this RDD by applying C{f}.
>>> x = sc.parallelize(range(0,3)).keyBy(lambda x: x*x)
>>> y = sc.parallelize(zip(range(0,5), range(0,5)))
>>> [(x, list(map(list, y))) for x, y in
sorted(x.cogroup(y).collect())]
[(0, [[0], [0]]), (1, [[1], [1]]), (2, [[], [2]]), (3, [[], [3]]),
(4, [[2], [4]])]
"""
return self.map(lambda x: (f(x), x))
--
View this message in context: http://apache-spark-developers-list.1001551.n3.nabble.com/Join-operation-on-DStreams-tp14228p14232.html
Sent from the Apache Spark Developers List mailing list archive at Nabble.com.
---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@spark.apache.org
For additional commands, e-mail: dev-help@spark.apache.org
Re: Join operation on DStreams
Posted by Reynold Xin <rx...@databricks.com>.
stream.map(record => (keyFunction(record), record))
For future reference, this question should go to the user list, not dev
list.
On Sun, Sep 20, 2015 at 11:47 PM, guoxu1231 <gu...@gmail.com> wrote:
> Hi Spark Experts,
>
> I'm trying to use join(otherStream, [numTasks]) on DStreams, and it
> requires called on two DStreams of (K, V) and (K, W) pairs,
>
> Usually in common RDD, we could use keyBy(f) to build the (K, V) pair,
> however I could not find it in DStream.
>
> My question is:
> What is the expected way to build (K, V) pair in DStream?
>
>
> Thanks
> Shawn
>
>
>
>
> --
> View this message in context:
> http://apache-spark-developers-list.1001551.n3.nabble.com/Join-operation-on-DStreams-tp14228.html
> Sent from the Apache Spark Developers List mailing list archive at
> Nabble.com.
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@spark.apache.org
> For additional commands, e-mail: dev-help@spark.apache.org
>
>