You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@spark.apache.org by ChengBo <Ch...@huawei.com> on 2015/10/16 21:10:59 UTC

Problem of RDD in calculation

Hi all,

I am new in Spark, and I have a question in dealing with RDD.
I've converted RDD to DataFrame. So there are two DF: DF1 and DF2
DF1 contains: userID, time, dataUsage, duration
DF2 contains: userID

Each userID has multiple rows in DF1.
DF2 has distinct userID, and I would like to compute the average, max and min value of both dataUsage and duration for each userID in DF1?
And store the results in a new dataframe.
How can I do that?
Thanks a lot.

Best
Frank

Re: Problem of RDD in calculation

Posted by Xiao Li <ga...@gmail.com>.

Hi, Frank,

After registering these DF as a temp table (via the API registerTempTable),
you can do it using SQL. I believe this should be much easier.

Good luck,

Xiao Li

2015-10-16 12:10 GMT-07:00 ChengBo <Ch...@huawei.com>:

> Hi all,
>
>
>
> I am new in Spark, and I have a question in dealing with RDD.
>
> I’ve converted RDD to DataFrame. So there are two DF: DF1 and DF2
>
> DF1 contains: userID, time, dataUsage, duration
>
> DF2 contains: userID
>
>
>
> Each userID has multiple rows in DF1.
>
> DF2 has distinct userID, and I would like to compute the average, max and
> min value of both dataUsage and duration for each userID in DF1?
>
> And store the results in a new dataframe.
>
> How can I do that?
>
> Thanks a lot.
>
>
>
> Best
>
> Frank
>

Re: Problem of RDD in calculation

Posted by Xiao Li <ga...@gmail.com>.

For most programmers, dataFrames are preferred thanks to the flexibility,
but using sql syntax is a great option for users who feel more comfortable
using SQL. : )

2015-10-16 18:22 GMT-07:00 Ali Tajeldin EDU <al...@gmail.com>:

> Since DF2 only has the userID, I'm assuming you are musing DF2 to filter
> for desired userIDs.
> You can just use the join() and groupBy operations on DataFrame to do what
> you desire.  For example:
>
> scala> val df1=app.createDF("id:String; v:Integer", "X,1;X,2;Y,3;Y,4;Z,10")
> df1: org.apache.spark.sql.DataFrame = [id: string, v: int]
>
> scala> df1.show
> +---+---+
> | id|  v|
> +---+---+
> |  X|  1|
> |  X|  2|
> |  Y|  3|
> |  Y|  4|
> |  Z| 10|
> +---+---+
>
> scala> val df2=app.createDF("id:String", "X;Y")
> df2: org.apache.spark.sql.DataFrame = [id: string]
>
> scala> df2.show
> +---+
> | id|
> +---+
> |  X|
> |  Y|
> +---+
>
> scala> df1.join(df2, "id").groupBy("id").agg(avg("v") as "avg_v", min("v")
> as "min_v").show
> +---+-----+-----+
> | id|avg_v|min_v|
> +---+-----+-----+
> |  X|  1.5|    1|
> |  Y|  3.5|    3|
> |---+-----+-----+
>
>
> Notes:
> * The above uses createDF method in SmvApp from SMV package, but the rest
> of the code is just standard Spark DataFrame ops.
> * One advantage of doing this using DataFrame rather than SQL is that you
> can build the expressions programmatically (e.g. imagine doing this for 100
> columns instead of 2).
>
> ---
> Ali
>
>
> On Oct 16, 2015, at 12:10 PM, ChengBo <Ch...@huawei.com> wrote:
>
> Hi all,
>
> I am new in Spark, and I have a question in dealing with RDD.
> I’ve converted RDD to DataFrame. So there are two DF: DF1 and DF2
> DF1 contains: userID, time, dataUsage, duration
> DF2 contains: userID
>
> Each userID has multiple rows in DF1.
> DF2 has distinct userID, and I would like to compute the average, max and
> min value of both dataUsage and duration for each userID in DF1?
> And store the results in a new dataframe.
> How can I do that?
> Thanks a lot.
>
> Best
> Frank
>
>
>

Re: Problem of RDD in calculation

Posted by Ali Tajeldin EDU <al...@gmail.com>.

Since DF2 only has the userID, I'm assuming you are musing DF2 to filter for desired userIDs.
You can just use the join() and groupBy operations on DataFrame to do what you desire.  For example:

scala> val df1=app.createDF("id:String; v:Integer", "X,1;X,2;Y,3;Y,4;Z,10")
df1: org.apache.spark.sql.DataFrame = [id: string, v: int]

scala> df1.show
+---+---+
| id|  v|
+---+---+
|  X|  1|
|  X|  2|
|  Y|  3|
|  Y|  4|
|  Z| 10|
+---+---+

scala> val df2=app.createDF("id:String", "X;Y")
df2: org.apache.spark.sql.DataFrame = [id: string]

scala> df2.show
+---+
| id|
+---+
|  X|
|  Y|
+---+

scala> df1.join(df2, "id").groupBy("id").agg(avg("v") as "avg_v", min("v") as "min_v").show
+---+-----+-----+
| id|avg_v|min_v|
+---+-----+-----+
|  X|  1.5|    1|
|  Y|  3.5|    3|
|---+-----+-----+

Notes:
* The above uses createDF method in SmvApp from SMV package, but the rest of the code is just standard Spark DataFrame ops.
* One advantage of doing this using DataFrame rather than SQL is that you can build the expressions programmatically (e.g. imagine doing this for 100 columns instead of 2).

---
Ali

On Oct 16, 2015, at 12:10 PM, ChengBo <Ch...@huawei.com> wrote:

> Hi all,
>  
> I am new in Spark, and I have a question in dealing with RDD.
> I’ve converted RDD to DataFrame. So there are two DF: DF1 and DF2
> DF1 contains: userID, time, dataUsage, duration
> DF2 contains: userID
>  
> Each userID has multiple rows in DF1.
> DF2 has distinct userID, and I would like to compute the average, max and min value of both dataUsage and duration for each userID in DF1?
> And store the results in a new dataframe.
> How can I do that?
> Thanks a lot.
>  
> Best
> Frank