You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@spark.apache.org by sparkuser2014 <Co...@crowdstrike.com> on 2014/12/12 06:23:51 UTC

GroupBy and nested Top on

I'm currently new to pyspark, thank you for your patience in advance - my
current problem is the following:

I have a RDD composed of the field A, B, and count => 

      result1 = rdd.map(lambda x: (A,B),1).reduceByKey(lambda a,b: a + b)

Then I wanted to group the results based on 'A', so I did the following =>

      result2 = result1.map(lambda x: (x[0][0],(x[0][1],x[1]))).groupByKey()

Now, my problem/challenge, with the new RDD <A, (B,count)>, I want to be
able to "subsort" and take the top 50 elements in (B, count) in descending
order and thereafter print or save the top(50) every element/instance of A.

i.e. final result =>   

A, B1, 40
A, B2, 30
A, B3, 20
A, B4, 10
A1, C1,30
A1, C2, 20
A1, C3, 10

Any guidance and help you can provide to help me solve this problem of mine
is much appreciated! Thank you :-)


    





--
View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/GroupBy-and-nested-Top-on-tp20648.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
For additional commands, e-mail: user-help@spark.apache.org