You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@spark.apache.org by askformore <as...@163.com> on 2015/07/30 04:13:09 UTC

help plz! how to use zipWithIndex to each subset of a RDD

I have some data like this:RDD[(String, String)] = ((*key-1*, a),
(*key-1*,b), (*key-2*,a), (*key-2*,c),(*key-3*,b),(*key-4*,d))and I want to
group the data by Key, and for each group, add index fields to the
groupmember, at last I can transform the data to below : RDD[(String, *Int*,
String)] = ((key-1,*1*, a), (key-1,*2,*b), (key-2,*1*,a),
(key-2,*2*,b),(key-3,*1*,b),(key-4,*1*,d))I tried to groupByKey firstly,
then I got a RDD[(String, Iterable[String])], but I don't know how to use
zipWithIndex function to each Iterable...thanks.



--
View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/help-plz-how-to-use-zipWithIndex-to-each-subset-of-a-RDD-tp24071.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

Re: help plz! how to use zipWithIndex to each subset of a RDD

Posted by askformore <as...@163.com>.

Hi @rok, thanks I got it 



--
View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/help-plz-how-to-use-zipWithIndex-to-each-subset-of-a-RDD-tp24071p24080.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
For additional commands, e-mail: user-help@spark.apache.org

Re: help plz! how to use zipWithIndex to each subset of a RDD

Posted by rok <ro...@gmail.com>.

zipWithIndex gives you global indices, which is not what you want. You'll
want to use flatMap with a map function that iterates through each iterable
and returns the (String, Int, String) tuple for each element.

On Thu, Jul 30, 2015 at 4:13 AM, askformore [via Apache Spark User List] <
ml-node+s1001560n24071h92@n3.nabble.com> wrote:

> I have some data like this: RDD[(String, String)] = ((*key-1*, a), (
> *key-1*,b), (*key-2*,a), (*key-2*,c),(*key-3*,b),(*key-4*,d)) and I want
> to group the data by Key, and for each group, add index fields to the
> groupmember, at last I can transform the data to below : RDD[(String,
> *Int*, String)] = ((key-1,*1*, a), (key-1,*2,*b), (key-2,*1*,a), (key-2,
> *2*,b),(key-3,*1*,b),(key-4,*1*,d)) I tried to groupByKey firstly, then I
> got a RDD[(String, Iterable[String])], but I don't know how to use
> zipWithIndex function to each Iterable... thanks.
>
> ------------------------------
>  If you reply to this email, your message will be added to the discussion
> below:
>
> http://apache-spark-user-list.1001560.n3.nabble.com/help-plz-how-to-use-zipWithIndex-to-each-subset-of-a-RDD-tp24071.html
>  To start a new topic under Apache Spark User List, email
> ml-node+s1001560n1h22@n3.nabble.com
> To unsubscribe from Apache Spark User List, click here
> <http://apache-spark-user-list.1001560.n3.nabble.com/template/NamlServlet.jtp?macro=unsubscribe_by_code&node=1&code=cm9rcm9za2FyQGdtYWlsLmNvbXwxfC0xNDM4OTI3NjU3>
> .
> NAML
> <http://apache-spark-user-list.1001560.n3.nabble.com/template/NamlServlet.jtp?macro=macro_viewer&id=instant_html%21nabble%3Aemail.naml&base=nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.view.web.template.NodeNamespace&breadcrumbs=notify_subscribers%21nabble%3Aemail.naml-instant_emails%21nabble%3Aemail.naml-send_instant_email%21nabble%3Aemail.naml>
>




--
View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/help-plz-how-to-use-zipWithIndex-to-each-subset-of-a-RDD-tp24071p24074.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

Re: help plz! how to use zipWithIndex to each subset of a RDD

Posted by Jeff Zhang <zj...@gmail.com>.

This may be what you want

val conf = new SparkConf().setMaster("local").setAppName("test")
val sc = new SparkContext(conf)

val inputRdd = sc.parallelize(Array(("key_1", "a"), ("key_1","b"),
("key_2","c"), ("key_2", "d")))

val result = inputRdd.groupByKey().flatMap(e=>{
  val key= e._1
  val valuesWithIndex = e._2.zipWithIndex
  valuesWithIndex.map(value => (key, value._2, value._1))
})

result.collect() foreach println


/// output

*(key_2,0,c)
(key_2,1,d)
(key_1,0,a)
(key_1,1,b)*


On Thu, Jul 30, 2015 at 10:19 AM, ayan guha <gu...@gmail.com> wrote:

> Is there a relationship between data and index? I.e with a,b,c to 1,2,3?
> On 30 Jul 2015 12:13, "askformore" <as...@163.com> wrote:
>
>> I have some data like this: RDD[(String, String)] = ((*key-1*, a), (
>> *key-1*,b), (*key-2*,a), (*key-2*,c),(*key-3*,b),(*key-4*,d)) and I want
>> to group the data by Key, and for each group, add index fields to the
>> groupmember, at last I can transform the data to below : RDD[(String,
>> *Int*, String)] = ((key-1,*1*, a), (key-1,*2,*b), (key-2,*1*,a), (key-2,
>> *2*,b),(key-3,*1*,b),(key-4,*1*,d)) I tried to groupByKey firstly, then
>> I got a RDD[(String, Iterable[String])], but I don't know how to use
>> zipWithIndex function to each Iterable... thanks.
>> ------------------------------
>> View this message in context: help plz! how to use zipWithIndex to each
>> subset of a RDD
>> <http://apache-spark-user-list.1001560.n3.nabble.com/help-plz-how-to-use-zipWithIndex-to-each-subset-of-a-RDD-tp24071.html>
>> Sent from the Apache Spark User List mailing list archive
>> <http://apache-spark-user-list.1001560.n3.nabble.com/> at Nabble.com.
>>
>


-- 
Best Regards

Jeff Zhang

Re: help plz! how to use zipWithIndex to each subset of a RDD

Posted by ayan guha <gu...@gmail.com>.

Is there a relationship between data and index? I.e with a,b,c to 1,2,3?
On 30 Jul 2015 12:13, "askformore" <as...@163.com> wrote:

> I have some data like this: RDD[(String, String)] = ((*key-1*, a), (
> *key-1*,b), (*key-2*,a), (*key-2*,c),(*key-3*,b),(*key-4*,d)) and I want
> to group the data by Key, and for each group, add index fields to the
> groupmember, at last I can transform the data to below : RDD[(String,
> *Int*, String)] = ((key-1,*1*, a), (key-1,*2,*b), (key-2,*1*,a), (key-2,
> *2*,b),(key-3,*1*,b),(key-4,*1*,d)) I tried to groupByKey firstly, then I
> got a RDD[(String, Iterable[String])], but I don't know how to use
> zipWithIndex function to each Iterable... thanks.
> ------------------------------
> View this message in context: help plz! how to use zipWithIndex to each
> subset of a RDD
> <http://apache-spark-user-list.1001560.n3.nabble.com/help-plz-how-to-use-zipWithIndex-to-each-subset-of-a-RDD-tp24071.html>
> Sent from the Apache Spark User List mailing list archive
> <http://apache-spark-user-list.1001560.n3.nabble.com/> at Nabble.com.
>