You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@spark.apache.org by Chengi Liu <ch...@gmail.com> on 2014/02/28 19:31:40 UTC

Use pyspark for following.

My use case:

prim_id,secondary_id,value

There are million ids.. but 5 secondary ids.. But any secondary id is
optional.
For example:
So.. secondary ids are say [alpha,beta,gamma,delta,kappa]
1,alpha,20
1,beta,22
1,gamma,25
2,alpha,1
2,delta,15
3,kappa,90

What I want is to get the following output

1,20,22,25,0,0 # since kappa and delta are not present
2,1,0,0,15,0
3,0,0,0,0,90

So basically flatten it out?
How do i do this in pyspark.
Thanks

Re: Use pyspark for following.

Posted by Andrew Ash <an...@andrewash.com>.
Roughly how many rows are in the most-common primary id?  If that's small,
you could group by primary id and assemble the resulting row from the group.

Is it possible to have two rows with the same primary and secondary id?
 Like this:

1,alpha,20
1,alpha,25

If not, you could map these to expanded-out rows and reduce by key to get
the result.

1,alpha,20
1,beta,22
1,gamma,25

<process>

1,(20,0,0,0)
1,(0,22,0,0)
1,(0,0,25,0)

<reduce by key>

1,(20,22,25,0)


Andrew



On Fri, Feb 28, 2014 at 10:31 AM, Chengi Liu <ch...@gmail.com>wrote:

> My use case:
>
> prim_id,secondary_id,value
>
> There are million ids.. but 5 secondary ids.. But any secondary id is
> optional.
> For example:
> So.. secondary ids are say [alpha,beta,gamma,delta,kappa]
> 1,alpha,20
> 1,beta,22
> 1,gamma,25
> 2,alpha,1
> 2,delta,15
> 3,kappa,90
>
> What I want is to get the following output
>
> 1,20,22,25,0,0 # since kappa and delta are not present
> 2,1,0,0,15,0
> 3,0,0,0,0,90
>
> So basically flatten it out?
> How do i do this in pyspark.
> Thanks
>
>
>