You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@spark.apache.org by Debabrata Ghosh <ma...@gmail.com> on 2017/10/12 16:09:31 UTC
How to flatten a row in PySpark
Hi,
Greetings !
I am having data in the format of the following row:
ABZ|ABZ|AF|2,3,7,8,B,C,D,E,J,K,L,M,P,Q,T,U,X,Y|1,2,3,4,5|730
I want to convert it into several rows in the format below:
ABZ|ABZ|AF|2|1|730
ABZ|ABZ|AF|3+1|730
.
.
.
ABZ|ABZ|AF|3|1|730
ABZ|ABZ|AF|3|2|730
ABZ|ABZ|AF|3|3|730
.
.
.
ABZ|ABZ|AF|Y|4|730
ABZ|ABZ|AF||Y|5|730
Basically, I want to consider the various combinations of the 4th and 5th
columns (where the values are delimited by commas) and accordingly generate
the above rows from a single row. Please can you suggest me for a good way
of acheiving this. Thanks in advance !
Regards,
Debu
Re: How to flatten a row in PySpark
Posted by Debabrata Ghosh <ma...@gmail.com>.
Thanks Ayan and NIcholas for your jetfast reply ! Appreciate it a lot.
Cheers,
Debu
On Fri, Oct 13, 2017 at 9:27 AM, ayan guha <gu...@gmail.com> wrote:
> Quick pyspark code:
>
> >>> s = "ABZ|ABZ|AF|2,3,7,8,B,C,D,E,J,K,L,M,P,Q,T,U,X,Y|1,2,3,4,5|730"
> >>> base = sc.parallelize([s.split("|")])
> >>> base.take(10)
> [['ABZ', 'ABZ', 'AF', '2,3,7,8,B,C,D,E,J,K,L,M,P,Q,T,U,X,Y', '1,2,3,4,5',
> '730']]
>
> >>> def pv(t):
> ... x = t[3].split(",")
> ... y = t[4].split(",")
> ... for k in product(x,y):
> ... yield (t[0],t[1],k[0],k[1],t[5])
> ...
> >>> res = base.flatMap(pv)
> >>> res.take(10)
> [('ABZ', 'ABZ', '2', '1', '730'), ('ABZ', 'ABZ', '2', '2', '730'), ('ABZ',
> 'ABZ', '2', '3', '730'), ('ABZ', 'ABZ', '2', '4', '730'), ('ABZ', 'ABZ',
> '2', '5', '730'), ('ABZ', 'ABZ', '3', '1', '730'), ('ABZ', 'ABZ', '3', '2',
> '730'), ('ABZ', 'ABZ', '3', '3', '730'), ('ABZ', 'ABZ', '3', '4', '730'),
> ('ABZ', 'ABZ', '3', '5', '730')]
>
>
>
> On Fri, Oct 13, 2017 at 6:03 AM, Nicholas Hakobian <nicholas.hakobian@
> rallyhealth.com> wrote:
>
>> Using explode on the 4th column, followed by an explode on the 5th column
>> would produce what you want (you might need to use split on the columns
>> first if they are not already an array).
>>
>> Nicholas Szandor Hakobian, Ph.D.
>> Staff Data Scientist
>> Rally Health
>> nicholas.hakobian@rallyhealth.com
>>
>>
>> On Thu, Oct 12, 2017 at 9:09 AM, Debabrata Ghosh <ma...@gmail.com>
>> wrote:
>>
>>> Hi,
>>> Greetings !
>>>
>>> I am having data in the format of the following row:
>>>
>>> ABZ|ABZ|AF|2,3,7,8,B,C,D,E,J,K,L,M,P,Q,T,U,X,Y|1,2,3,4,5|730
>>>
>>> I want to convert it into several rows in the format below:
>>>
>>> ABZ|ABZ|AF|2|1|730
>>> ABZ|ABZ|AF|3+1|730
>>> .
>>> .
>>> .
>>> ABZ|ABZ|AF|3|1|730
>>> ABZ|ABZ|AF|3|2|730
>>> ABZ|ABZ|AF|3|3|730
>>> .
>>> .
>>> .
>>> ABZ|ABZ|AF|Y|4|730
>>> ABZ|ABZ|AF||Y|5|730
>>>
>>> Basically, I want to consider the various combinations of the 4th and
>>> 5th columns (where the values are delimited by commas) and accordingly
>>> generate the above rows from a single row. Please can you suggest me for a
>>> good way of acheiving this. Thanks in advance !
>>>
>>> Regards,
>>>
>>> Debu
>>>
>>
>>
>
>
> --
> Best Regards,
> Ayan Guha
>
Re: How to flatten a row in PySpark
Posted by ayan guha <gu...@gmail.com>.
Quick pyspark code:
>>> s = "ABZ|ABZ|AF|2,3,7,8,B,C,D,E,J,K,L,M,P,Q,T,U,X,Y|1,2,3,4,5|730"
>>> base = sc.parallelize([s.split("|")])
>>> base.take(10)
[['ABZ', 'ABZ', 'AF', '2,3,7,8,B,C,D,E,J,K,L,M,P,Q,T,U,X,Y', '1,2,3,4,5',
'730']]
>>> def pv(t):
... x = t[3].split(",")
... y = t[4].split(",")
... for k in product(x,y):
... yield (t[0],t[1],k[0],k[1],t[5])
...
>>> res = base.flatMap(pv)
>>> res.take(10)
[('ABZ', 'ABZ', '2', '1', '730'), ('ABZ', 'ABZ', '2', '2', '730'), ('ABZ',
'ABZ', '2', '3', '730'), ('ABZ', 'ABZ', '2', '4', '730'), ('ABZ', 'ABZ',
'2', '5', '730'), ('ABZ', 'ABZ', '3', '1', '730'), ('ABZ', 'ABZ', '3', '2',
'730'), ('ABZ', 'ABZ', '3', '3', '730'), ('ABZ', 'ABZ', '3', '4', '730'),
('ABZ', 'ABZ', '3', '5', '730')]
On Fri, Oct 13, 2017 at 6:03 AM, Nicholas Hakobian <
nicholas.hakobian@rallyhealth.com> wrote:
> Using explode on the 4th column, followed by an explode on the 5th column
> would produce what you want (you might need to use split on the columns
> first if they are not already an array).
>
> Nicholas Szandor Hakobian, Ph.D.
> Staff Data Scientist
> Rally Health
> nicholas.hakobian@rallyhealth.com
>
>
> On Thu, Oct 12, 2017 at 9:09 AM, Debabrata Ghosh <ma...@gmail.com>
> wrote:
>
>> Hi,
>> Greetings !
>>
>> I am having data in the format of the following row:
>>
>> ABZ|ABZ|AF|2,3,7,8,B,C,D,E,J,K,L,M,P,Q,T,U,X,Y|1,2,3,4,5|730
>>
>> I want to convert it into several rows in the format below:
>>
>> ABZ|ABZ|AF|2|1|730
>> ABZ|ABZ|AF|3+1|730
>> .
>> .
>> .
>> ABZ|ABZ|AF|3|1|730
>> ABZ|ABZ|AF|3|2|730
>> ABZ|ABZ|AF|3|3|730
>> .
>> .
>> .
>> ABZ|ABZ|AF|Y|4|730
>> ABZ|ABZ|AF||Y|5|730
>>
>> Basically, I want to consider the various combinations of the 4th and 5th
>> columns (where the values are delimited by commas) and accordingly generate
>> the above rows from a single row. Please can you suggest me for a good way
>> of acheiving this. Thanks in advance !
>>
>> Regards,
>>
>> Debu
>>
>
>
--
Best Regards,
Ayan Guha
Re: How to flatten a row in PySpark
Posted by Nicholas Hakobian <ni...@rallyhealth.com>.
Using explode on the 4th column, followed by an explode on the 5th column
would produce what you want (you might need to use split on the columns
first if they are not already an array).
Nicholas Szandor Hakobian, Ph.D.
Staff Data Scientist
Rally Health
nicholas.hakobian@rallyhealth.com
On Thu, Oct 12, 2017 at 9:09 AM, Debabrata Ghosh <ma...@gmail.com>
wrote:
> Hi,
> Greetings !
>
> I am having data in the format of the following row:
>
> ABZ|ABZ|AF|2,3,7,8,B,C,D,E,J,K,L,M,P,Q,T,U,X,Y|1,2,3,4,5|730
>
> I want to convert it into several rows in the format below:
>
> ABZ|ABZ|AF|2|1|730
> ABZ|ABZ|AF|3+1|730
> .
> .
> .
> ABZ|ABZ|AF|3|1|730
> ABZ|ABZ|AF|3|2|730
> ABZ|ABZ|AF|3|3|730
> .
> .
> .
> ABZ|ABZ|AF|Y|4|730
> ABZ|ABZ|AF||Y|5|730
>
> Basically, I want to consider the various combinations of the 4th and 5th
> columns (where the values are delimited by commas) and accordingly generate
> the above rows from a single row. Please can you suggest me for a good way
> of acheiving this. Thanks in advance !
>
> Regards,
>
> Debu
>