You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@spark.apache.org by Debabrata Ghosh <ma...@gmail.com> on 2017/10/12 16:09:31 UTC

How to flatten a row in PySpark

Hi,
        Greetings !

I am having data in the format of the following row:

ABZ|ABZ|AF|2,3,7,8,B,C,D,E,J,K,L,M,P,Q,T,U,X,Y|1,2,3,4,5|730

I want to convert it into several rows in the format below:

ABZ|ABZ|AF|2|1|730
ABZ|ABZ|AF|3+1|730
.
.
.
ABZ|ABZ|AF|3|1|730
ABZ|ABZ|AF|3|2|730
ABZ|ABZ|AF|3|3|730
.
.
.
ABZ|ABZ|AF|Y|4|730
ABZ|ABZ|AF||Y|5|730

Basically, I want to consider the various combinations of the 4th and 5th
columns (where the values are delimited by commas) and accordingly generate
the above rows from a single row. Please can you suggest me for a good way
of acheiving this. Thanks in advance !

Regards,

Debu

Re: How to flatten a row in PySpark

Posted by Debabrata Ghosh <ma...@gmail.com>.
Thanks Ayan and NIcholas for your jetfast reply ! Appreciate it a lot.

Cheers,

Debu

On Fri, Oct 13, 2017 at 9:27 AM, ayan guha <gu...@gmail.com> wrote:

> Quick pyspark code:
>
> >>> s = "ABZ|ABZ|AF|2,3,7,8,B,C,D,E,J,K,L,M,P,Q,T,U,X,Y|1,2,3,4,5|730"
> >>> base = sc.parallelize([s.split("|")])
> >>> base.take(10)
> [['ABZ', 'ABZ', 'AF', '2,3,7,8,B,C,D,E,J,K,L,M,P,Q,T,U,X,Y', '1,2,3,4,5',
> '730']]
>
> >>> def pv(t):
> ...     x = t[3].split(",")
> ...     y = t[4].split(",")
> ...     for k in product(x,y):
> ...         yield (t[0],t[1],k[0],k[1],t[5])
> ...
> >>> res = base.flatMap(pv)
> >>> res.take(10)
> [('ABZ', 'ABZ', '2', '1', '730'), ('ABZ', 'ABZ', '2', '2', '730'), ('ABZ',
> 'ABZ', '2', '3', '730'), ('ABZ', 'ABZ', '2', '4', '730'), ('ABZ', 'ABZ',
> '2', '5', '730'), ('ABZ', 'ABZ', '3', '1', '730'), ('ABZ', 'ABZ', '3', '2',
> '730'), ('ABZ', 'ABZ', '3', '3', '730'), ('ABZ', 'ABZ', '3', '4', '730'),
> ('ABZ', 'ABZ', '3', '5', '730')]
>
>
>
> On Fri, Oct 13, 2017 at 6:03 AM, Nicholas Hakobian <nicholas.hakobian@
> rallyhealth.com> wrote:
>
>> Using explode on the 4th column, followed by an explode on the 5th column
>> would produce what you want (you might need to use split on the columns
>> first if they are not already an array).
>>
>> Nicholas Szandor Hakobian, Ph.D.
>> Staff Data Scientist
>> Rally Health
>> nicholas.hakobian@rallyhealth.com
>>
>>
>> On Thu, Oct 12, 2017 at 9:09 AM, Debabrata Ghosh <ma...@gmail.com>
>> wrote:
>>
>>> Hi,
>>>         Greetings !
>>>
>>> I am having data in the format of the following row:
>>>
>>> ABZ|ABZ|AF|2,3,7,8,B,C,D,E,J,K,L,M,P,Q,T,U,X,Y|1,2,3,4,5|730
>>>
>>> I want to convert it into several rows in the format below:
>>>
>>> ABZ|ABZ|AF|2|1|730
>>> ABZ|ABZ|AF|3+1|730
>>> .
>>> .
>>> .
>>> ABZ|ABZ|AF|3|1|730
>>> ABZ|ABZ|AF|3|2|730
>>> ABZ|ABZ|AF|3|3|730
>>> .
>>> .
>>> .
>>> ABZ|ABZ|AF|Y|4|730
>>> ABZ|ABZ|AF||Y|5|730
>>>
>>> Basically, I want to consider the various combinations of the 4th and
>>> 5th columns (where the values are delimited by commas) and accordingly
>>> generate the above rows from a single row. Please can you suggest me for a
>>> good way of acheiving this. Thanks in advance !
>>>
>>> Regards,
>>>
>>> Debu
>>>
>>
>>
>
>
> --
> Best Regards,
> Ayan Guha
>

Re: How to flatten a row in PySpark

Posted by ayan guha <gu...@gmail.com>.
Quick pyspark code:

>>> s = "ABZ|ABZ|AF|2,3,7,8,B,C,D,E,J,K,L,M,P,Q,T,U,X,Y|1,2,3,4,5|730"
>>> base = sc.parallelize([s.split("|")])
>>> base.take(10)
[['ABZ', 'ABZ', 'AF', '2,3,7,8,B,C,D,E,J,K,L,M,P,Q,T,U,X,Y', '1,2,3,4,5',
'730']]

>>> def pv(t):
...     x = t[3].split(",")
...     y = t[4].split(",")
...     for k in product(x,y):
...         yield (t[0],t[1],k[0],k[1],t[5])
...
>>> res = base.flatMap(pv)
>>> res.take(10)
[('ABZ', 'ABZ', '2', '1', '730'), ('ABZ', 'ABZ', '2', '2', '730'), ('ABZ',
'ABZ', '2', '3', '730'), ('ABZ', 'ABZ', '2', '4', '730'), ('ABZ', 'ABZ',
'2', '5', '730'), ('ABZ', 'ABZ', '3', '1', '730'), ('ABZ', 'ABZ', '3', '2',
'730'), ('ABZ', 'ABZ', '3', '3', '730'), ('ABZ', 'ABZ', '3', '4', '730'),
('ABZ', 'ABZ', '3', '5', '730')]



On Fri, Oct 13, 2017 at 6:03 AM, Nicholas Hakobian <
nicholas.hakobian@rallyhealth.com> wrote:

> Using explode on the 4th column, followed by an explode on the 5th column
> would produce what you want (you might need to use split on the columns
> first if they are not already an array).
>
> Nicholas Szandor Hakobian, Ph.D.
> Staff Data Scientist
> Rally Health
> nicholas.hakobian@rallyhealth.com
>
>
> On Thu, Oct 12, 2017 at 9:09 AM, Debabrata Ghosh <ma...@gmail.com>
> wrote:
>
>> Hi,
>>         Greetings !
>>
>> I am having data in the format of the following row:
>>
>> ABZ|ABZ|AF|2,3,7,8,B,C,D,E,J,K,L,M,P,Q,T,U,X,Y|1,2,3,4,5|730
>>
>> I want to convert it into several rows in the format below:
>>
>> ABZ|ABZ|AF|2|1|730
>> ABZ|ABZ|AF|3+1|730
>> .
>> .
>> .
>> ABZ|ABZ|AF|3|1|730
>> ABZ|ABZ|AF|3|2|730
>> ABZ|ABZ|AF|3|3|730
>> .
>> .
>> .
>> ABZ|ABZ|AF|Y|4|730
>> ABZ|ABZ|AF||Y|5|730
>>
>> Basically, I want to consider the various combinations of the 4th and 5th
>> columns (where the values are delimited by commas) and accordingly generate
>> the above rows from a single row. Please can you suggest me for a good way
>> of acheiving this. Thanks in advance !
>>
>> Regards,
>>
>> Debu
>>
>
>


-- 
Best Regards,
Ayan Guha

Re: How to flatten a row in PySpark

Posted by Nicholas Hakobian <ni...@rallyhealth.com>.
Using explode on the 4th column, followed by an explode on the 5th column
would produce what you want (you might need to use split on the columns
first if they are not already an array).

Nicholas Szandor Hakobian, Ph.D.
Staff Data Scientist
Rally Health
nicholas.hakobian@rallyhealth.com


On Thu, Oct 12, 2017 at 9:09 AM, Debabrata Ghosh <ma...@gmail.com>
wrote:

> Hi,
>         Greetings !
>
> I am having data in the format of the following row:
>
> ABZ|ABZ|AF|2,3,7,8,B,C,D,E,J,K,L,M,P,Q,T,U,X,Y|1,2,3,4,5|730
>
> I want to convert it into several rows in the format below:
>
> ABZ|ABZ|AF|2|1|730
> ABZ|ABZ|AF|3+1|730
> .
> .
> .
> ABZ|ABZ|AF|3|1|730
> ABZ|ABZ|AF|3|2|730
> ABZ|ABZ|AF|3|3|730
> .
> .
> .
> ABZ|ABZ|AF|Y|4|730
> ABZ|ABZ|AF||Y|5|730
>
> Basically, I want to consider the various combinations of the 4th and 5th
> columns (where the values are delimited by commas) and accordingly generate
> the above rows from a single row. Please can you suggest me for a good way
> of acheiving this. Thanks in advance !
>
> Regards,
>
> Debu
>