You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@spark.apache.org by "Bode, Meikel, NMA-CFD" <Me...@Bertelsmann.de> on 2021/05/21 11:27:51 UTC
DF blank value fill
Hi all,
My df looks like follows:
Situation:
MainKey, SubKey, Val1, Val2, Val3, ...
1, 2, a, null, c
1, 2, null, null, c
1, 3, null, b, null
1, 3, a, null, c
Desired outcome:
1, 2, a, b, c
1, 2, a, b, c
1, 3, a, b, c
1, 3, a, b, c
How could I populate/synchronize empty cells of all records with the same combination of MainKey and SubKey with the respective value of other rows with the same key combination?
A certain value, if not null, of a col is guaranteed to be unique within the df. If a col exists then there is at least one row with a not-null value.
I am using pyspark.
Thanks for any hint,
Best
Meikel
Re: DF blank value fill
Posted by ayan guha <gu...@gmail.com>.
Hi
You can do something like this:
SELECT MainKey, Subkey,
case when val1 is null then newval1 else val1 end val1,
case when val2 is null then newval2 else val1 end val2,
case when val3 is null then newval3 else val1 end val3
from (select mainkey,subkey, val1,val2, val3,
first_value() over (partitionby mainkey, subkey order
by val1 nulls last) newval1,
first_value() over (partitionby mainkey, subkey order
by val2 nulls last) newval2,
first_value() over (partitionby mainkey, subkey order
by val3 nulls last) newval3
from table) x
On Fri, May 21, 2021 at 9:29 PM Bode, Meikel, NMA-CFD <
Meikel.Bode@bertelsmann.de> wrote:
> Hi all,
>
>
>
> My df looks like follows:
>
>
>
> Situation:
>
> MainKey, SubKey, Val1, Val2, Val3, …
>
> 1, 2, a, null, c
>
> 1, 2, null, null, c
>
> 1, 3, null, b, null
>
> 1, 3, a, null, c
>
>
>
>
>
> Desired outcome:
>
> 1, 2, a, b, c
>
> 1, 2, a, b, c
>
> 1, 3, a, b, c
>
> 1, 3, a, b, c
>
>
>
>
>
> How could I populate/synchronize empty cells of all records with the same
> combination of MainKey and SubKey with the respective value of other rows
> with the same key combination?
>
> A certain value, if not null, of a col is guaranteed to be unique within
> the df. If a col exists then there is at least one row with a not-null
> value.
>
>
>
> I am using pyspark.
>
>
>
> Thanks for any hint,
>
> Best
>
> Meikel
>
--
Best Regards,
Ayan Guha