You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@spark.apache.org by Henrique Oliveira <he...@gmail.com> on 2020/08/02 00:07:09 UTC

[Spark SQL]: Can't write DataFrame after using explode function on multiple columns.

I have a PySpark method that applies the explode function on every Array
column on the DataFrame.

def explode_column(df, column):
    select_cols = list(df.columns)
    col_position = select_cols.index(column)
    select_cols[col_position] = explode_outer(column).alias(column)
    return df.select(select_cols)

def explode_all_arrays(df):
    still_has_arrays = True
    exploded_df = df

    while still_has_arrays:
        still_has_arrays = False
        for f in exploded_df.schema.fields:
            if isinstance(f.dataType, ArrayType):
                print(f"Exploding: {f}")
                still_has_arrays = True
                exploded_df = explode_column(exploded_df, f.name)

    return exploded_df

When I have a small number of columns to explode it works perfectly,
but on large DataFrames (~200 columns with ~40 explosions), after
finishing, the DataFrame can't be written as Parquet.

Even a small amount of data fails (400KB), not during the method
processing, but on the write step.

Any tips? I tried to write the dataframe as table and as a parquet
file. It works when I store it as a temp view though.

Thank you,

henrique.

Re: [Spark SQL]: Can't write DataFrame after using explode function on multiple columns.

Posted by Henrique Oliveira <he...@gmail.com>.
Thank you for both tips, I will definitely try the pandas_udfs. About
changing the select operation, it's not possible to have multiple explode
functions on the same select, sadly they must be applied one at a time.

Em seg., 3 de ago. de 2020 às 11:41, Patrick McCarthy <
pmccarthy@dstillery.com> escreveu:

> If you use pandas_udfs in 2.4 they should be quite performant (or at least
> won't suffer serialization overhead), might be worth looking into.
>
> I didn't run your code but one consideration is that the while loop might
> be making the DAG a lot bigger than it has to be. You might see if defining
> those columns with list comprehensions forming a single select() statement
> makes for a smaller DAG.
>
> On Mon, Aug 3, 2020 at 10:06 AM Henrique Oliveira <he...@gmail.com>
> wrote:
>
>> Hi Patrick, thank you for your quick response.
>> That's exactly what I think. Actually, the result of this processing is
>> an intermediate table that is going to be used for other views generation.
>> Another approach I'm trying now, is to move the "explosion" step for this
>> "view generation" step, this way I don't need to explode every column but
>> just those used for the final client.
>>
>> ps.I was avoiding UDFs for now because I'm still on Spark 2.4 and the
>> python udfs I tried had very bad performance, but I will give it a try in
>> this case. It can't be worse.
>> Thanks again!
>>
>> Em seg., 3 de ago. de 2020 às 10:53, Patrick McCarthy <
>> pmccarthy@dstillery.com> escreveu:
>>
>>> This seems like a very expensive operation. Why do you want to write out
>>> all the exploded values? If you just want all combinations of values, could
>>> you instead do it at read-time with a UDF or something?
>>>
>>> On Sat, Aug 1, 2020 at 8:34 PM hesouol <he...@gmail.com> wrote:
>>>
>>>> I forgot to add an information. By "can't write" I mean it keeps
>>>> processing
>>>> and nothing happens. The job runs for hours even with a very small file
>>>> and
>>>> I have to force the stoppage.
>>>>
>>>>
>>>>
>>>> --
>>>> Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/
>>>>
>>>> ---------------------------------------------------------------------
>>>> To unsubscribe e-mail: user-unsubscribe@spark.apache.org
>>>>
>>>>
>>>
>>> --
>>>
>>>
>>> *Patrick McCarthy  *
>>>
>>> Senior Data Scientist, Machine Learning Engineering
>>>
>>> Dstillery
>>>
>>> 470 Park Ave South, 17th Floor, NYC 10016
>>>
>>
>
> --
>
>
> *Patrick McCarthy  *
>
> Senior Data Scientist, Machine Learning Engineering
>
> Dstillery
>
> 470 Park Ave South, 17th Floor, NYC 10016
>

Re: [Spark SQL]: Can't write DataFrame after using explode function on multiple columns.

Posted by Patrick McCarthy <pm...@dstillery.com.INVALID>.
If you use pandas_udfs in 2.4 they should be quite performant (or at least
won't suffer serialization overhead), might be worth looking into.

I didn't run your code but one consideration is that the while loop might
be making the DAG a lot bigger than it has to be. You might see if defining
those columns with list comprehensions forming a single select() statement
makes for a smaller DAG.

On Mon, Aug 3, 2020 at 10:06 AM Henrique Oliveira <he...@gmail.com> wrote:

> Hi Patrick, thank you for your quick response.
> That's exactly what I think. Actually, the result of this processing is an
> intermediate table that is going to be used for other views generation.
> Another approach I'm trying now, is to move the "explosion" step for this
> "view generation" step, this way I don't need to explode every column but
> just those used for the final client.
>
> ps.I was avoiding UDFs for now because I'm still on Spark 2.4 and the
> python udfs I tried had very bad performance, but I will give it a try in
> this case. It can't be worse.
> Thanks again!
>
> Em seg., 3 de ago. de 2020 às 10:53, Patrick McCarthy <
> pmccarthy@dstillery.com> escreveu:
>
>> This seems like a very expensive operation. Why do you want to write out
>> all the exploded values? If you just want all combinations of values, could
>> you instead do it at read-time with a UDF or something?
>>
>> On Sat, Aug 1, 2020 at 8:34 PM hesouol <he...@gmail.com> wrote:
>>
>>> I forgot to add an information. By "can't write" I mean it keeps
>>> processing
>>> and nothing happens. The job runs for hours even with a very small file
>>> and
>>> I have to force the stoppage.
>>>
>>>
>>>
>>> --
>>> Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/
>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe e-mail: user-unsubscribe@spark.apache.org
>>>
>>>
>>
>> --
>>
>>
>> *Patrick McCarthy  *
>>
>> Senior Data Scientist, Machine Learning Engineering
>>
>> Dstillery
>>
>> 470 Park Ave South, 17th Floor, NYC 10016
>>
>

-- 


*Patrick McCarthy  *

Senior Data Scientist, Machine Learning Engineering

Dstillery

470 Park Ave South, 17th Floor, NYC 10016

Re: [Spark SQL]: Can't write DataFrame after using explode function on multiple columns.

Posted by Henrique Oliveira <he...@gmail.com>.
Hi Patrick, thank you for your quick response.
That's exactly what I think. Actually, the result of this processing is an
intermediate table that is going to be used for other views generation.
Another approach I'm trying now, is to move the "explosion" step for this
"view generation" step, this way I don't need to explode every column but
just those used for the final client.

ps.I was avoiding UDFs for now because I'm still on Spark 2.4 and the
python udfs I tried had very bad performance, but I will give it a try in
this case. It can't be worse.
Thanks again!

Em seg., 3 de ago. de 2020 às 10:53, Patrick McCarthy <
pmccarthy@dstillery.com> escreveu:

> This seems like a very expensive operation. Why do you want to write out
> all the exploded values? If you just want all combinations of values, could
> you instead do it at read-time with a UDF or something?
>
> On Sat, Aug 1, 2020 at 8:34 PM hesouol <he...@gmail.com> wrote:
>
>> I forgot to add an information. By "can't write" I mean it keeps
>> processing
>> and nothing happens. The job runs for hours even with a very small file
>> and
>> I have to force the stoppage.
>>
>>
>>
>> --
>> Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/
>>
>> ---------------------------------------------------------------------
>> To unsubscribe e-mail: user-unsubscribe@spark.apache.org
>>
>>
>
> --
>
>
> *Patrick McCarthy  *
>
> Senior Data Scientist, Machine Learning Engineering
>
> Dstillery
>
> 470 Park Ave South, 17th Floor, NYC 10016
>

Re: [Spark SQL]: Can't write DataFrame after using explode function on multiple columns.

Posted by Patrick McCarthy <pm...@dstillery.com.INVALID>.
This seems like a very expensive operation. Why do you want to write out
all the exploded values? If you just want all combinations of values, could
you instead do it at read-time with a UDF or something?

On Sat, Aug 1, 2020 at 8:34 PM hesouol <he...@gmail.com> wrote:

> I forgot to add an information. By "can't write" I mean it keeps processing
> and nothing happens. The job runs for hours even with a very small file and
> I have to force the stoppage.
>
>
>
> --
> Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/
>
> ---------------------------------------------------------------------
> To unsubscribe e-mail: user-unsubscribe@spark.apache.org
>
>

-- 


*Patrick McCarthy  *

Senior Data Scientist, Machine Learning Engineering

Dstillery

470 Park Ave South, 17th Floor, NYC 10016

Re: [Spark SQL]: Can't write DataFrame after using explode function on multiple columns.

Posted by hesouol <he...@gmail.com>.
I forgot to add an information. By "can't write" I mean it keeps processing
and nothing happens. The job runs for hours even with a very small file and
I have to force the stoppage.



--
Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/

---------------------------------------------------------------------
To unsubscribe e-mail: user-unsubscribe@spark.apache.org