You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@spark.apache.org by Gourav Sengupta <go...@gmail.com> on 2018/10/01 16:22:19 UTC

Re: Pyspark Partitioning

Hi,

the most simple option is create UDF's of these different functions and
then use case statement (or similar) in SQL and pass it on. But this is low
tech, in case you have conditions based on record values which are even
more granular, why not use a single UDF, and then let conditions handle it.

But I think that UDF is not that super unless you use Scala.

It will be interesting to see if there are other scalable options (which
are not RDD based) from the group.

Regards,
Gourav Sengupta

On Sun, Sep 30, 2018 at 7:31 PM dimitris plakas <di...@gmail.com>
wrote:

> Hello everyone,
>
> I am trying to split a dataframe on partitions and i want to apply a
> custom function on every partition. More precisely i have a dataframe like
> the one below
>
> Group_Id | Id | Points
> 1            | id1| Point1
> 2            | id2| Point2
>
> I want to have a partition for every Group_Id and apply on every partition
> a function defined by me.
> I have tried with partitionBy('Group_Id').mapPartitions() but i receive
> error.
> Could you please advice me how to do it?
>