You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@spark.apache.org by Traku traku <tr...@gmail.com> on 2018/05/28 23:22:23 UTC

Pandas UDF for PySpark error. Big Dataset

Hi.

I'm trying to use the new feature but I can't use it with a big dataset
(about 5 million rows).

I tried  increasing executor memory, driver memory, partition number, but
any solution can help me to solve the problem.

One of the executor task increase the shufle memory until fails.

Error is arrow generated: unable to expand the buffer.

Any idea?

Re: Pandas UDF for PySpark error. Big Dataset

Posted by Bryan Cutler <cu...@gmail.com>.

Can you share some of the code used, or at least the pandas_udf plus the
stacktrace?  Also does decreasing your dataset size fix the oom?

On Mon, May 28, 2018, 4:22 PM Traku traku <tr...@gmail.com> wrote:

> Hi.
>
> I'm trying to use the new feature but I can't use it with a big dataset
> (about 5 million rows).
>
> I tried  increasing executor memory, driver memory, partition number, but
> any solution can help me to solve the problem.
>
> One of the executor task increase the shufle memory until fails.
>
> Error is arrow generated: unable to expand the buffer.
>
> Any idea?
>