You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Nasir Ali (Jira)" <ji...@apache.org> on 2019/11/06 00:39:00 UTC
[jira] [Comment Edited] (SPARK-28502) Error with struct conversion while using pandas_udf

    [ https://issues.apache.org/jira/browse/SPARK-28502?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16967977#comment-16967977 ] 

Nasir Ali edited comment on SPARK-28502 at 11/6/19 12:38 AM:
-------------------------------------------------------------

[~bryanc] Sorry I had to remove my previous comment as I was in the middle of debugging this issue. I found the culprit package. My code works all fine with pyarrow==0.14.1. However, with the latest pyarrow release (0.15.1), my example code throws following exception. Please let me know if you need more information.

 
{code:java}
// code placeholder
19/11/05 18:23:17 ERROR Executor: Exception in task 13.0 in stage 5.0 (TID 13)
java.lang.IllegalArgumentException
 at java.nio.ByteBuffer.allocate(ByteBuffer.java:334)
 at org.apache.arrow.vector.ipc.message.MessageSerializer.readMessage(MessageSerializer.java:547)
 at org.apache.arrow.vector.ipc.message.MessageChannelReader.readNext(MessageChannelReader.java:58)
 at org.apache.arrow.vector.ipc.ArrowStreamReader.readSchema(ArrowStreamReader.java:132)
 at org.apache.arrow.vector.ipc.ArrowReader.initialize(ArrowReader.java:178)
 at org.apache.arrow.vector.ipc.ArrowReader.ensureInitialized(ArrowReader.java:169)
 at org.apache.arrow.vector.ipc.ArrowReader.getVectorSchemaRoot(ArrowReader.java:62)
 at org.apache.spark.sql.execution.python.PythonArrowOutput$$anon$1.read(PythonArrowOutput.scala:89)
 at org.apache.spark.sql.execution.python.PythonArrowOutput$$anon$1.read(PythonArrowOutput.scala:49)
 at org.apache.spark.api.python.BasePythonRunner$ReaderIterator.hasNext(PythonRunner.scala:437)
 at org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:37)
 at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:489)
 at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:458)
 at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage3.processNext(Unknown Source)
 at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
 at org.apache.spark.sql.execution.WholeStageCodegenExec$$anon$1.hasNext(WholeStageCodegenExec.scala:726)
 at org.apache.spark.sql.execution.SparkPlan.$anonfun$getByteArrayRdd$1(SparkPlan.scala:337)
 at org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2(RDD.scala:872)
 at org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2$adapted(RDD.scala:872)
 at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
 at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:349)
 at org.apache.spark.rdd.RDD.iterator(RDD.scala:313)
 at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
 at org.apache.spark.scheduler.Task.run(Task.scala:127)
 at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:455)
 at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1377)
 at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:458)
 at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
 at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
 at java.lang.Thread.run(Thread.java:748)
19/11/05 18:23:17 ERROR TaskSetManager: Task 13 in stage 5.0 failed 1 times; aborting job
{code}


was (Author: nasirali):
Sorry I had to remove my previous comment as I was in the middle of debugging this issue. I found the culprit package. My code works all fine with pyarrow==0.14.1. However, with the latest pyarrow release (0.15.1), my example code throws following exception. Please let me know if you need more information.

 
{code:java}
// code placeholder
19/11/05 18:23:17 ERROR Executor: Exception in task 13.0 in stage 5.0 (TID 13)
java.lang.IllegalArgumentException
 at java.nio.ByteBuffer.allocate(ByteBuffer.java:334)
 at org.apache.arrow.vector.ipc.message.MessageSerializer.readMessage(MessageSerializer.java:547)
 at org.apache.arrow.vector.ipc.message.MessageChannelReader.readNext(MessageChannelReader.java:58)
 at org.apache.arrow.vector.ipc.ArrowStreamReader.readSchema(ArrowStreamReader.java:132)
 at org.apache.arrow.vector.ipc.ArrowReader.initialize(ArrowReader.java:178)
 at org.apache.arrow.vector.ipc.ArrowReader.ensureInitialized(ArrowReader.java:169)
 at org.apache.arrow.vector.ipc.ArrowReader.getVectorSchemaRoot(ArrowReader.java:62)
 at org.apache.spark.sql.execution.python.PythonArrowOutput$$anon$1.read(PythonArrowOutput.scala:89)
 at org.apache.spark.sql.execution.python.PythonArrowOutput$$anon$1.read(PythonArrowOutput.scala:49)
 at org.apache.spark.api.python.BasePythonRunner$ReaderIterator.hasNext(PythonRunner.scala:437)
 at org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:37)
 at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:489)
 at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:458)
 at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage3.processNext(Unknown Source)
 at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
 at org.apache.spark.sql.execution.WholeStageCodegenExec$$anon$1.hasNext(WholeStageCodegenExec.scala:726)
 at org.apache.spark.sql.execution.SparkPlan.$anonfun$getByteArrayRdd$1(SparkPlan.scala:337)
 at org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2(RDD.scala:872)
 at org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2$adapted(RDD.scala:872)
 at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
 at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:349)
 at org.apache.spark.rdd.RDD.iterator(RDD.scala:313)
 at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
 at org.apache.spark.scheduler.Task.run(Task.scala:127)
 at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:455)
 at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1377)
 at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:458)
 at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
 at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
 at java.lang.Thread.run(Thread.java:748)
19/11/05 18:23:17 ERROR TaskSetManager: Task 13 in stage 5.0 failed 1 times; aborting job
{code}

> Error with struct conversion while using pandas_udf
> ---------------------------------------------------
>
>                 Key: SPARK-28502
>                 URL: https://issues.apache.org/jira/browse/SPARK-28502
>             Project: Spark
>          Issue Type: Bug
>          Components: PySpark
>    Affects Versions: 2.4.3
>         Environment: OS: Ubuntu
> Python: 3.6
>            Reporter: Nasir Ali
>            Priority: Minor
>             Fix For: 3.0.0
>
>
> What I am trying to do: Group data based on time intervals (e.g., 15 days window) and perform some operations on dataframe using (pandas) UDFs. I don't know if there is a better/cleaner way to do it.
> Below is the sample code that I tried and error message I am getting.
>  
> {code:java}
> df = sparkSession.createDataFrame([(17.00, "2018-03-10T15:27:18+00:00"),
>                             (13.00, "2018-03-11T12:27:18+00:00"),
>                             (25.00, "2018-03-12T11:27:18+00:00"),
>                             (20.00, "2018-03-13T15:27:18+00:00"),
>                             (17.00, "2018-03-14T12:27:18+00:00"),
>                             (99.00, "2018-03-15T11:27:18+00:00"),
>                             (156.00, "2018-03-22T11:27:18+00:00"),
>                             (17.00, "2018-03-31T11:27:18+00:00"),
>                             (25.00, "2018-03-15T11:27:18+00:00"),
>                             (25.00, "2018-03-16T11:27:18+00:00")
>                             ],
>                            ["id", "ts"])
> df = df.withColumn('ts', df.ts.cast('timestamp'))
> schema = StructType([
>     StructField("id", IntegerType()),
>     StructField("ts", TimestampType())
> ])
> @pandas_udf(schema, PandasUDFType.GROUPED_MAP)
> def some_udf(df):
>     # some computation
>     return df
> df.groupby('id', F.window("ts", "15 days")).apply(some_udf).show()
> {code}
> This throws following exception:
> {code:java}
> TypeError: Unsupported type in conversion from Arrow: struct<start: timestamp[us, tz=America/Chicago], end: timestamp[us, tz=America/Chicago]>
> {code}
>  
> However, if I use builtin agg method then it works all fine. For example,
> {code:java}
> df.groupby('id', F.window("ts", "15 days")).mean().show(truncate=False)
> {code}
> Output
> {code:java}
> +-----+------------------------------------------+-------+
> |id   |window                                    |avg(id)|
> +-----+------------------------------------------+-------+
> |13.0 |[2018-03-05 00:00:00, 2018-03-20 00:00:00]|13.0   |
> |17.0 |[2018-03-20 00:00:00, 2018-04-04 00:00:00]|17.0   |
> |156.0|[2018-03-20 00:00:00, 2018-04-04 00:00:00]|156.0  |
> |99.0 |[2018-03-05 00:00:00, 2018-03-20 00:00:00]|99.0   |
> |20.0 |[2018-03-05 00:00:00, 2018-03-20 00:00:00]|20.0   |
> |17.0 |[2018-03-05 00:00:00, 2018-03-20 00:00:00]|17.0   |
> |25.0 |[2018-03-05 00:00:00, 2018-03-20 00:00:00]|25.0   |
> +-----+------------------------------------------+-------+
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org