You are viewing a plain text version of this content. The canonical link for it is here.

Posted to jira@arrow.apache.org by "Liangcai li (Jira)" <ji...@apache.org> on 2022/10/04 10:17:00 UTC

[jira] [Comment Edited] (ARROW-17912) [C++] IPC writer does not write an empty batch in the case of an empty table, which PySpark cannot handle

    [ https://issues.apache.org/jira/browse/ARROW-17912?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17612549#comment-17612549 ] 

Liangcai li edited comment on ARROW-17912 at 10/4/22 10:16 AM:
---------------------------------------------------------------

—  I think either the sender here should explicitly write out an empty batch

Not fully get what this means. Could you share how to do it ?

Do you mean the current C++ IPC supports write out an empty table ?

 

--- There is a slight difference between a stream with 0 batches and a stream of a single empty batch

Yeah, what I want is just sending out an empty table, and the Python receiver can get it.


was (Author: JIRAUSER296424):
---  I think either the sender here should explicitly write out an empty batch

Not fully get what this means. Could you share how to do it ?

Do you mean the current C++ IPC supports write out an empty table ?

 

> [C++] IPC writer does not write an empty batch in the case of an empty table, which PySpark cannot handle
> ---------------------------------------------------------------------------------------------------------
>
>                 Key: ARROW-17912
>                 URL: https://issues.apache.org/jira/browse/ARROW-17912
>             Project: Apache Arrow
>          Issue Type: Bug
>          Components: C++
>            Reporter: Liangcai li
>            Priority: Major
>
> My current work is about Pyspark Cogroup Pandas UDF. And two processes are involved, the JVM one (sender) and the Python one (receiver).
> [Spark is using the Arrow Java `ArrowStreamWriter`|https://github.com/apache/spark/blob/branch-3.3/sql/core/src/main/scala/org/apache/spark/sql/execution/python/CoGroupedArrowPythonRunner.scala#L99] to serialize Arrow tables being sent from the JVM process to the Python process, and ArrowStreamWriter can handle empty tables correctly.
> [While cuDF is using the Arrow C++ RecordBatchWriter |https://github.com/rapidsai/cudf/blob/branch-22.10/java/src/main/native/src/TableJni.cpp#L254]to do the same serialization, but it leads to an error as below on the Python side, where [the Pyspark is calling Pyarrow *Table.from_batches*|https://github.com/apache/spark/blob/branch-3.3/python/pyspark/sql/pandas/serializers.py#L366] to deserialize the arrow stream.
> ``` 
> _E                     File "/usr/local/spark/python/lib/pyspark.zip/pyspark/sql/pandas/serializers.py", line 297, in load_stream_
> _E                       [self.arrow_to_pandas(c) for c in pa.Table.from_batches(batch2).itercolumns()]_
> _E                     File "pyarrow/table.pxi", line 1609, in pyarrow.lib.Table.from_batches_
> _E                   {color:#de350b}*ValueError: Must pass schema, or at least one RecordBatch*{color}_
> ```



--
This message was sent by Atlassian Jira
(v8.20.10#820010)