You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@arrow.apache.org by "Andrew Redd (Jira)" <ji...@apache.org> on 2020/05/07 14:13:00 UTC

[jira] [Updated] (ARROW-8731) Error when using toPandas with PyArrow

     [ https://issues.apache.org/jira/browse/ARROW-8731?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Andrew Redd updated ARROW-8731:
-------------------------------
    Description: 
I'm getting the following error when calling toPandas on a spark dataframe
 * This is a blocker to our use of pyarrow on a project

 
{code:java}
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-8-e2ed63d96b43> in <module>
----> 1 df.limit(100).toPandas()

/venv/lib/python3.6/site-packages/pyspark/sql/dataframe.py in toPandas(self)
   2119                         _check_dataframe_localize_timestamps
   2120                     import pyarrow
-> 2121                     batches = self._collectAsArrow()
   2122                     if len(batches) > 0:
   2123                         table = pyarrow.Table.from_batches(batches)

/venv/lib/python3.6/site-packages/pyspark/sql/dataframe.py in _collectAsArrow(self)
   2177         with SCCallSiteSync(self._sc) as css:
   2178             sock_info = self._jdf.collectAsArrowToPython()
-> 2179         return list(_load_from_socket(sock_info, ArrowStreamSerializer()))
   2180 
   2181     ##########################################################################################

/venv/lib/python3.6/site-packages/pyspark/rdd.py in _load_from_socket(sock_info, serializer)
    142 
    143 def _load_from_socket(sock_info, serializer):
--> 144     (sockfile, sock) = local_connect_and_auth(*sock_info)
    145     # The RDD materialization time is unpredicable, if we set a timeout for socket reading
    146     # operation, it will very possibly fail. See SPARK-18281.

TypeError: local_connect_and_auth() takes 2 positional arguments but 3 were given
{code}

  was:
I'm getting the following error when calling toPandas on a spark dataframe
 * This is a blocker to our use of pyarrow on a project

 
{code:java}
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-8-e2ed63d96b43> in <module>
----> 1 s.load_table_to_df('csn_customer.tblcustomerpro').limit(100).toPandas()

/venv/lib/python3.6/site-packages/pyspark/sql/dataframe.py in toPandas(self)
   2119                         _check_dataframe_localize_timestamps
   2120                     import pyarrow
-> 2121                     batches = self._collectAsArrow()
   2122                     if len(batches) > 0:
   2123                         table = pyarrow.Table.from_batches(batches)

/venv/lib/python3.6/site-packages/pyspark/sql/dataframe.py in _collectAsArrow(self)
   2177         with SCCallSiteSync(self._sc) as css:
   2178             sock_info = self._jdf.collectAsArrowToPython()
-> 2179         return list(_load_from_socket(sock_info, ArrowStreamSerializer()))
   2180 
   2181     ##########################################################################################

/venv/lib/python3.6/site-packages/pyspark/rdd.py in _load_from_socket(sock_info, serializer)
    142 
    143 def _load_from_socket(sock_info, serializer):
--> 144     (sockfile, sock) = local_connect_and_auth(*sock_info)
    145     # The RDD materialization time is unpredicable, if we set a timeout for socket reading
    146     # operation, it will very possibly fail. See SPARK-18281.

TypeError: local_connect_and_auth() takes 2 positional arguments but 3 were given
{code}


> Error when using toPandas with PyArrow
> --------------------------------------
>
>                 Key: ARROW-8731
>                 URL: https://issues.apache.org/jira/browse/ARROW-8731
>             Project: Apache Arrow
>          Issue Type: Bug
>         Environment: Python Environment on the worker and driver
> - jupyter==1.0.0
> - pandas==1.0.3
> - pyarrow==0.14.0
> - pyspark==2.4.0
> - py4j==0.10.7
> - pyarrow==0.14.0
>            Reporter: Andrew Redd
>            Priority: Blocker
>
> I'm getting the following error when calling toPandas on a spark dataframe
>  * This is a blocker to our use of pyarrow on a project
>  
> {code:java}
> ---------------------------------------------------------------------------
> TypeError                                 Traceback (most recent call last)
> <ipython-input-8-e2ed63d96b43> in <module>
> ----> 1 df.limit(100).toPandas()
> /venv/lib/python3.6/site-packages/pyspark/sql/dataframe.py in toPandas(self)
>    2119                         _check_dataframe_localize_timestamps
>    2120                     import pyarrow
> -> 2121                     batches = self._collectAsArrow()
>    2122                     if len(batches) > 0:
>    2123                         table = pyarrow.Table.from_batches(batches)
> /venv/lib/python3.6/site-packages/pyspark/sql/dataframe.py in _collectAsArrow(self)
>    2177         with SCCallSiteSync(self._sc) as css:
>    2178             sock_info = self._jdf.collectAsArrowToPython()
> -> 2179         return list(_load_from_socket(sock_info, ArrowStreamSerializer()))
>    2180 
>    2181     ##########################################################################################
> /venv/lib/python3.6/site-packages/pyspark/rdd.py in _load_from_socket(sock_info, serializer)
>     142 
>     143 def _load_from_socket(sock_info, serializer):
> --> 144     (sockfile, sock) = local_connect_and_auth(*sock_info)
>     145     # The RDD materialization time is unpredicable, if we set a timeout for socket reading
>     146     # operation, it will very possibly fail. See SPARK-18281.
> TypeError: local_connect_and_auth() takes 2 positional arguments but 3 were given
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)