You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@arrow.apache.org by "Joris Van den Bossche (JIRA)" <ji...@apache.org> on 2019/06/20 14:31:01 UTC
[jira] [Commented] (ARROW-5665) ArrowInvalid on converting Pandas Series with dtype float64

    [ https://issues.apache.org/jira/browse/ARROW-5665?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16868572#comment-16868572 ] 

Joris Van den Bossche commented on ARROW-5665:
----------------------------------------------

[~tnesztler] Can you try to provide a reproducible example?

Based on the error message, it seems you have a column in your DataFrame that has Series objects as values in the rows. That's not support by pyarrow. 
If that is intentional, and you want to save them as a nested List type, then you need to convert the column of Series objects to a column of arrays or lists.

> ArrowInvalid on converting Pandas Series with dtype float64
> -----------------------------------------------------------
>
>                 Key: ARROW-5665
>                 URL: https://issues.apache.org/jira/browse/ARROW-5665
>             Project: Apache Arrow
>          Issue Type: Bug
>          Components: Python
>            Reporter: Thibaud Nesztler
>            Priority: Minor
>
> {code:java}
> ('Could not convert 0 70.699997\n0 73.000000\n0 0.000000\nName: fact_value, dtype: float64 with type Series: did not recognize Python value type when inferring an Arrow data type', 'Conversion failed for column fact_value with type float64'){code}
> We are experiencing a lot of random errors (will run the same code and not get the error at all) when converting Pandas Dataframe to parquet files using pyarrow.
> We use this line of code for the convertion:
> {code:java}
> dataframe.to_parquet(filePath, compression="snappy", index=False){code}
> Note: `filePath` is an AWS S3 URI.
> {code:java}
> ArrowInvalid: ('Could not convert 0 70.699997\n0 73.000000\n0 0.000000\nName: fact_value, dtype: float64 with type Series: did not recognize Python value type when inferring an Arrow data type', 'Conversion failed for column fact_value with type float64')
>  File "store_manager.py", line 25, in _write_files_and_partitions
>  dataframe.to_parquet(filePath, compression="snappy", index=False)
>  File "pandas/core/frame.py", line 2203, in to_parquet
>  partition_cols=partition_cols, **kwargs)
>  File "pandas/io/parquet.py", line 252, in to_parquet
>  partition_cols=partition_cols, **kwargs)
>  File "pandas/io/parquet.py", line 113, in write
>  table = self.api.Table.from_pandas(df, **from_pandas_kwargs)
>  File "pyarrow/table.pxi", line 1139, in pyarrow.lib.Table.from_pandas
>  names, arrays, metadata = dataframe_to_arrays(
>  File "pyarrow/pandas_compat.py", line 474, in dataframe_to_arrays
>  convert_types))
>  File "concurrent/futures/_base.py", line 586, in result_iterator
>  yield fs.pop().result()
>  File "concurrent/futures/_base.py", line 425, in result
>  return self.__get_result()
>  File "concurrent/futures/_base.py", line 384, in __get_result
>  raise self._exception
>  File "concurrent/futures/thread.py", line 57, in run
>  result = self.fn(*self.args, **self.kwargs)
>  File "pyarrow/pandas_compat.py", line 463, in convert_column
>  raise e
>  File "pyarrow/pandas_compat.py", line 457, in convert_column
>  return pa.array(col, type=ty, from_pandas=True, safe=safe)
>  File "pyarrow/array.pxi", line 173, in pyarrow.lib.array
>  return _sequence_to_array(obj, mask, size, type, pool, from_pandas)
>  File "pyarrow/array.pxi", line 36, in pyarrow.lib._sequence_to_array
>  check_status(ConvertPySequence(sequence, mask, options, &out))
>  File "pyarrow/error.pxi", line 81, in pyarrow.lib.check_status
>  raise ArrowInvalid(message){code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)