You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@arrow.apache.org by "Thibaud Nesztler (JIRA)" <ji...@apache.org> on 2019/06/20 14:39:00 UTC

[jira] [Comment Edited] (ARROW-5665) [Python] ArrowInvalid on converting Pandas Series with dtype float64

    [ https://issues.apache.org/jira/browse/ARROW-5665?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16868579#comment-16868579 ] 

Thibaud Nesztler edited comment on ARROW-5665 at 6/20/19 2:38 PM:
------------------------------------------------------------------

[~jorisvandenbossche] I didn't understand that a Series object was used as value vs a simple float64.
 I have been trying to look at info() outputs of the DataFrame and it always stated float64 as the column dtype, not object.

This is not intentional and the entire column (fact_value in this example) should be a list / series of floats.

If I come across a reproducible example, I'll be happy to share it but in the mean time, with your help, I will try to debug the issue. 


was (Author: tnesztler):
[~jorisvandenbossche] I didn't understood that a Series object was used as value vs a simple float64.
I have been trying to look at info() outputs of the DataFrame and it always stated float64 as the column dtype, not object.

This is not intentional and the entire column (fact_value in this example) should be a list / series of floats.

If I come across a reproducible example, I'll be happy to share it but in the mean time, with your help, I will try to debug the issue. 

> [Python] ArrowInvalid on converting Pandas Series with dtype float64
> --------------------------------------------------------------------
>
>                 Key: ARROW-5665
>                 URL: https://issues.apache.org/jira/browse/ARROW-5665
>             Project: Apache Arrow
>          Issue Type: Bug
>          Components: Python
>            Reporter: Thibaud Nesztler
>            Priority: Minor
>
> {code:java}
> ('Could not convert 0 70.699997\n0 73.000000\n0 0.000000\nName: fact_value, dtype: float64 with type Series: did not recognize Python value type when inferring an Arrow data type', 'Conversion failed for column fact_value with type float64'){code}
> We are experiencing a lot of random errors (will run the same code and not get the error at all) when converting Pandas Dataframe to parquet files using pyarrow.
> We use this line of code for the convertion:
> {code:java}
> dataframe.to_parquet(filePath, compression="snappy", index=False){code}
> Note: `filePath` is an AWS S3 URI.
> {code:java}
> ArrowInvalid: ('Could not convert 0 70.699997\n0 73.000000\n0 0.000000\nName: fact_value, dtype: float64 with type Series: did not recognize Python value type when inferring an Arrow data type', 'Conversion failed for column fact_value with type float64')
>  File "store_manager.py", line 25, in _write_files_and_partitions
>  dataframe.to_parquet(filePath, compression="snappy", index=False)
>  File "pandas/core/frame.py", line 2203, in to_parquet
>  partition_cols=partition_cols, **kwargs)
>  File "pandas/io/parquet.py", line 252, in to_parquet
>  partition_cols=partition_cols, **kwargs)
>  File "pandas/io/parquet.py", line 113, in write
>  table = self.api.Table.from_pandas(df, **from_pandas_kwargs)
>  File "pyarrow/table.pxi", line 1139, in pyarrow.lib.Table.from_pandas
>  names, arrays, metadata = dataframe_to_arrays(
>  File "pyarrow/pandas_compat.py", line 474, in dataframe_to_arrays
>  convert_types))
>  File "concurrent/futures/_base.py", line 586, in result_iterator
>  yield fs.pop().result()
>  File "concurrent/futures/_base.py", line 425, in result
>  return self.__get_result()
>  File "concurrent/futures/_base.py", line 384, in __get_result
>  raise self._exception
>  File "concurrent/futures/thread.py", line 57, in run
>  result = self.fn(*self.args, **self.kwargs)
>  File "pyarrow/pandas_compat.py", line 463, in convert_column
>  raise e
>  File "pyarrow/pandas_compat.py", line 457, in convert_column
>  return pa.array(col, type=ty, from_pandas=True, safe=safe)
>  File "pyarrow/array.pxi", line 173, in pyarrow.lib.array
>  return _sequence_to_array(obj, mask, size, type, pool, from_pandas)
>  File "pyarrow/array.pxi", line 36, in pyarrow.lib._sequence_to_array
>  check_status(ConvertPySequence(sequence, mask, options, &out))
>  File "pyarrow/error.pxi", line 81, in pyarrow.lib.check_status
>  raise ArrowInvalid(message){code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)