You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@arrow.apache.org by "Meehai (via GitHub)" <gi...@apache.org> on 2023/05/23 16:07:55 UTC
[GitHub] [arrow] Meehai opened a new issue, #35661: [C++] PyArrow's csv reader yields different results than the default pandas
Meehai opened a new issue, #35661:
URL: https://github.com/apache/arrow/issues/35661
### Describe the bug, including details regarding any error messages, version, and platform.
Platform: Ubuntu 20.04
Version: 11.0 and 12.0 (w/ pandas 1.4.1 and 2.0.1)
Hello, I've posted this issue on the pandas board as well, and they've asked me to put it here too:
```
"""
user_id,value
1225717802.1679841607,33
"""
import pandas as pd
a = pd.read_csv("bug.csv", dtype={"user_id": str})
b = pd.read_csv("bug.csv", dtype={"user_id": str}, engine="pyarrow")
print(a.user_id.iloc[0]) # 1225717802.1679841607
print(b.user_id.iloc[0]) # 1225717802.1679842
assert a.user_id.dtype == b.user_id.dtype # <- both are strings
assert a.user_id.iloc[0] == b.user_id.iloc[0] # <- this fails
```
It seems that under the hood `pyarrow.read_csv` handles strings (explicitly asked as strings) differently than expected, in the sense that there is some automatic conversion happening first before the explicit string conversion takes place. In this case it is first interpreted as a float, truncated because of precision issues and just then reconverted to string type.
### Component(s)
Python
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: issues-unsubscribe@arrow.apache.org.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [arrow] jorisvandenbossche commented on issue #35661: [C++] PyArrow's csv reader yields different results than the default pandas
Posted by "jorisvandenbossche (via GitHub)" <gi...@apache.org>.
jorisvandenbossche commented on issue #35661:
URL: https://github.com/apache/arrow/issues/35661#issuecomment-1559745327
The behaviour you notice is indeed from casting what has been read/parsed as a float afterwards to string. However, if you use pyarrow's csv reader directly and using the column_types argument, this is done properly:
```
>>> from pyarrow import csv
>>> csv.read_csv("bug.csv")
pyarrow.Table
user_id: double
value: int64
----
user_id: [[1225717802.1679842]]
value: [[33]]
>>> csv.read_csv("bug.csv", convert_options=csv.ConvertOptions(column_types={"user_id": pa.string()}))
pyarrow.Table
user_id: string
value: int64
----
user_id: [["1225717802.1679841607"]]
value: [[33]]
```
So I assume this is actually a bug in pandas after all (in how pandas integrates with the pyarrow csv reader, and how it translates its own arguments to arguments passed to pyarrow). Therefore closing this issue, and will re-open the one on the pandas side (https://github.com/pandas-dev/pandas/issues/53269)
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [arrow] jorisvandenbossche closed issue #35661: [C++] PyArrow's csv reader yields different results than the default pandas
Posted by "jorisvandenbossche (via GitHub)" <gi...@apache.org>.
jorisvandenbossche closed issue #35661: [C++] PyArrow's csv reader yields different results than the default pandas
URL: https://github.com/apache/arrow/issues/35661
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: issues-unsubscribe@arrow.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org