You are viewing a plain text version of this content. The canonical link for it is here.

Posted to issues@arrow.apache.org by "Meehai (via GitHub)" <gi...@apache.org> on 2023/05/23 16:07:55 UTC

[GitHub] [arrow] Meehai opened a new issue, #35661: [C++] PyArrow's csv reader yields different results than the default pandas

Meehai opened a new issue, #35661:
URL: https://github.com/apache/arrow/issues/35661

   ### Describe the bug, including details regarding any error messages, version, and platform.
   
   Platform: Ubuntu 20.04
   Version: 11.0 and 12.0 (w/ pandas 1.4.1 and 2.0.1)
   
   Hello, I've posted this issue on the pandas board as well, and they've asked me to put it here too:
   
   ```
   """
   user_id,value
   1225717802.1679841607,33
   """
   
   import pandas as pd
   a = pd.read_csv("bug.csv", dtype={"user_id": str})
   b = pd.read_csv("bug.csv", dtype={"user_id": str}, engine="pyarrow")
   
   print(a.user_id.iloc[0]) # 1225717802.1679841607
   print(b.user_id.iloc[0]) # 1225717802.1679842
   
   assert a.user_id.dtype == b.user_id.dtype # <- both are strings
   assert a.user_id.iloc[0] == b.user_id.iloc[0] # <- this fails
   ```
   
   It seems that under the hood `pyarrow.read_csv` handles strings (explicitly asked as strings) differently than expected, in the sense that there is some automatic conversion happening first before the explicit string conversion takes place. In this case it is first interpreted as a float, truncated because of precision issues and just then reconverted to string type.
   
   
   ### Component(s)
   
   Python


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@arrow.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] jorisvandenbossche commented on issue #35661: [C++] PyArrow's csv reader yields different results than the default pandas

Posted by "jorisvandenbossche (via GitHub)" <gi...@apache.org>.

jorisvandenbossche commented on issue #35661:
URL: https://github.com/apache/arrow/issues/35661#issuecomment-1559745327

   The behaviour you notice is indeed from casting what has been read/parsed as a float afterwards to string. However, if you use pyarrow's csv reader directly and using the column_types argument, this is done properly:
   
   ```
   >>> from pyarrow import csv
   >>> csv.read_csv("bug.csv")
   pyarrow.Table
   user_id: double
   value: int64
   ----
   user_id: [[1225717802.1679842]]
   value: [[33]]
   
   >>> csv.read_csv("bug.csv", convert_options=csv.ConvertOptions(column_types={"user_id": pa.string()}))
   pyarrow.Table
   user_id: string
   value: int64
   ----
   user_id: [["1225717802.1679841607"]]
   value: [[33]]
   ```
   
   So I assume this is actually a bug in pandas after all (in how pandas integrates with the pyarrow csv reader, and how it translates its own arguments to arguments passed to pyarrow). Therefore closing this issue, and will re-open the one on the pandas side (https://github.com/pandas-dev/pandas/issues/53269)


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] jorisvandenbossche closed issue #35661: [C++] PyArrow's csv reader yields different results than the default pandas

Posted by "jorisvandenbossche (via GitHub)" <gi...@apache.org>.

jorisvandenbossche closed issue #35661: [C++] PyArrow's csv reader yields different results than the default pandas
URL: https://github.com/apache/arrow/issues/35661


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org