You are viewing a plain text version of this content. The canonical link for it is here.
Posted to jira@arrow.apache.org by "Nic Crane (Jira)" <ji...@apache.org> on 2021/05/04 08:28:00 UTC
[jira] [Updated] (ARROW-12025) [Python] pyarrow read_csv works
incorrectly with multilines if skiprows is present
[ https://issues.apache.org/jira/browse/ARROW-12025?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Nic Crane updated ARROW-12025:
------------------------------
Summary: [Python] pyarrow read_csv works incorrectly with multilines if skiprows is present (was: pyarrow read_csv works incorrectly with multilines if skiprows is present)
> [Python] pyarrow read_csv works incorrectly with multilines if skiprows is present
> ----------------------------------------------------------------------------------
>
> Key: ARROW-12025
> URL: https://issues.apache.org/jira/browse/ARROW-12025
> Project: Apache Arrow
> Issue Type: Bug
> Components: Python
> Affects Versions: 3.0.0
> Reporter: Alexander M
> Priority: Critical
>
> Reproducer:
> import os
> from pyarrow.csv import read_csv, ReadOptions
> import pyarrow
> print("pyarrow.__version__:", pyarrow.__version__)
> test_filename = "test.csv"
> test_data = """col1,col2,col3,col4
> "This is a very long
> string with several
> newline characters",2,3,4
> """
> try :
> with open(test_filename, "w") as f:
> f.write(test_data)
> ans_1 = read_csv(test_filename) # works fine
> print("ans_1: \n", ans_1)
> ans_2 = read_csv(test_filename, read_options=ReadOptions(skip_rows=2))
> print("ans_2: \n", ans_2)
> finally:
> os.remove(test_filename)
>
> Output:
> pyarrow.__version__: 3.0.0
> ans_1:
> pyarrow.Table
> col1: string
> col2: int64
> col3: int64
> col4: int64
> Traceback (most recent call last):
> File "pyarrow_bug.py", line 21, in <module>
> ans_2 = read_csv(test_filename, read_options=ReadOptions(skip_rows=2))
> File "pyarrow/_csv.pyx", line 714, in pyarrow._csv.read_csv
> File "pyarrow/error.pxi", line 122, in pyarrow.lib.pyarrow_internal_check_status
> File "pyarrow/error.pxi", line 84, in pyarrow.lib.check_status
> pyarrow.lib.ArrowInvalid: CSV parse error: Expected 1 columns, got 4
>
> Note: python version: 3.8.8, platform: Ubuntu 20.04
--
This message was sent by Atlassian Jira
(v8.3.4#803005)