You are viewing a plain text version of this content. The canonical link for it is here.
Posted to jira@arrow.apache.org by "Nic Crane (Jira)" <ji...@apache.org> on 2021/05/04 08:28:00 UTC

[jira] [Updated] (ARROW-12025) [Python] pyarrow read_csv works incorrectly with multilines if skiprows is present

     [ https://issues.apache.org/jira/browse/ARROW-12025?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Nic Crane updated ARROW-12025:
------------------------------
    Summary: [Python] pyarrow read_csv works incorrectly with multilines if skiprows is present  (was: pyarrow read_csv works incorrectly with multilines if skiprows is present)

> [Python] pyarrow read_csv works incorrectly with multilines if skiprows is present
> ----------------------------------------------------------------------------------
>
>                 Key: ARROW-12025
>                 URL: https://issues.apache.org/jira/browse/ARROW-12025
>             Project: Apache Arrow
>          Issue Type: Bug
>          Components: Python
>    Affects Versions: 3.0.0
>            Reporter: Alexander M
>            Priority: Critical
>
> Reproducer:
> import os
> from pyarrow.csv import read_csv, ReadOptions
> import pyarrow
> print("pyarrow.__version__:", pyarrow.__version__)
> test_filename = "test.csv"
> test_data = """col1,col2,col3,col4
> "This is a very long
> string with several
> newline characters",2,3,4
> """
> try :
>     with open(test_filename, "w") as f:
>         f.write(test_data)
>     ans_1 = read_csv(test_filename) # works fine
>     print("ans_1: \n", ans_1)
>     ans_2 = read_csv(test_filename, read_options=ReadOptions(skip_rows=2))
>     print("ans_2: \n", ans_2)
> finally:
>     os.remove(test_filename)
>  
> Output:
> pyarrow.__version__: 3.0.0
> ans_1:
>  pyarrow.Table
> col1: string
> col2: int64
> col3: int64
> col4: int64
> Traceback (most recent call last):
>  File "pyarrow_bug.py", line 21, in <module>
>  ans_2 = read_csv(test_filename, read_options=ReadOptions(skip_rows=2))
>  File "pyarrow/_csv.pyx", line 714, in pyarrow._csv.read_csv
>  File "pyarrow/error.pxi", line 122, in pyarrow.lib.pyarrow_internal_check_status
>  File "pyarrow/error.pxi", line 84, in pyarrow.lib.check_status
> pyarrow.lib.ArrowInvalid: CSV parse error: Expected 1 columns, got 4
>  
> Note: python version: 3.8.8, platform: Ubuntu 20.04



--
This message was sent by Atlassian Jira
(v8.3.4#803005)