You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Apache Spark (Jira)" <ji...@apache.org> on 2022/07/01 08:32:00 UTC

[jira] [Commented] (SPARK-39654) parameters quotechar and escapechar needs to limite to a char in pyspark pandas read_csv function

    [ https://issues.apache.org/jira/browse/SPARK-39654?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17561358#comment-17561358 ] 

Apache Spark commented on SPARK-39654:
--------------------------------------

User 'bzhaoopenstack' has created a pull request for this issue:
https://github.com/apache/spark/pull/37044

> parameters quotechar and escapechar needs to limite to a char in pyspark pandas read_csv function
> -------------------------------------------------------------------------------------------------
>
>                 Key: SPARK-39654
>                 URL: https://issues.apache.org/jira/browse/SPARK-39654
>             Project: Spark
>          Issue Type: Bug
>          Components: PySpark
>    Affects Versions: 3.3.0
>         Environment: pyspark pandas: master
> OS: Ubuntu 1804
> Python version: 3.8.14
> pandas version: 1.4.2
>            Reporter: bo zhao
>            Priority: Minor
>
> Different behavior between pyspark and pandas with the single blank string. We need to keep the same behavior like pandas, even the backend DataFrame support this input.
>  
> test case(test3.csv):
> {code:java}
> "column1","column2", "column3", "column4", "column5", "column6" "AM", 7, "1", "SD", "SD", "CR" "AM", 8, "1,2 ,3", "PR, SD,SD", "PR ; , SD,SD", "PR , ,, SD ,SD" "AM", 1, "2", "SD", "SD", "SD" {code}
>  
> For quotechar
> pandas:
> {code:java}
> >>> pd.read_csv('/home/spark/test3.csv', quotechar=' ')
> Traceback (most recent call last):
>   File "<stdin>", line 1, in <module>
>   File "/home/spark/.pyenv/versions/3.8.13/lib/python3.8/site-packages/pandas/util/_decorators.py", line 311, in wrapper
>     return func(*args, **kwargs)
>   File "/home/spark/.pyenv/versions/3.8.13/lib/python3.8/site-packages/pandas/io/parsers/readers.py", line 680, in read_csv
>     return _read(filepath_or_buffer, kwds)
>   File "/home/spark/.pyenv/versions/3.8.13/lib/python3.8/site-packages/pandas/io/parsers/readers.py", line 581, in _read
>     return parser.read(nrows)
>   File "/home/spark/.pyenv/versions/3.8.13/lib/python3.8/site-packages/pandas/io/parsers/readers.py", line 1254, in read
>     index, columns, col_dict = self._engine.read(nrows)
>   File "/home/spark/.pyenv/versions/3.8.13/lib/python3.8/site-packages/pandas/io/parsers/c_parser_wrapper.py", line 225, in read
>     chunks = self._reader.read_low_memory(nrows)
>   File "pandas/_libs/parsers.pyx", line 805, in pandas._libs.parsers.TextReader.read_low_memory
>   File "pandas/_libs/parsers.pyx", line 861, in pandas._libs.parsers.TextReader._read_rows
>   File "pandas/_libs/parsers.pyx", line 847, in pandas._libs.parsers.TextReader._tokenize_rows
>   File "pandas/_libs/parsers.pyx", line 1960, in pandas._libs.parsers.raise_parser_error
> pandas.errors.ParserError: Error tokenizing data. C error: EOF inside string starting at row 2
> >>> 
>  {code}
> pyspark:
> {code:java}
> >>> sp.read_csv('/home/spark/test3.csv', quotechar=' ')
> /home/spark/spark/python/pyspark/pandas/utils.py:976: PandasAPIOnSparkAdviceWarning: If `index_col` is not specified for `read_csv`, the default index is attached which can cause additional overhead.
>   warnings.warn(message, PandasAPIOnSparkAdviceWarning)
>   "column1" "column2"  "column3", "column4"  "column5", "column6"
> 0      "AM"    7, "1"            "SD", "SD"                  "CR"
> 1      "AM"     8, "1                    2                     3"
> 2      "AM"    1, "2"            "SD", "SD"                  "SD"
>  {code}
>  
> For escapechar
> pandas:
> {code:java}
> >>> pd.read_csv('/home/spark/test3.csv', escapechar=' ')
> Traceback (most recent call last):
>   File "<stdin>", line 1, in <module>
>   File "/home/spark/.pyenv/versions/3.8.13/lib/python3.8/site-packages/pandas/util/_decorators.py", line 311, in wrapper
>     return func(*args, **kwargs)
>   File "/home/spark/.pyenv/versions/3.8.13/lib/python3.8/site-packages/pandas/io/parsers/readers.py", line 680, in read_csv
>     return _read(filepath_or_buffer, kwds)
>   File "/home/spark/.pyenv/versions/3.8.13/lib/python3.8/site-packages/pandas/io/parsers/readers.py", line 581, in _read
>     return parser.read(nrows)
>   File "/home/spark/.pyenv/versions/3.8.13/lib/python3.8/site-packages/pandas/io/parsers/readers.py", line 1254, in read
>     index, columns, col_dict = self._engine.read(nrows)
>   File "/home/spark/.pyenv/versions/3.8.13/lib/python3.8/site-packages/pandas/io/parsers/c_parser_wrapper.py", line 225, in read
>     chunks = self._reader.read_low_memory(nrows)
>   File "pandas/_libs/parsers.pyx", line 805, in pandas._libs.parsers.TextReader.read_low_memory
>   File "pandas/_libs/parsers.pyx", line 861, in pandas._libs.parsers.TextReader._read_rows
>   File "pandas/_libs/parsers.pyx", line 847, in pandas._libs.parsers.TextReader._tokenize_rows
>   File "pandas/_libs/parsers.pyx", line 1960, in pandas._libs.parsers.raise_parser_error
> pandas.errors.ParserError: Error tokenizing data. C error: Expected 6 fields in line 3, saw 11{code}
> pyspark:
> {code:java}
> >>> sp.read_csv('/home/spark/test3.csv', escapechar=' ')
> /home/spark/spark/python/pyspark/pandas/utils.py:976: PandasAPIOnSparkAdviceWarning: If `index_col` is not specified for `read_csv`, the default index is attached which can cause additional overhead.
>   warnings.warn(message, PandasAPIOnSparkAdviceWarning)
>   column1  column2  "column3"  "column4"  "column5"  "column6"
> 0      AM      7.0        "1"       "SD"       "SD"       "CR"
> 1      AM      8.0         "1         2          3"        "PR
> 2      AM      1.0        "2"       "SD"       "SD"       "SD"
>  {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org