You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Apache Spark (Jira)" <ji...@apache.org> on 2022/07/01 08:32:00 UTC
[jira] [Commented] (SPARK-39654) parameters quotechar and escapechar needs to limite to a char in pyspark pandas read_csv function
[ https://issues.apache.org/jira/browse/SPARK-39654?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17561358#comment-17561358 ]
Apache Spark commented on SPARK-39654:
--------------------------------------
User 'bzhaoopenstack' has created a pull request for this issue:
https://github.com/apache/spark/pull/37044
> parameters quotechar and escapechar needs to limite to a char in pyspark pandas read_csv function
> -------------------------------------------------------------------------------------------------
>
> Key: SPARK-39654
> URL: https://issues.apache.org/jira/browse/SPARK-39654
> Project: Spark
> Issue Type: Bug
> Components: PySpark
> Affects Versions: 3.3.0
> Environment: pyspark pandas: master
> OS: Ubuntu 1804
> Python version: 3.8.14
> pandas version: 1.4.2
> Reporter: bo zhao
> Priority: Minor
>
> Different behavior between pyspark and pandas with the single blank string. We need to keep the same behavior like pandas, even the backend DataFrame support this input.
>
> test case(test3.csv):
> {code:java}
> "column1","column2", "column3", "column4", "column5", "column6" "AM", 7, "1", "SD", "SD", "CR" "AM", 8, "1,2 ,3", "PR, SD,SD", "PR ; , SD,SD", "PR , ,, SD ,SD" "AM", 1, "2", "SD", "SD", "SD" {code}
>
> For quotechar
> pandas:
> {code:java}
> >>> pd.read_csv('/home/spark/test3.csv', quotechar=' ')
> Traceback (most recent call last):
> File "<stdin>", line 1, in <module>
> File "/home/spark/.pyenv/versions/3.8.13/lib/python3.8/site-packages/pandas/util/_decorators.py", line 311, in wrapper
> return func(*args, **kwargs)
> File "/home/spark/.pyenv/versions/3.8.13/lib/python3.8/site-packages/pandas/io/parsers/readers.py", line 680, in read_csv
> return _read(filepath_or_buffer, kwds)
> File "/home/spark/.pyenv/versions/3.8.13/lib/python3.8/site-packages/pandas/io/parsers/readers.py", line 581, in _read
> return parser.read(nrows)
> File "/home/spark/.pyenv/versions/3.8.13/lib/python3.8/site-packages/pandas/io/parsers/readers.py", line 1254, in read
> index, columns, col_dict = self._engine.read(nrows)
> File "/home/spark/.pyenv/versions/3.8.13/lib/python3.8/site-packages/pandas/io/parsers/c_parser_wrapper.py", line 225, in read
> chunks = self._reader.read_low_memory(nrows)
> File "pandas/_libs/parsers.pyx", line 805, in pandas._libs.parsers.TextReader.read_low_memory
> File "pandas/_libs/parsers.pyx", line 861, in pandas._libs.parsers.TextReader._read_rows
> File "pandas/_libs/parsers.pyx", line 847, in pandas._libs.parsers.TextReader._tokenize_rows
> File "pandas/_libs/parsers.pyx", line 1960, in pandas._libs.parsers.raise_parser_error
> pandas.errors.ParserError: Error tokenizing data. C error: EOF inside string starting at row 2
> >>>
> {code}
> pyspark:
> {code:java}
> >>> sp.read_csv('/home/spark/test3.csv', quotechar=' ')
> /home/spark/spark/python/pyspark/pandas/utils.py:976: PandasAPIOnSparkAdviceWarning: If `index_col` is not specified for `read_csv`, the default index is attached which can cause additional overhead.
> warnings.warn(message, PandasAPIOnSparkAdviceWarning)
> "column1" "column2" "column3", "column4" "column5", "column6"
> 0 "AM" 7, "1" "SD", "SD" "CR"
> 1 "AM" 8, "1 2 3"
> 2 "AM" 1, "2" "SD", "SD" "SD"
> {code}
>
> For escapechar
> pandas:
> {code:java}
> >>> pd.read_csv('/home/spark/test3.csv', escapechar=' ')
> Traceback (most recent call last):
> File "<stdin>", line 1, in <module>
> File "/home/spark/.pyenv/versions/3.8.13/lib/python3.8/site-packages/pandas/util/_decorators.py", line 311, in wrapper
> return func(*args, **kwargs)
> File "/home/spark/.pyenv/versions/3.8.13/lib/python3.8/site-packages/pandas/io/parsers/readers.py", line 680, in read_csv
> return _read(filepath_or_buffer, kwds)
> File "/home/spark/.pyenv/versions/3.8.13/lib/python3.8/site-packages/pandas/io/parsers/readers.py", line 581, in _read
> return parser.read(nrows)
> File "/home/spark/.pyenv/versions/3.8.13/lib/python3.8/site-packages/pandas/io/parsers/readers.py", line 1254, in read
> index, columns, col_dict = self._engine.read(nrows)
> File "/home/spark/.pyenv/versions/3.8.13/lib/python3.8/site-packages/pandas/io/parsers/c_parser_wrapper.py", line 225, in read
> chunks = self._reader.read_low_memory(nrows)
> File "pandas/_libs/parsers.pyx", line 805, in pandas._libs.parsers.TextReader.read_low_memory
> File "pandas/_libs/parsers.pyx", line 861, in pandas._libs.parsers.TextReader._read_rows
> File "pandas/_libs/parsers.pyx", line 847, in pandas._libs.parsers.TextReader._tokenize_rows
> File "pandas/_libs/parsers.pyx", line 1960, in pandas._libs.parsers.raise_parser_error
> pandas.errors.ParserError: Error tokenizing data. C error: Expected 6 fields in line 3, saw 11{code}
> pyspark:
> {code:java}
> >>> sp.read_csv('/home/spark/test3.csv', escapechar=' ')
> /home/spark/spark/python/pyspark/pandas/utils.py:976: PandasAPIOnSparkAdviceWarning: If `index_col` is not specified for `read_csv`, the default index is attached which can cause additional overhead.
> warnings.warn(message, PandasAPIOnSparkAdviceWarning)
> column1 column2 "column3" "column4" "column5" "column6"
> 0 AM 7.0 "1" "SD" "SD" "CR"
> 1 AM 8.0 "1 2 3" "PR
> 2 AM 1.0 "2" "SD" "SD" "SD"
> {code}
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org