You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Christopher Auston (Jira)" <ji...@apache.org> on 2022/02/25 16:03:00 UTC
[jira] [Updated] (SPARK-38331) csv parser exception when quote and escape are both double-quote and a value is just "," and column pruning enabled
[ https://issues.apache.org/jira/browse/SPARK-38331?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Christopher Auston updated SPARK-38331:
---------------------------------------
Description:
Workaround: disable column pruning.
Example pyspark code (from Databricks):
{noformat}
import pyspark
print(pyspark.version.__version__)
# enable column pruning (reset default value)
spark.conf.set('spark.sql.csv.parser.columnPruning.enabled', 'true')
dbutils.fs.put(file='/tmp/example.csv', contents='''"col1","b4_comma","comma","col4"
"","",",","x"
''', overwrite=True)df = spark.read.csv(
path='/tmp/example.csv'
,inferSchema=True
,header=True
,escape='"'
,multiLine=True
,unescapedQuoteHandling='RAISE_ERROR'
,mode='FAILFAST'
)
ex = None
try:
df.select(df.col1,df.comma).take(1)
except Exception as e:
ex = e
if ex:
print('[pruning] Exception is raised if b4_comma is NOT selected')
df.select(df.b4_comma, df.comma).take(1)
print('[pruning] No exception if b4_comma is selected')ex = None
try:
df.count()
except Exception as e:
ex = e
if ex:
print('[pruning] Exception raised by count')
print('\ndisabling pruning\n')
# disable column pruning
spark.conf.set('spark.sql.csv.parser.columnPruning.enabled', 'false')
df.select(df.col1,df.comma).take(1)
print('[no prune] No exception if b4_comma is NOT selected') {noformat}
Output:
{noformat}
3.1.2
Wrote 47 bytes.
[pruning] Exception is raised if b4_comma is NOT selected
[pruning] No exception if b4_comma is selected
[pruning] Exception raised by count
disabling pruning
[no prune] No exception if b4_comma is NOT selected {noformat}
was:
Workaround: disable column pruning.
Example pyspark code (from Databricks):
{noformat}
import pyspark
print(pyspark.version.__version__)# enable column pruning (reset default value)
spark.conf.set('spark.sql.csv.parser.columnPruning.enabled', 'true')dbutils.fs.put(file='/tmp/example.csv', contents='''"col1","b4_comma","comma","col4"
"","",",","x"
''', overwrite=True)df = spark.read.csv(
path='/tmp/example.csv'
,inferSchema=True
,header=True
,escape='"'
,multiLine=True
,unescapedQuoteHandling='RAISE_ERROR'
,mode='FAILFAST'
)
ex = None
try:
df.select(df.col1,df.comma).take(1)
except Exception as e:
ex = e
if ex:
print('[pruning] Exception is raised if b4_comma is NOT selected')
df.select(df.b4_comma, df.comma).take(1)
print('[pruning] No exception if b4_comma is selected')ex = None
try:
df.count()
except Exception as e:
ex = e
if ex:
print('[pruning] Exception raised by count')
print('\ndisabling pruning\n')
# disable column pruning
spark.conf.set('spark.sql.csv.parser.columnPruning.enabled', 'false')
df.select(df.col1,df.comma).take(1)
print('[no prune] No exception if b4_comma is NOT selected') {noformat}
Output:
{noformat}
3.1.2
Wrote 47 bytes.
[pruning] Exception is raised if b4_comma is NOT selected
[pruning] No exception if b4_comma is selected
[pruning] Exception raised by count
disabling pruning
[no prune] No exception if b4_comma is NOT selected {noformat}
> csv parser exception when quote and escape are both double-quote and a value is just "," and column pruning enabled
> -------------------------------------------------------------------------------------------------------------------
>
> Key: SPARK-38331
> URL: https://issues.apache.org/jira/browse/SPARK-38331
> Project: Spark
> Issue Type: Bug
> Components: Input/Output
> Affects Versions: 3.1.2, 3.2.1
> Reporter: Christopher Auston
> Priority: Minor
>
> Workaround: disable column pruning.
> Example pyspark code (from Databricks):
> {noformat}
> import pyspark
> print(pyspark.version.__version__)
> # enable column pruning (reset default value)
> spark.conf.set('spark.sql.csv.parser.columnPruning.enabled', 'true')
> dbutils.fs.put(file='/tmp/example.csv', contents='''"col1","b4_comma","comma","col4"
> "","",",","x"
> ''', overwrite=True)df = spark.read.csv(
> path='/tmp/example.csv'
> ,inferSchema=True
> ,header=True
> ,escape='"'
> ,multiLine=True
> ,unescapedQuoteHandling='RAISE_ERROR'
> ,mode='FAILFAST'
> )
> ex = None
> try:
> df.select(df.col1,df.comma).take(1)
> except Exception as e:
> ex = e
>
> if ex:
> print('[pruning] Exception is raised if b4_comma is NOT selected')
>
> df.select(df.b4_comma, df.comma).take(1)
> print('[pruning] No exception if b4_comma is selected')ex = None
> try:
> df.count()
> except Exception as e:
> ex = e
>
> if ex:
> print('[pruning] Exception raised by count')
> print('\ndisabling pruning\n')
>
>
> # disable column pruning
> spark.conf.set('spark.sql.csv.parser.columnPruning.enabled', 'false')
> df.select(df.col1,df.comma).take(1)
> print('[no prune] No exception if b4_comma is NOT selected') {noformat}
>
> Output:
> {noformat}
> 3.1.2
> Wrote 47 bytes.
> [pruning] Exception is raised if b4_comma is NOT selected
> [pruning] No exception if b4_comma is selected
> [pruning] Exception raised by count
> disabling pruning
> [no prune] No exception if b4_comma is NOT selected {noformat}
--
This message was sent by Atlassian Jira
(v8.20.1#820001)
---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org