You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Christopher Auston (Jira)" <ji...@apache.org> on 2022/02/25 16:03:00 UTC
[jira] [Updated] (SPARK-38331) csv parser exception when quote and escape are both double-quote and a value is just "," and column pruning enabled

     [ https://issues.apache.org/jira/browse/SPARK-38331?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Christopher Auston updated SPARK-38331:
---------------------------------------
    Description: 
Workaround: disable column pruning.

Example pyspark code (from Databricks):
{noformat}
import pyspark
print(pyspark.version.__version__)

# enable column pruning (reset default value)
spark.conf.set('spark.sql.csv.parser.columnPruning.enabled', 'true')

dbutils.fs.put(file='/tmp/example.csv', contents='''"col1","b4_comma","comma","col4"
"","",",","x"
''', overwrite=True)df = spark.read.csv(
    path='/tmp/example.csv'
    ,inferSchema=True
    ,header=True
    ,escape='"'
    ,multiLine=True
    ,unescapedQuoteHandling='RAISE_ERROR'
    ,mode='FAILFAST'
    )
ex = None
try:
    df.select(df.col1,df.comma).take(1)
except Exception as e:
    ex = e
    
if ex:
    print('[pruning] Exception is raised if b4_comma is NOT selected')
    
df.select(df.b4_comma, df.comma).take(1)
print('[pruning] No exception if b4_comma is selected')ex = None
try:
    df.count()
except Exception as e:
    ex = e
    
if ex:
    print('[pruning] Exception raised by count')
print('\ndisabling pruning\n')
    
    
# disable column pruning
spark.conf.set('spark.sql.csv.parser.columnPruning.enabled', 'false')
df.select(df.col1,df.comma).take(1)
print('[no prune] No exception if b4_comma is NOT selected') {noformat}
 

Output:
{noformat}
3.1.2
Wrote 47 bytes.
[pruning] Exception is raised if b4_comma is NOT selected
[pruning] No exception if b4_comma is selected
[pruning] Exception raised by count

disabling pruning

[no prune] No exception if b4_comma is NOT selected {noformat}

  was:
Workaround: disable column pruning.

Example pyspark code (from Databricks):
{noformat}
import pyspark
print(pyspark.version.__version__)# enable column pruning (reset default value)
spark.conf.set('spark.sql.csv.parser.columnPruning.enabled', 'true')dbutils.fs.put(file='/tmp/example.csv', contents='''"col1","b4_comma","comma","col4"
"","",",","x"
''', overwrite=True)df = spark.read.csv(
    path='/tmp/example.csv'
    ,inferSchema=True
    ,header=True
    ,escape='"'
    ,multiLine=True
    ,unescapedQuoteHandling='RAISE_ERROR'
    ,mode='FAILFAST'
    )
ex = None
try:
    df.select(df.col1,df.comma).take(1)
except Exception as e:
    ex = e
    
if ex:
    print('[pruning] Exception is raised if b4_comma is NOT selected')
    
df.select(df.b4_comma, df.comma).take(1)
print('[pruning] No exception if b4_comma is selected')ex = None
try:
    df.count()
except Exception as e:
    ex = e
    
if ex:
    print('[pruning] Exception raised by count')
print('\ndisabling pruning\n')
    
    
# disable column pruning
spark.conf.set('spark.sql.csv.parser.columnPruning.enabled', 'false')
df.select(df.col1,df.comma).take(1)
print('[no prune] No exception if b4_comma is NOT selected') {noformat}
 

Output:
{noformat}
3.1.2
Wrote 47 bytes.
[pruning] Exception is raised if b4_comma is NOT selected
[pruning] No exception if b4_comma is selected
[pruning] Exception raised by count

disabling pruning

[no prune] No exception if b4_comma is NOT selected {noformat}


> csv parser exception when quote and escape are both double-quote and a value is just "," and column pruning enabled
> -------------------------------------------------------------------------------------------------------------------
>
>                 Key: SPARK-38331
>                 URL: https://issues.apache.org/jira/browse/SPARK-38331
>             Project: Spark
>          Issue Type: Bug
>          Components: Input/Output
>    Affects Versions: 3.1.2, 3.2.1
>            Reporter: Christopher Auston
>            Priority: Minor
>
> Workaround: disable column pruning.
> Example pyspark code (from Databricks):
> {noformat}
> import pyspark
> print(pyspark.version.__version__)
> # enable column pruning (reset default value)
> spark.conf.set('spark.sql.csv.parser.columnPruning.enabled', 'true')
> dbutils.fs.put(file='/tmp/example.csv', contents='''"col1","b4_comma","comma","col4"
> "","",",","x"
> ''', overwrite=True)df = spark.read.csv(
>     path='/tmp/example.csv'
>     ,inferSchema=True
>     ,header=True
>     ,escape='"'
>     ,multiLine=True
>     ,unescapedQuoteHandling='RAISE_ERROR'
>     ,mode='FAILFAST'
>     )
> ex = None
> try:
>     df.select(df.col1,df.comma).take(1)
> except Exception as e:
>     ex = e
>     
> if ex:
>     print('[pruning] Exception is raised if b4_comma is NOT selected')
>     
> df.select(df.b4_comma, df.comma).take(1)
> print('[pruning] No exception if b4_comma is selected')ex = None
> try:
>     df.count()
> except Exception as e:
>     ex = e
>     
> if ex:
>     print('[pruning] Exception raised by count')
> print('\ndisabling pruning\n')
>     
>     
> # disable column pruning
> spark.conf.set('spark.sql.csv.parser.columnPruning.enabled', 'false')
> df.select(df.col1,df.comma).take(1)
> print('[no prune] No exception if b4_comma is NOT selected') {noformat}
>  
> Output:
> {noformat}
> 3.1.2
> Wrote 47 bytes.
> [pruning] Exception is raised if b4_comma is NOT selected
> [pruning] No exception if b4_comma is selected
> [pruning] Exception raised by count
> disabling pruning
> [no prune] No exception if b4_comma is NOT selected {noformat}



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org