You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@spark.apache.org by Marnix van den Broek <ma...@bundlesandbatches.io> on 2022/02/10 16:27:46 UTC

Help needed to locate the csv parser (for Spark bug reporting/fixing)

hi all,

Yesterday I filed a CSV parsing bug [1] for Spark, that leads to data
incorrectness when data contains sequences similar to the one in the
report.

I wanted to take a look at the parsing logic to see if I could spot the
error to update the issue with more information and to possibly contribute
a PR with a bug fix, but I got completely lost navigating my way down the
dependencies in the Spark repository. Can someone point me in the right
direction?

I am looking for the csv parser itself, which is likely a dependency?

The next question might need too much knowledge about Spark internals to
know where to look or understand what I'd be looking at, but I am also
looking to see if and why the implementation of the CSV parsing is
different when columns are projected as opposed to the processing of the
full dataframe/ The issue only occurs when projecting columns and this
inconsistency is a worry in itself.

Many thanks,

Marnix

1. https://issues.apache.org/jira/browse/SPARK-38167

Re: Help needed to locate the csv parser (for Spark bug reporting/fixing)

Posted by Marnix van den Broek <ma...@bundlesandbatches.io>.

Thanks, Sean!

It was actually on the Catalyst side of things, but I found where column
pruning pushdown is delegated to univocity, see [1].

I've tried setting the spark configuration
*spark.sql.csv.parser.columnPruning.enabled* to *False* and this prevents
the bug from happening. I am unfamiliar with Java / Scala so I might be
misreading things, but to me everything points to a bug in univocity,
specifically in how the *selectIndexes* parser setting impacts the parsing
of the example in the bug report.

This means that to fix this bug, univocity must be fixed and Spark then
needs to refer to a fixed version, correct? Unless someone thinks this
analysis is off, I'll add this info to the Spark issue and file a bug
report with univocity.

1.
https://github.com/apache/spark/blob/6a59fba248359fb2614837fe8781dc63ac8fdc4c/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/csv/UnivocityParser.scala#L79

On Thu, Feb 10, 2022 at 5:39 PM Sean Owen <sr...@gmail.com> wrote:

> It starts in org.apache.spark.sql.execution.datasources.csv.CSVDataSource.
> Yes univocity is used for much of the parsing.
> I am not sure of the cause of the bug but it does look like one indeed. In
> one case the parser is asked to read all fields, in the other, to skip one.
> The pushdown helps efficiency but something is going wrong.
>
> On Thu, Feb 10, 2022 at 10:34 AM Marnix van den Broek <
> marnix.van.den.broek@bundlesandbatches.io> wrote:
>
>> hi all,
>>
>> Yesterday I filed a CSV parsing bug [1] for Spark, that leads to data
>> incorrectness when data contains sequences similar to the one in the
>> report.
>>
>> I wanted to take a look at the parsing logic to see if I could spot the
>> error to update the issue with more information and to possibly contribute
>> a PR with a bug fix, but I got completely lost navigating my way down the
>> dependencies in the Spark repository. Can someone point me in the right
>> direction?
>>
>> I am looking for the csv parser itself, which is likely a dependency?
>>
>> The next question might need too much knowledge about Spark internals to
>> know where to look or understand what I'd be looking at, but I am also
>> looking to see if and why the implementation of the CSV parsing is
>> different when columns are projected as opposed to the processing of the
>> full dataframe/ The issue only occurs when projecting columns and this
>> inconsistency is a worry in itself.
>>
>> Many thanks,
>>
>> Marnix
>>
>> 1. https://issues.apache.org/jira/browse/SPARK-38167
>>
>>

Re: Help needed to locate the csv parser (for Spark bug reporting/fixing)

Posted by Sean Owen <sr...@gmail.com>.

It starts in org.apache.spark.sql.execution.datasources.csv.CSVDataSource.
Yes univocity is used for much of the parsing.
I am not sure of the cause of the bug but it does look like one indeed. In
one case the parser is asked to read all fields, in the other, to skip one.
The pushdown helps efficiency but something is going wrong.

On Thu, Feb 10, 2022 at 10:34 AM Marnix van den Broek <
marnix.van.den.broek@bundlesandbatches.io> wrote:

> hi all,
>
> Yesterday I filed a CSV parsing bug [1] for Spark, that leads to data
> incorrectness when data contains sequences similar to the one in the
> report.
>
> I wanted to take a look at the parsing logic to see if I could spot the
> error to update the issue with more information and to possibly contribute
> a PR with a bug fix, but I got completely lost navigating my way down the
> dependencies in the Spark repository. Can someone point me in the right
> direction?
>
> I am looking for the csv parser itself, which is likely a dependency?
>
> The next question might need too much knowledge about Spark internals to
> know where to look or understand what I'd be looking at, but I am also
> looking to see if and why the implementation of the CSV parsing is
> different when columns are projected as opposed to the processing of the
> full dataframe/ The issue only occurs when projecting columns and this
> inconsistency is a worry in itself.
>
> Many thanks,
>
> Marnix
>
> 1. https://issues.apache.org/jira/browse/SPARK-38167
>
>