You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Kuba Tyszko (JIRA)" <ji...@apache.org> on 2016/12/16 21:53:58 UTC

[jira] [Comment Edited] (SPARK-18906) CSV parser should return null for empty (or with "") numeric columns.

    [ https://issues.apache.org/jira/browse/SPARK-18906?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15755597#comment-15755597 ] 

Kuba Tyszko edited comment on SPARK-18906 at 12/16/16 9:53 PM:
---------------------------------------------------------------

Well, in csv null can either be an empty field or in this case a dedicated value (NA), but some data providers use empty string to indicate an empty value as well.

I've looked at JIRA and there were a few requests to allow multiple nullValue settings - but that seems to be a challenging task.

The patch I'm proposing here enables handing of such "empty integers" in a predictable way.

I understand this may look unclean, but unfortunately some reputable data providers do that... - there is nothing we can do to stop them...
In fact, for example excel can be set to always quote columns when exporting to CSV, it can be limited to only text columns - but I don't think we can assume that users won't put numbers in a text column.

We're dealing with completely untyped data source - it's better to be robust..



was (Author: kubatyszko):
Well, in csv null can either be an empty field or in this case a dedicated value (NA), but some data providers use empty string to indicate an empty value as well.

I've looked at JIRA and there were a few requests to allow multiple nullValue settings - but that seems to be a challenging task.

The patch I'm proposing here enables handing of such "empty integers" in a predictable way.

I understand this may look unclean, but unfortunately some reputable data providers do that... - there is nothing we can do to stop them...

> CSV parser should return null for empty (or with "") numeric columns.
> ---------------------------------------------------------------------
>
>                 Key: SPARK-18906
>                 URL: https://issues.apache.org/jira/browse/SPARK-18906
>             Project: Spark
>          Issue Type: Bug
>          Components: SQL
>    Affects Versions: 2.0.1
>            Reporter: Kuba Tyszko
>            Priority: Minor
>
> Spark allows user to set a nullValue that will indicate certain value's translation to a null type , for example string "NA" could be the one.
> Data sources that use such nullValue but also have other columns that may contain empty values may not be parsed correctly.
> The change resolves that by assuming that:
> when column is infered as numeric
> its field will be set to null when parsing fails, for example upon seeing empty value or an empty string.
> Example:
> ---------------
> |char|int1|int2|
> ---------------
> |a|1|2|
> ---------------
> |a|  |0|
> ---------------
> |NA|""|""|
> ----------------
> This example illustrates that column "char" may contain an empty value indicated as "NA", column int1 has a "true null" value but then both int1 and int2 columns have an empty string set as their values.
> In such situation parsing will fail.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org