You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "natalya (JIRA)" <ji...@apache.org> on 2015/05/21 18:30:17 UTC

[jira] [Commented] (SPARK-6189) Pandas to DataFrame conversion should check field names for periods

    [ https://issues.apache.org/jira/browse/SPARK-6189?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14554609#comment-14554609 ] 

natalya commented on SPARK-6189:
--------------------------------

Figuring out what is wrong is not the difficulty.  The current error message while confusing and humorous, provides sufficient information to track down the issue.  

However, if Spark simply returns an error it will remain incompatible with certain data sets - for example, URLs, server names, IP addresses, and e-mail addresses.  All necessarily will contain a period.  Some small subset will also contain underscores.  Both solutions will prohibit direct handling of this type of data in field names which seems like a significant restriction, and even more so when you factor in the additional restriction on compatibility with R and SQL.  

Wouldn't it be better to fix the problem and allow periods?

> Pandas to DataFrame conversion should check field names for periods
> -------------------------------------------------------------------
>
>                 Key: SPARK-6189
>                 URL: https://issues.apache.org/jira/browse/SPARK-6189
>             Project: Spark
>          Issue Type: Improvement
>          Components: SQL
>    Affects Versions: 1.3.0
>            Reporter: Joseph K. Bradley
>            Priority: Minor
>
> Issue I ran into:  I imported an R dataset in CSV format into a Pandas DataFrame and then use toDF() to convert that into a Spark DataFrame.  The R dataset had a column with a period in it (column "GNP.deflator" in the "longley" dataset).  When I tried to select it using the Spark DataFrame DSL, I could not because the DSL thought the period was selecting a field within GNP.
> Also, since "GNP" is another field's name, it gives an error which could be obscure to users, complaining:
> {code}
> org.apache.spark.sql.AnalysisException: GetField is not valid on fields of type DoubleType;
> {code}
> We should either handle periods in column names or check during loading and warn/fail gracefully.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org