You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Sean Owen (JIRA)" <ji...@apache.org> on 2015/12/15 11:35:46 UTC
[jira] [Resolved] (SPARK-12325) Inappropriate error messages in DataFrame StatFunctions

     [ https://issues.apache.org/jira/browse/SPARK-12325?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Sean Owen resolved SPARK-12325.
-------------------------------
    Resolution: Invalid

[~Narine] I'm going to push back on this, since it's inappropriate to open a "Critical Bug" on JIRA just to get attention for your question. This is, at best, suggesting a very minor change to an error message.

Please first read https://cwiki.apache.org/confluence/display/SPARK/Contributing+to+Spark  Then consider whether you clearly explained your problem and proposed change -- I think it's a lot simpler than this.

When you're ready to open a pull request with a change to the message, *then* make a "Minor Improvement" / "Documentation" JIRA which your PR references.

> Inappropriate error messages in DataFrame StatFunctions 
> --------------------------------------------------------
>
>                 Key: SPARK-12325
>                 URL: https://issues.apache.org/jira/browse/SPARK-12325
>             Project: Spark
>          Issue Type: Bug
>          Components: SQL
>    Affects Versions: 1.5.2
>            Reporter: Narine Kokhlikyan
>            Priority: Critical
>
> Hi there,
> I have mentioned this issue earlier in one of my pull requests for SQL component, but I've never received a feedback in any of them.
> https://github.com/apache/spark/pull/9366#issuecomment-155171975
> Although this has been very frustrating, I'll try to list certain facts again:
> 1. I call dataframe correlation method and it says that covariance is wrong.
> I do not think that this is an appropriate message to show here.
> scala> df.stat.corr("rating", "income")
> java.lang.IllegalArgumentException: requirement failed: Covariance calculation for columns with dataType StringType not supported.
>     at scala.Predef$.require(Predef.scala:233)
>     at org.apache.spark.sql.execution.stat.StatFunctions$$anonfun$collectStatisticalData$3.apply(StatFunctions.scala:81)
> 2. The biggest issue here is not the message shown, but the design.
> A class called CovarianceCounter does the computations both for correlation and covariance. This might be a convenient way
> from certain perspective, however something like this is harder to understand and extend, especially if you want to add another algorithm
> e.g. Spearman correlation, or something else.
> There are many possible solutions here:
> starting from
> 1. just fixing the message 
> 2. fixing the message and renaming  CovarianceCounter and corresponding methods
> 3. create CorrelationCounter and splitting the computations for correlation and covariance
> and many more .... 
> Since I'm not getting any response and according to github all five of you have been working on this, I'll try again:
> [~brkyvz], [~rxin], [~davies], [~viirya], [~cloud_fan]
> Can any of you ,please, explain me such a behavior with the stat functions or communicate more about this ?
> In case you are planning to remove it or something else, we'd truly appreciate if you communicate.
> In fact, I would like to do a pull request on this, but since my pull requests in SQL/ML components are just staying there without any response, I'll wait for your response first.
> cc: [~shivaram], [~mengxr]
> Thank you,
> Narine



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org