You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Narine Kokhlikyan (JIRA)" <ji...@apache.org> on 2015/12/14 21:54:46 UTC
[jira] [Updated] (SPARK-12325) Inappropriate error messages in DataFrame StatFunctions

     [ https://issues.apache.org/jira/browse/SPARK-12325?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Narine Kokhlikyan updated SPARK-12325:
--------------------------------------
    Description: 
Hi there,

I have mentioned this issue earlier in one of my pull requests for SQL component, but I've never received a feedback in any of them.
https://github.com/apache/spark/pull/9366#issuecomment-155171975

Although this has been very frustrating, I'll try to list certain facts again:

1. I call dataframe correlation method and it says that covariance is wrong.
I do not think that this is an appropriate message to show here.

scala> df.stat.corr("rating", "income")
java.lang.IllegalArgumentException: requirement failed: Covariance calculation for columns with dataType StringType not supported.
    at scala.Predef$.require(Predef.scala:233)
    at org.apache.spark.sql.execution.stat.StatFunctions$$anonfun$collectStatisticalData$3.apply(StatFunctions.scala:81)


2. The biggest issue here is not the message shown, but the design.
A class called CovarianceCounter does the computations both for correlation and covariance. This might be a convenient way
from certain perspective, however something like this is harder to understand and extend, especially if you want to add another algorithm
e.g. Spearman correlation, or something else.

There are many possible solutions here:
starting from
1. just fixing the message 
2. fixing the message and renaming  CovarianceCounter and corresponding methods
3. create CorrelationCounter and splitting the computations for correlation and covariance

and many more .... 

Since I'm not getting any response and according to github all five of you have been working on this, I'll try again:
[~brkyvz], [~rxin], [~davies], [~viirya], [~cloud_fan]

Can any of you ,please, explain me such a behavior or communicate more about this ?
In case you are planning to remove it or something else, we'd truly appreciate if you communicate.

In fact, I would like to do a pull request on this, but since my pull requests in SQL/ML components are just staying there without any response, I'll wait for your response first.

cc: [~shivaram], [~mengxr]

Thank you,
Narine


  was:
Hi there,

I have mentioned this issue earlier in one of my pull requests for SQL component, but I've never received a feedback in any of them.
https://github.com/apache/spark/pull/9366#issuecomment-155171975

Although this has been very frustrating, I'll try to list certain facts again:

1. I call dataframe correlation method and it says that covariance is wrong.
I do not think that this is an appropriate message to show here.

scala> df.stat.corr("rating", "income")
java.lang.IllegalArgumentException: requirement failed: Covariance calculation for columns with dataType StringType not supported.
    at scala.Predef$.require(Predef.scala:233)
    at org.apache.spark.sql.execution.stat.StatFunctions$$anonfun$collectStatisticalData$3.apply(StatFunctions.scala:81)


2. The biggest issue here is not the message shown, but the design.
A class called CovarianceCounter does the computations both for correlation and covariance. This might be a convenient way
from certain perspective, however something like this is harder to understand and extend, especially if you want to add another algorithm
e.g. Spearman correlation, or something else.

There are many possible solutions here:
starting from
1. just fixing the message 
2. fixing the message and renaming  CovarianceCounter and corresponding methods
3. create CorrelationCounter and splitting the computations for correlation and covariance

and many more .... 

Since I'm not getting any response and according to github all five of you have been working on this, I'll try again:
[~brkyvz], [~rxin], [~davies], [~viirya], [~cloud_fan]

Can any of you ,please, explain me such a behavior or communicate more about this.
In case you are planning to remove it or something else, we'd truly appreciate if you communicate.

In fact, I would like to do a pull request on this, but since my pull requests in SQL/ML components are just staying there without any response, I'll wait for your response first.

cc: [~shivaram], [~mengxr]

Thank you,
Narine



> Inappropriate error messages in DataFrame StatFunctions 
> --------------------------------------------------------
>
>                 Key: SPARK-12325
>                 URL: https://issues.apache.org/jira/browse/SPARK-12325
>             Project: Spark
>          Issue Type: Bug
>          Components: SQL
>            Reporter: Narine Kokhlikyan
>            Priority: Critical
>
> Hi there,
> I have mentioned this issue earlier in one of my pull requests for SQL component, but I've never received a feedback in any of them.
> https://github.com/apache/spark/pull/9366#issuecomment-155171975
> Although this has been very frustrating, I'll try to list certain facts again:
> 1. I call dataframe correlation method and it says that covariance is wrong.
> I do not think that this is an appropriate message to show here.
> scala> df.stat.corr("rating", "income")
> java.lang.IllegalArgumentException: requirement failed: Covariance calculation for columns with dataType StringType not supported.
>     at scala.Predef$.require(Predef.scala:233)
>     at org.apache.spark.sql.execution.stat.StatFunctions$$anonfun$collectStatisticalData$3.apply(StatFunctions.scala:81)
> 2. The biggest issue here is not the message shown, but the design.
> A class called CovarianceCounter does the computations both for correlation and covariance. This might be a convenient way
> from certain perspective, however something like this is harder to understand and extend, especially if you want to add another algorithm
> e.g. Spearman correlation, or something else.
> There are many possible solutions here:
> starting from
> 1. just fixing the message 
> 2. fixing the message and renaming  CovarianceCounter and corresponding methods
> 3. create CorrelationCounter and splitting the computations for correlation and covariance
> and many more .... 
> Since I'm not getting any response and according to github all five of you have been working on this, I'll try again:
> [~brkyvz], [~rxin], [~davies], [~viirya], [~cloud_fan]
> Can any of you ,please, explain me such a behavior or communicate more about this ?
> In case you are planning to remove it or something else, we'd truly appreciate if you communicate.
> In fact, I would like to do a pull request on this, but since my pull requests in SQL/ML components are just staying there without any response, I'll wait for your response first.
> cc: [~shivaram], [~mengxr]
> Thank you,
> Narine



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org