You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@flink.apache.org by sachingoel0101 <gi...@git.apache.org> on 2015/08/18 15:34:32 UTC

[GitHub] flink pull request: [FLINK-2379][ml]Add column wise statistics for...

GitHub user sachingoel0101 opened a pull request:

    https://github.com/apache/flink/pull/1032

    [FLINK-2379][ml]Add column wise statistics for vector Data Sets

    This is because of splitting the work in #861 in two parts.

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/sachingoel0101/flink statistics

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/flink/pull/1032.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #1032
    
----
commit ce433130b909daf3b11c814fa6897af7753404f9
Author: Sachin Goel <sa...@gmail.com>
Date:   2015-08-18T13:27:53Z

    [FLINK-2379]Adds column statistics for vector Data Sets

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] flink issue #1032: [FLINK-2379][ml]Add column wise statistics for vector Dat...

Posted by coveralls <gi...@git.apache.org>.
Github user coveralls commented on the issue:

    https://github.com/apache/flink/pull/1032
  
    
    [![Coverage Status](https://coveralls.io/builds/11626696/badge)](https://coveralls.io/builds/11626696)
    
    Changes Unknown when pulling **9343b20c108ae3963d8ad98e003e96614cec93b2 on sachingoel0101:statistics** into ** on apache:master**.



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] flink pull request: [FLINK-2379][ml]Add column wise statistics for...

Posted by chiwanpark <gi...@git.apache.org>.
Github user chiwanpark commented on the pull request:

    https://github.com/apache/flink/pull/1032#issuecomment-162518625
  
    Hi @sachingoel0101, I just reviewed quickly. I have some questions and comments for this pull request.
    
    First, why `FieldStats` covers statistics value for continuous and discrete values? There is no common value between statistics value for continuous and that for discrete. How about split them into `ContinuousFieldStats` and `DiscreteFieldStats`?
    
    Second, many scaladocs are written in javadoc style.
    
    Third, we need to generalize type of the input dataset to `DataSet[T]` whose generic parameter is `T <: Vector` because `DataSet[Vector]` cannot cover `DataSet[DenseVector]` and `DataSet[SparseVector]`.
    
    If there is something which I misunderstood, please feel free to leave a comment.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] flink pull request: [FLINK-2379][ml]Add column wise statistics for...

Posted by sachingoel0101 <gi...@git.apache.org>.
Github user sachingoel0101 commented on the pull request:

    https://github.com/apache/flink/pull/1032#issuecomment-136978688
  
    @tillrohrmann , can you review this?
    This will be a starting point for a package to provide more statistical methods, such as hypothesis testing, correlation, etc.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] flink pull request: [FLINK-2379][ml]Add column wise statistics for...

Posted by chiwanpark <gi...@git.apache.org>.
Github user chiwanpark commented on the pull request:

    https://github.com/apache/flink/pull/1032#issuecomment-162519456
  
    Additionally, if you will modify your pull request, please do not rebase the commit because I'm tracking your branch (`sachingoel0101/flink.git/statistics`). We can rebase and squash the commits just before merging.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] flink pull request: [FLINK-2379][ml]Add column wise statistics for...

Posted by sachingoel0101 <gi...@git.apache.org>.
Github user sachingoel0101 commented on the pull request:

    https://github.com/apache/flink/pull/1032#issuecomment-162550968
  
    Hi @chiwanpark, thanks for picking this up. :)
    
    Since a `Vector` might contain discrete fields as well as continuous fields, we need to have a `FieldStats` object which can cover both types. To prevent the need of casting from `FieldStats` to `ContinuousFieldStats` and `DiscreteFieldStats` in case there is an abstract class `FieldStats`, I supported them both in a single class. 
    What do you think would be the best solution here?
    
    As for your second point regarding `T <: Vector`, will fix it.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] flink pull request: [FLINK-2379][ml]Add column wise statistics for...

Posted by chiwanpark <gi...@git.apache.org>.
Github user chiwanpark commented on the pull request:

    https://github.com/apache/flink/pull/1032#issuecomment-162821875
  
    How about base class with declaration of the both statistics values and child class with the implementation of that values?
    
    ```scala
    abstract class FieldStats {
      // statistics values for discrete fields
      def entropy: Double = throw NonImplementedError("entropy cannot be accessed for continuous fields")
      def gini: Double = throw NonImplementedError("gini cannot be accessed for continuous fields")
      def categoryCounts: Map[Double, Int] = throw NonImplementedError("categoryCounts cannot be accessed for continuous fields")
    
      // statistics values for continuous fields
      def max: Double = throw NonImplementedError("max cannot be accessed for discrete fields")
      def min: Double = throw NonImplementedError("min cannot be accessed for discrete fields")
      def mean: Double = throw NonImplementedError("mean cannot be accessed for discrete fields")
      def variance: Double = throw NonImplementedError("variance cannot be accessed for discrete fields")
    }
    
    class DiscreteFieldStats(
      private val counts: Map[Double, Int]
    ) extends FieldStats {
      override lazy val entropy = // calculation of entropy
      override lazy val gini = // calculation of gini
      override lazy val categoryCounts = // calculation of categoryCounts
      override def toString = // implementation of toString
    }
    
    class ContinuousFieldStats(
      override val max: Double,
      override val min: Double,
      override val mean: Double,
      override val variance: Double
    ) extends FieldStats {
      override def toString = // implementation of toString
    }
    ```
    
    If a user calls some non-implemented methods in the derived classes, `scala.NotImplementedError` will be raised. Additionally, we can calculate all values (including gini and entropy) once in constructor with this approach.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] flink pull request: [FLINK-2379][ml]Add column wise statistics for...

Posted by chiwanpark <gi...@git.apache.org>.
Github user chiwanpark commented on the pull request:

    https://github.com/apache/flink/pull/1032#issuecomment-162506577
  
    I would like to shepherd this pull request. Sorry for long delay.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] flink pull request: [FLINK-2379][ml]Add column wise statistics for...

Posted by sachingoel0101 <gi...@git.apache.org>.
Github user sachingoel0101 commented on the pull request:

    https://github.com/apache/flink/pull/1032#issuecomment-156147916
  
    Ping.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---