You are viewing a plain text version of this content. The canonical link for it is here.

Posted to reviews@spark.apache.org by viirya <gi...@git.apache.org> on 2018/01/18 09:30:28 UTC

[GitHub] spark pull request #20313: [SPARK-22974][ML] Attach attributes to output col...

GitHub user viirya opened a pull request:

    https://github.com/apache/spark/pull/20313

    [SPARK-22974][ML] Attach attributes to output column of CountVectorModel

    ## What changes were proposed in this pull request?
    
    The output column from `CountVectorModel` lacks attribute. So a later transformer like `Interaction` can raise error because no attribute available.
    
    ## How was this patch tested?
    
    Added test.
    
    Please review http://spark.apache.org/contributing.html before opening a pull request.


You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/viirya/spark-1 SPARK-22974

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/20313.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #20313
    
----
commit aeae308055cd16d888895ef9ff86df882ec1aa20
Author: Liang-Chi Hsieh <vi...@...>
Date:   2018-01-18T09:25:54Z

    Attach attributes to output column of CountVectorModel.

----


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #20313: [SPARK-22974][ML] Attach attributes to output column of ...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/20313
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/86332/
    Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #20313: [SPARK-22974][ML] Attach attributes to output column of ...

Posted by dbtsai <gi...@git.apache.org>.

Github user dbtsai commented on the issue:

    https://github.com/apache/spark/pull/20313
  
    LGTM. I think to have transformer framework working properly, it's required to have attributes in `CountVector`. Being said that, we should deal with the issue of allocating big attributes for sparse cases in as a separate task.
    
    Merged into master.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #20313: [SPARK-22974][ML] Attach attributes to output col...

Posted by PowerToThePeople111 <gi...@git.apache.org>.

Github user PowerToThePeople111 commented on a diff in the pull request:

    https://github.com/apache/spark/pull/20313#discussion_r208977114
  
    --- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/CountVectorizer.scala ---
    @@ -264,7 +265,9 @@ class CountVectorizerModel(
     
           Vectors.sparse(dictBr.value.size, effectiveCounts)
         }
    -    dataset.withColumn($(outputCol), vectorizer(col($(inputCol))))
    +    val attrs = vocabulary.map(_ => new NumericAttribute).asInstanceOf[Array[Attribute]]
    --- End diff --
    
    I do not think, that the information is totally useless: if you want to know which feature-vector-index (created by a CountVectorizer) corresponds to which LR coefficient for example is very helpful. It should in general be possible to actually easily get this information given an arbitrary vector which was created by properly implemented feature-generation-transformer.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #20313: [SPARK-22974][ML] Attach attributes to output column of ...

Posted by viirya <gi...@git.apache.org>.

Github user viirya commented on the issue:

    https://github.com/apache/spark/pull/20313
  
    cc @dbtsai too.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #20313: [SPARK-22974][ML] Attach attributes to output column of ...

Posted by viirya <gi...@git.apache.org>.

Github user viirya commented on the issue:

    https://github.com/apache/spark/pull/20313
  
    ping @jkbradley @MLnick again


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #20313: [SPARK-22974][ML] Attach attributes to output column of ...

Posted by viirya <gi...@git.apache.org>.

Github user viirya commented on the issue:

    https://github.com/apache/spark/pull/20313
  
    cc @MLnick @WeichenXu123 @jkbradley 


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #20313: [SPARK-22974][ML] Attach attributes to output column of ...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/20313
  
    **[Test build #86332 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/86332/testReport)** for PR 20313 at commit [`aeae308`](https://github.com/apache/spark/commit/aeae308055cd16d888895ef9ff86df882ec1aa20).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #20313: [SPARK-22974][ML] Attach attributes to output column of ...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/20313
  
    Merged build finished. Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #20313: [SPARK-22974][ML] Attach attributes to output col...

Posted by viirya <gi...@git.apache.org>.

Github user viirya commented on a diff in the pull request:

    https://github.com/apache/spark/pull/20313#discussion_r194636521
  
    --- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/CountVectorizer.scala ---
    @@ -264,7 +265,9 @@ class CountVectorizerModel(
     
           Vectors.sparse(dictBr.value.size, effectiveCounts)
         }
    -    dataset.withColumn($(outputCol), vectorizer(col($(inputCol))))
    +    val attrs = vocabulary.map(_ => new NumericAttribute).asInstanceOf[Array[Attribute]]
    --- End diff --
    
    Sorry for replying late. Though I agree that this attributes don't provide much info, I'm wondering if we can let it lazily generated. At this point, I think we don't know if following transformer will need it or not?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #20313: [SPARK-22974][ML] Attach attributes to output column of ...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/20313
  
    **[Test build #86332 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/86332/testReport)** for PR 20313 at commit [`aeae308`](https://github.com/apache/spark/commit/aeae308055cd16d888895ef9ff86df882ec1aa20).


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #20313: [SPARK-22974][ML] Attach attributes to output col...

Posted by asfgit <gi...@git.apache.org>.

Github user asfgit closed the pull request at:

    https://github.com/apache/spark/pull/20313


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #20313: [SPARK-22974][ML] Attach attributes to output col...

Posted by WeichenXu123 <gi...@git.apache.org>.

Github user WeichenXu123 commented on a diff in the pull request:

    https://github.com/apache/spark/pull/20313#discussion_r178517391
  
    --- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/CountVectorizer.scala ---
    @@ -264,7 +265,9 @@ class CountVectorizerModel(
     
           Vectors.sparse(dictBr.value.size, effectiveCounts)
         }
    -    dataset.withColumn($(outputCol), vectorizer(col($(inputCol))))
    +    val attrs = vocabulary.map(_ => new NumericAttribute).asInstanceOf[Array[Attribute]]
    --- End diff --
    
    The attributes append no useful statistics but only allocate a large array. I think it should be generated lazily, e.g., when it needed in following transformer then we generate it.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org