You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@spark.apache.org by facaiy <gi...@git.apache.org> on 2017/07/04 05:46:09 UTC

[GitHub] spark pull request #18523: [SPARK-21285][ML] VectorAssembler reports the col...

GitHub user facaiy opened a pull request:

    https://github.com/apache/spark/pull/18523

    [SPARK-21285][ML] VectorAssembler reports the column name of unsupported data type

    ## What changes were proposed in this pull request?
    add the column name in the exception which is raised by unsupported data type.
    
    ## How was this patch tested?
    [ ] pass all tests.


You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/facaiy/spark ENH/vectorassembler_add_col

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/18523.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #18523
    
----
commit 95dbf6c7b287d0010af9de377ff6b93dec760808
Author: Yan Facai (颜发才) <fa...@gmail.com>
Date:   2017-07-04T05:42:07Z

    ENH: report the name of missing column

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #18523: [SPARK-21285][ML] VectorAssembler reports the col...

Posted by yanboliang <gi...@git.apache.org>.
Github user yanboliang commented on a diff in the pull request:

    https://github.com/apache/spark/pull/18523#discussion_r125876862
  
    --- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/VectorAssembler.scala ---
    @@ -113,12 +113,15 @@ class VectorAssembler @Since("1.4.0") (@Since("1.4.0") override val uid: String)
       override def transformSchema(schema: StructType): StructType = {
         val inputColNames = $(inputCols)
         val outputColName = $(outputCol)
    -    val inputDataTypes = inputColNames.map(name => schema(name).dataType)
    -    inputDataTypes.foreach {
    -      case _: NumericType | BooleanType =>
    -      case t if t.isInstanceOf[VectorUDT] =>
    -      case other =>
    -        throw new IllegalArgumentException(s"Data type $other is not supported.")
    +    val incorrectColumns = inputColNames.flatMap { name =>
    --- End diff --
    
    Make sense, thanks for clarifying.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #18523: [SPARK-21285][ML] VectorAssembler reports the column nam...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/18523
  
    Can one of the admins verify this patch?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #18523: [SPARK-21285][ML] VectorAssembler reports the col...

Posted by srowen <gi...@git.apache.org>.
Github user srowen commented on a diff in the pull request:

    https://github.com/apache/spark/pull/18523#discussion_r125849590
  
    --- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/VectorAssembler.scala ---
    @@ -113,12 +113,15 @@ class VectorAssembler @Since("1.4.0") (@Since("1.4.0") override val uid: String)
       override def transformSchema(schema: StructType): StructType = {
         val inputColNames = $(inputCols)
         val outputColName = $(outputCol)
    -    val inputDataTypes = inputColNames.map(name => schema(name).dataType)
    -    inputDataTypes.foreach {
    -      case _: NumericType | BooleanType =>
    -      case t if t.isInstanceOf[VectorUDT] =>
    -      case other =>
    -        throw new IllegalArgumentException(s"Data type $other is not supported.")
    +    val incorrectColumns = inputColNames.flatMap { name =>
    +      schema(name).dataType match {
    +        case _: NumericType | BooleanType => None
    +        case t if t.isInstanceOf[VectorUDT] => None
    +        case other => Some(s"Data type $other of column $name is not supported.")
    +      }
    +    }
    +    if (!incorrectColumns.isEmpty) {
    --- End diff --
    
    Trivial, but you can use `.nonEmpty` here instead


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #18523: [SPARK-21285][ML] VectorAssembler reports the col...

Posted by srowen <gi...@git.apache.org>.
Github user srowen commented on a diff in the pull request:

    https://github.com/apache/spark/pull/18523#discussion_r125452186
  
    --- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/VectorAssembler.scala ---
    @@ -113,12 +113,12 @@ class VectorAssembler @Since("1.4.0") (@Since("1.4.0") override val uid: String)
       override def transformSchema(schema: StructType): StructType = {
         val inputColNames = $(inputCols)
         val outputColName = $(outputCol)
    -    val inputDataTypes = inputColNames.map(name => schema(name).dataType)
    +    val inputDataTypes = inputColNames.map(name => (name, schema(name).dataType))
         inputDataTypes.foreach {
    -      case _: NumericType | BooleanType =>
    -      case t if t.isInstanceOf[VectorUDT] =>
    -      case other =>
    -        throw new IllegalArgumentException(s"Data type $other is not supported.")
    +      case (_, _: NumericType | BooleanType) =>
    +      case (_, _: VectorUDT) =>
    +      case (name, other) =>
    +        throw new IllegalArgumentException(s"Data type $other of column $name is not supported.")
    --- End diff --
    
    How about just
    
    ```
    inputColNames.foreach { name =>
      schema(name).dataType match {
        case _: NumericType | BooleanType | VectorUDT =>
        case other =>
          throw new IllegalArgumentException(s"Data type $other in column $name is not supported.")
      }
    }
    ```


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #18523: [SPARK-21285][ML] VectorAssembler reports the column nam...

Posted by yanboliang <gi...@git.apache.org>.
Github user yanboliang commented on the issue:

    https://github.com/apache/spark/pull/18523
  
    Jenkins, test this please.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #18523: [SPARK-21285][ML] VectorAssembler reports the col...

Posted by yanboliang <gi...@git.apache.org>.
Github user yanboliang commented on a diff in the pull request:

    https://github.com/apache/spark/pull/18523#discussion_r125652869
  
    --- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/VectorAssembler.scala ---
    @@ -113,12 +113,15 @@ class VectorAssembler @Since("1.4.0") (@Since("1.4.0") override val uid: String)
       override def transformSchema(schema: StructType): StructType = {
         val inputColNames = $(inputCols)
         val outputColName = $(outputCol)
    -    val inputDataTypes = inputColNames.map(name => schema(name).dataType)
    -    inputDataTypes.foreach {
    -      case _: NumericType | BooleanType =>
    -      case t if t.isInstanceOf[VectorUDT] =>
    -      case other =>
    -        throw new IllegalArgumentException(s"Data type $other is not supported.")
    +    val incorrectColumns = inputColNames.flatMap { name =>
    --- End diff --
    
    One minor question: Does it necessary to change from ```map``` to ```flatMap```?Otherwise, LGTM.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #18523: [SPARK-21285][ML] VectorAssembler reports the column nam...

Posted by yanboliang <gi...@git.apache.org>.
Github user yanboliang commented on the issue:

    https://github.com/apache/spark/pull/18523
  
    Jenkins, test this please.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #18523: [SPARK-21285][ML] VectorAssembler reports the col...

Posted by facaiy <gi...@git.apache.org>.
Github user facaiy commented on a diff in the pull request:

    https://github.com/apache/spark/pull/18523#discussion_r125860650
  
    --- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/VectorAssembler.scala ---
    @@ -113,12 +113,15 @@ class VectorAssembler @Since("1.4.0") (@Since("1.4.0") override val uid: String)
       override def transformSchema(schema: StructType): StructType = {
         val inputColNames = $(inputCols)
         val outputColName = $(outputCol)
    -    val inputDataTypes = inputColNames.map(name => schema(name).dataType)
    -    inputDataTypes.foreach {
    -      case _: NumericType | BooleanType =>
    -      case t if t.isInstanceOf[VectorUDT] =>
    -      case other =>
    -        throw new IllegalArgumentException(s"Data type $other is not supported.")
    +    val incorrectColumns = inputColNames.flatMap { name =>
    +      schema(name).dataType match {
    +        case _: NumericType | BooleanType => None
    +        case t if t.isInstanceOf[VectorUDT] => None
    +        case other => Some(s"Data type $other of column $name is not supported.")
    +      }
    +    }
    +    if (!incorrectColumns.isEmpty) {
    --- End diff --
    
    Yes, modified.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #18523: [SPARK-21285][ML] VectorAssembler reports the column nam...

Posted by yanboliang <gi...@git.apache.org>.
Github user yanboliang commented on the issue:

    https://github.com/apache/spark/pull/18523
  
    Merged into master, thanks for all!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #18523: [SPARK-21285][ML] VectorAssembler reports the col...

Posted by facaiy <gi...@git.apache.org>.
Github user facaiy commented on a diff in the pull request:

    https://github.com/apache/spark/pull/18523#discussion_r125584572
  
    --- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/VectorAssembler.scala ---
    @@ -113,12 +113,12 @@ class VectorAssembler @Since("1.4.0") (@Since("1.4.0") override val uid: String)
       override def transformSchema(schema: StructType): StructType = {
         val inputColNames = $(inputCols)
         val outputColName = $(outputCol)
    -    val inputDataTypes = inputColNames.map(name => schema(name).dataType)
    +    val inputDataTypes = inputColNames.map(name => (name, schema(name).dataType))
         inputDataTypes.foreach {
    -      case _: NumericType | BooleanType =>
    -      case t if t.isInstanceOf[VectorUDT] =>
    -      case other =>
    -        throw new IllegalArgumentException(s"Data type $other is not supported.")
    +      case (_, _: NumericType | BooleanType) =>
    +      case (_, t) if t.isInstanceOf[VectorUDT] =>
    --- End diff --
    
    Hey, it will throw no found exception if `t: VectorUDT` is used. Do you have any idea?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #18523: [SPARK-21285][ML] VectorAssembler reports the col...

Posted by srowen <gi...@git.apache.org>.
Github user srowen commented on a diff in the pull request:

    https://github.com/apache/spark/pull/18523#discussion_r125849537
  
    --- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/VectorAssembler.scala ---
    @@ -113,12 +113,15 @@ class VectorAssembler @Since("1.4.0") (@Since("1.4.0") override val uid: String)
       override def transformSchema(schema: StructType): StructType = {
         val inputColNames = $(inputCols)
         val outputColName = $(outputCol)
    -    val inputDataTypes = inputColNames.map(name => schema(name).dataType)
    -    inputDataTypes.foreach {
    -      case _: NumericType | BooleanType =>
    -      case t if t.isInstanceOf[VectorUDT] =>
    -      case other =>
    -        throw new IllegalArgumentException(s"Data type $other is not supported.")
    +    val incorrectColumns = inputColNames.flatMap { name =>
    --- End diff --
    
    @yanboliang I think @facaiy means that the match is producing `None` and `Some` here and that's a reasonably conventional way to produce 0 or 1 outputs for each input.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #18523: [SPARK-21285][ML] VectorAssembler reports the col...

Posted by yanboliang <gi...@git.apache.org>.
Github user yanboliang commented on a diff in the pull request:

    https://github.com/apache/spark/pull/18523#discussion_r125804032
  
    --- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/VectorAssembler.scala ---
    @@ -113,12 +113,15 @@ class VectorAssembler @Since("1.4.0") (@Since("1.4.0") override val uid: String)
       override def transformSchema(schema: StructType): StructType = {
         val inputColNames = $(inputCols)
         val outputColName = $(outputCol)
    -    val inputDataTypes = inputColNames.map(name => schema(name).dataType)
    -    inputDataTypes.foreach {
    -      case _: NumericType | BooleanType =>
    -      case t if t.isInstanceOf[VectorUDT] =>
    -      case other =>
    -        throw new IllegalArgumentException(s"Data type $other is not supported.")
    +    val incorrectColumns = inputColNames.flatMap { name =>
    --- End diff --
    
    I see, but the type of ```inputColNames``` is not ```Option[T]```, so here we can still use ```map```. This doesn't matter, LGTM pending Jenkins.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #18523: [SPARK-21285][ML] VectorAssembler reports the col...

Posted by asfgit <gi...@git.apache.org>.
Github user asfgit closed the pull request at:

    https://github.com/apache/spark/pull/18523


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #18523: [SPARK-21285][ML] VectorAssembler reports the column nam...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/18523
  
    **[Test build #3836 has started](https://amplab.cs.berkeley.edu/jenkins/job/NewSparkPullRequestBuilder/3836/testReport)** for PR 18523 at commit [`9ac85ed`](https://github.com/apache/spark/commit/9ac85ed3d773548472b069b6825cddc166a232d6).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #18523: [SPARK-21285][ML] VectorAssembler reports the column nam...

Posted by jaceklaskowski <gi...@git.apache.org>.
Github user jaceklaskowski commented on the issue:

    https://github.com/apache/spark/pull/18523
  
    Thanks @facaiy for the changes. I wonder if the code could `collect` all the columns with incorrect type in one go (rather than reporting issues column by column until a user fixed all).
    
    Something along the lines:
    
    ```
    val incorrectColumns = inputColNames
      .map { name => (name, schema(name).dataType) }
      .collect {
        case (_, _: NumericType | BooleanType) =>
        case t if t.isInstanceOf[VectorUDT] =>
        case other => other }
    // report issue with incorrectColumns
    ```


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #18523: [SPARK-21285][ML] VectorAssembler reports the column nam...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/18523
  
    **[Test build #3836 has finished](https://amplab.cs.berkeley.edu/jenkins/job/NewSparkPullRequestBuilder/3836/testReport)** for PR 18523 at commit [`9ac85ed`](https://github.com/apache/spark/commit/9ac85ed3d773548472b069b6825cddc166a232d6).
     * This patch **fails to build**.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #18523: [SPARK-21285][ML] VectorAssembler reports the column nam...

Posted by facaiy <gi...@git.apache.org>.
Github user facaiy commented on the issue:

    https://github.com/apache/spark/pull/18523
  
    @SparkQA test again, please.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #18523: [SPARK-21285][ML] VectorAssembler reports the column nam...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/18523
  
    **[Test build #3837 has finished](https://amplab.cs.berkeley.edu/jenkins/job/NewSparkPullRequestBuilder/3837/testReport)** for PR 18523 at commit [`942e533`](https://github.com/apache/spark/commit/942e533af18af5cf556f094565b3d2450bc8838d).
     * This patch passes all tests.
     * This patch **does not merge cleanly**.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #18523: [SPARK-21285][ML] VectorAssembler reports the col...

Posted by facaiy <gi...@git.apache.org>.
Github user facaiy commented on a diff in the pull request:

    https://github.com/apache/spark/pull/18523#discussion_r125763918
  
    --- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/VectorAssembler.scala ---
    @@ -113,12 +113,15 @@ class VectorAssembler @Since("1.4.0") (@Since("1.4.0") override val uid: String)
       override def transformSchema(schema: StructType): StructType = {
         val inputColNames = $(inputCols)
         val outputColName = $(outputCol)
    -    val inputDataTypes = inputColNames.map(name => schema(name).dataType)
    -    inputDataTypes.foreach {
    -      case _: NumericType | BooleanType =>
    -      case t if t.isInstanceOf[VectorUDT] =>
    -      case other =>
    -        throw new IllegalArgumentException(s"Data type $other is not supported.")
    +    val incorrectColumns = inputColNames.flatMap { name =>
    --- End diff --
    
    Hi, yanbo. Use `flatMap` is aimed at unpacking value from Option. Is it OK?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #18523: [SPARK-21285][ML] VectorAssembler reports the column nam...

Posted by yanboliang <gi...@git.apache.org>.
Github user yanboliang commented on the issue:

    https://github.com/apache/spark/pull/18523
  
    Jenkins, test this please.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #18523: [SPARK-21285][ML] VectorAssembler reports the column nam...

Posted by facaiy <gi...@git.apache.org>.
Github user facaiy commented on the issue:

    https://github.com/apache/spark/pull/18523
  
    Good idea!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #18523: [SPARK-21285][ML] VectorAssembler reports the column nam...

Posted by facaiy <gi...@git.apache.org>.
Github user facaiy commented on the issue:

    https://github.com/apache/spark/pull/18523
  
    I don't know how to write an unit test for the pr? Is it necessary?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #18523: [SPARK-21285][ML] VectorAssembler reports the col...

Posted by jaceklaskowski <gi...@git.apache.org>.
Github user jaceklaskowski commented on a diff in the pull request:

    https://github.com/apache/spark/pull/18523#discussion_r125397518
  
    --- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/VectorAssembler.scala ---
    @@ -113,12 +113,12 @@ class VectorAssembler @Since("1.4.0") (@Since("1.4.0") override val uid: String)
       override def transformSchema(schema: StructType): StructType = {
         val inputColNames = $(inputCols)
         val outputColName = $(outputCol)
    -    val inputDataTypes = inputColNames.map(name => schema(name).dataType)
    +    val inputDataTypes = inputColNames.map(name => (name, schema(name).dataType))
         inputDataTypes.foreach {
    -      case _: NumericType | BooleanType =>
    -      case t if t.isInstanceOf[VectorUDT] =>
    -      case other =>
    -        throw new IllegalArgumentException(s"Data type $other is not supported.")
    +      case (_, _: NumericType | BooleanType) =>
    +      case (_, t) if t.isInstanceOf[VectorUDT] =>
    --- End diff --
    
    I wonder why the line does not do `(_, t: VectorUDT)`?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #18523: [SPARK-21285][ML] VectorAssembler reports the column nam...

Posted by facaiy <gi...@git.apache.org>.
Github user facaiy commented on the issue:

    https://github.com/apache/spark/pull/18523
  
    @SparkQA Jenkins, run tests again, please.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #18523: [SPARK-21285][ML] VectorAssembler reports the col...

Posted by facaiy <gi...@git.apache.org>.
Github user facaiy commented on a diff in the pull request:

    https://github.com/apache/spark/pull/18523#discussion_r125398010
  
    --- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/VectorAssembler.scala ---
    @@ -113,12 +113,12 @@ class VectorAssembler @Since("1.4.0") (@Since("1.4.0") override val uid: String)
       override def transformSchema(schema: StructType): StructType = {
         val inputColNames = $(inputCols)
         val outputColName = $(outputCol)
    -    val inputDataTypes = inputColNames.map(name => schema(name).dataType)
    +    val inputDataTypes = inputColNames.map(name => (name, schema(name).dataType))
         inputDataTypes.foreach {
    -      case _: NumericType | BooleanType =>
    -      case t if t.isInstanceOf[VectorUDT] =>
    -      case other =>
    -        throw new IllegalArgumentException(s"Data type $other is not supported.")
    +      case (_, _: NumericType | BooleanType) =>
    +      case (_, t) if t.isInstanceOf[VectorUDT] =>
    --- End diff --
    
    Good suggestion. fixed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #18523: [SPARK-21285][ML] VectorAssembler reports the column nam...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/18523
  
    **[Test build #3837 has started](https://amplab.cs.berkeley.edu/jenkins/job/NewSparkPullRequestBuilder/3837/testReport)** for PR 18523 at commit [`942e533`](https://github.com/apache/spark/commit/942e533af18af5cf556f094565b3d2450bc8838d).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #18523: [SPARK-21285][ML] VectorAssembler reports the col...

Posted by facaiy <gi...@git.apache.org>.
Github user facaiy commented on a diff in the pull request:

    https://github.com/apache/spark/pull/18523#discussion_r125539040
  
    --- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/VectorAssembler.scala ---
    @@ -113,12 +113,12 @@ class VectorAssembler @Since("1.4.0") (@Since("1.4.0") override val uid: String)
       override def transformSchema(schema: StructType): StructType = {
         val inputColNames = $(inputCols)
         val outputColName = $(outputCol)
    -    val inputDataTypes = inputColNames.map(name => schema(name).dataType)
    +    val inputDataTypes = inputColNames.map(name => (name, schema(name).dataType))
         inputDataTypes.foreach {
    -      case _: NumericType | BooleanType =>
    -      case t if t.isInstanceOf[VectorUDT] =>
    -      case other =>
    -        throw new IllegalArgumentException(s"Data type $other is not supported.")
    +      case (_, _: NumericType | BooleanType) =>
    +      case (_, _: VectorUDT) =>
    +      case (name, other) =>
    +        throw new IllegalArgumentException(s"Data type $other of column $name is not supported.")
    --- End diff --
    
    Fine, modified.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org