You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@spark.apache.org by "zhengruifeng (via GitHub)" <gi...@apache.org> on 2023/10/09 04:38:53 UTC

[PR] [SPARK-45466][ML] `VectorAssembler` should validate the vector values [spark]

zhengruifeng opened a new pull request, #43288:
URL: https://github.com/apache/spark/pull/43288

   ### What changes were proposed in this pull request?
   Make `VectorAssembler` validate the vector values
   
   
   ### Why are the changes needed?
   `VectorAssembler` should validate the vector values
   
   
   ### Does this PR introduce _any_ user-facing change?
   yes, in `error` mode, if a vector contains `NaN`, it fails
   
   
   ### How was this patch tested?
   added UTs
   
   
   ### Was this patch authored or co-authored using generative AI tooling?
   no


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Re: [PR] [SPARK-45466][ML] `VectorAssembler` should validate the vector values [spark]

Posted by "zhengruifeng (via GitHub)" <gi...@apache.org>.
zhengruifeng commented on PR #43288:
URL: https://github.com/apache/spark/pull/43288#issuecomment-1754259142

   > Does this change behavior or just refactor code? I'm actually surprised that NaN is treated as an error but looks like that is existing behavior?
   
   It is a behavior change.
   Before this PR, NaN/null doubles and null vectors are treated as errors, but NaN values inside a vector are ignored.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Re: [PR] [SPARK-45466][ML] `VectorAssembler` should validate the vector elements [spark]

Posted by "zhengruifeng (via GitHub)" <gi...@apache.org>.
zhengruifeng commented on PR #43288:
URL: https://github.com/apache/spark/pull/43288#issuecomment-1756833728

   ok. I'm fine to close it since there is no strong requirement


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Re: [PR] [SPARK-45466][ML] `VectorAssembler` should validate the vector elements [spark]

Posted by "zhengruifeng (via GitHub)" <gi...@apache.org>.
zhengruifeng commented on PR #43288:
URL: https://github.com/apache/spark/pull/43288#issuecomment-1755231209

   > Is that really an error? It's the only way to represent a missing value
   
   https://github.com/apache/spark/blob/d679dabdd1b5ad04b8c7deb1c06ce886a154a928/mllib/src/main/scala/org/apache/spark/ml/feature/VectorAssembler.scala#L72-L81
   
   
   according to the doc, the `NULL and NaN values` are invalid data.
   And the default `handleInvalid` is `error`, which treats invalid values as errors.
   
   If users want to keep the missing value, users can set `handleInvalid=keep`


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Re: [PR] [SPARK-45466][ML] `VectorAssembler` should validate the vector elements [spark]

Posted by "srowen (via GitHub)" <gi...@apache.org>.
srowen commented on PR #43288:
URL: https://github.com/apache/spark/pull/43288#issuecomment-1755440862

   I see. Would users want to filter/catch NaN here and not upstream? maybe. I'm reluctant to change behavior even in 4.0.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Re: [PR] [SPARK-45466][ML] `VectorAssembler` should validate the vector elements [spark]

Posted by "zhengruifeng (via GitHub)" <gi...@apache.org>.
zhengruifeng closed pull request #43288: [SPARK-45466][ML] `VectorAssembler` should validate the vector elements
URL: https://github.com/apache/spark/pull/43288


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Re: [PR] [SPARK-45466][ML] `VectorAssembler` should validate the vector elements [spark]

Posted by "srowen (via GitHub)" <gi...@apache.org>.
srowen commented on PR #43288:
URL: https://github.com/apache/spark/pull/43288#issuecomment-1754292488

   Is that really an error? It's the only way to represent a missing value


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org