You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@spark.apache.org by "zhengruifeng (via GitHub)" <gi...@apache.org> on 2023/10/16 02:50:34 UTC

[PR] [SPARK-45547][ML] Validate Vectors with built-in function [spark]

zhengruifeng opened a new pull request, #43380:
URL: https://github.com/apache/spark/pull/43380

   ### What changes were proposed in this pull request?
   Validate Vectors with built-in function
   
   
   ### Why are the changes needed?
   with built-in function, the logic might be optimized further
   
   
   ### Does this PR introduce _any_ user-facing change?
   no
   
   
   ### How was this patch tested?
   ci
   
   
   ### Was this patch authored or co-authored using generative AI tooling?
   no


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Re: [PR] [SPARK-45547][ML] Validate Vectors with built-in function [spark]

Posted by "zhengruifeng (via GitHub)" <gi...@apache.org>.
zhengruifeng commented on PR #43380:
URL: https://github.com/apache/spark/pull/43380#issuecomment-1767424792

   thanks, I happen to find another a few similar places, please hold on and let me add them all


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Re: [PR] [SPARK-45547][ML] Validate Vectors with built-in function [spark]

Posted by "srowen (via GitHub)" <gi...@apache.org>.
srowen commented on PR #43380:
URL: https://github.com/apache/spark/pull/43380#issuecomment-1766284743

   Ok seems fine if you like


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Re: [PR] [SPARK-45547][ML] Validate Vectors with built-in function [spark]

Posted by "zhengruifeng (via GitHub)" <gi...@apache.org>.
zhengruifeng closed pull request #43380: [SPARK-45547][ML] Validate Vectors with built-in function
URL: https://github.com/apache/spark/pull/43380


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Re: [PR] [SPARK-45547][ML] Validate Vectors with built-in function [spark]

Posted by "zhengruifeng (via GitHub)" <gi...@apache.org>.
zhengruifeng commented on PR #43380:
URL: https://github.com/apache/spark/pull/43380#issuecomment-1774455364

   @srowen would you mind taking another look? I think all vector validation related udfs are covered in this PR.
   
   when we check the values, the performances are similar;
   when we check the size, the new implementation is much faster since it can directly utilize the `size` field in `VectorUDT`


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Re: [PR] [SPARK-45547][ML] Validate Vectors with built-in function [spark]

Posted by "zhengruifeng (via GitHub)" <gi...@apache.org>.
zhengruifeng commented on PR #43380:
URL: https://github.com/apache/spark/pull/43380#issuecomment-1765976728

   ```
   scala> import org.apache.spark.ml.linalg._
        |
        | val df = Seq.range(0, 1000000).map(i => (i,Vectors.dense(Array.fill(256)(1.0)))).toDF("i", "vec")
        |
   import org.apache.spark.ml.linalg._
   val df: org.apache.spark.sql.DataFrame = [i: int, vec: vector]
   
   scala> df.count()
   23/10/17 16:28:31 WARN TaskSetManager: Stage 0 contains a task of very large size (1391 KiB). The maximum recommended task size is 1000 KiB.
   val res0: Long = 1000000
   
   scala> val validateUDF = udf { vector: Vector =>
        |     vector match {
        |       case dv: DenseVector =>
        |         dv.values.forall(v => !v.isNaN && !v.isInfinity)
        |       case sv: SparseVector =>
        |         sv.values.forall(v => !v.isNaN && !v.isInfinity)
        |     }
        |   }
   val validateUDF: org.apache.spark.sql.expressions.UserDefinedFunction = SparkUserDefinedFunction($Lambda$4177/0x000000b00198fa00@17640714,BooleanType,List(Some(class[value[0]: vector])),Some(class[value[0]: boolean]),None,false,true)
   
   scala> val validatedCol = forall(unwrap_udt(col("vec")).getField("values"), v => not(v.isNaN) && abs(v) =!= expr("double('inf')"))
   val validatedCol: org.apache.spark.sql.Column = forall(unwrap_udt(vec)[values], lambdafunction(and(`!`(isNaN(x_0)), `!`(`=`(abs(x_0), double(inf)))), x_0))
   
   scala> val start = System.currentTimeMillis; df.select(bool_and(validateUDF(col("vec")))).head(); System.currentTimeMillis - start
   23/10/17 16:28:47 WARN TaskSetManager: Stage 3 contains a task of very large size (176683 KiB). The maximum recommended task size is 1000 KiB.
   val start: Long = 1697531323779
   val res1: Long = 4562
   
   scala>
   
   scala> val start = System.currentTimeMillis; df.select(bool_and(validatedCol)).head(); System.currentTimeMillis - start; System.currentTimeMillis - start
   23/10/17 16:28:52 WARN TaskSetManager: Stage 6 contains a task of very large size (176683 KiB). The maximum recommended task size is 1000 KiB.
   val start: Long = 1697531329903
   val res2: Long = 4637
   ```
   
   I did a quick test, it seems there is no significant change. Especially, this validation is only performed once.
   
   Using built-in functions can help simplify the codes and get potential benefit from SQL optimization.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Re: [PR] [SPARK-45547][ML] Validate Vectors with built-in function [spark]

Posted by "zhengruifeng (via GitHub)" <gi...@apache.org>.
zhengruifeng closed pull request #43380: [SPARK-45547][ML] Validate Vectors with built-in function
URL: https://github.com/apache/spark/pull/43380


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Re: [PR] [SPARK-45547][ML] Validate Vectors with built-in function [spark]

Posted by "zhengruifeng (via GitHub)" <gi...@apache.org>.
zhengruifeng commented on PR #43380:
URL: https://github.com/apache/spark/pull/43380#issuecomment-1774280453

   on second thought, let's let it alone


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Re: [PR] [SPARK-45547][ML] Validate Vectors with built-in function [spark]

Posted by "zhengruifeng (via GitHub)" <gi...@apache.org>.
zhengruifeng commented on PR #43380:
URL: https://github.com/apache/spark/pull/43380#issuecomment-1776239897

   merged to master


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Re: [PR] [SPARK-45547][ML] Validate Vectors with built-in function [spark]

Posted by "srowen (via GitHub)" <gi...@apache.org>.
srowen commented on PR #43380:
URL: https://github.com/apache/spark/pull/43380#issuecomment-1765482685

   Is this actually faster? I would be surprised 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org