You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@spark.apache.org by "panbingkun (via GitHub)" <gi...@apache.org> on 2023/05/15 07:29:39 UTC

[GitHub] [spark] panbingkun opened a new pull request, #41169: [SPARK-43493][SQL] Add a max distance argument to the levenshtein() function

panbingkun opened a new pull request, #41169:
URL: https://github.com/apache/spark/pull/41169

   ### What changes were proposed in this pull request?
   The pr aims to add a max distance argument to the levenshtein() function.
   
   ### Why are the changes needed?
   Currently, Spark's levenshtein(str1, str2) function can be very inefficient for long strings. Many other databases which support this type of built-in function also take a third argument which signifies a maximum distance after which it is okay to terminate the algorithm.
   
   For example something like
   
   levenshtein(str1, str2[, max_distance])
   the function stops computing the distant once the max values is reached.
   See postgresql for an example of a 3 argument [levenshtein](https://www.postgresql.org/docs/current/fuzzystrmatch.html#id-1.11.7.26.7).
   
   ### Does this PR introduce _any_ user-facing change?
   Yes.
   
   ### How was this patch tested?
   Add new UT & Pass GA.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] panbingkun commented on a diff in pull request #41169: [SPARK-43493][SQL] Add a max distance argument to the levenshtein() function

Posted by "panbingkun (via GitHub)" <gi...@apache.org>.
panbingkun commented on code in PR #41169:
URL: https://github.com/apache/spark/pull/41169#discussion_r1241146403


##########
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/stringExpressions.scala:
##########
@@ -2133,31 +2133,147 @@ case class OctetLength(child: Expression)
 /**
  * A function that return the Levenshtein distance between the two given strings.
  */
+// scalastyle:off line.size.limit
 @ExpressionDescription(
-  usage = "_FUNC_(str1, str2) - Returns the Levenshtein distance between the two given strings.",
+  usage = """
+    _FUNC_(str1, str2[, threshold]) - Returns the Levenshtein distance between the two given strings. If threshold is set and distance more than it, return -1.""",
   examples = """
     Examples:
       > SELECT _FUNC_('kitten', 'sitting');
        3
+      > SELECT _FUNC_('kitten', 'sitting', 2);
+       -1
   """,
   since = "1.5.0",
   group = "string_funcs")
-case class Levenshtein(left: Expression, right: Expression) extends BinaryExpression
-    with ImplicitCastInputTypes with NullIntolerant {
+// scalastyle:on line.size.limit
+case class Levenshtein(
+    left: Expression,
+    right: Expression,
+    threshold: Option[Expression] = None)

Review Comment:
   I have strengthened the check for constants and ranges.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] MaxGekk commented on pull request #41169: [SPARK-43493][SQL] Add a max distance argument to the levenshtein() function

Posted by "MaxGekk (via GitHub)" <gi...@apache.org>.
MaxGekk commented on PR #41169:
URL: https://github.com/apache/spark/pull/41169#issuecomment-1559850268

   +1, LGTM. Merging to master.
   Thank you, @panbingkun.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] cloud-fan commented on a diff in pull request #41169: [SPARK-43493][SQL] Add a max distance argument to the levenshtein() function

Posted by "cloud-fan (via GitHub)" <gi...@apache.org>.
cloud-fan commented on code in PR #41169:
URL: https://github.com/apache/spark/pull/41169#discussion_r1211911239


##########
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/stringExpressions.scala:
##########
@@ -2133,31 +2133,147 @@ case class OctetLength(child: Expression)
 /**
  * A function that return the Levenshtein distance between the two given strings.
  */
+// scalastyle:off line.size.limit
 @ExpressionDescription(
-  usage = "_FUNC_(str1, str2) - Returns the Levenshtein distance between the two given strings.",
+  usage = """
+    _FUNC_(str1, str2[, threshold]) - Returns the Levenshtein distance between the two given strings. If threshold is set and distance more than it, return -1.""",
   examples = """
     Examples:
       > SELECT _FUNC_('kitten', 'sitting');
        3
+      > SELECT _FUNC_('kitten', 'sitting', 2);
+       -1
   """,
   since = "1.5.0",
   group = "string_funcs")
-case class Levenshtein(left: Expression, right: Expression) extends BinaryExpression
-    with ImplicitCastInputTypes with NullIntolerant {
+// scalastyle:on line.size.limit
+case class Levenshtein(
+    left: Expression,
+    right: Expression,
+    threshold: Option[Expression] = None)

Review Comment:
   I'm wondering if we should require the threshold parameter to be a constant, so that we can do more checks, like requiring it to be positive and not null.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] LuciferYang commented on a diff in pull request #41169: [SPARK-43493][SQL] Add a max distance argument to the levenshtein() function

Posted by "LuciferYang (via GitHub)" <gi...@apache.org>.
LuciferYang commented on code in PR #41169:
URL: https://github.com/apache/spark/pull/41169#discussion_r1193717481


##########
common/unsafe/src/main/java/org/apache/spark/unsafe/types/UTF8String.java:
##########
@@ -37,7 +37,6 @@
 
 import static org.apache.spark.unsafe.Platform.*;
 
-

Review Comment:
   please remove this change



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] panbingkun commented on a diff in pull request #41169: [SPARK-43493][SQL] Add a max distance argument to the levenshtein() function

Posted by "panbingkun (via GitHub)" <gi...@apache.org>.
panbingkun commented on code in PR #41169:
URL: https://github.com/apache/spark/pull/41169#discussion_r1195837984


##########
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/stringExpressions.scala:
##########
@@ -2142,22 +2142,118 @@ case class OctetLength(child: Expression)
   """,
   since = "1.5.0",
   group = "string_funcs")
-case class Levenshtein(left: Expression, right: Expression) extends BinaryExpression
-    with ImplicitCastInputTypes with NullIntolerant {
+case class Levenshtein(
+    left: Expression,
+    right: Expression,
+    threshold: Option[Expression] = None)
+  extends Expression
+  with ImplicitCastInputTypes
+  with NullIntolerant{
 
-  override def inputTypes: Seq[AbstractDataType] = Seq(StringType, StringType)
+  def this(left: Expression, right: Expression, threshold: Expression) =
+    this(left, right, Option(threshold))
+
+  def this(left: Expression, right: Expression) = this(left, right, None)
+
+  override def checkInputDataTypes(): TypeCheckResult = {
+    if (children.length > 3 || children.length < 2) {
+      throw QueryCompilationErrors.wrongNumArgsError(
+        toSQLId(prettyName), Seq(2, 3), children.length)
+    } else {
+      super.checkInputDataTypes()
+    }
+  }
+
+  override def inputTypes: Seq[AbstractDataType] = if (threshold.isDefined) {
+    Seq(StringType, StringType, IntegerType)
+  } else {
+    Seq(StringType, StringType)
+  }
+
+  override def children: Seq[Expression] = if (threshold.isDefined) {
+    Seq(left, right, threshold.get)
+  } else {
+    Seq(left, right)
+  }
 
   override def dataType: DataType = IntegerType
-  protected override def nullSafeEval(leftValue: Any, rightValue: Any): Any =
-    leftValue.asInstanceOf[UTF8String].levenshteinDistance(rightValue.asInstanceOf[UTF8String])
 
-  override def doGenCode(ctx: CodegenContext, ev: ExprCode): ExprCode = {
-    nullSafeCodeGen(ctx, ev, (left, right) =>
-      s"${ev.value} = $left.levenshteinDistance($right);")
+  override protected def withNewChildrenInternal(newChildren: IndexedSeq[Expression]): Expression =
+    if (threshold.isDefined) {
+      copy(left = newChildren(0), right = newChildren(1), threshold = Some(newChildren(2)))
+    } else {
+      copy(left = newChildren(0), right = newChildren(1))
+    }
+
+  override def nullable: Boolean = children.exists(_.nullable)
+
+  override def foldable: Boolean = children.forall(_.foldable)
+
+  override def eval(input: InternalRow): Any = {
+    val leftEval = left.eval(input)
+    if (leftEval == null) return null
+    val rightEval = right.eval(input)
+    if (rightEval == null) return null
+
+    val thresholdEval = threshold.map(_.eval(input))
+    if (thresholdEval.isDefined) {
+      leftEval.asInstanceOf[UTF8String].levenshteinDistance(
+        rightEval.asInstanceOf[UTF8String], thresholdEval.get.asInstanceOf[Int])
+    } else {
+      leftEval.asInstanceOf[UTF8String].levenshteinDistance(rightEval.asInstanceOf[UTF8String])
+    }
   }
 
-  override protected def withNewChildrenInternal(
-    newLeft: Expression, newRight: Expression): Levenshtein = copy(left = newLeft, right = newRight)
+  override def doGenCode(ctx: CodegenContext, ev: ExprCode): ExprCode = {

Review Comment:
   Done



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] panbingkun commented on a diff in pull request #41169: [SPARK-43493][SQL] Add a max distance argument to the levenshtein() function

Posted by "panbingkun (via GitHub)" <gi...@apache.org>.
panbingkun commented on code in PR #41169:
URL: https://github.com/apache/spark/pull/41169#discussion_r1196188986


##########
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/stringExpressions.scala:
##########
@@ -2134,30 +2134,145 @@ case class OctetLength(child: Expression)
  * A function that return the Levenshtein distance between the two given strings.
  */
 @ExpressionDescription(
-  usage = "_FUNC_(str1, str2) - Returns the Levenshtein distance between the two given strings.",
+  usage = """
+    _FUNC_(str1, str2) - Returns the Levenshtein distance between the two given strings.
+      If threshold is set and distance more than it, return -1.""",
   examples = """
     Examples:
       > SELECT _FUNC_('kitten', 'sitting');
        3
+      > SELECT _FUNC_('kitten', 'sitting', 2);
+       -1
   """,
   since = "1.5.0",
   group = "string_funcs")
-case class Levenshtein(left: Expression, right: Expression) extends BinaryExpression
-    with ImplicitCastInputTypes with NullIntolerant {
+case class Levenshtein(
+    left: Expression,
+    right: Expression,
+    threshold: Option[Expression] = None)
+  extends Expression

Review Comment:
   1.If we extend TernaryExpression, then threshold expression will not be optional, just like the implementation logic of `Substring`. When the threshold value is not passed in, we will set a default value for it: Integer.MAX_ Value, the levenshteinDistance algorithm in UTF8String, only one reserved - levenshteinDistance(UTF8String other, int threshold)
   But it seems that `org.apache.commons.text.similarity.LevenshteinDistance` did not do this, it retained two - `limitedCompare` & `unlimitedCompare`.
   https://github.com/apache/commons-text/blob/master/src/main/java/org/apache/commons/text/similarity/LevenshteinDistance.java
   
   Are we really going to do this?
   
   2.Additionally, I am referring to the implementation of `ArrayJoin` in spark
   
   3.UT referring to https://github.com/apache/commons-text/blob/master/src/test/java/org/apache/commons/text/similarity/LevenshteinDistanceTest.java#L76-L138



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] panbingkun commented on a diff in pull request #41169: [SPARK-43493][SQL] Add a max distance argument to the levenshtein() function

Posted by "panbingkun (via GitHub)" <gi...@apache.org>.
panbingkun commented on code in PR #41169:
URL: https://github.com/apache/spark/pull/41169#discussion_r1196188986


##########
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/stringExpressions.scala:
##########
@@ -2134,30 +2134,145 @@ case class OctetLength(child: Expression)
  * A function that return the Levenshtein distance between the two given strings.
  */
 @ExpressionDescription(
-  usage = "_FUNC_(str1, str2) - Returns the Levenshtein distance between the two given strings.",
+  usage = """
+    _FUNC_(str1, str2) - Returns the Levenshtein distance between the two given strings.
+      If threshold is set and distance more than it, return -1.""",
   examples = """
     Examples:
       > SELECT _FUNC_('kitten', 'sitting');
        3
+      > SELECT _FUNC_('kitten', 'sitting', 2);
+       -1
   """,
   since = "1.5.0",
   group = "string_funcs")
-case class Levenshtein(left: Expression, right: Expression) extends BinaryExpression
-    with ImplicitCastInputTypes with NullIntolerant {
+case class Levenshtein(
+    left: Expression,
+    right: Expression,
+    threshold: Option[Expression] = None)
+  extends Expression

Review Comment:
   1.If we extend TernaryExpression, then threshold expression will not be optional, just like the implementation logic of `Substring`. When the threshold value is not passed in, we will set a default value for it: Integer.MAX_ Value, the levenshteinDistance algorithm in UTF8String, only one reserved - levenshteinDistance(UTF8String other, int threshold)
   But it seems that `org.apache.commons.text.similarity.LevenshteinDistance` did not do this, it retained two - `limitedCompare` & `unlimitedCompare`.
   https://github.com/apache/commons-text/blob/master/src/main/java/org/apache/commons/text/similarity/LevenshteinDistance.java
   
   Are we really going to do this?
   
   2.Additionally, I am referring to the implementation of [`ArrayJoin`](https://github.com/apache/spark/blob/d44e073f0cdaf16028a4854e79db200a4e39a6fe/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/collectionOperations.scala#L1824-L1854) in spark
   
   3.UT referring to https://github.com/apache/commons-text/blob/master/src/test/java/org/apache/commons/text/similarity/LevenshteinDistanceTest.java#L76-L138



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] LuciferYang commented on a diff in pull request #41169: [SPARK-43493][SQL] Add a max distance argument to the levenshtein() function

Posted by "LuciferYang (via GitHub)" <gi...@apache.org>.
LuciferYang commented on code in PR #41169:
URL: https://github.com/apache/spark/pull/41169#discussion_r1194845716


##########
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/stringExpressions.scala:
##########
@@ -2142,22 +2142,118 @@ case class OctetLength(child: Expression)
   """,
   since = "1.5.0",
   group = "string_funcs")
-case class Levenshtein(left: Expression, right: Expression) extends BinaryExpression
-    with ImplicitCastInputTypes with NullIntolerant {
+case class Levenshtein(
+    left: Expression,
+    right: Expression,
+    threshold: Option[Expression] = None)
+  extends Expression
+  with ImplicitCastInputTypes
+  with NullIntolerant{
 
-  override def inputTypes: Seq[AbstractDataType] = Seq(StringType, StringType)
+  def this(left: Expression, right: Expression, threshold: Expression) =
+    this(left, right, Option(threshold))
+
+  def this(left: Expression, right: Expression) = this(left, right, None)
+
+  override def checkInputDataTypes(): TypeCheckResult = {
+    if (children.length > 3 || children.length < 2) {
+      throw QueryCompilationErrors.wrongNumArgsError(
+        toSQLId(prettyName), Seq(2, 3), children.length)
+    } else {
+      super.checkInputDataTypes()
+    }
+  }
+
+  override def inputTypes: Seq[AbstractDataType] = if (threshold.isDefined) {

Review Comment:
   case match maybe more robust?Otherwise, if `threshold` is `null`, NPE will be thrown here?
   



##########
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/stringExpressions.scala:
##########
@@ -2142,22 +2142,118 @@ case class OctetLength(child: Expression)
   """,
   since = "1.5.0",
   group = "string_funcs")
-case class Levenshtein(left: Expression, right: Expression) extends BinaryExpression
-    with ImplicitCastInputTypes with NullIntolerant {
+case class Levenshtein(
+    left: Expression,
+    right: Expression,
+    threshold: Option[Expression] = None)
+  extends Expression
+  with ImplicitCastInputTypes
+  with NullIntolerant{
 
-  override def inputTypes: Seq[AbstractDataType] = Seq(StringType, StringType)
+  def this(left: Expression, right: Expression, threshold: Expression) =
+    this(left, right, Option(threshold))
+
+  def this(left: Expression, right: Expression) = this(left, right, None)
+
+  override def checkInputDataTypes(): TypeCheckResult = {
+    if (children.length > 3 || children.length < 2) {
+      throw QueryCompilationErrors.wrongNumArgsError(
+        toSQLId(prettyName), Seq(2, 3), children.length)
+    } else {
+      super.checkInputDataTypes()
+    }
+  }
+
+  override def inputTypes: Seq[AbstractDataType] = if (threshold.isDefined) {
+    Seq(StringType, StringType, IntegerType)
+  } else {
+    Seq(StringType, StringType)
+  }
+
+  override def children: Seq[Expression] = if (threshold.isDefined) {

Review Comment:
   ditto



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] panbingkun commented on pull request #41169: [SPARK-43493][SQL] Add a max distance argument to the levenshtein() function

Posted by "panbingkun (via GitHub)" <gi...@apache.org>.
panbingkun commented on PR #41169:
URL: https://github.com/apache/spark/pull/41169#issuecomment-1547342279

   cc @MaxGekk 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] LuciferYang commented on pull request #41169: [SPARK-43493][SQL] Add a max distance argument to the levenshtein() function

Posted by "LuciferYang (via GitHub)" <gi...@apache.org>.
LuciferYang commented on PR #41169:
URL: https://github.com/apache/spark/pull/41169#issuecomment-1547697865

   @panbingkun run `ProtoToParsedPlanTestSuite:` with this pr, `function_levenshtein` failed as follows:
   ```
   [info] - function_levenshtein *** FAILED *** (4 milliseconds)
   [info]   Expected and actual plans do not match:
   [info]   
   [info]   === Expected Plan ===
   [info]   Project [levenshtein(g#0, bob) AS levenshtein(g, bob)#0]
   [info]   +- LocalRelation <empty>, [id#0L, a#0, b#0, d#0, e#0, f#0, g#0]
   [info]   
   [info]   
   [info]   === Actual Plan ===
   [info]   Project [levenshtein(g#0, bob, None) AS levenshtein(g, bob)#0]
   [info]   +- LocalRelation <empty>, [id#0L, a#0, b#0, d#0, e#0, f#0, g#0] (ProtoToParsedPlanTestSuite.scala:177)
   ```
   
   We should re-generate the golden files.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] LuciferYang commented on a diff in pull request #41169: [SPARK-43493][SQL] Add a max distance argument to the levenshtein() function

Posted by "LuciferYang (via GitHub)" <gi...@apache.org>.
LuciferYang commented on code in PR #41169:
URL: https://github.com/apache/spark/pull/41169#discussion_r1194847178


##########
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/stringExpressions.scala:
##########
@@ -2142,22 +2142,118 @@ case class OctetLength(child: Expression)
   """,
   since = "1.5.0",
   group = "string_funcs")
-case class Levenshtein(left: Expression, right: Expression) extends BinaryExpression
-    with ImplicitCastInputTypes with NullIntolerant {
+case class Levenshtein(
+    left: Expression,
+    right: Expression,
+    threshold: Option[Expression] = None)
+  extends Expression
+  with ImplicitCastInputTypes
+  with NullIntolerant{
 
-  override def inputTypes: Seq[AbstractDataType] = Seq(StringType, StringType)
+  def this(left: Expression, right: Expression, threshold: Expression) =
+    this(left, right, Option(threshold))
+
+  def this(left: Expression, right: Expression) = this(left, right, None)
+
+  override def checkInputDataTypes(): TypeCheckResult = {
+    if (children.length > 3 || children.length < 2) {
+      throw QueryCompilationErrors.wrongNumArgsError(
+        toSQLId(prettyName), Seq(2, 3), children.length)
+    } else {
+      super.checkInputDataTypes()
+    }
+  }
+
+  override def inputTypes: Seq[AbstractDataType] = if (threshold.isDefined) {
+    Seq(StringType, StringType, IntegerType)
+  } else {
+    Seq(StringType, StringType)
+  }
+
+  override def children: Seq[Expression] = if (threshold.isDefined) {
+    Seq(left, right, threshold.get)
+  } else {
+    Seq(left, right)
+  }
 
   override def dataType: DataType = IntegerType
-  protected override def nullSafeEval(leftValue: Any, rightValue: Any): Any =
-    leftValue.asInstanceOf[UTF8String].levenshteinDistance(rightValue.asInstanceOf[UTF8String])
 
-  override def doGenCode(ctx: CodegenContext, ev: ExprCode): ExprCode = {
-    nullSafeCodeGen(ctx, ev, (left, right) =>
-      s"${ev.value} = $left.levenshteinDistance($right);")
+  override protected def withNewChildrenInternal(newChildren: IndexedSeq[Expression]): Expression =
+    if (threshold.isDefined) {

Review Comment:
   ditto



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] panbingkun commented on a diff in pull request #41169: [SPARK-43493][SQL] Add a max distance argument to the levenshtein() function

Posted by "panbingkun (via GitHub)" <gi...@apache.org>.
panbingkun commented on code in PR #41169:
URL: https://github.com/apache/spark/pull/41169#discussion_r1196113962


##########
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/stringExpressions.scala:
##########
@@ -2134,30 +2134,145 @@ case class OctetLength(child: Expression)
  * A function that return the Levenshtein distance between the two given strings.
  */
 @ExpressionDescription(
-  usage = "_FUNC_(str1, str2) - Returns the Levenshtein distance between the two given strings.",
+  usage = """
+    _FUNC_(str1, str2) - Returns the Levenshtein distance between the two given strings.
+      If threshold is set and distance more than it, return -1.""",

Review Comment:
   ok



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] panbingkun commented on pull request #41169: [SPARK-43493][SQL] Add a max distance argument to the levenshtein() function

Posted by "panbingkun (via GitHub)" <gi...@apache.org>.
panbingkun commented on PR #41169:
URL: https://github.com/apache/spark/pull/41169#issuecomment-1606053243

   > Hi @panbingkun , do you have time to address [#41169 (comment)](https://github.com/apache/spark/pull/41169#discussion_r1211911239) . If not I propose to revert it first and resubmit later. I don't think we have well-defined behavior for now: what if the `max_distance` parameter is null or negative? Should -1 mean no limitation so that we can unify the code?
   
   When max_distance = Int.MaxValue mean no limitation, I think it is possible to unify the code later on.
   The above design is because the `org.apache.commons.text.similarity.LevenshteinDistance` class also retains two methods: limitedCompare & unlimitedCompare.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] beliefer commented on pull request #41169: [SPARK-43493][SQL] Add a max distance argument to the levenshtein() function

Posted by "beliefer (via GitHub)" <gi...@apache.org>.
beliefer commented on PR #41169:
URL: https://github.com/apache/spark/pull/41169#issuecomment-1595576650

   It seems not difficult to fix.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] panbingkun commented on a diff in pull request #41169: [SPARK-43493][SQL] Add a max distance argument to the levenshtein() function

Posted by "panbingkun (via GitHub)" <gi...@apache.org>.
panbingkun commented on code in PR #41169:
URL: https://github.com/apache/spark/pull/41169#discussion_r1193837633


##########
common/unsafe/src/main/java/org/apache/spark/unsafe/types/UTF8String.java:
##########
@@ -37,7 +37,6 @@
 
 import static org.apache.spark.unsafe.Platform.*;
 
-

Review Comment:
   Done



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] chenhao-db commented on pull request #41169: [SPARK-43493][SQL] Add a max distance argument to the levenshtein() function

Posted by "chenhao-db (via GitHub)" <gi...@apache.org>.
chenhao-db commented on PR #41169:
URL: https://github.com/apache/spark/pull/41169#issuecomment-1548233629

   I am wondering whether it is better to follow PostgreSQL's semantics:
   
   > If the actual distance is less than or equal to max_d, then levenshtein_less_equal returns the correct distance; otherwise it returns some value greater than max_d.
   
   or to follow `org.apache.commons.text.similarity.LevenshteinDistance.limitedCompare`'s semantics to return -1 when the distance is greater than the threshold (the current code).
   
   I think the former is probably better: the optimizer can safely convert `levenshtein(s1, s2) < c` into `levenshtein(s1, s2, c) < c`, which I believe should be a quite common use case of `levenshtein`.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] LuciferYang commented on a diff in pull request #41169: [SPARK-43493][SQL] Add a max distance argument to the levenshtein() function

Posted by "LuciferYang (via GitHub)" <gi...@apache.org>.
LuciferYang commented on code in PR #41169:
URL: https://github.com/apache/spark/pull/41169#discussion_r1194845716


##########
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/stringExpressions.scala:
##########
@@ -2142,22 +2142,118 @@ case class OctetLength(child: Expression)
   """,
   since = "1.5.0",
   group = "string_funcs")
-case class Levenshtein(left: Expression, right: Expression) extends BinaryExpression
-    with ImplicitCastInputTypes with NullIntolerant {
+case class Levenshtein(
+    left: Expression,
+    right: Expression,
+    threshold: Option[Expression] = None)
+  extends Expression
+  with ImplicitCastInputTypes
+  with NullIntolerant{
 
-  override def inputTypes: Seq[AbstractDataType] = Seq(StringType, StringType)
+  def this(left: Expression, right: Expression, threshold: Expression) =
+    this(left, right, Option(threshold))
+
+  def this(left: Expression, right: Expression) = this(left, right, None)
+
+  override def checkInputDataTypes(): TypeCheckResult = {
+    if (children.length > 3 || children.length < 2) {
+      throw QueryCompilationErrors.wrongNumArgsError(
+        toSQLId(prettyName), Seq(2, 3), children.length)
+    } else {
+      super.checkInputDataTypes()
+    }
+  }
+
+  override def inputTypes: Seq[AbstractDataType] = if (threshold.isDefined) {

Review Comment:
   case match maybe more robust?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] MaxGekk commented on pull request #41169: [SPARK-43493][SQL] Add a max distance argument to the levenshtein() function

Posted by "MaxGekk (via GitHub)" <gi...@apache.org>.
MaxGekk commented on PR #41169:
URL: https://github.com/apache/spark/pull/41169#issuecomment-1594637895

   @panbingkun Please, let me know if you don't have time to fix the corner cases mentioned by @cloud-fan . I am going to fix them soon.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] panbingkun commented on pull request #41169: [SPARK-43493][SQL] Add a max distance argument to the levenshtein() function

Posted by "panbingkun (via GitHub)" <gi...@apache.org>.
panbingkun commented on PR #41169:
URL: https://github.com/apache/spark/pull/41169#issuecomment-1547530123

   > Maybe we should keep connect client synchronized with this change, or at least add an `exclude` entry in `CheckConnectJvmClientCompatibility`
   
   I will add `exclude` entry in `CheckConnectJvmClientCompatibility`.
   I will implements it at `connect` later.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] panbingkun commented on a diff in pull request #41169: [SPARK-43493][SQL] Add a max distance argument to the levenshtein() function

Posted by "panbingkun (via GitHub)" <gi...@apache.org>.
panbingkun commented on code in PR #41169:
URL: https://github.com/apache/spark/pull/41169#discussion_r1196197681


##########
sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/expressions/StringExpressionsSuite.scala:
##########
@@ -516,6 +516,95 @@ class StringExpressionsSuite extends SparkFunSuite with ExpressionEvalHelper {
     // scalastyle:on
   }
 
+  test("Levenshtein distance threshold") {

Review Comment:
   referring to https://github.com/apache/commons-text/blob/master/src/test/java/org/apache/commons/text/similarity/LevenshteinDistanceTest.java#L76-L138



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] LuciferYang commented on a diff in pull request #41169: [SPARK-43493][SQL] Add a max distance argument to the levenshtein() function

Posted by "LuciferYang (via GitHub)" <gi...@apache.org>.
LuciferYang commented on code in PR #41169:
URL: https://github.com/apache/spark/pull/41169#discussion_r1195051956


##########
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/stringExpressions.scala:
##########
@@ -2142,22 +2142,118 @@ case class OctetLength(child: Expression)
   """,
   since = "1.5.0",
   group = "string_funcs")
-case class Levenshtein(left: Expression, right: Expression) extends BinaryExpression
-    with ImplicitCastInputTypes with NullIntolerant {
+case class Levenshtein(
+    left: Expression,
+    right: Expression,
+    threshold: Option[Expression] = None)
+  extends Expression
+  with ImplicitCastInputTypes
+  with NullIntolerant{
 
-  override def inputTypes: Seq[AbstractDataType] = Seq(StringType, StringType)
+  def this(left: Expression, right: Expression, threshold: Expression) =
+    this(left, right, Option(threshold))
+
+  def this(left: Expression, right: Expression) = this(left, right, None)
+
+  override def checkInputDataTypes(): TypeCheckResult = {
+    if (children.length > 3 || children.length < 2) {
+      throw QueryCompilationErrors.wrongNumArgsError(
+        toSQLId(prettyName), Seq(2, 3), children.length)
+    } else {
+      super.checkInputDataTypes()
+    }
+  }
+
+  override def inputTypes: Seq[AbstractDataType] = if (threshold.isDefined) {
+    Seq(StringType, StringType, IntegerType)
+  } else {
+    Seq(StringType, StringType)
+  }
+
+  override def children: Seq[Expression] = if (threshold.isDefined) {
+    Seq(left, right, threshold.get)
+  } else {
+    Seq(left, right)
+  }
 
   override def dataType: DataType = IntegerType
-  protected override def nullSafeEval(leftValue: Any, rightValue: Any): Any =
-    leftValue.asInstanceOf[UTF8String].levenshteinDistance(rightValue.asInstanceOf[UTF8String])
 
-  override def doGenCode(ctx: CodegenContext, ev: ExprCode): ExprCode = {
-    nullSafeCodeGen(ctx, ev, (left, right) =>
-      s"${ev.value} = $left.levenshteinDistance($right);")
+  override protected def withNewChildrenInternal(newChildren: IndexedSeq[Expression]): Expression =
+    if (threshold.isDefined) {
+      copy(left = newChildren(0), right = newChildren(1), threshold = Some(newChildren(2)))
+    } else {
+      copy(left = newChildren(0), right = newChildren(1))
+    }
+
+  override def nullable: Boolean = children.exists(_.nullable)
+
+  override def foldable: Boolean = children.forall(_.foldable)
+
+  override def eval(input: InternalRow): Any = {
+    val leftEval = left.eval(input)
+    if (leftEval == null) return null
+    val rightEval = right.eval(input)
+    if (rightEval == null) return null
+
+    val thresholdEval = threshold.map(_.eval(input))
+    if (thresholdEval.isDefined) {
+      leftEval.asInstanceOf[UTF8String].levenshteinDistance(
+        rightEval.asInstanceOf[UTF8String], thresholdEval.get.asInstanceOf[Int])
+    } else {
+      leftEval.asInstanceOf[UTF8String].levenshteinDistance(rightEval.asInstanceOf[UTF8String])
+    }
   }
 
-  override protected def withNewChildrenInternal(
-    newLeft: Expression, newRight: Expression): Levenshtein = copy(left = newLeft, right = newRight)
+  override def doGenCode(ctx: CodegenContext, ev: ExprCode): ExprCode = {

Review Comment:
   There are many branches here to check `if (threshold.isDefined)`. 
   
   If we split it into 
   ```scala
   if (threshold.isDefined) { // or threshold match ...
   doGenCodeWithThreshold(...)
   } else {
   doGenCodeWithoutThreshold(...)
   }
   ```
   , will it look clearer?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] LuciferYang commented on pull request #41169: [SPARK-43493][SQL] Add a max distance argument to the levenshtein() function

Posted by "LuciferYang (via GitHub)" <gi...@apache.org>.
LuciferYang commented on PR #41169:
URL: https://github.com/apache/spark/pull/41169#issuecomment-1547471468

   Maybe we should keep connect client synchronized with this change, or at least add an `exclude` entry in `CheckConnectJvmClientCompatibility` in this one?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] LuciferYang commented on a diff in pull request #41169: [SPARK-43493][SQL] Add a max distance argument to the levenshtein() function

Posted by "LuciferYang (via GitHub)" <gi...@apache.org>.
LuciferYang commented on code in PR #41169:
URL: https://github.com/apache/spark/pull/41169#discussion_r1195051956


##########
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/stringExpressions.scala:
##########
@@ -2142,22 +2142,118 @@ case class OctetLength(child: Expression)
   """,
   since = "1.5.0",
   group = "string_funcs")
-case class Levenshtein(left: Expression, right: Expression) extends BinaryExpression
-    with ImplicitCastInputTypes with NullIntolerant {
+case class Levenshtein(
+    left: Expression,
+    right: Expression,
+    threshold: Option[Expression] = None)
+  extends Expression
+  with ImplicitCastInputTypes
+  with NullIntolerant{
 
-  override def inputTypes: Seq[AbstractDataType] = Seq(StringType, StringType)
+  def this(left: Expression, right: Expression, threshold: Expression) =
+    this(left, right, Option(threshold))
+
+  def this(left: Expression, right: Expression) = this(left, right, None)
+
+  override def checkInputDataTypes(): TypeCheckResult = {
+    if (children.length > 3 || children.length < 2) {
+      throw QueryCompilationErrors.wrongNumArgsError(
+        toSQLId(prettyName), Seq(2, 3), children.length)
+    } else {
+      super.checkInputDataTypes()
+    }
+  }
+
+  override def inputTypes: Seq[AbstractDataType] = if (threshold.isDefined) {
+    Seq(StringType, StringType, IntegerType)
+  } else {
+    Seq(StringType, StringType)
+  }
+
+  override def children: Seq[Expression] = if (threshold.isDefined) {
+    Seq(left, right, threshold.get)
+  } else {
+    Seq(left, right)
+  }
 
   override def dataType: DataType = IntegerType
-  protected override def nullSafeEval(leftValue: Any, rightValue: Any): Any =
-    leftValue.asInstanceOf[UTF8String].levenshteinDistance(rightValue.asInstanceOf[UTF8String])
 
-  override def doGenCode(ctx: CodegenContext, ev: ExprCode): ExprCode = {
-    nullSafeCodeGen(ctx, ev, (left, right) =>
-      s"${ev.value} = $left.levenshteinDistance($right);")
+  override protected def withNewChildrenInternal(newChildren: IndexedSeq[Expression]): Expression =
+    if (threshold.isDefined) {
+      copy(left = newChildren(0), right = newChildren(1), threshold = Some(newChildren(2)))
+    } else {
+      copy(left = newChildren(0), right = newChildren(1))
+    }
+
+  override def nullable: Boolean = children.exists(_.nullable)
+
+  override def foldable: Boolean = children.forall(_.foldable)
+
+  override def eval(input: InternalRow): Any = {
+    val leftEval = left.eval(input)
+    if (leftEval == null) return null
+    val rightEval = right.eval(input)
+    if (rightEval == null) return null
+
+    val thresholdEval = threshold.map(_.eval(input))
+    if (thresholdEval.isDefined) {
+      leftEval.asInstanceOf[UTF8String].levenshteinDistance(
+        rightEval.asInstanceOf[UTF8String], thresholdEval.get.asInstanceOf[Int])
+    } else {
+      leftEval.asInstanceOf[UTF8String].levenshteinDistance(rightEval.asInstanceOf[UTF8String])
+    }
   }
 
-  override protected def withNewChildrenInternal(
-    newLeft: Expression, newRight: Expression): Levenshtein = copy(left = newLeft, right = newRight)
+  override def doGenCode(ctx: CodegenContext, ev: ExprCode): ExprCode = {

Review Comment:
   There are many branches here to check `if (threshold.isDefined)`. 
   
   If we split it into 
   ```scala
   if (threshold.isDefined) { // or threshold match ...
   doGenCodeWithThreshold(...)
   } else {
   doGenCodeWithoutThreshold(...)
   }
   ```
   , will it look clearer, and we can only check if (threshold.isDefined)` once?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] LuciferYang commented on pull request #41169: [SPARK-43493][SQL] Add a max distance argument to the levenshtein() function

Posted by "LuciferYang (via GitHub)" <gi...@apache.org>.
LuciferYang commented on PR #41169:
URL: https://github.com/apache/spark/pull/41169#issuecomment-1550573293

   also cc @wangyum @beliefer FYI


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] MaxGekk closed pull request #41169: [SPARK-43493][SQL] Add a max distance argument to the levenshtein() function

Posted by "MaxGekk (via GitHub)" <gi...@apache.org>.
MaxGekk closed pull request #41169: [SPARK-43493][SQL] Add a max distance argument to the levenshtein() function
URL: https://github.com/apache/spark/pull/41169


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] panbingkun commented on a diff in pull request #41169: [SPARK-43493][SQL] Add a max distance argument to the levenshtein() function

Posted by "panbingkun (via GitHub)" <gi...@apache.org>.
panbingkun commented on code in PR #41169:
URL: https://github.com/apache/spark/pull/41169#discussion_r1193849840


##########
connector/connect/client/jvm/src/test/scala/org/apache/spark/sql/connect/client/CheckConnectJvmClientCompatibility.scala:
##########
@@ -180,6 +180,7 @@ object CheckConnectJvmClientCompatibility {
       ProblemFilters.exclude[Problem]("org.apache.spark.sql.functions.udaf"),
       ProblemFilters.exclude[Problem]("org.apache.spark.sql.functions.typedlit"),
       ProblemFilters.exclude[Problem]("org.apache.spark.sql.functions.typedLit"),
+      ProblemFilters.exclude[Problem]("org.apache.spark.sql.functions.levenshtein"),

Review Comment:
   Let's exclude it first.
   I will implement it for connect.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] MaxGekk commented on a diff in pull request #41169: [SPARK-43493][SQL] Add a max distance argument to the levenshtein() function

Posted by "MaxGekk (via GitHub)" <gi...@apache.org>.
MaxGekk commented on code in PR #41169:
URL: https://github.com/apache/spark/pull/41169#discussion_r1196017647


##########
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/stringExpressions.scala:
##########
@@ -2134,30 +2134,145 @@ case class OctetLength(child: Expression)
  * A function that return the Levenshtein distance between the two given strings.
  */
 @ExpressionDescription(
-  usage = "_FUNC_(str1, str2) - Returns the Levenshtein distance between the two given strings.",
+  usage = """
+    _FUNC_(str1, str2) - Returns the Levenshtein distance between the two given strings.
+      If threshold is set and distance more than it, return -1.""",

Review Comment:
   Please, add the parameter that you mentioned: `_FUNC_(str1, str2[, threshold])`



##########
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/stringExpressions.scala:
##########
@@ -2134,30 +2134,145 @@ case class OctetLength(child: Expression)
  * A function that return the Levenshtein distance between the two given strings.
  */
 @ExpressionDescription(
-  usage = "_FUNC_(str1, str2) - Returns the Levenshtein distance between the two given strings.",
+  usage = """
+    _FUNC_(str1, str2) - Returns the Levenshtein distance between the two given strings.
+      If threshold is set and distance more than it, return -1.""",
   examples = """
     Examples:
       > SELECT _FUNC_('kitten', 'sitting');
        3
+      > SELECT _FUNC_('kitten', 'sitting', 2);
+       -1
   """,
   since = "1.5.0",
   group = "string_funcs")
-case class Levenshtein(left: Expression, right: Expression) extends BinaryExpression
-    with ImplicitCastInputTypes with NullIntolerant {
+case class Levenshtein(
+    left: Expression,
+    right: Expression,
+    threshold: Option[Expression] = None)
+  extends Expression

Review Comment:
   How about to extend `TernaryExpression`, see for instance:
   https://github.com/apache/spark/blob/e37c74b3ef40aebd33ca9338210300a1ed1164f8/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/datetimeExpressions.scala#L1974-L1979
   
   Or there are some reasons to don't do that?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] panbingkun commented on pull request #41169: [SPARK-43493][SQL] Add a max distance argument to the levenshtein() function

Posted by "panbingkun (via GitHub)" <gi...@apache.org>.
panbingkun commented on PR #41169:
URL: https://github.com/apache/spark/pull/41169#issuecomment-1605916874

   I am very sorry, I only saw this now. Let me take a look at this first.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] cloud-fan commented on pull request #41169: [SPARK-43493][SQL] Add a max distance argument to the levenshtein() function

Posted by "cloud-fan (via GitHub)" <gi...@apache.org>.
cloud-fan commented on PR #41169:
URL: https://github.com/apache/spark/pull/41169#issuecomment-1573122266

   Hi @panbingkun , do you have time to address https://github.com/apache/spark/pull/41169#discussion_r1211911239 . If not I propose to revert it first and resubmit later. I don't think we have well-defined behavior for now: what if the `max_distance` parameter is null or negative? Should -1 mean no limitation so that we can unify the code? 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org