You are viewing a plain text version of this content. The canonical link for it is here.

Posted to reviews@spark.apache.org by "chenhao-db (via GitHub)" <gi...@apache.org> on 2023/05/15 17:06:48 UTC

[GitHub] [spark] chenhao-db commented on pull request #41169: [SPARK-43493][SQL] Add a max distance argument to the levenshtein() function

chenhao-db commented on PR #41169:
URL: https://github.com/apache/spark/pull/41169#issuecomment-1548233629

   I am wondering whether it is better to follow PostgreSQL's semantics:
   
   > If the actual distance is less than or equal to max_d, then levenshtein_less_equal returns the correct distance; otherwise it returns some value greater than max_d.
   
   or to follow `org.apache.commons.text.similarity.LevenshteinDistance.limitedCompare`'s semantics to return -1 when the distance is greater than the threshold (the current code).
   
   I think the former is probably better: the optimizer can safely convert `levenshtein(s1, s2) < c` into `levenshtein(s1, s2, c) < c`, which I believe should be a quite common use case of `levenshtein`.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org