You are viewing a plain text version of this content. The canonical link for it is here.

Posted to issues@spark.apache.org by "Dinesh Dharme (Jira)" <ji...@apache.org> on 2023/08/27 13:03:00 UTC

[jira] [Created] (SPARK-44979) Cache results of simple udfs on executors if same arguments are passed.

Dinesh Dharme created SPARK-44979:
-------------------------------------

             Summary: Cache results of simple udfs on executors if same arguments are passed.
                 Key: SPARK-44979
                 URL: https://issues.apache.org/jira/browse/SPARK-44979
             Project: Spark
          Issue Type: Improvement
          Components: Spark Core
    Affects Versions: 3.4.1
            Reporter: Dinesh Dharme


Consider two dataframes :

{{keyword_given = [
    ["green pstr",],
    ["greenpstr",],
    ["wlmrt", ],
    ["walmart",],
    ["walmart super",]
    ]}}

{{variations = [
            ("type green pstr", "ABC", 100),
            ("type green pstr","PQR",200),
            ("type green pstr", "NZSD", 2999),
            ("wlmrt payment","walmart",200),
            ("wlmrt solutions", "walmart", 200),
            ("nppssdwlmrt", "walmart", 2000)
             ]}}

{{Imagine I have a task to do fuzzy substring matching between keyword and variation[0] using in built levenstein function. It is possible to optimize this futher in the code itself where we extract out the uniques and then do fuzzy matching on the uniques and join back with the original table. }}

{{But it could be possible as an optimization to cache the results of the already computed udfs till now and do a lookup on each executor separately.}}

Just a thought. Not sure if it makes any sense. This behaviour could be behind a config.

{{}}

{{}}

{{}}

{{{}{}}}{{{}{}}}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org