You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Apache Spark (Jira)" <ji...@apache.org> on 2022/02/17 09:24:00 UTC

[jira] [Assigned] (SPARK-38238) Contains Join for Spark SQL

     [ https://issues.apache.org/jira/browse/SPARK-38238?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Apache Spark reassigned SPARK-38238:
------------------------------------

    Assignee:     (was: Apache Spark)

> Contains Join for Spark SQL
> ---------------------------
>
>                 Key: SPARK-38238
>                 URL: https://issues.apache.org/jira/browse/SPARK-38238
>             Project: Spark
>          Issue Type: Bug
>          Components: SQL
>    Affects Versions: 3.3.0
>            Reporter: Wan Kun
>            Priority: Major
>
> Currently Spark SQL uses a Broadcast Nested Loop join when it has to execute the following string contains query:
> {code:sql}
> SELECT a.text, b.pattern
> FROM fact_table a
> JOIN patterns b
> ON a.text like concat('%', b.pattern, '%');
> {code}
> OR
> {code:sql}
> SELECT a.text, b.pattern
> FROM fact_table a
> JOIN patterns b
> ON position(b.pattern, a.text) > 0;
> {code}
> If there are many patterns to match in the left table, the query many execute for a long time.
> Actually this join is called *Multi-Pattern String Matching* or {*}Multi-Way String Matching{*}, and many algorithm trying to improve this matching. One of the famous algorithm called [*Aho–Corasick algorithm*|https://en.wikipedia.org/wiki/Aho%E2%80%93Corasick_algorithm]
> The basic idea to optimize this query is to transform all the patterns into a trie tree and broadcast it. So then each row from the left table only need to match its content to the trie tree once.
> The query will go from *O(M * N * m * n)* to *O(M * m * max( n ))*
> M = number of records in the fact table
> N = number of records in the patterns table
> m = row length of the fact table
> n = row length of the patterns table



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org