You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Apache Spark (Jira)" <ji...@apache.org> on 2022/02/17 09:24:00 UTC
[jira] [Assigned] (SPARK-38238) Contains Join for Spark SQL
[ https://issues.apache.org/jira/browse/SPARK-38238?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Apache Spark reassigned SPARK-38238:
------------------------------------
Assignee: (was: Apache Spark)
> Contains Join for Spark SQL
> ---------------------------
>
> Key: SPARK-38238
> URL: https://issues.apache.org/jira/browse/SPARK-38238
> Project: Spark
> Issue Type: Bug
> Components: SQL
> Affects Versions: 3.3.0
> Reporter: Wan Kun
> Priority: Major
>
> Currently Spark SQL uses a Broadcast Nested Loop join when it has to execute the following string contains query:
> {code:sql}
> SELECT a.text, b.pattern
> FROM fact_table a
> JOIN patterns b
> ON a.text like concat('%', b.pattern, '%');
> {code}
> OR
> {code:sql}
> SELECT a.text, b.pattern
> FROM fact_table a
> JOIN patterns b
> ON position(b.pattern, a.text) > 0;
> {code}
> If there are many patterns to match in the left table, the query many execute for a long time.
> Actually this join is called *Multi-Pattern String Matching* or {*}Multi-Way String Matching{*}, and many algorithm trying to improve this matching. One of the famous algorithm called [*Aho–Corasick algorithm*|https://en.wikipedia.org/wiki/Aho%E2%80%93Corasick_algorithm]
> The basic idea to optimize this query is to transform all the patterns into a trie tree and broadcast it. So then each row from the left table only need to match its content to the trie tree once.
> The query will go from *O(M * N * m * n)* to *O(M * m * max( n ))*
> M = number of records in the fact table
> N = number of records in the patterns table
> m = row length of the fact table
> n = row length of the patterns table
--
This message was sent by Atlassian Jira
(v8.20.1#820001)
---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org