You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@asterixdb.apache.org by dl...@apache.org on 2020/08/25 16:27:20 UTC

[asterixdb] branch master updated: [NO ISSUE] Added Documentation for Interval Joins

This is an automated email from the ASF dual-hosted git repository.

dlych pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/asterixdb.git


The following commit(s) were added to refs/heads/master by this push:
     new dfdb656  [NO ISSUE] Added Documentation for Interval Joins
dfdb656 is described below

commit dfdb6569a183e24ce6dd08f5a15b64c41b907433
Author: Caleb Herbel <ca...@gmail.com>
AuthorDate: Thu Aug 20 10:01:34 2020 -0600

    [NO ISSUE] Added Documentation for Interval Joins
    
    - user model changes: no
    - storage format changes: no
    - interface changes: no
    
    Details:
    -Added documentation for interval joins
    -Added documentation on how to pick an optimized interval join
    
    Change-Id: Ibb4acb9c072b24723815112da66db0827f1916ac
    Reviewed-on: https://asterix-gerrit.ics.uci.edu/c/asterixdb/+/7604
    Integration-Tests: Jenkins <je...@fulliautomatix.ics.uci.edu>
    Tested-by: Jenkins <je...@fulliautomatix.ics.uci.edu>
    Reviewed-by: Preston Carman <pr...@gmail.com>
    Reviewed-by: Dmitry Lychagin <dm...@couchbase.com>
---
 asterixdb/asterix-doc/pom.xml                      |  2 +-
 .../asterix-doc/src/main/markdown/sqlpp/0_toc.md   |  1 +
 .../markdown/sqlpp/appendix_2_interval_joins.md    | 89 ++++++++++++++++++++++
 3 files changed, 91 insertions(+), 1 deletion(-)

diff --git a/asterixdb/asterix-doc/pom.xml b/asterixdb/asterix-doc/pom.xml
index ea23f5e..f7d914e 100644
--- a/asterixdb/asterix-doc/pom.xml
+++ b/asterixdb/asterix-doc/pom.xml
@@ -52,7 +52,7 @@
             <configuration>
               <target>
                 <concat destfile="${project.build.directory}/generated-site/markdown/sqlpp/manual.md">
-                  <filelist dir="${project.basedir}/src/main/markdown/sqlpp" files="0_toc.md,1_intro.md,2_expr_title.md,2_expr.md,3_query_title.md,3_declare_dataverse.md,3_declare_function.md,3_query.md,4_error_title.md,4_error.md,5_ddl_head.md,5_ddl_dataset_index.md,5_ddl_function_removal.md,5_ddl_dml.md,appendix_1_title.md,appendix_1_keywords.md,appendix_2_title.md,appendix_2_parameters.md,appendix_2_parallel_sort.md,appendix_2_index_only.md,appendix_3_title.md,appendix_3_resolution.md" />
+                  <filelist dir="${project.basedir}/src/main/markdown/sqlpp" files="0_toc.md,1_intro.md,2_expr_title.md,2_expr.md,3_query_title.md,3_declare_dataverse.md,3_declare_function.md,3_query.md,4_error_title.md,4_error.md,5_ddl_head.md,5_ddl_dataset_index.md,5_ddl_function_removal.md,5_ddl_dml.md,appendix_1_title.md,appendix_1_keywords.md,appendix_2_title.md,appendix_2_parameters.md,appendix_2_parallel_sort.md,appendix_2_index_only.md,appendix_2_interval_joins.md,appendix_3_titl [...]
                 </concat>
                 <concat destfile="${project.build.directory}/generated-site/markdown/sqlpp/builtins.md">
                   <filelist dir="${project.basedir}/src/main/markdown/builtins" files="0_toc.md,0_toc_sqlpp.md,0_toc_common.md,1_numeric_common.md,1_numeric_delta.md,2_string_common.md,2_string_delta.md,3_binary.md,4_spatial.md,5_similarity.md,6_tokenizing.md,7_temporal.md,7_allens.md,8_record.md,9_aggregate_sql.md,10_comparison.md,11_type.md,13_conditional.md,12_misc.md,15_bitwise.md,14_window.md" />
diff --git a/asterixdb/asterix-doc/src/main/markdown/sqlpp/0_toc.md b/asterixdb/asterix-doc/src/main/markdown/sqlpp/0_toc.md
index e65ae9f..f4e2cef 100644
--- a/asterixdb/asterix-doc/src/main/markdown/sqlpp/0_toc.md
+++ b/asterixdb/asterix-doc/src/main/markdown/sqlpp/0_toc.md
@@ -99,5 +99,6 @@
 * [Appendix 1. Reserved Keywords](#Reserved_keywords)
 * [Appendix 2. Performance Tuning](#Performance_tuning)
       * [Parallelism Parameter](#Parallelism_parameter)
+      * [Interval Joins](#Interval_joins)
       * [Memory Parameters](#Memory_parameters)
 * [Appendix 3. Variable Bindings and Name Resolution](#Variable_bindings_and_name_resolution)
diff --git a/asterixdb/asterix-doc/src/main/markdown/sqlpp/appendix_2_interval_joins.md b/asterixdb/asterix-doc/src/main/markdown/sqlpp/appendix_2_interval_joins.md
new file mode 100644
index 0000000..45f13f3
--- /dev/null
+++ b/asterixdb/asterix-doc/src/main/markdown/sqlpp/appendix_2_interval_joins.md
@@ -0,0 +1,89 @@
+<!--
+ ! Licensed to the Apache Software Foundation (ASF) under one
+ ! or more contributor license agreements.  See the NOTICE file
+ ! distributed with this work for additional information
+ ! regarding copyright ownership.  The ASF licenses this file
+ ! to you under the Apache License, Version 2.0 (the
+ ! "License"); you may not use this file except in compliance
+ ! with the License.  You may obtain a copy of the License at
+ !
+ !   http://www.apache.org/licenses/LICENSE-2.0
+ !
+ ! Unless required by applicable law or agreed to in writing,
+ ! software distributed under the License is distributed on an
+ ! "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ ! KIND, either express or implied.  See the License for the
+ ! specific language governing permissions and limitations
+ ! under the License.
+ !-->
+
+This system allows for the 13 types of Allen's interval-join relations.
+The default, when using these joins, is either Nested Loop, or Hybrid Hash Join.
+The optimal algorithm will be automatically selected based on the query.
+Hybrid Hash Join will be selected in cases where the relation includes an equality;
+these cases are: `interval_starts()`, `interval_started_by()`, `interval_ends()`, `interval_ended_by()`,
+`interval_meets()`, and `interval_met_by()`.
+Otherwise, the system will default to nested loop join.
+To use interval merge join you must include a range hint.
+Adding a range hint allows for the system to pick interval merge join.
+
+## <a id="Interval_joins">Types of Interval Joins</a>
+The 13 interval functions are `interval_after()`, `interval_before()`, `interval_covers()`, `interval_covered_by()`,
+`interval_ends()`, `interval_ended_by()`, `interval_meets()`, `interval_met_by()`, `interval_overlaps()`,
+`interval_overlapping()`, `interval_overlapped_by()`, `interval_starts()`, and `interval_started_by()`.
+
+##### How to use an interval join
+
+    select f.name as staff, d.name as student
+    from Staff as f, Students as d
+    where interval_after(f.employment, d.attendance)
+
+In this scenario, `interval_after()` can be replaced with any of the 13 join functions.
+Here is what each of the functions represent if A represents the first interval parameter,
+and B represents the second set interval parameter:
+
+| Function | Condition |
+|-------------------------|-------------------------|
+| Before(A, B) and After(B, A) | A.end < B.start |
+| Covers(A, B) and Covered_by(B, A) | A.start <= B.start and A.end >= B.end |
+| Ends(A, B) and Ended_by(B, A) | A.end = B.end and A.start >= B.start |
+| Meets(A, B) and Met_by(B, A) | A.end = B.start |
+| Overlaps(A, B) and Overlapped_by(B, A) | A.start < B.start and B.start > A.end and A.end > B.start |
+| Overlapping(A, B)| (A.start >= B.start and B.start < A.end) or (B.end <= A.end and B.end < A.start)|
+| Starts(A, B) and Started_by(B, A) | A.start = B.start and A.end <= B.end |
+
+## <a id="Range_hint">Using a Range Hint</a>
+
+To use an efficient interval join the data must be partitioned with the details in a range hint.
+Interval joins with a range hint currently work for intervals types of date, datetime, or time;
+the range hint type must match the interval type.
+Adding a range hint directly before the interval join function will cause the system to pick interval
+merge join for these interval functions: `interval_after()`, `interval_before()`, `interval_covers()`,
+`interval_covered_by()`, `interval_overlaps()`, `interval_overlapping()`, `interval_overlapped_by()`.
+The other relations will ignore the range hint and pick Hybrid Hash Join as described earlier.
+
+Here is an example of how interval joins work with a range hint for all the supported data types.
+Suppose that we have two sets of data, a data set of staff members with an interval for length of
+employment and an id.
+The other dataset represents students, which may include an interval for attendance and an id.
+Each partition receives data based on the split points;
+The split points in the range hint must be strategically set by the
+user so that the data divides evenly among partitions.
+For example, if your query contains 1 split point, and the system is using two partitions,
+the data before the split point will be sent to the first partition,
+and the data after the split point will be sent to the second partition.
+This continues to work respectively based on the number of split points and number of partitions.
+Ideally, the number of split points should equal the number of partitions - 1.
+
+##### Range Hint Example
+
+    /*+ range [<Expression>, ..., ] */
+
+##### Range Hint Example with Date
+
+    select f.name as staff, d.name as student
+    from Staff as f, Students as d
+    where
+    /*+ range [date("2003-06-30"), date("2005-12-31"), date("2008-06-30")] */
+    interval_after(f.employment, d.attendance)
+    order by f.name, d.name;