You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Li Jin (JIRA)" <ji...@apache.org> on 2018/01/03 19:31:01 UTC
[jira] [Created] (SPARK-22947) SPIP: as-of join in Spark SQL

Li Jin created SPARK-22947:
------------------------------

             Summary: SPIP: as-of join in Spark SQL
                 Key: SPARK-22947
                 URL: https://issues.apache.org/jira/browse/SPARK-22947
             Project: Spark
          Issue Type: Improvement
          Components: SQL
    Affects Versions: 2.2.1
            Reporter: Li Jin


h2. Background and Motivation
Time series analysis is one of the most common analysis on financial data. In time series analysis, as-of join is a very common operation. Supporting as-of join in Spark SQL will allow many use cases of using Spark SQL for time series analysis.

As-of join is “join on time” with inexact time matching criteria. Various library has implemented asof join or similar functionality:
Kdb: https://code.kx.com/wiki/Reference/aj
Pandas: http://pandas.pydata.org/pandas-docs/version/0.19.0/merging.html#merging-merge-asof
R: This functionality is called “Last Observation Carried Forward”
https://www.rdocumentation.org/packages/zoo/versions/1.8-0/topics/na.locf
JuliaDB: http://juliadb.org/latest/api/joins.html#IndexedTables.asofjoin
Flint: https://github.com/twosigma/flint#temporal-join-functions

This proposal advocates introducing new API in Spark SQL to support as-of join.

h2. Target Personas
Data scientists, data engineers

h2. Goals
* New API in Spark SQL that allows as-of join
* As-of join of multiple table (>2) should be performant, because it’s very common that users need to join multiple data sources together for further analysis.
* Define Distribution, Partitioning and shuffle strategy for ordered time series data

h2. Non-Goals
These are out of scope for the existing SPIP, should be considered in future SPIP as improvement to Spark’s time series analysis ability:
* Utilize partition information from data source, i.e, begin/end of each partition to reduce sorting/shuffling
* Define API for user to implement asof join time spec in business calendar (i.e. lookback one business day, this is very common in financial data analysis because of market calendars)
* Support broadcast join

h2. Proposed API Changes
See attachment




--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org