You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "zhengruifeng (Jira)" <ji...@apache.org> on 2022/04/10 04:05:00 UTC

[jira] [Created] (SPARK-38844) impl Series.interpolate and DataFrame.interpolate

zhengruifeng created SPARK-38844:
------------------------------------

             Summary: impl Series.interpolate and DataFrame.interpolate
                 Key: SPARK-38844
                 URL: https://issues.apache.org/jira/browse/SPARK-38844
             Project: Spark
          Issue Type: Sub-task
          Components: PySpark
    Affects Versions: 3.4.0
            Reporter: zhengruifeng


h2. Goal:

[pandas's interpolate|https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.interpolate.html] supports many methods, _linear_ is applied by default, other methods ( _pad_ _ffill_ _backfill_ _bifll_ ) can also be implemented in pandas API on spark.

The remainder ones ( including _quadratic_ _cubic_ _spline_ ) can not be implemented easily since scipy is used internally and the window frame used is complex.

Since methods ( _pad_ _ffill_ _backfill_ _bifll_ ) were already implemented in pandas API on spark via {_}fillna{_}, so this work currently focus on implementing the missing *linear interpolation*
h2.  
h2. Impl:

To implement the linear interpolation, two extra window functions are added, one ( _null_index_ ) is to compute the indices of missing values in each consecutive seq, the other ({_}last_not_null{_}) is to keep the last no-missing value.
||index||value||_null_index_forward_||_last_not_null_forward_||_null_index_backward_||_last_not_null_backward_||filled||filled (limit=1)||
|1|nan|1|nan|1|1|-|-|
|2|1|0|1|0|1| | |
|3|nan|1|1|3|5|2.0|2.0|
|4|nan|2|1|2|5|3.0|-|
|5|nan|3|1|1|5|4.0|-|
|6|5|0|5|0|5| | |
|7|6|0|6|0|6| | |
|8|nan|1|6|2|nan|6.0|6.0|
|9|nan|2|6|1|nan|6.0|-|
 * for the NANs at indices (3,4,5), we always compute the filled value via

({_}last_not_null_backward{_} - {_}last_not_null_forward{_}) / ({_}null_index_forward{_} + {_}null_index_backward{_}) * _null_index_forward_ + _last_not_null_forward_

 * for the NaN at index(1), skip it due to the default *limit_direction* = _forward_

 * for the NaN at index(8), fill it like _ffill_ with vlaue _last_not_null_forward_

 * If _limit_ is set, then NaNs with _null_index_forward_ greater than _limit_ will not be interpolated.

h2. Plan

1, impl the basic _linear interpolate_ with param _limit_

2, add param _limit_direction_

3, add param _limit_area_



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org