You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Apache Spark (Jira)" <ji...@apache.org> on 2022/04/10 04:19:00 UTC
[jira] [Commented] (SPARK-38844) impl Series.interpolate and DataFrame.interpolate

    [ https://issues.apache.org/jira/browse/SPARK-38844?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17520063#comment-17520063 ] 

Apache Spark commented on SPARK-38844:
--------------------------------------

User 'zhengruifeng' has created a pull request for this issue:
https://github.com/apache/spark/pull/36127

> impl Series.interpolate and DataFrame.interpolate
> -------------------------------------------------
>
>                 Key: SPARK-38844
>                 URL: https://issues.apache.org/jira/browse/SPARK-38844
>             Project: Spark
>          Issue Type: Sub-task
>          Components: PySpark
>    Affects Versions: 3.4.0
>            Reporter: zhengruifeng
>            Priority: Major
>
> h2. Goal:
> [pandas's interpolate|https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.interpolate.html] supports many methods, _linear_ is applied by default, other methods ( _pad_ _ffill_ _backfill_ _bifll_ ) can also be implemented in pandas API on spark.
> The remainder ones ( including _quadratic_ _cubic_ _spline_ ) can not be implemented easily since scipy is used internally and the window frame used is complex.
> Since methods ( _pad_ _ffill_ _backfill_ _bifll_ ) were already implemented in pandas API on spark via {_}fillna{_}, so this work currently focus on implementing the missing *linear interpolation*
> h2.  
> h2. Impl:
> To implement the linear interpolation, two extra window functions are added, one ( _null_index_ ) is to compute the indices of missing values in each consecutive seq, the other ({_}last_not_null{_}) is to keep the last no-missing value.
> ||index||value||_null_index_forward_||_last_not_null_forward_||_null_index_backward_||_last_not_null_backward_||filled||filled (limit=1)||
> |1|nan|1|nan|1|1|-|-|
> |2|1|0|1|0|1| | |
> |3|nan|1|1|3|5|2.0|2.0|
> |4|nan|2|1|2|5|3.0|-|
> |5|nan|3|1|1|5|4.0|-|
> |6|5|0|5|0|5| | |
> |7|6|0|6|0|6| | |
> |8|nan|1|6|2|nan|6.0|6.0|
> |9|nan|2|6|1|nan|6.0|-|
>  * for the NANs at indices (3,4,5), we always compute the filled value via
> ({_}last_not_null_backward{_} - {_}last_not_null_forward{_}) / ({_}null_index_forward{_} + {_}null_index_backward{_}) * _null_index_forward_ + _last_not_null_forward_
>  * for the NaN at index(1), skip it due to the default *limit_direction* = _forward_
>  * for the NaN at index(8), fill it like _ffill_ with vlaue _last_not_null_forward_
>  * If _limit_ is set, then NaNs with _null_index_forward_ greater than _limit_ will not be interpolated.
> h2. Plan
> 1, impl the basic _linear interpolate_ with param _limit_
> 2, add param _limit_direction_
> 3, add param _limit_area_



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org