You are viewing a plain text version of this content. The canonical link for it is here.

Posted to reviews@spark.apache.org by GitBox <gi...@apache.org> on 2022/12/07 12:35:22 UTC

[GitHub] [spark] ahmed-mahran opened a new pull request, #38966: [SPARK-41008][MLLIB] Dedup isotonic regression duplicate features

ahmed-mahran opened a new pull request, #38966:
URL: https://github.com/apache/spark/pull/38966

### What changes were proposed in this pull request?

Adding a pre-processing step to isotonic regression in mllib to handle duplicate features. This is to match `sklearn` implementation. Input points of duplicate feature values are aggregated into a single point using as label the weighted average of the labels of the points with duplicate feature values. All points for a unique feature values are aggregated as:
- Aggregated label is the weighted average of all labels
- Aggregated feature is the weighted average of all equal features. It is possible that feature values to be equal up to a resolution due to representation errors, since we cannot know which feature value to use in that case, we compute the weighted average of the features. Ideally, all feature values will be equal and the weighted average is just the value at any point.
- Aggregated weight is the sum of all weights

### Why are the changes needed?

As per discussion on ticket [[SPARK-41008]](https://issues.apache.org/jira/browse/SPARK-41008), it is a bug and results should match `sklearn`.

### Does this PR introduce _any_ user-facing change?

There are no changes to the API, documentation or error messages. However, the user should expect results to change.

### How was this patch tested?

Existing test cases for duplicate features failed. These tests were adjusted accordingly. Also, new tests are added.

Here is a python snippet that can be used to verify the results:

```python
from sklearn.isotonic import IsotonicRegression

def test(x, y, x_test, isotonic=True):
ir = IsotonicRegression(out_of_bounds='clip', increasing=isotonic).fit(x, y)
y_test = ir.predict(x_test)

def print_array(label, a):
print(f"{label}: [{', '.join([str(i) for i in a])}]")

print_array("boundaries", ir.X_thresholds_)
print_array("predictions", ir.y_thresholds_)
print_array("y_test", y_test)

test(
x = [0.6, 0.6, 0.333, 0.333, 0.333, 0.20, 0.20, 0.20, 0.20],
y = [1, 0, 0, 1, 0, 1, 0, 0, 0],
x_test = [0.6, 0.6, 0.333, 0.333, 0.333, 0.20, 0.20, 0.20, 0.20]
)
```

@srowen @zapletal-martin

--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org