You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@spark.apache.org by GitBox <gi...@apache.org> on 2022/12/07 12:35:22 UTC

[GitHub] [spark] ahmed-mahran opened a new pull request, #38966: [SPARK-41008][MLLIB] Dedup isotonic regression duplicate features

ahmed-mahran opened a new pull request, #38966:
URL: https://github.com/apache/spark/pull/38966

   <!--
   Thanks for sending a pull request!  Here are some tips for you:
     1. If this is your first time, please read our contributor guidelines: https://spark.apache.org/contributing.html
     2. Ensure you have added or run the appropriate tests for your PR: https://spark.apache.org/developer-tools.html
     3. If the PR is unfinished, add '[WIP]' in your PR title, e.g., '[WIP][SPARK-XXXX] Your PR title ...'.
     4. Be sure to keep the PR description updated to reflect all changes.
     5. Please write your PR title to summarize what this PR proposes.
     6. If possible, provide a concise example to reproduce the issue for a faster review.
     7. If you want to add a new configuration, please read the guideline first for naming configurations in
        'core/src/main/scala/org/apache/spark/internal/config/ConfigEntry.scala'.
     8. If you want to add or modify an error type or message, please read the guideline first in
        'core/src/main/resources/error/README.md'.
   -->
   
   ### What changes were proposed in this pull request?
   <!--
   Please clarify what changes you are proposing. The purpose of this section is to outline the changes and how this PR fixes the issue. 
   If possible, please consider writing useful notes for better and faster reviews in your PR. See the examples below.
     1. If you refactor some codes with changing classes, showing the class hierarchy will help reviewers.
     2. If you fix some SQL features, you can provide some references of other DBMSes.
     3. If there is design documentation, please add the link.
     4. If there is a discussion in the mailing list, please add the link.
   -->
   
   Adding a pre-processing step to isotonic regression in mllib to handle duplicate features. This is to match `sklearn` implementation. Input points of duplicate feature values are aggregated into a single point using as label the weighted average of the labels of the points with duplicate feature values. All points for a unique feature values are aggregated as:
    - Aggregated label is the weighted average of all labels
    - Aggregated feature is the weighted average of all equal features. It is possible that feature values to be equal up to a resolution due to representation errors, since we cannot know which feature value to use in that case, we compute the weighted average of the features. Ideally, all feature values will be equal and the weighted average is just the value at any point.
    - Aggregated weight is the sum of all weights
   
   ### Why are the changes needed?
   <!--
   Please clarify why the changes are needed. For instance,
     1. If you propose a new API, clarify the use case for a new API.
     2. If you fix a bug, you can clarify why it is a bug.
   -->
   As per discussion on ticket [[SPARK-41008]](https://issues.apache.org/jira/browse/SPARK-41008), it is a bug and results should match `sklearn`.
   
   ### Does this PR introduce _any_ user-facing change?
   <!--
   Note that it means *any* user-facing change including all aspects such as the documentation fix.
   If yes, please clarify the previous behavior and the change this PR proposes - provide the console output, description and/or an example to show the behavior difference if possible.
   If possible, please also clarify if this is a user-facing change compared to the released Spark versions or within the unreleased branches such as master.
   If no, write 'No'.
   -->
   There are no changes to the API, documentation or error messages. However, the user should expect results to change.
   
   ### How was this patch tested?
   <!--
   If tests were added, say they were added here. Please make sure to add some test cases that check the changes thoroughly including negative and positive cases if possible.
   If it was tested in a way different from regular unit tests, please clarify how you tested step by step, ideally copy and paste-able, so that other reviewers can test and check, and descendants can verify in the future.
   If tests were not added, please describe why they were not added and/or why it was difficult to add.
   If benchmark tests were added, please run the benchmarks in GitHub Actions for the consistent environment, and the instructions could accord to: https://spark.apache.org/developer-tools.html#github-workflow-benchmarks.
   -->
   
   Existing test cases for duplicate features failed. These tests were adjusted accordingly. Also, new tests are added.
   
   Here is a python snippet that can be used to verify the results:
   
   ```python
   from sklearn.isotonic import IsotonicRegression
   
   def test(x, y, x_test, isotonic=True):
       ir = IsotonicRegression(out_of_bounds='clip', increasing=isotonic).fit(x, y)
       y_test = ir.predict(x_test)
       
       def print_array(label, a):
           print(f"{label}: [{', '.join([str(i) for i in a])}]")
   
       print_array("boundaries", ir.X_thresholds_)
       print_array("predictions", ir.y_thresholds_)
       print_array("y_test", y_test)
   
   test(
       x = [0.6, 0.6, 0.333, 0.333, 0.333, 0.20, 0.20, 0.20, 0.20],
       y = [1, 0, 0, 1, 0, 1, 0, 0, 0],
       x_test = [0.6, 0.6, 0.333, 0.333, 0.333, 0.20, 0.20, 0.20, 0.20]
   )
   ```
   
   @srowen @zapletal-martin


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] srowen commented on pull request #38966: [SPARK-41008][MLLIB] Dedup isotonic regression duplicate features

Posted by GitBox <gi...@apache.org>.
srowen commented on PR #38966:
URL: https://github.com/apache/spark/pull/38966#issuecomment-1341263587

   Ok if it was already varying from sklearn, and isn't the same issue as you're fixing, leave it for now


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] srowen commented on pull request #38966: [SPARK-41008][MLLIB] Dedup isotonic regression duplicate features

Posted by GitBox <gi...@apache.org>.
srowen commented on PR #38966:
URL: https://github.com/apache/spark/pull/38966#issuecomment-1342822868

   Merged to master


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] ahmed-mahran commented on a diff in pull request #38966: [SPARK-41008][MLLIB] Dedup isotonic regression duplicate features

Posted by GitBox <gi...@apache.org>.
ahmed-mahran commented on code in PR #38966:
URL: https://github.com/apache/spark/pull/38966#discussion_r1043429623


##########
mllib/src/main/scala/org/apache/spark/mllib/regression/IsotonicRegression.scala:
##########
@@ -434,12 +485,56 @@ class IsotonicRegression private (private var isotonic: Boolean) extends Seriali
       input: RDD[(Double, Double, Double)]): Array[(Double, Double, Double)] = {
     val keyedInput = input.keyBy(_._2)
     val parallelStepResult = keyedInput
+      // Points with same or adjacent features must collocate within the same partition.
       .partitionBy(new RangePartitioner(keyedInput.getNumPartitions, keyedInput))
       .values
+      // Lexicographically sort points by features then labels.
       .mapPartitions(p => Iterator(p.toArray.sortBy(x => (x._2, x._1))))
+      // Aggregate points with equal features into a single point.
+      .map(makeUnique)
       .flatMap(poolAdjacentViolators)
       .collect()
-      .sortBy(x => (x._2, x._1)) // Sort again because collect() doesn't promise ordering.
+      // Sort again because collect() doesn't promise ordering.
+      .sortBy(x => (x._2, x._1))

Review Comment:
   I think now it is redundant to sort by labels since we already are making features unique.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] ahmed-mahran commented on pull request #38966: [SPARK-41008][MLLIB] Dedup isotonic regression duplicate features

Posted by GitBox <gi...@apache.org>.
ahmed-mahran commented on PR #38966:
URL: https://github.com/apache/spark/pull/38966#issuecomment-1341429923

   Regardless from this change, these tests' assumptions contradicts with what this ticket is about. See the tests' names at least: the tests are about handling duplicate features. My guess is that these were introduced to cover corner cases of lex sorting and repartitioning. However, we know that original implementation doesn't actually handle cosolidation of duplicate features. Hence, these tests by definition, are not valid.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] srowen commented on pull request #38966: [SPARK-41008][MLLIB] Dedup isotonic regression duplicate features

Posted by GitBox <gi...@apache.org>.
srowen commented on PR #38966:
URL: https://github.com/apache/spark/pull/38966#issuecomment-1341296380

   Ah OK, hm, that sounds like a regression related to this change then and probably needs to be fixed (or decide the old test was wrong)


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] srowen commented on a diff in pull request #38966: [SPARK-41008][MLLIB] Dedup isotonic regression duplicate features

Posted by GitBox <gi...@apache.org>.
srowen commented on code in PR #38966:
URL: https://github.com/apache/spark/pull/38966#discussion_r1043430907


##########
mllib/src/main/scala/org/apache/spark/mllib/regression/IsotonicRegression.scala:
##########
@@ -434,12 +485,56 @@ class IsotonicRegression private (private var isotonic: Boolean) extends Seriali
       input: RDD[(Double, Double, Double)]): Array[(Double, Double, Double)] = {
     val keyedInput = input.keyBy(_._2)
     val parallelStepResult = keyedInput
+      // Points with same or adjacent features must collocate within the same partition.
       .partitionBy(new RangePartitioner(keyedInput.getNumPartitions, keyedInput))
       .values
+      // Lexicographically sort points by features then labels.
       .mapPartitions(p => Iterator(p.toArray.sortBy(x => (x._2, x._1))))
+      // Aggregate points with equal features into a single point.
+      .map(makeUnique)
       .flatMap(poolAdjacentViolators)
       .collect()
-      .sortBy(x => (x._2, x._1)) // Sort again because collect() doesn't promise ordering.
+      // Sort again because collect() doesn't promise ordering.
+      .sortBy(x => (x._2, x._1))

Review Comment:
   You can open up a follow-up PR on the same JIRA if there are minor additional improvements or docs to suggest



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] srowen commented on pull request #38966: [SPARK-41008][MLLIB] Dedup isotonic regression duplicate features

Posted by GitBox <gi...@apache.org>.
srowen commented on PR #38966:
URL: https://github.com/apache/spark/pull/38966#issuecomment-1342836098

   OK, as long as it's logical and consistent, and more logical/consistent than before, I think it's OK. Feel free to update docs as you see fit


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] ahmed-mahran commented on pull request #38966: [SPARK-41008][MLLIB] Dedup isotonic regression duplicate features

Posted by GitBox <gi...@apache.org>.
ahmed-mahran commented on PR #38966:
URL: https://github.com/apache/spark/pull/38966#issuecomment-1342838277

   I'll follow up with PR for doc and removing the redundant sort key


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on pull request #38966: [SPARK-41008][MLLIB] Dedup isotonic regression duplicate features

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on PR #38966:
URL: https://github.com/apache/spark/pull/38966#issuecomment-1342981265

   Can one of the admins verify this patch?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] srowen commented on pull request #38966: [SPARK-41008][MLLIB] Dedup isotonic regression duplicate features

Posted by GitBox <gi...@apache.org>.
srowen commented on PR #38966:
URL: https://github.com/apache/spark/pull/38966#issuecomment-1341478310

   OK so the test result is different, but maybe wasn't correct in the first place. But it doesn't match sklearn now, and did before (?) I'm still a little concerned that this changes behavior in unintended ways, but maybe you have a better judgment about whether this is still mostly an improvement in the result


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] ahmed-mahran commented on pull request #38966: [SPARK-41008][MLLIB] Dedup isotonic regression duplicate features

Posted by GitBox <gi...@apache.org>.
ahmed-mahran commented on PR #38966:
URL: https://github.com/apache/spark/pull/38966#issuecomment-1341224238

   I don't remember whether I've checked all tests against sklearn. I'll double check anyways.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] ahmed-mahran commented on pull request #38966: [SPARK-41008][MLLIB] Dedup isotonic regression duplicate features

Posted by GitBox <gi...@apache.org>.
ahmed-mahran commented on PR #38966:
URL: https://github.com/apache/spark/pull/38966#issuecomment-1341502443

   > OK so the test result is different, but maybe wasn't correct in the first place. But it doesn't match sklearn now, and did before (?)
   
   I'll try to trace history of sklearn, maybe it has changed behavior.
   
   
    > I'm still a little concerned that this changes behavior in unintended ways, but maybe you have a better judgment about whether this is still mostly an improvement in the result
   
   All I can say here is that if sklearn should be the reference, then these tests are definitely wrong and the issue is because of not doing duplicate features consolidation.
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] ahmed-mahran commented on pull request #38966: [SPARK-41008][MLLIB] Dedup isotonic regression duplicate features

Posted by GitBox <gi...@apache.org>.
ahmed-mahran commented on PR #38966:
URL: https://github.com/apache/spark/pull/38966#issuecomment-1341261931

   Double-checking on this branch, all tests produce same results as sklearn.
   
   On master, these two (already existing) tests produce different results than sklearn:
   - [`test("isotonic regression prediction with duplicate features")`](https://github.com/apache/spark/pull/38966/files#diff-6eae24fe4c455659652a83f423f0df48ad994237b3350f18953c39258ae1ab86R248-R261)
   - [`test("antitonic regression prediction with duplicate features")`](https://github.com/apache/spark/pull/38966/files#diff-6eae24fe4c455659652a83f423f0df48ad994237b3350f18953c39258ae1ab86R263-R276)


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] ahmed-mahran commented on pull request #38966: [SPARK-41008][MLLIB] Dedup isotonic regression duplicate features

Posted by GitBox <gi...@apache.org>.
ahmed-mahran commented on PR #38966:
URL: https://github.com/apache/spark/pull/38966#issuecomment-1341173543

   > Is it fair to say this keeps existing tests and adds new ones? my only concern is about introducing new unintended behavior changes, but if it's passing all the previous tests, that's good enough for purposes here
   
   Well, the behavior has changed on duplicate features cases. Specifically, only these two tests I had to change to conform with the new behavior. These tests, on `master`, produce different results than `sklearn`:
   - [`test("isotonic regression prediction with duplicate features")`](https://github.com/apache/spark/pull/38966/files#diff-6eae24fe4c455659652a83f423f0df48ad994237b3350f18953c39258ae1ab86R248-R261)
   - [`test("antitonic regression prediction with duplicate features")`](https://github.com/apache/spark/pull/38966/files#diff-6eae24fe4c455659652a83f423f0df48ad994237b3350f18953c39258ae1ab86R263-R276)
   
   Added new tests:
   - [`test("SPARK-41008 isotonic regression with duplicate features differs from sklearn")`](https://github.com/apache/spark/pull/38966/files#diff-6eae24fe4c455659652a83f423f0df48ad994237b3350f18953c39258ae1ab86R226-R246)
   - [`test("makeUnique: handle duplicate features")`](https://github.com/apache/spark/pull/38966/files#diff-6eae24fe4c455659652a83f423f0df48ad994237b3350f18953c39258ae1ab86R327-R380)
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] ahmed-mahran commented on pull request #38966: [SPARK-41008][MLLIB] Dedup isotonic regression duplicate features

Posted by GitBox <gi...@apache.org>.
ahmed-mahran commented on PR #38966:
URL: https://github.com/apache/spark/pull/38966#issuecomment-1341268162

   Aha, I see your point! It was varying because of the issue I'm fixing, which is about handling duplicate features.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] ahmed-mahran commented on pull request #38966: [SPARK-41008][MLLIB] Dedup isotonic regression duplicate features

Posted by GitBox <gi...@apache.org>.
ahmed-mahran commented on PR #38966:
URL: https://github.com/apache/spark/pull/38966#issuecomment-1342832587

   I think we need to follow up with documentation updates https://spark.apache.org/docs/latest/mllib-isotonic-regression.html#:~:text=If%20the%20prediction,point%20are%20used. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] srowen commented on a diff in pull request #38966: [SPARK-41008][MLLIB] Dedup isotonic regression duplicate features

Posted by GitBox <gi...@apache.org>.
srowen commented on code in PR #38966:
URL: https://github.com/apache/spark/pull/38966#discussion_r1042358866


##########
mllib/src/main/scala/org/apache/spark/mllib/regression/IsotonicRegression.scala:
##########
@@ -307,6 +307,99 @@ class IsotonicRegression private (private var isotonic: Boolean) extends Seriali
     run(input.rdd.retag.asInstanceOf[RDD[(Double, Double, Double)]])
   }
 
+  /**
+   * Aggregates points of duplicate feature values into a single point using as label the weighted
+   * average of the labels of the points with duplicate feature values. All points for a unique
+   * feature values are aggregated as:
+   *
+   *   - Aggregated label is the weighted average of all labels
+   *   - Aggregated feature is the weighted average of all equal features[1]
+   *   - Aggregated weight is the sum of all weights
+   *
+   * [1] Note: It is possible that feature values to be equal up to a resolution due to
+   * representation errors, since we cannot know which feature value to use in that case, we
+   * compute the weighted average of the features. Ideally, all feature values will be equal and
+   * the weighted average is just the value at any point.
+   *
+   * @param input
+   *   Input data of tuples (label, feature, weight). Weights must be non-negative.
+   * @return
+   *   Points with unique feature values.
+   */
+  private[regression] def makeUnique(
+      input: Array[(Double, Double, Double)]): Array[(Double, Double, Double)] = {
+
+    val cleanInput = input.filter { case (y, x, weight) =>
+      require(
+        weight >= 0.0,
+        s"Negative weight at point ($y, $x, $weight). Weights must be non-negative")
+      weight > 0
+    }
+
+    if (cleanInput.isEmpty) Array.empty

Review Comment:
   Expand the if-else so bodies are on new lines



##########
mllib/src/main/scala/org/apache/spark/mllib/regression/IsotonicRegression.scala:
##########
@@ -307,6 +307,99 @@ class IsotonicRegression private (private var isotonic: Boolean) extends Seriali
     run(input.rdd.retag.asInstanceOf[RDD[(Double, Double, Double)]])
   }
 
+  /**
+   * Aggregates points of duplicate feature values into a single point using as label the weighted
+   * average of the labels of the points with duplicate feature values. All points for a unique
+   * feature values are aggregated as:
+   *
+   *   - Aggregated label is the weighted average of all labels
+   *   - Aggregated feature is the weighted average of all equal features[1]
+   *   - Aggregated weight is the sum of all weights
+   *
+   * [1] Note: It is possible that feature values to be equal up to a resolution due to
+   * representation errors, since we cannot know which feature value to use in that case, we
+   * compute the weighted average of the features. Ideally, all feature values will be equal and
+   * the weighted average is just the value at any point.
+   *
+   * @param input
+   *   Input data of tuples (label, feature, weight). Weights must be non-negative.
+   * @return
+   *   Points with unique feature values.
+   */
+  private[regression] def makeUnique(
+      input: Array[(Double, Double, Double)]): Array[(Double, Double, Double)] = {
+
+    val cleanInput = input.filter { case (y, x, weight) =>
+      require(
+        weight >= 0.0,
+        s"Negative weight at point ($y, $x, $weight). Weights must be non-negative")
+      weight > 0
+    }
+
+    if (cleanInput.isEmpty) Array.empty
+    else if (cleanInput.length <= 1) cleanInput
+    else {
+      // whether or not two double features are equal up to a precision
+      @inline def areEqual(a: Double, b: Double): Boolean = Precision.equals(a, b)
+
+      // Utility object, holds a buffer of all points with unique features so far, and performs
+      // weighted sum accumulation of points. Hides these details for better readability of the
+      // main algorithm.
+      object PointsAccumulator {

Review Comment:
   Maybe make this a top-level object in this file at least, if not a separate file



##########
mllib/src/test/scala/org/apache/spark/mllib/regression/IsotonicRegressionSuite.scala:
##########
@@ -194,40 +241,46 @@ class IsotonicRegressionSuite extends SparkFunSuite with MLlibTestSparkContext w
     assert(model.predict(0.5) === 1.5)
     assert(model.predict(0.75) === 1.75)
     assert(model.predict(1) === 2)
-    assert(model.predict(2) === 10d/3)
-    assert(model.predict(9) === 10d/3)
+    assert(model.predict(2) === 10d / 3)

Review Comment:
   You're welcome to just write 10.0 / 3 here too



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] srowen commented on pull request #38966: [SPARK-41008][MLLIB] Dedup isotonic regression duplicate features

Posted by GitBox <gi...@apache.org>.
srowen commented on PR #38966:
URL: https://github.com/apache/spark/pull/38966#issuecomment-1341240367

   Double-checking is great, though, I'm asking if you meant in your comment that you know a test currently disagrees with sklearn already


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] srowen commented on pull request #38966: [SPARK-41008][MLLIB] Dedup isotonic regression duplicate features

Posted by GitBox <gi...@apache.org>.
srowen commented on PR #38966:
URL: https://github.com/apache/spark/pull/38966#issuecomment-1341215158

   Of course yeah this is going to change behavior where a bug is fixed. Other than that, do the tests now return different answers from sklearn where they agreed before? I'm only concerned about other regressions here


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] ahmed-mahran commented on pull request #38966: [SPARK-41008][MLLIB] Dedup isotonic regression duplicate features

Posted by GitBox <gi...@apache.org>.
ahmed-mahran commented on PR #38966:
URL: https://github.com/apache/spark/pull/38966#issuecomment-1342828376

   So, it is true: [scikit-learn has changed behavior](https://github.com/scikit-learn/scikit-learn/pull/4302) just after [spark introduced isotonic regression](https://github.com/apache/spark/pull/3519). I've tested different scikit-learn versions (namely: 0.13.1, 0.14.1, 0.15.0, 0.15.2) since isotonic regression was first introduced until the duplicates PR applied, and interestingly none has matched master's spark.
   
   ```python
   from sklearn.isotonic import IsotonicRegression
   
   IsotonicRegression(out_of_bounds='clip', increasing=isotonic).fit(x, y, sample_weight=weights).predict(x_test) # >= 0.15
   IsotonicRegression(increasing=isotonic).fit(x, y, sample_weight=weights).predict(x_test) # = 0.14
   IsotonicRegression().fit(x, y, weight=weights).predict(x_test) # = 0.13
   
   # test("isotonic regression prediction with duplicate features")
   x, y, weights, x_test, isotonic = [1, 1, 2, 2, 3, 3], [2, 1, 4, 2, 6, 5], [1, 1, 1, 1, 1, 1], [0, 1.5, 2.5, 4], True
   
   # spark  : array([ 1 ,  2,  4.5,  6 ])
   # v0.13.1: error
   # v0.14.1: error
   # v0.15.0: array([ nan,  2. ,  4.5,  5. ])
   # v0.15.2: array([ 2. ,  2. ,  4.5,  6. ])
   # v0.16.0: array([ 1.5 ,  2.25,  4.25,  5.5 ])
   
   # test("antitonic regression prediction with duplicate features")
   x, y, weights, x_test, isotonic = [1, 1, 2, 2, 3, 3], [5, 6, 2, 4, 1, 2], [1, 1, 1, 1, 1, 1], [0, 1.5, 2.5, 4], False
   
   # spark  : array([ 6 ,  4.5,  2,  1 ])
   # v0.13.1: error
   # v0.14.1: error
   # v0.15.0: array([  nan,  4.25,  2.25,  1.5 ])
   # v0.15.2: array([ 5.5 ,  4.25,  2.25,  1.5 ])
   # v0.16.0: array([ 5.5 ,  4.25,  2.25,  1.5 ])
   ```
   
   [This R package](https://cran.r-project.org/web/packages/isotone/vignettes/isotone.pdf) suggests that there are three methods for breaking ties: primary, secondary and tertiary. Spark and scikit-learn originally implemented "primary". Then scikit-learn replaced it with "secondary" which is implemented in this PR.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] srowen closed pull request #38966: [SPARK-41008][MLLIB] Dedup isotonic regression duplicate features

Posted by GitBox <gi...@apache.org>.
srowen closed pull request #38966: [SPARK-41008][MLLIB] Dedup isotonic regression duplicate features
URL: https://github.com/apache/spark/pull/38966


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org