You are viewing a plain text version of this content. The canonical link for it is here.

Posted to reviews@spark.apache.org by GitBox <gi...@apache.org> on 2022/12/31 12:36:15 UTC

[GitHub] [spark] zhengruifeng opened a new pull request, #39328: [SPARK-41066][CONNECT][PYTHON] Implement `DataFrame.sampleBy ` and `DataFrame.stat.sampleBy `

zhengruifeng opened a new pull request, #39328:
URL: https://github.com/apache/spark/pull/39328

   ### What changes were proposed in this pull request?
    Implement `DataFrame.sampleBy ` and `DataFrame.stat.sampleBy `
   
   
   ### Why are the changes needed?
   For API coverage
   
   
   ### Does this PR introduce _any_ user-facing change?
   yes
   
   ### How was this patch tested?
   added UT


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] HyukjinKwon commented on pull request #39328: [SPARK-41066][CONNECT][PYTHON] Implement `DataFrame.sampleBy ` and `DataFrame.stat.sampleBy `

Posted by GitBox <gi...@apache.org>.

HyukjinKwon commented on PR #39328:
URL: https://github.com/apache/spark/pull/39328#issuecomment-1368577854

   Merged to master.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] zhengruifeng commented on a diff in pull request #39328: [SPARK-41066][CONNECT][PYTHON] Implement `DataFrame.sampleBy ` and `DataFrame.stat.sampleBy `

Posted by GitBox <gi...@apache.org>.

zhengruifeng commented on code in PR #39328:
URL: https://github.com/apache/spark/pull/39328#discussion_r1059717018


##########
connector/connect/common/src/main/protobuf/spark/connect/relations.proto:
##########
@@ -546,6 +547,34 @@ message StatFreqItems {
   optional double support = 3;
 }
 
+
+// Returns a stratified sample without replacement based on the fraction
+// given on each stratum.
+message StatSampleBy {
+  // (Required) The input relation.
+  Relation input = 1;
+
+  // (Required) The column that defines strata.
+  Expression col = 2;
+
+  // (Required) Sampling fraction for each stratum.
+  //
+  // If a stratum is not specified, we treat its fraction as zero.
+  repeated Fraction fractions = 3;
+
+  // (Optional) The random seed.
+  optional int64 seed = 5;

Review Comment:
   here I want to keep in line with other method which generate a random seed in server if not provided in the proto



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] beliefer commented on a diff in pull request #39328: [SPARK-41066][CONNECT][PYTHON] Implement `DataFrame.sampleBy ` and `DataFrame.stat.sampleBy `

Posted by GitBox <gi...@apache.org>.

beliefer commented on code in PR #39328:
URL: https://github.com/apache/spark/pull/39328#discussion_r1059719373


##########
python/pyspark/sql/connect/plan.py:
##########
@@ -1140,6 +1140,50 @@ def plan(self, session: "SparkConnectClient") -> proto.Relation:
         return plan
 
 
+class StatSampleBy(LogicalPlan):
+    def __init__(
+        self,
+        child: Optional["LogicalPlan"],
+        col: "ColumnOrName",
+        fractions: Dict[Any, float],
+        seed: Optional[int],
+    ) -> None:
+        super().__init__(child)
+
+        assert col is not None and isinstance(col, (Column, str))
+
+        assert fractions is not None and isinstance(fractions, dict)
+        for k, v in fractions.items():
+            assert v is not None and isinstance(v, float)
+
+        assert seed is None or isinstance(seed, int)

Review Comment:
   I think we can let server side require that the client must pass seed.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] zhengruifeng commented on a diff in pull request #39328: [SPARK-41066][CONNECT][PYTHON] Implement `DataFrame.sampleBy ` and `DataFrame.stat.sampleBy `

Posted by GitBox <gi...@apache.org>.

zhengruifeng commented on code in PR #39328:
URL: https://github.com/apache/spark/pull/39328#discussion_r1059717128


##########
python/pyspark/sql/connect/dataframe.py:
##########
@@ -952,6 +952,26 @@ def freqItems(
 
     freqItems.__doc__ = PySparkDataFrame.freqItems.__doc__
 
+    def sampleBy(
+        self, col: "ColumnOrName", fractions: Dict[Any, float], seed: Optional[int] = None
+    ) -> "DataFrame":
+        if not isinstance(col, (Column, str)):

Review Comment:
   there is no behavior change, since the underlying plan can accept a `ColumnOrName`



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] beliefer commented on a diff in pull request #39328: [SPARK-41066][CONNECT][PYTHON] Implement `DataFrame.sampleBy ` and `DataFrame.stat.sampleBy `

Posted by GitBox <gi...@apache.org>.

beliefer commented on code in PR #39328:
URL: https://github.com/apache/spark/pull/39328#discussion_r1059719245


##########
python/pyspark/sql/connect/dataframe.py:
##########
@@ -952,6 +952,26 @@ def freqItems(
 
     freqItems.__doc__ = PySparkDataFrame.freqItems.__doc__
 
+    def sampleBy(
+        self, col: "ColumnOrName", fractions: Dict[Any, float], seed: Optional[int] = None
+    ) -> "DataFrame":
+        if not isinstance(col, (Column, str)):

Review Comment:
   Got it.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] HyukjinKwon closed pull request #39328: [SPARK-41066][CONNECT][PYTHON] Implement `DataFrame.sampleBy ` and `DataFrame.stat.sampleBy `

Posted by GitBox <gi...@apache.org>.

HyukjinKwon closed pull request #39328: [SPARK-41066][CONNECT][PYTHON] Implement `DataFrame.sampleBy ` and `DataFrame.stat.sampleBy `
URL: https://github.com/apache/spark/pull/39328


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] beliefer commented on a diff in pull request #39328: [SPARK-41066][CONNECT][PYTHON] Implement `DataFrame.sampleBy ` and `DataFrame.stat.sampleBy `

Posted by GitBox <gi...@apache.org>.

beliefer commented on code in PR #39328:
URL: https://github.com/apache/spark/pull/39328#discussion_r1059696862


##########
connector/connect/common/src/main/protobuf/spark/connect/relations.proto:
##########
@@ -546,6 +547,34 @@ message StatFreqItems {
   optional double support = 3;
 }
 
+
+// Returns a stratified sample without replacement based on the fraction
+// given on each stratum.
+message StatSampleBy {
+  // (Required) The input relation.
+  Relation input = 1;
+
+  // (Required) The column that defines strata.
+  Expression col = 2;
+
+  // (Required) Sampling fraction for each stratum.
+  //
+  // If a stratum is not specified, we treat its fraction as zero.
+  repeated Fraction fractions = 3;
+
+  // (Optional) The random seed.
+  optional int64 seed = 5;

Review Comment:
   It seems is required.



##########
connector/connect/common/src/main/protobuf/spark/connect/relations.proto:
##########
@@ -546,6 +547,34 @@ message StatFreqItems {
   optional double support = 3;
 }
 
+
+// Returns a stratified sample without replacement based on the fraction
+// given on each stratum.

Review Comment:
   It would be better to keep the comment



##########
connector/connect/common/src/main/protobuf/spark/connect/relations.proto:
##########
@@ -546,6 +547,34 @@ message StatFreqItems {
   optional double support = 3;
 }
 
+
+// Returns a stratified sample without replacement based on the fraction
+// given on each stratum.

Review Comment:
   `It will invoke 'Dataset.stat.sampleBy' (same as 'StatFunctions.sampleBy') to compute the results.` should be added.



##########
python/pyspark/sql/connect/plan.py:
##########
@@ -1140,6 +1140,50 @@ def plan(self, session: "SparkConnectClient") -> proto.Relation:
         return plan
 
 
+class StatSampleBy(LogicalPlan):
+    def __init__(
+        self,
+        child: Optional["LogicalPlan"],
+        col: "ColumnOrName",
+        fractions: Dict[Any, float],
+        seed: Optional[int],
+    ) -> None:
+        super().__init__(child)
+
+        assert col is not None and isinstance(col, (Column, str))
+
+        assert fractions is not None and isinstance(fractions, dict)
+        for k, v in fractions.items():
+            assert v is not None and isinstance(v, float)
+
+        assert seed is None or isinstance(seed, int)

Review Comment:
   Do we need add these check here?



##########
connector/connect/server/src/main/scala/org/apache/spark/sql/connect/planner/SparkConnectPlanner.scala:
##########
@@ -419,6 +421,26 @@ class SparkConnectPlanner(session: SparkSession) {
     }
   }
 
+  private def transformStatSampleBy(rel: proto.StatSampleBy): LogicalPlan = {
+    val fractions = mutable.Map.empty[Any, Double]

Review Comment:
   How about
   ```
   val fractions = rel.getFractionsList.asScala.toSeq.map { protoFraction =>
   ...
   }
   ```



##########
python/pyspark/sql/connect/dataframe.py:
##########
@@ -952,6 +952,26 @@ def freqItems(
 
     freqItems.__doc__ = PySparkDataFrame.freqItems.__doc__
 
+    def sampleBy(
+        self, col: "ColumnOrName", fractions: Dict[Any, float], seed: Optional[int] = None
+    ) -> "DataFrame":
+        if not isinstance(col, (Column, str)):

Review Comment:
   The behavior is changed from pyspark sql
   ```
   if isinstance(col, str):
       col = Column(col)
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] zhengruifeng commented on pull request #39328: [SPARK-41066][CONNECT][PYTHON] Implement `DataFrame.sampleBy ` and `DataFrame.stat.sampleBy `

Posted by GitBox <gi...@apache.org>.

zhengruifeng commented on PR #39328:
URL: https://github.com/apache/spark/pull/39328#issuecomment-1368210787

   cc @HyukjinKwon @beliefer 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] zhengruifeng commented on a diff in pull request #39328: [SPARK-41066][CONNECT][PYTHON] Implement `DataFrame.sampleBy ` and `DataFrame.stat.sampleBy `

Posted by GitBox <gi...@apache.org>.

zhengruifeng commented on code in PR #39328:
URL: https://github.com/apache/spark/pull/39328#discussion_r1059717236


##########
python/pyspark/sql/connect/plan.py:
##########
@@ -1140,6 +1140,50 @@ def plan(self, session: "SparkConnectClient") -> proto.Relation:
         return plan
 
 
+class StatSampleBy(LogicalPlan):
+    def __init__(
+        self,
+        child: Optional["LogicalPlan"],
+        col: "ColumnOrName",
+        fractions: Dict[Any, float],
+        seed: Optional[int],
+    ) -> None:
+        super().__init__(child)
+
+        assert col is not None and isinstance(col, (Column, str))
+
+        assert fractions is not None and isinstance(fractions, dict)
+        for k, v in fractions.items():
+            assert v is not None and isinstance(v, float)
+
+        assert seed is None or isinstance(seed, int)

Review Comment:
   I prefer adding assertion in the plan layer to make sure all parameters are expected



##########
connector/connect/common/src/main/protobuf/spark/connect/relations.proto:
##########
@@ -546,6 +547,34 @@ message StatFreqItems {
   optional double support = 3;
 }
 
+
+// Returns a stratified sample without replacement based on the fraction
+// given on each stratum.

Review Comment:
   nice, will update



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org