You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@spark.apache.org by GitBox <gi...@apache.org> on 2022/10/19 08:02:46 UTC

[GitHub] [spark] zhengruifeng opened a new pull request, #38310: [SPARK-40839][CONNECT][PYTHON][WIP] Implement `DataFrame.sample`

zhengruifeng opened a new pull request, #38310:
URL: https://github.com/apache/spark/pull/38310

   ### What changes were proposed in this pull request?
   Implement `DataFrame.sample` in Connect
   
   
   ### Why are the changes needed?
   for DataFrame API coverage
   
   
   ### Does this PR introduce _any_ user-facing change?
   Yes, new API
   
   
   ### How was this patch tested?
   added UT


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] cloud-fan commented on a diff in pull request #38310: [SPARK-40839][CONNECT][PYTHON] Implement `DataFrame.sample`

Posted by GitBox <gi...@apache.org>.
cloud-fan commented on code in PR #38310:
URL: https://github.com/apache/spark/pull/38310#discussion_r999540246


##########
python/pyspark/sql/connect/dataframe.py:
##########
@@ -201,6 +202,34 @@ def sort(self, *cols: "ColumnOrString") -> "DataFrame":
         """Sort by a specific column"""
         return DataFrame.withPlan(plan.Sort(self._plan, *cols), session=self._session)
 
+    def sample(
+        self,
+        withReplacement: Optional[Union[float, bool]] = None,
+        fraction: Optional[Union[int, float]] = None,
+        seed: Optional[int] = None,
+    ) -> "DataFrame":
+        from pyspark.sql import DataFrame as PySparkDataFrame
+
+        (withReplacement, fraction, seed) = PySparkDataFrame._prepare_augments_for_sample(
+            withReplacement, fraction, seed
+        )
+        if withReplacement is None:
+            withReplacement = False
+        if seed is None:
+            # TODO: make 'seed' optional in proto, then we can use 'Utils.random.nextLong' in JVM

Review Comment:
   @amaliujia We should really consider this. The principle is to move code implementation to the server side as much as possible. We just moved the identifier parsing logic to server side, and we should probably do the same for parameter default values.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] HyukjinKwon commented on a diff in pull request #38310: [SPARK-40839][CONNECT][PYTHON] Implement `DataFrame.sample`

Posted by GitBox <gi...@apache.org>.
HyukjinKwon commented on code in PR #38310:
URL: https://github.com/apache/spark/pull/38310#discussion_r1000055500


##########
python/pyspark/sql/connect/dataframe.py:
##########
@@ -201,6 +202,34 @@ def sort(self, *cols: "ColumnOrString") -> "DataFrame":
         """Sort by a specific column"""
         return DataFrame.withPlan(plan.Sort(self._plan, *cols), session=self._session)
 
+    def sample(
+        self,
+        withReplacement: Optional[Union[float, bool]] = None,
+        fraction: Optional[Union[int, float]] = None,
+        seed: Optional[int] = None,

Review Comment:
   Maybe we should just leverage keyword-only argument which will make the logic much simpler. Actually we wanted to do it in PySpark API layer in the past. Since this is a new API layer, I think it;s a good chance to replace them. cc @ueshin 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] zhengruifeng commented on a diff in pull request #38310: [SPARK-40839][CONNECT][PYTHON][WIP] Implement `DataFrame.sample`

Posted by GitBox <gi...@apache.org>.
zhengruifeng commented on code in PR #38310:
URL: https://github.com/apache/spark/pull/38310#discussion_r999082997


##########
python/pyspark/sql/dataframe.py:
##########
@@ -1697,7 +1697,19 @@ def sample(  # type: ignore[misc]
         >>> df.sample(False, fraction=1.0).count()
         10
         """
+        (withReplacement, fraction, seed) = DataFrame._prepare_augments_for_sample(
+            withReplacement, fraction, seed
+        )
+        args = [arg for arg in [withReplacement, fraction, seed] if arg is not None]
+        jdf = self._jdf.sample(*args)
+        return DataFrame(jdf, self.sparkSession)
 
+    @staticmethod
+    def _prepare_augments_for_sample(

Review Comment:
   the pre-processing of `sample` augments is pretty complex, so make it a static method and reuse it in connect



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] amaliujia commented on a diff in pull request #38310: [SPARK-40839][CONNECT][PYTHON] Implement `DataFrame.sample`

Posted by GitBox <gi...@apache.org>.
amaliujia commented on code in PR #38310:
URL: https://github.com/apache/spark/pull/38310#discussion_r999898020


##########
python/pyspark/sql/connect/dataframe.py:
##########
@@ -201,6 +202,34 @@ def sort(self, *cols: "ColumnOrString") -> "DataFrame":
         """Sort by a specific column"""
         return DataFrame.withPlan(plan.Sort(self._plan, *cols), session=self._session)
 
+    def sample(
+        self,
+        withReplacement: Optional[Union[float, bool]] = None,
+        fraction: Optional[Union[int, float]] = None,
+        seed: Optional[int] = None,
+    ) -> "DataFrame":
+        from pyspark.sql import DataFrame as PySparkDataFrame
+
+        (withReplacement, fraction, seed) = PySparkDataFrame._prepare_augments_for_sample(
+            withReplacement, fraction, seed
+        )
+        if withReplacement is None:
+            withReplacement = False

Review Comment:
   The default bool value for proto is `False` so this is probably not needed.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] amaliujia commented on a diff in pull request #38310: [SPARK-40839][CONNECT][PYTHON] Implement `DataFrame.sample`

Posted by GitBox <gi...@apache.org>.
amaliujia commented on code in PR #38310:
URL: https://github.com/apache/spark/pull/38310#discussion_r999901587


##########
python/pyspark/sql/connect/dataframe.py:
##########
@@ -201,6 +202,34 @@ def sort(self, *cols: "ColumnOrString") -> "DataFrame":
         """Sort by a specific column"""
         return DataFrame.withPlan(plan.Sort(self._plan, *cols), session=self._session)
 
+    def sample(

Review Comment:
   The pyspark dataframe API has
   ```
       @overload
       def sample(self, fraction: float, seed: Optional[int] = ...) -> "DataFrame":
           ...
   
       @overload
       def sample(
           self,
           withReplacement: Optional[bool],
           fraction: float,
           seed: Optional[int] = ...,
       ) -> "DataFrame":
           ...
   ```
   
   Can we match (as easy as copy the API into connect dataframe.py)?
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] amaliujia commented on a diff in pull request #38310: [SPARK-40839][CONNECT][PYTHON] Implement `DataFrame.sample`

Posted by GitBox <gi...@apache.org>.
amaliujia commented on code in PR #38310:
URL: https://github.com/apache/spark/pull/38310#discussion_r999897760


##########
python/pyspark/sql/connect/dataframe.py:
##########
@@ -201,6 +202,34 @@ def sort(self, *cols: "ColumnOrString") -> "DataFrame":
         """Sort by a specific column"""
         return DataFrame.withPlan(plan.Sort(self._plan, *cols), session=self._session)
 
+    def sample(
+        self,
+        withReplacement: Optional[Union[float, bool]] = None,
+        fraction: Optional[Union[int, float]] = None,
+        seed: Optional[int] = None,
+    ) -> "DataFrame":
+        from pyspark.sql import DataFrame as PySparkDataFrame
+
+        (withReplacement, fraction, seed) = PySparkDataFrame._prepare_augments_for_sample(
+            withReplacement, fraction, seed
+        )
+        if withReplacement is None:
+            withReplacement = False
+        if seed is None:
+            # TODO: make 'seed' optional in proto, then we can use 'Utils.random.nextLong' in JVM

Review Comment:
   This makes sense.
   
   @zhengruifeng I am thinking you can wrap this `seed` into a proto message and in that case the server side can know if this is set or not? In that case, the server side can does the random generation rather than using the value from proto.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] amaliujia commented on a diff in pull request #38310: [SPARK-40839][CONNECT][PYTHON] Implement `DataFrame.sample`

Posted by GitBox <gi...@apache.org>.
amaliujia commented on code in PR #38310:
URL: https://github.com/apache/spark/pull/38310#discussion_r999897760


##########
python/pyspark/sql/connect/dataframe.py:
##########
@@ -201,6 +202,34 @@ def sort(self, *cols: "ColumnOrString") -> "DataFrame":
         """Sort by a specific column"""
         return DataFrame.withPlan(plan.Sort(self._plan, *cols), session=self._session)
 
+    def sample(
+        self,
+        withReplacement: Optional[Union[float, bool]] = None,
+        fraction: Optional[Union[int, float]] = None,
+        seed: Optional[int] = None,
+    ) -> "DataFrame":
+        from pyspark.sql import DataFrame as PySparkDataFrame
+
+        (withReplacement, fraction, seed) = PySparkDataFrame._prepare_augments_for_sample(
+            withReplacement, fraction, seed
+        )
+        if withReplacement is None:
+            withReplacement = False
+        if seed is None:
+            # TODO: make 'seed' optional in proto, then we can use 'Utils.random.nextLong' in JVM

Review Comment:
   This makes sense.
   
   @zhengruifeng I am thinking you can wrap this `seed` into a proto message and in that case the server side can know if this is set or not? In that case, the server side can does the random generation.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] amaliujia commented on a diff in pull request #38310: [SPARK-40839][CONNECT][PYTHON] Implement `DataFrame.sample`

Posted by GitBox <gi...@apache.org>.
amaliujia commented on code in PR #38310:
URL: https://github.com/apache/spark/pull/38310#discussion_r999900237


##########
python/pyspark/sql/connect/dataframe.py:
##########
@@ -201,6 +202,34 @@ def sort(self, *cols: "ColumnOrString") -> "DataFrame":
         """Sort by a specific column"""
         return DataFrame.withPlan(plan.Sort(self._plan, *cols), session=self._session)
 
+    def sample(
+        self,
+        withReplacement: Optional[Union[float, bool]] = None,
+        fraction: Optional[Union[int, float]] = None,
+        seed: Optional[int] = None,
+    ) -> "DataFrame":
+        from pyspark.sql import DataFrame as PySparkDataFrame
+
+        (withReplacement, fraction, seed) = PySparkDataFrame._prepare_augments_for_sample(
+            withReplacement, fraction, seed
+        )
+        if withReplacement is None:
+            withReplacement = False
+        if seed is None:
+            # TODO: make 'seed' optional in proto, then we can use 'Utils.random.nextLong' in JVM

Review Comment:
   This is an example: https://github.com/apache/spark/pull/38275



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] zhengruifeng commented on pull request #38310: [SPARK-40839][CONNECT][PYTHON] Implement `DataFrame.sample`

Posted by GitBox <gi...@apache.org>.
zhengruifeng commented on PR #38310:
URL: https://github.com/apache/spark/pull/38310#issuecomment-1283839836

   cc @HyukjinKwon @amaliujia @cloud-fan @grundprinzip 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] zhengruifeng commented on a diff in pull request #38310: [SPARK-40839][CONNECT][PYTHON] Implement `DataFrame.sample`

Posted by GitBox <gi...@apache.org>.
zhengruifeng commented on code in PR #38310:
URL: https://github.com/apache/spark/pull/38310#discussion_r1000114991


##########
python/pyspark/sql/connect/dataframe.py:
##########
@@ -201,6 +202,34 @@ def sort(self, *cols: "ColumnOrString") -> "DataFrame":
         """Sort by a specific column"""
         return DataFrame.withPlan(plan.Sort(self._plan, *cols), session=self._session)
 
+    def sample(

Review Comment:
   users may have to change their codes for this emigration, but I think this is also a chance to make some changes.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] zhengruifeng commented on a diff in pull request #38310: [SPARK-40839][CONNECT][PYTHON] Implement `DataFrame.sample`

Posted by GitBox <gi...@apache.org>.
zhengruifeng commented on code in PR #38310:
URL: https://github.com/apache/spark/pull/38310#discussion_r1000151029


##########
connector/connect/src/main/protobuf/spark/connect/relations.proto:
##########
@@ -192,5 +192,9 @@ message Sample {
   double lower_bound = 2;
   double upper_bound = 3;
   bool with_replacement = 4;
-  int64 seed = 5;
+  Seed seed = 5;
+}
+
+message Seed {

Review Comment:
   I need to define `Seed` out of `Sample`, otherwise there is no `HasSeed` method in the generated files



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] cloud-fan commented on a diff in pull request #38310: [SPARK-40839][CONNECT][PYTHON] Implement `DataFrame.sample`

Posted by GitBox <gi...@apache.org>.
cloud-fan commented on code in PR #38310:
URL: https://github.com/apache/spark/pull/38310#discussion_r999541952


##########
python/pyspark/sql/dataframe.py:
##########
@@ -1697,7 +1697,19 @@ def sample(  # type: ignore[misc]
         >>> df.sample(False, fraction=1.0).count()
         10
         """
+        (withReplacement, fraction, seed) = DataFrame._prepare_augments_for_sample(
+            withReplacement, fraction, seed
+        )
+        args = [arg for arg in [withReplacement, fraction, seed] if arg is not None]
+        jdf = self._jdf.sample(*args)
+        return DataFrame(jdf, self.sparkSession)
 
+    @staticmethod
+    def _prepare_augments_for_sample(

Review Comment:
   If we do need to share code between pyspark and spark connect python client, we should probably add a new module like `python-common`



##########
python/pyspark/sql/dataframe.py:
##########
@@ -1697,7 +1697,19 @@ def sample(  # type: ignore[misc]
         >>> df.sample(False, fraction=1.0).count()
         10
         """
+        (withReplacement, fraction, seed) = DataFrame._prepare_augments_for_sample(
+            withReplacement, fraction, seed
+        )
+        args = [arg for arg in [withReplacement, fraction, seed] if arg is not None]
+        jdf = self._jdf.sample(*args)
+        return DataFrame(jdf, self.sparkSession)
 
+    @staticmethod
+    def _prepare_augments_for_sample(

Review Comment:
   If we do need to share code between pyspark and spark connect python client, we should probably add a new module like `pyspark-common`



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] cloud-fan commented on a diff in pull request #38310: [SPARK-40839][CONNECT][PYTHON] Implement `DataFrame.sample`

Posted by GitBox <gi...@apache.org>.
cloud-fan commented on code in PR #38310:
URL: https://github.com/apache/spark/pull/38310#discussion_r999537005


##########
python/pyspark/sql/connect/dataframe.py:
##########
@@ -201,6 +202,34 @@ def sort(self, *cols: "ColumnOrString") -> "DataFrame":
         """Sort by a specific column"""
         return DataFrame.withPlan(plan.Sort(self._plan, *cols), session=self._session)
 
+    def sample(
+        self,
+        withReplacement: Optional[Union[float, bool]] = None,
+        fraction: Optional[Union[int, float]] = None,
+        seed: Optional[int] = None,
+    ) -> "DataFrame":
+        from pyspark.sql import DataFrame as PySparkDataFrame

Review Comment:
   oh, does spark connect python client depends on pyspark? Then it's not a thin client any more...



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] amaliujia commented on a diff in pull request #38310: [SPARK-40839][CONNECT][PYTHON] Implement `DataFrame.sample`

Posted by GitBox <gi...@apache.org>.
amaliujia commented on code in PR #38310:
URL: https://github.com/apache/spark/pull/38310#discussion_r1000120190


##########
python/pyspark/sql/connect/dataframe.py:
##########
@@ -201,6 +202,34 @@ def sort(self, *cols: "ColumnOrString") -> "DataFrame":
         """Sort by a specific column"""
         return DataFrame.withPlan(plan.Sort(self._plan, *cols), session=self._session)
 
+    def sample(

Review Comment:
   Sure. We also can go to that direction.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] ueshin commented on a diff in pull request #38310: [SPARK-40839][CONNECT][PYTHON] Implement `DataFrame.sample`

Posted by GitBox <gi...@apache.org>.
ueshin commented on code in PR #38310:
URL: https://github.com/apache/spark/pull/38310#discussion_r1001265307


##########
python/pyspark/sql/connect/dataframe.py:
##########
@@ -201,6 +202,34 @@ def sort(self, *cols: "ColumnOrString") -> "DataFrame":
         """Sort by a specific column"""
         return DataFrame.withPlan(plan.Sort(self._plan, *cols), session=self._session)
 
+    def sample(
+        self,
+        withReplacement: Optional[Union[float, bool]] = None,
+        fraction: Optional[Union[int, float]] = None,
+        seed: Optional[int] = None,

Review Comment:
   `withReplacement` can be `: bool = False` if the default is `False`.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] amaliujia commented on a diff in pull request #38310: [SPARK-40839][CONNECT][PYTHON] Implement `DataFrame.sample`

Posted by GitBox <gi...@apache.org>.
amaliujia commented on code in PR #38310:
URL: https://github.com/apache/spark/pull/38310#discussion_r1001029243


##########
connector/connect/src/main/protobuf/spark/connect/relations.proto:
##########
@@ -192,5 +192,9 @@ message Sample {
   double lower_bound = 2;
   double upper_bound = 3;
   bool with_replacement = 4;
-  int64 seed = 5;
+  Seed seed = 5;
+}
+
+message Seed {

Review Comment:
   Yeah I always does a clean then build.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] HyukjinKwon closed pull request #38310: [SPARK-40839][CONNECT][PYTHON] Implement `DataFrame.sample`

Posted by GitBox <gi...@apache.org>.
HyukjinKwon closed pull request #38310: [SPARK-40839][CONNECT][PYTHON] Implement `DataFrame.sample`
URL: https://github.com/apache/spark/pull/38310


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] zhengruifeng commented on a diff in pull request #38310: [SPARK-40839][CONNECT][PYTHON] Implement `DataFrame.sample`

Posted by GitBox <gi...@apache.org>.
zhengruifeng commented on code in PR #38310:
URL: https://github.com/apache/spark/pull/38310#discussion_r1000115922


##########
python/pyspark/sql/connect/dataframe.py:
##########
@@ -201,6 +202,34 @@ def sort(self, *cols: "ColumnOrString") -> "DataFrame":
         """Sort by a specific column"""
         return DataFrame.withPlan(plan.Sort(self._plan, *cols), session=self._session)
 
+    def sample(
+        self,
+        withReplacement: Optional[Union[float, bool]] = None,
+        fraction: Optional[Union[int, float]] = None,
+        seed: Optional[int] = None,
+    ) -> "DataFrame":
+        from pyspark.sql import DataFrame as PySparkDataFrame
+
+        (withReplacement, fraction, seed) = PySparkDataFrame._prepare_augments_for_sample(
+            withReplacement, fraction, seed
+        )
+        if withReplacement is None:
+            withReplacement = False
+        if seed is None:
+            # TODO: make 'seed' optional in proto, then we can use 'Utils.random.nextLong' in JVM

Review Comment:
   yeah, let me make this change



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] amaliujia commented on pull request #38310: [SPARK-40839][CONNECT][PYTHON] Implement `DataFrame.sample`

Posted by GitBox <gi...@apache.org>.
amaliujia commented on PR #38310:
URL: https://github.com/apache/spark/pull/38310#issuecomment-1285979019

   LGTM


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] zhengruifeng commented on a diff in pull request #38310: [SPARK-40839][CONNECT][PYTHON] Implement `DataFrame.sample`

Posted by GitBox <gi...@apache.org>.
zhengruifeng commented on code in PR #38310:
URL: https://github.com/apache/spark/pull/38310#discussion_r1000065722


##########
python/pyspark/sql/connect/dataframe.py:
##########
@@ -201,6 +202,34 @@ def sort(self, *cols: "ColumnOrString") -> "DataFrame":
         """Sort by a specific column"""
         return DataFrame.withPlan(plan.Sort(self._plan, *cols), session=self._session)
 
+    def sample(
+        self,
+        withReplacement: Optional[Union[float, bool]] = None,
+        fraction: Optional[Union[int, float]] = None,
+        seed: Optional[int] = None,

Review Comment:
   yes, that's a bit confusing at first glance.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] grundprinzip commented on a diff in pull request #38310: [SPARK-40839][CONNECT][PYTHON] Implement `DataFrame.sample`

Posted by GitBox <gi...@apache.org>.
grundprinzip commented on code in PR #38310:
URL: https://github.com/apache/spark/pull/38310#discussion_r1000162117


##########
connector/connect/src/main/protobuf/spark/connect/relations.proto:
##########
@@ -192,5 +192,9 @@ message Sample {
   double lower_bound = 2;
   double upper_bound = 3;
   bool with_replacement = 4;
-  int64 seed = 5;
+  Seed seed = 5;
+}
+
+message Seed {

Review Comment:
   This is not true. The has* messages are generated for non simple types. 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] zhengruifeng commented on a diff in pull request #38310: [SPARK-40839][CONNECT][PYTHON] Implement `DataFrame.sample`

Posted by GitBox <gi...@apache.org>.
zhengruifeng commented on code in PR #38310:
URL: https://github.com/apache/spark/pull/38310#discussion_r1000173502


##########
connector/connect/src/main/protobuf/spark/connect/relations.proto:
##########
@@ -192,5 +192,9 @@ message Sample {
   double lower_bound = 2;
   double upper_bound = 3;
   bool with_replacement = 4;
-  int64 seed = 5;
+  Seed seed = 5;
+}
+
+message Seed {

Review Comment:
   you are right, maybe the jars were out of sync in that time, let me move `Seed` in `Sample`



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] amaliujia commented on a diff in pull request #38310: [SPARK-40839][CONNECT][PYTHON] Implement `DataFrame.sample`

Posted by GitBox <gi...@apache.org>.
amaliujia commented on code in PR #38310:
URL: https://github.com/apache/spark/pull/38310#discussion_r999903231


##########
python/pyspark/sql/connect/dataframe.py:
##########
@@ -201,6 +202,34 @@ def sort(self, *cols: "ColumnOrString") -> "DataFrame":
         """Sort by a specific column"""
         return DataFrame.withPlan(plan.Sort(self._plan, *cols), session=self._session)
 
+    def sample(
+        self,
+        withReplacement: Optional[Union[float, bool]] = None,
+        fraction: Optional[Union[int, float]] = None,
+        seed: Optional[int] = None,
+    ) -> "DataFrame":
+        from pyspark.sql import DataFrame as PySparkDataFrame
+
+        (withReplacement, fraction, seed) = PySparkDataFrame._prepare_augments_for_sample(
+            withReplacement, fraction, seed
+        )
+        if withReplacement is None:
+            withReplacement = False

Review Comment:
   oh The Plan definition is not `Optional` for `withReplacement`. In this case probably set it as `False` makes sense.
   
   ```
   class Sample(LogicalPlan):
       def __init__(
           self,
           child: Optional["LogicalPlan"],
           lower_bound: float,
           upper_bound: float,
           with_replacement: bool,
           seed: int,
       ) -> None:
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] amaliujia commented on a diff in pull request #38310: [SPARK-40839][CONNECT][PYTHON] Implement `DataFrame.sample`

Posted by GitBox <gi...@apache.org>.
amaliujia commented on code in PR #38310:
URL: https://github.com/apache/spark/pull/38310#discussion_r999904256


##########
python/pyspark/sql/connect/plan.py:
##########
@@ -310,6 +310,55 @@ def _repr_html_(self) -> str:
         """
 
 
+class Sample(LogicalPlan):
+    def __init__(
+        self,
+        child: Optional["LogicalPlan"],
+        lower_bound: float,
+        upper_bound: float,
+        with_replacement: bool,
+        seed: int,
+    ) -> None:
+        super().__init__(child)
+        self.lower_bound = lower_bound
+        self.upper_bound = upper_bound
+        self.with_replacement = with_replacement
+        self.seed = seed
+
+    def plan(self, session: Optional["RemoteSparkSession"]) -> proto.Relation:
+        assert self._child is not None
+        plan = proto.Relation()
+        plan.sample.input.CopyFrom(self._child.plan(session))
+        plan.sample.lower_bound = self.lower_bound
+        plan.sample.upper_bound = self.upper_bound
+        plan.sample.with_replacement = self.with_replacement
+        plan.sample.seed = self.seed
+        return plan
+
+    def print(self, indent: int = 0) -> str:

Review Comment:
   For my self education on python, why do we need `print` given that we have `_repr_html`?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] zhengruifeng commented on a diff in pull request #38310: [SPARK-40839][CONNECT][PYTHON] Implement `DataFrame.sample`

Posted by GitBox <gi...@apache.org>.
zhengruifeng commented on code in PR #38310:
URL: https://github.com/apache/spark/pull/38310#discussion_r1000065319


##########
python/pyspark/sql/connect/dataframe.py:
##########
@@ -201,6 +202,34 @@ def sort(self, *cols: "ColumnOrString") -> "DataFrame":
         """Sort by a specific column"""
         return DataFrame.withPlan(plan.Sort(self._plan, *cols), session=self._session)
 
+    def sample(

Review Comment:
   I guess we can discard those ones ? @HyukjinKwon 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] amaliujia commented on a diff in pull request #38310: [SPARK-40839][CONNECT][PYTHON] Implement `DataFrame.sample`

Posted by GitBox <gi...@apache.org>.
amaliujia commented on code in PR #38310:
URL: https://github.com/apache/spark/pull/38310#discussion_r999895806


##########
python/pyspark/sql/connect/dataframe.py:
##########
@@ -201,6 +202,34 @@ def sort(self, *cols: "ColumnOrString") -> "DataFrame":
         """Sort by a specific column"""
         return DataFrame.withPlan(plan.Sort(self._plan, *cols), session=self._session)
 
+    def sample(
+        self,
+        withReplacement: Optional[Union[float, bool]] = None,
+        fraction: Optional[Union[int, float]] = None,
+        seed: Optional[int] = None,
+    ) -> "DataFrame":
+        from pyspark.sql import DataFrame as PySparkDataFrame

Review Comment:
   yes this is now depending on the pyspark. In fact it depends on pyspark since the first PR.  For the short term it is ok cc @HyukjinKwon 
   
   I guess we will need to make a final decision how whether it should depend or not before making the python packaging and release. 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] amaliujia commented on a diff in pull request #38310: [SPARK-40839][CONNECT][PYTHON] Implement `DataFrame.sample`

Posted by GitBox <gi...@apache.org>.
amaliujia commented on code in PR #38310:
URL: https://github.com/apache/spark/pull/38310#discussion_r999901587


##########
python/pyspark/sql/connect/dataframe.py:
##########
@@ -201,6 +202,34 @@ def sort(self, *cols: "ColumnOrString") -> "DataFrame":
         """Sort by a specific column"""
         return DataFrame.withPlan(plan.Sort(self._plan, *cols), session=self._session)
 
+    def sample(

Review Comment:
   The pyspark dataframe API has
   ```
       @overload
       def sample(self, fraction: float, seed: Optional[int] = ...) -> "DataFrame":
           ...
   
       @overload
       def sample(
           self,
           withReplacement: Optional[bool],
           fraction: float,
           seed: Optional[int] = ...,
       ) -> "DataFrame":
           ...
   ```
   
   Can we match? 
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] zhengruifeng commented on a diff in pull request #38310: [SPARK-40839][CONNECT][PYTHON] Implement `DataFrame.sample`

Posted by GitBox <gi...@apache.org>.
zhengruifeng commented on code in PR #38310:
URL: https://github.com/apache/spark/pull/38310#discussion_r1000115457


##########
python/pyspark/sql/connect/plan.py:
##########
@@ -310,6 +310,55 @@ def _repr_html_(self) -> str:
         """
 
 
+class Sample(LogicalPlan):
+    def __init__(
+        self,
+        child: Optional["LogicalPlan"],
+        lower_bound: float,
+        upper_bound: float,
+        with_replacement: bool,
+        seed: int,
+    ) -> None:
+        super().__init__(child)
+        self.lower_bound = lower_bound
+        self.upper_bound = upper_bound
+        self.with_replacement = with_replacement
+        self.seed = seed
+
+    def plan(self, session: Optional["RemoteSparkSession"]) -> proto.Relation:
+        assert self._child is not None
+        plan = proto.Relation()
+        plan.sample.input.CopyFrom(self._child.plan(session))
+        plan.sample.lower_bound = self.lower_bound
+        plan.sample.upper_bound = self.upper_bound
+        plan.sample.with_replacement = self.with_replacement
+        plan.sample.seed = self.seed
+        return plan
+
+    def print(self, indent: int = 0) -> str:

Review Comment:
   I don't know either, just follow other logicalplan here



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] zhengruifeng commented on a diff in pull request #38310: [SPARK-40839][CONNECT][PYTHON] Implement `DataFrame.sample`

Posted by GitBox <gi...@apache.org>.
zhengruifeng commented on code in PR #38310:
URL: https://github.com/apache/spark/pull/38310#discussion_r1000115630


##########
python/pyspark/sql/connect/dataframe.py:
##########
@@ -201,6 +202,34 @@ def sort(self, *cols: "ColumnOrString") -> "DataFrame":
         """Sort by a specific column"""
         return DataFrame.withPlan(plan.Sort(self._plan, *cols), session=self._session)
 
+    def sample(
+        self,
+        withReplacement: Optional[Union[float, bool]] = None,
+        fraction: Optional[Union[int, float]] = None,
+        seed: Optional[int] = None,
+    ) -> "DataFrame":
+        from pyspark.sql import DataFrame as PySparkDataFrame
+
+        (withReplacement, fraction, seed) = PySparkDataFrame._prepare_augments_for_sample(
+            withReplacement, fraction, seed
+        )
+        if withReplacement is None:
+            withReplacement = False

Review Comment:
   yeah, let me make this change



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] zhengruifeng commented on a diff in pull request #38310: [SPARK-40839][CONNECT][PYTHON] Implement `DataFrame.sample`

Posted by GitBox <gi...@apache.org>.
zhengruifeng commented on code in PR #38310:
URL: https://github.com/apache/spark/pull/38310#discussion_r1001277403


##########
python/pyspark/sql/connect/dataframe.py:
##########
@@ -201,6 +202,34 @@ def sort(self, *cols: "ColumnOrString") -> "DataFrame":
         """Sort by a specific column"""
         return DataFrame.withPlan(plan.Sort(self._plan, *cols), session=self._session)
 
+    def sample(
+        self,
+        withReplacement: Optional[Union[float, bool]] = None,
+        fraction: Optional[Union[int, float]] = None,
+        seed: Optional[int] = None,

Review Comment:
   I like this idea 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] amaliujia commented on a diff in pull request #38310: [SPARK-40839][CONNECT][PYTHON] Implement `DataFrame.sample`

Posted by GitBox <gi...@apache.org>.
amaliujia commented on code in PR #38310:
URL: https://github.com/apache/spark/pull/38310#discussion_r999895806


##########
python/pyspark/sql/connect/dataframe.py:
##########
@@ -201,6 +202,34 @@ def sort(self, *cols: "ColumnOrString") -> "DataFrame":
         """Sort by a specific column"""
         return DataFrame.withPlan(plan.Sort(self._plan, *cols), session=self._session)
 
+    def sample(
+        self,
+        withReplacement: Optional[Union[float, bool]] = None,
+        fraction: Optional[Union[int, float]] = None,
+        seed: Optional[int] = None,
+    ) -> "DataFrame":
+        from pyspark.sql import DataFrame as PySparkDataFrame

Review Comment:
   yes this is now depending on the pyspark. In fact since the first PR.  For the short term it is ok cc @HyukjinKwon 
   
   I guess we will need to make a final decision how whether it should depend or not before making the python packaging and release. 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] ueshin commented on a diff in pull request #38310: [SPARK-40839][CONNECT][PYTHON] Implement `DataFrame.sample`

Posted by GitBox <gi...@apache.org>.
ueshin commented on code in PR #38310:
URL: https://github.com/apache/spark/pull/38310#discussion_r1001255432


##########
python/pyspark/sql/connect/dataframe.py:
##########
@@ -201,6 +202,34 @@ def sort(self, *cols: "ColumnOrString") -> "DataFrame":
         """Sort by a specific column"""
         return DataFrame.withPlan(plan.Sort(self._plan, *cols), session=self._session)
 
+    def sample(
+        self,
+        withReplacement: Optional[Union[float, bool]] = None,
+        fraction: Optional[Union[int, float]] = None,
+        seed: Optional[int] = None,

Review Comment:
   Yes, if we can break the signature, it would be:
   
   ```py
   def sample(
       self,
       fraction: float,
       *,
       withReplacement: Optional[bool] = None,
       seed: Optional[int] = None,
   ) -> "DataFrame":
       ...
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] amaliujia commented on a diff in pull request #38310: [SPARK-40839][CONNECT][PYTHON] Implement `DataFrame.sample`

Posted by GitBox <gi...@apache.org>.
amaliujia commented on code in PR #38310:
URL: https://github.com/apache/spark/pull/38310#discussion_r1000108835


##########
python/pyspark/sql/connect/dataframe.py:
##########
@@ -201,6 +202,34 @@ def sort(self, *cols: "ColumnOrString") -> "DataFrame":
         """Sort by a specific column"""
         return DataFrame.withPlan(plan.Sort(self._plan, *cols), session=self._session)
 
+    def sample(

Review Comment:
   Maybe my real question was, will we have an issue to be compatible with existing pyspark dataframe code (needs different imports, of course) if we discard such API? I see many other similar API existing for pyspark dataframe.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] HyukjinKwon commented on pull request #38310: [SPARK-40839][CONNECT][PYTHON] Implement `DataFrame.sample`

Posted by GitBox <gi...@apache.org>.
HyukjinKwon commented on PR #38310:
URL: https://github.com/apache/spark/pull/38310#issuecomment-1286624995

   Merged to master.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] zhengruifeng commented on pull request #38310: [SPARK-40839][CONNECT][PYTHON] Implement `DataFrame.sample`

Posted by GitBox <gi...@apache.org>.
zhengruifeng commented on PR #38310:
URL: https://github.com/apache/spark/pull/38310#issuecomment-1286804042

   thank you guys


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org