You are viewing a plain text version of this content. The canonical link for it is here.

Posted to reviews@spark.apache.org by GitBox <gi...@apache.org> on 2022/11/28 09:10:45 UTC

[GitHub] [spark] zhengruifeng opened a new pull request, #38819: [SPARK-41148][CONNECT][PYTHON] Implement `DataFrame.dropna ` and `DataFrame.na.drop `

zhengruifeng opened a new pull request, #38819:
URL: https://github.com/apache/spark/pull/38819

   ### What changes were proposed in this pull request?
    Implement `DataFrame.dropna ` and `DataFrame.na.drop`
   
   
   
   
   ### Why are the changes needed?
   For API coverage
   
   
   ### Does this PR introduce _any_ user-facing change?
   yes, new method
   
   ### How was this patch tested?
   added UT


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] zhengruifeng commented on a diff in pull request #38819: [SPARK-41148][CONNECT][PYTHON] Implement `DataFrame.dropna` and `DataFrame.na.drop`

Posted by GitBox <gi...@apache.org>.

zhengruifeng commented on code in PR #38819:
URL: https://github.com/apache/spark/pull/38819#discussion_r1034250363


##########
python/pyspark/sql/connect/dataframe.py:
##########
@@ -727,6 +727,77 @@ def fillna(
             session=self._session,
         )
 
+    def dropna(
+        self,
+        how: str = "any",
+        thresh: Optional[int] = None,
+        subset: Optional[Union[str, Tuple[str, ...], List[str]]] = None,
+    ) -> "DataFrame":
+        """Returns a new :class:`DataFrame` omitting rows with null values.
+        :func:`DataFrame.dropna` and :func:`DataFrameNaFunctions.drop` are aliases of each other.
+
+        .. versionadded:: 3.4.0
+
+        Parameters
+        ----------
+        how : str, optional
+            'any' or 'all'.
+            If 'any', drop a row if it contains any nulls.
+            If 'all', drop a row only if all its values are null.
+        thresh: int, optional
+            default None
+            If specified, drop rows that have less than `thresh` non-null values.
+            This overwrites the `how` parameter.
+        subset : str, tuple or list, optional
+            optional list of column names to consider.
+
+        Returns
+        -------
+        :class:`DataFrame`
+            DataFrame with null only rows excluded.
+        """
+        min_non_nulls: Optional[int] = None
+
+        if how is not None:
+            if not isinstance(how, str):
+                raise TypeError(f"how should be a str, but got {type(how).__name__}")
+            if how == "all":
+                min_non_nulls = 1
+            elif how == "any":
+                min_non_nulls = None
+            else:
+                raise ValueError("how ('" + how + "') should be 'any' or 'all'")
+
+        if thresh is not None:
+            if not isinstance(thresh, int):
+                raise TypeError(f"thresh should be a int, but got {type(thresh).__name__}")
+
+            # 'thresh' overwrites 'how'
+            min_non_nulls = thresh
+
+        _cols: List[str] = []
+        if subset is not None:
+            if isinstance(subset, str):
+                _cols = [subset]
+            elif isinstance(subset, (tuple, list)):
+                for c in subset:
+                    if not isinstance(c, str):
+                        raise TypeError(
+                            f"cols should be a str, tuple[str] or list[str], "
+                            f"but got {type(c).__name__}"
+                        )
+                _cols = list(subset)
+            else:
+                raise TypeError(
+                    f"cols should be a str, tuple[str] or list[str], "
+                    f"but got {type(subset).__name__}"
+                )
+
+        return DataFrame.withPlan(
+            plan.NADrop(child=self._plan, cols=_cols, min_non_nulls=min_non_nulls),
+            session=self._session,

Review Comment:
   other places still use `self._session` now, see https://github.com/apache/spark/blob/master/python/pyspark/sql/connect/dataframe.py#L132



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] zhengruifeng commented on a diff in pull request #38819: [SPARK-41148][CONNECT][PYTHON] Implement `DataFrame.dropna` and `DataFrame.na.drop`

Posted by GitBox <gi...@apache.org>.

zhengruifeng commented on code in PR #38819:
URL: https://github.com/apache/spark/pull/38819#discussion_r1033287401


##########
connector/connect/src/main/protobuf/spark/connect/relations.proto:
##########
@@ -440,6 +441,26 @@ message NAFill {
   repeated Expression.Literal values = 3;
 }
 
+
+// Drop rows containing null values.
+// It will invoke 'Dataset.na.drop' (same as 'DataFrameNaFunctions.drop') to compute the results.
+message NADrop {

Review Comment:
   since parameter `how: String` can be converted to `min_non_nulls` https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/DataFrameNaFunctions.scala#L491-L497 ,
   
   only `min_non_nulls` is needed in proto message



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] zhengruifeng closed pull request #38819: [SPARK-41148][CONNECT][PYTHON] Implement `DataFrame.dropna` and `DataFrame.na.drop`

Posted by GitBox <gi...@apache.org>.

zhengruifeng closed pull request #38819: [SPARK-41148][CONNECT][PYTHON] Implement `DataFrame.dropna` and `DataFrame.na.drop`
URL: https://github.com/apache/spark/pull/38819


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] zhengruifeng commented on a diff in pull request #38819: [SPARK-41148][CONNECT][PYTHON] Implement `DataFrame.dropna` and `DataFrame.na.drop`

Posted by GitBox <gi...@apache.org>.

zhengruifeng commented on code in PR #38819:
URL: https://github.com/apache/spark/pull/38819#discussion_r1034251475


##########
connector/connect/src/main/protobuf/spark/connect/relations.proto:
##########
@@ -440,6 +441,26 @@ message NAFill {
   repeated Expression.Literal values = 3;
 }
 
+
+// Drop rows containing null values.
+// It will invoke 'Dataset.na.drop' (same as 'DataFrameNaFunctions.drop') to compute the results.
+message NADrop {

Review Comment:
   nice, will update this doc



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] amaliujia commented on a diff in pull request #38819: [SPARK-41148][CONNECT][PYTHON] Implement `DataFrame.dropna` and `DataFrame.na.drop`

Posted by GitBox <gi...@apache.org>.

amaliujia commented on code in PR #38819:
URL: https://github.com/apache/spark/pull/38819#discussion_r1033921474


##########
python/pyspark/sql/tests/connect/test_connect_basic.py:
##########
@@ -472,6 +472,46 @@ def test_fill_na(self):
             self.spark.sql(query).na.fill({"a": True, "b": 2}).toPandas(),
         )
 
+    def test_drop_na(self):
+        # SPARK-41148: Test drop na
+        query = """
+            SELECT * FROM VALUES
+            (false, 1, NULL), (false, NULL, 2.0), (NULL, 3, 3.0)
+            AS tab(a, b, c)
+            """
+        # +-----+----+----+
+        # |    a|   b|   c|
+        # +-----+----+----+
+        # |false|   1|null|
+        # |false|null| 2.0|
+        # | null|   3| 3.0|
+        # +-----+----+----+

Review Comment:
   +1 this format looks nice!



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] grundprinzip commented on a diff in pull request #38819: [SPARK-41148][CONNECT][PYTHON] Implement `DataFrame.dropna` and `DataFrame.na.drop`

Posted by GitBox <gi...@apache.org>.

grundprinzip commented on code in PR #38819:
URL: https://github.com/apache/spark/pull/38819#discussion_r1033669941


##########
python/pyspark/sql/connect/dataframe.py:
##########
@@ -727,6 +727,77 @@ def fillna(
             session=self._session,
         )
 
+    def dropna(
+        self,
+        how: str = "any",
+        thresh: Optional[int] = None,
+        subset: Optional[Union[str, Tuple[str, ...], List[str]]] = None,
+    ) -> "DataFrame":
+        """Returns a new :class:`DataFrame` omitting rows with null values.
+        :func:`DataFrame.dropna` and :func:`DataFrameNaFunctions.drop` are aliases of each other.
+
+        .. versionadded:: 3.4.0
+
+        Parameters
+        ----------
+        how : str, optional
+            'any' or 'all'.
+            If 'any', drop a row if it contains any nulls.
+            If 'all', drop a row only if all its values are null.
+        thresh: int, optional
+            default None
+            If specified, drop rows that have less than `thresh` non-null values.
+            This overwrites the `how` parameter.
+        subset : str, tuple or list, optional
+            optional list of column names to consider.
+
+        Returns
+        -------
+        :class:`DataFrame`
+            DataFrame with null only rows excluded.
+        """
+        min_non_nulls: Optional[int] = None
+
+        if how is not None:
+            if not isinstance(how, str):
+                raise TypeError(f"how should be a str, but got {type(how).__name__}")
+            if how == "all":
+                min_non_nulls = 1
+            elif how == "any":
+                min_non_nulls = None
+            else:
+                raise ValueError("how ('" + how + "') should be 'any' or 'all'")
+
+        if thresh is not None:
+            if not isinstance(thresh, int):
+                raise TypeError(f"thresh should be a int, but got {type(thresh).__name__}")
+
+            # 'thresh' overwrites 'how'
+            min_non_nulls = thresh
+
+        _cols: List[str] = []
+        if subset is not None:
+            if isinstance(subset, str):
+                _cols = [subset]
+            elif isinstance(subset, (tuple, list)):
+                for c in subset:
+                    if not isinstance(c, str):
+                        raise TypeError(
+                            f"cols should be a str, tuple[str] or list[str], "
+                            f"but got {type(c).__name__}"
+                        )
+                _cols = list(subset)
+            else:
+                raise TypeError(
+                    f"cols should be a str, tuple[str] or list[str], "
+                    f"but got {type(subset).__name__}"
+                )
+
+        return DataFrame.withPlan(
+            plan.NADrop(child=self._plan, cols=_cols, min_non_nulls=min_non_nulls),
+            session=self._session,

Review Comment:
   this should probably `self._session.client` now.



##########
python/pyspark/sql/tests/connect/test_connect_basic.py:
##########
@@ -472,6 +472,46 @@ def test_fill_na(self):
             self.spark.sql(query).na.fill({"a": True, "b": 2}).toPandas(),
         )
 
+    def test_drop_na(self):
+        # SPARK-41148: Test drop na
+        query = """
+            SELECT * FROM VALUES
+            (false, 1, NULL), (false, NULL, 2.0), (NULL, 3, 3.0)
+            AS tab(a, b, c)
+            """
+        # +-----+----+----+
+        # |    a|   b|   c|
+        # +-----+----+----+
+        # |false|   1|null|
+        # |false|null| 2.0|
+        # | null|   3| 3.0|
+        # +-----+----+----+
+
+        self.assertTrue(

Review Comment:
   is `self.assert_eq` not applicable here? If we consistently hit issues where we can't use the Pandas equality helper let's figure out how to improve.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] zhengruifeng commented on pull request #38819: [SPARK-41148][CONNECT][PYTHON] Implement `DataFrame.dropna` and `DataFrame.na.drop`

Posted by GitBox <gi...@apache.org>.

zhengruifeng commented on PR #38819:
URL: https://github.com/apache/spark/pull/38819#issuecomment-1330096394

   the last commit only change the number of `min_non_nulls` from 4 to 3, I'm going to merge this PR.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] amaliujia commented on a diff in pull request #38819: [SPARK-41148][CONNECT][PYTHON] Implement `DataFrame.dropna` and `DataFrame.na.drop`

Posted by GitBox <gi...@apache.org>.

amaliujia commented on code in PR #38819:
URL: https://github.com/apache/spark/pull/38819#discussion_r1033920039


##########
connector/connect/src/main/protobuf/spark/connect/relations.proto:
##########
@@ -440,6 +441,26 @@ message NAFill {
   repeated Expression.Literal values = 3;
 }
 
+
+// Drop rows containing null values.
+// It will invoke 'Dataset.na.drop' (same as 'DataFrameNaFunctions.drop') to compute the results.
+message NADrop {

Review Comment:
   In this case, can you document on top  the that `optional int32 min_non_nulls = 4;` for clients. E.g. string `any` = 1.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] zhengruifeng commented on a diff in pull request #38819: [SPARK-41148][CONNECT][PYTHON] Implement `DataFrame.dropna ` and `DataFrame.na.drop `

Posted by GitBox <gi...@apache.org>.

zhengruifeng commented on code in PR #38819:
URL: https://github.com/apache/spark/pull/38819#discussion_r1033287401


##########
connector/connect/src/main/protobuf/spark/connect/relations.proto:
##########
@@ -440,6 +441,26 @@ message NAFill {
   repeated Expression.Literal values = 3;
 }
 
+
+// Drop rows containing null values.
+// It will invoke 'Dataset.na.drop' (same as 'DataFrameNaFunctions.drop') to compute the results.
+message NADrop {

Review Comment:
   since parameter `how: String` can be expressed by `min_non_nulls` https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/DataFrameNaFunctions.scala#L491-L497 ,
   
   only `min_non_nulls` is added in proto message



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] zhengruifeng commented on a diff in pull request #38819: [SPARK-41148][CONNECT][PYTHON] Implement `DataFrame.dropna` and `DataFrame.na.drop`

Posted by GitBox <gi...@apache.org>.

zhengruifeng commented on code in PR #38819:
URL: https://github.com/apache/spark/pull/38819#discussion_r1033287401


##########
connector/connect/src/main/protobuf/spark/connect/relations.proto:
##########
@@ -440,6 +441,26 @@ message NAFill {
   repeated Expression.Literal values = 3;
 }
 
+
+// Drop rows containing null values.
+// It will invoke 'Dataset.na.drop' (same as 'DataFrameNaFunctions.drop') to compute the results.
+message NADrop {

Review Comment:
   since parameter `how: String` can be converted to `min_non_nulls` 
   
   https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/DataFrameNaFunctions.scala#L491-L497
   
   only `min_non_nulls` is needed in proto message



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] zhengruifeng commented on pull request #38819: [SPARK-41148][CONNECT][PYTHON] Implement `DataFrame.dropna` and `DataFrame.na.drop`

Posted by GitBox <gi...@apache.org>.

zhengruifeng commented on PR #38819:
URL: https://github.com/apache/spark/pull/38819#issuecomment-1330097738

   merged into master, thanks all for the reviews


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] zhengruifeng commented on a diff in pull request #38819: [SPARK-41148][CONNECT][PYTHON] Implement `DataFrame.dropna` and `DataFrame.na.drop`

Posted by GitBox <gi...@apache.org>.

zhengruifeng commented on code in PR #38819:
URL: https://github.com/apache/spark/pull/38819#discussion_r1033287401


##########
connector/connect/src/main/protobuf/spark/connect/relations.proto:
##########
@@ -440,6 +441,26 @@ message NAFill {
   repeated Expression.Literal values = 3;
 }
 
+
+// Drop rows containing null values.
+// It will invoke 'Dataset.na.drop' (same as 'DataFrameNaFunctions.drop') to compute the results.
+message NADrop {

Review Comment:
   since parameter `how: String` can be expressed by `min_non_nulls` https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/DataFrameNaFunctions.scala#L491-L497 ,
   
   only `min_non_nulls` is needed in proto message



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] zhengruifeng commented on pull request #38819: [SPARK-41148][CONNECT][PYTHON] Implement `DataFrame.dropna` and `DataFrame.na.drop`

Posted by GitBox <gi...@apache.org>.

zhengruifeng commented on PR #38819:
URL: https://github.com/apache/spark/pull/38819#issuecomment-1328938651

   cc @HyukjinKwon @amaliujia @grundprinzip @cloud-fan 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] zhengruifeng commented on a diff in pull request #38819: [SPARK-41148][CONNECT][PYTHON] Implement `DataFrame.dropna` and `DataFrame.na.drop`

Posted by GitBox <gi...@apache.org>.

zhengruifeng commented on code in PR #38819:
URL: https://github.com/apache/spark/pull/38819#discussion_r1033287401


##########
connector/connect/src/main/protobuf/spark/connect/relations.proto:
##########
@@ -440,6 +441,26 @@ message NAFill {
   repeated Expression.Literal values = 3;
 }
 
+
+// Drop rows containing null values.
+// It will invoke 'Dataset.na.drop' (same as 'DataFrameNaFunctions.drop') to compute the results.
+message NADrop {

Review Comment:
   since parameter `how: String` can be converted to `min_non_nulls` 
   
   https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/DataFrameNaFunctions.scala#L491-L497
   
   
   only `min_non_nulls` is needed in proto message



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] zhengruifeng commented on a diff in pull request #38819: [SPARK-41148][CONNECT][PYTHON] Implement `DataFrame.dropna` and `DataFrame.na.drop`

Posted by GitBox <gi...@apache.org>.

zhengruifeng commented on code in PR #38819:
URL: https://github.com/apache/spark/pull/38819#discussion_r1034217228


##########
python/pyspark/sql/tests/connect/test_connect_basic.py:
##########
@@ -472,6 +472,46 @@ def test_fill_na(self):
             self.spark.sql(query).na.fill({"a": True, "b": 2}).toPandas(),
         )
 
+    def test_drop_na(self):
+        # SPARK-41148: Test drop na
+        query = """
+            SELECT * FROM VALUES
+            (false, 1, NULL), (false, NULL, 2.0), (NULL, 3, 3.0)
+            AS tab(a, b, c)
+            """
+        # +-----+----+----+
+        # |    a|   b|   c|
+        # +-----+----+----+
+        # |false|   1|null|
+        # |false|null| 2.0|
+        # | null|   3| 3.0|
+        # +-----+----+----+
+
+        self.assertTrue(

Review Comment:
   thanks, will update soon. 
   
   I guess I copied those test from somewhere and didnot check the comparsion.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] amaliujia commented on a diff in pull request #38819: [SPARK-41148][CONNECT][PYTHON] Implement `DataFrame.dropna` and `DataFrame.na.drop`

Posted by GitBox <gi...@apache.org>.

amaliujia commented on code in PR #38819:
URL: https://github.com/apache/spark/pull/38819#discussion_r1033920039


##########
connector/connect/src/main/protobuf/spark/connect/relations.proto:
##########
@@ -440,6 +441,26 @@ message NAFill {
   repeated Expression.Literal values = 3;
 }
 
+
+// Drop rows containing null values.
+// It will invoke 'Dataset.na.drop' (same as 'DataFrameNaFunctions.drop') to compute the results.
+message NADrop {

Review Comment:
   In this case, can you document these two string case and how they are expected to be mapped on top  the that `optional int32 min_non_nulls = 4;` for clients. E.g. string `any` = 1?



##########
connector/connect/src/main/protobuf/spark/connect/relations.proto:
##########
@@ -440,6 +441,26 @@ message NAFill {
   repeated Expression.Literal values = 3;
 }
 
+
+// Drop rows containing null values.
+// It will invoke 'Dataset.na.drop' (same as 'DataFrameNaFunctions.drop') to compute the results.
+message NADrop {

Review Comment:
   In this case, can you document these two string case and how they are expected to be mapped on top of the `optional int32 min_non_nulls = 4;` for clients. E.g. string `any` = 1?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org