You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@spark.apache.org by GitBox <gi...@apache.org> on 2022/11/21 03:44:02 UTC

[GitHub] [spark] zhengruifeng opened a new pull request, #38734: [SPARK-41212][CONNECT][PYTHON] Implement `DataFrame.isEmpty`

zhengruifeng opened a new pull request, #38734:
URL: https://github.com/apache/spark/pull/38734

   ### What changes were proposed in this pull request?
   Implement `DataFrame.isEmpty`
   
   
   ### Why are the changes needed?
   API Coverage
   
   
   ### Does this PR introduce _any_ user-facing change?
   Yes, new api
   
   
   ### How was this patch tested?
   added UT


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] zhengruifeng commented on pull request #38734: [SPARK-41212][CONNECT][PYTHON] Implement `DataFrame.isEmpty`

Posted by GitBox <gi...@apache.org>.
zhengruifeng commented on PR #38734:
URL: https://github.com/apache/spark/pull/38734#issuecomment-1323345244

   merged into master, thanks all for reviews


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] zhengruifeng closed pull request #38734: [SPARK-41212][CONNECT][PYTHON] Implement `DataFrame.isEmpty`

Posted by GitBox <gi...@apache.org>.
zhengruifeng closed pull request #38734: [SPARK-41212][CONNECT][PYTHON] Implement `DataFrame.isEmpty`
URL: https://github.com/apache/spark/pull/38734


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] cloud-fan commented on a diff in pull request #38734: [SPARK-41212][CONNECT][PYTHON] Implement `DataFrame.isEmpty`

Posted by GitBox <gi...@apache.org>.
cloud-fan commented on code in PR #38734:
URL: https://github.com/apache/spark/pull/38734#discussion_r1028961395


##########
python/pyspark/sql/connect/dataframe.py:
##########
@@ -122,6 +122,18 @@ def withPlan(cls, plan: plan.LogicalPlan, session: "RemoteSparkSession") -> "Dat
         new_frame._plan = plan
         return new_frame
 
+    def isEmpty(self) -> bool:
+        """Returns ``True`` if this :class:`DataFrame` is empty.
+
+        .. versionadded:: 3.4.0
+
+        Returns
+        -------
+        bool
+            Whether it's empty DataFrame or not.
+        """
+        return len(self.take(1)) == 0

Review Comment:
   This makes me think that maybe we should have some shared code between pyspark and python connect client, to share some API definitions (api doc, function signature) and functions with default implementations. cc @HyukjinKwon 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] amaliujia commented on a diff in pull request #38734: [SPARK-41212][CONNECT][PYTHON] Implement `DataFrame.isEmpty`

Posted by GitBox <gi...@apache.org>.
amaliujia commented on code in PR #38734:
URL: https://github.com/apache/spark/pull/38734#discussion_r1028724092


##########
python/pyspark/sql/connect/dataframe.py:
##########
@@ -122,6 +122,20 @@ def withPlan(cls, plan: plan.LogicalPlan, session: "RemoteSparkSession") -> "Dat
         new_frame._plan = plan
         return new_frame
 
+    def isEmpty(self) -> bool:
+        """Returns ``True`` if this :class:`DataFrame` is empty.

Review Comment:
   Nit: 
   
   When I read this, I actually not sure if it means the size of DataFrame or it means DataFrame is None. I guess it is the size. Not sure if there is a better to clarify this.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] amaliujia commented on pull request #38734: [SPARK-41212][CONNECT][PYTHON] Implement `DataFrame.isEmpty`

Posted by GitBox <gi...@apache.org>.
amaliujia commented on PR #38734:
URL: https://github.com/apache/spark/pull/38734#issuecomment-1322927685

   LGTM


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] HyukjinKwon commented on a diff in pull request #38734: [SPARK-41212][CONNECT][PYTHON] Implement `DataFrame.isEmpty`

Posted by GitBox <gi...@apache.org>.
HyukjinKwon commented on code in PR #38734:
URL: https://github.com/apache/spark/pull/38734#discussion_r1028721327


##########
python/pyspark/sql/connect/dataframe.py:
##########
@@ -122,6 +122,20 @@ def withPlan(cls, plan: plan.LogicalPlan, session: "RemoteSparkSession") -> "Dat
         new_frame._plan = plan
         return new_frame
 
+    def isEmpty(self) -> bool:
+        """Returns ``True`` if this :class:`DataFrame` is empty.
+
+        .. versionadded:: 3.4.0
+
+        Returns
+        -------
+        bool
+            Whether it's empty DataFrame or not.
+        """
+        if "is_empty" not in self._cache:
+            self._cache["is_empty"] = len(self.take(1)) == 0
+        return bool(self._cache["is_empty"])

Review Comment:
   I am fine with that but we can even do the caching in the DataFrame instead of Spark Connect I guess (?)



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] zhengruifeng commented on a diff in pull request #38734: [SPARK-41212][CONNECT][PYTHON] Implement `DataFrame.isEmpty`

Posted by GitBox <gi...@apache.org>.
zhengruifeng commented on code in PR #38734:
URL: https://github.com/apache/spark/pull/38734#discussion_r1028812420


##########
python/pyspark/sql/connect/dataframe.py:
##########
@@ -122,6 +122,20 @@ def withPlan(cls, plan: plan.LogicalPlan, session: "RemoteSparkSession") -> "Dat
         new_frame._plan = plan
         return new_frame
 
+    def isEmpty(self) -> bool:
+        """Returns ``True`` if this :class:`DataFrame` is empty.

Review Comment:
   this doc was copied from pyspark and scala api. I think the dataframe itself is expected to be not None, otherwise, this method can not be called.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] zhengruifeng commented on a diff in pull request #38734: [SPARK-41212][CONNECT][PYTHON] Implement `DataFrame.isEmpty`

Posted by GitBox <gi...@apache.org>.
zhengruifeng commented on code in PR #38734:
URL: https://github.com/apache/spark/pull/38734#discussion_r1028789393


##########
python/pyspark/sql/connect/dataframe.py:
##########
@@ -122,6 +122,20 @@ def withPlan(cls, plan: plan.LogicalPlan, session: "RemoteSparkSession") -> "Dat
         new_frame._plan = plan
         return new_frame
 
+    def isEmpty(self) -> bool:
+        """Returns ``True`` if this :class:`DataFrame` is empty.
+
+        .. versionadded:: 3.4.0
+
+        Returns
+        -------
+        bool
+            Whether it's empty DataFrame or not.
+        """
+        if "is_empty" not in self._cache:
+            self._cache["is_empty"] = len(self.take(1)) == 0
+        return bool(self._cache["is_empty"])

Review Comment:
   oh, let me update it. Maybe it is time to design how to do the caching, I will work on it.
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] zhengruifeng commented on pull request #38734: [SPARK-41212][CONNECT][PYTHON] Implement `DataFrame.isEmpty`

Posted by GitBox <gi...@apache.org>.
zhengruifeng commented on PR #38734:
URL: https://github.com/apache/spark/pull/38734#issuecomment-1322863152

   @HyukjinKwon @amaliujia @cloud-fan @grundprinzip 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] amaliujia commented on a diff in pull request #38734: [SPARK-41212][CONNECT][PYTHON] Implement `DataFrame.isEmpty`

Posted by GitBox <gi...@apache.org>.
amaliujia commented on code in PR #38734:
URL: https://github.com/apache/spark/pull/38734#discussion_r1028711626


##########
python/pyspark/sql/connect/dataframe.py:
##########
@@ -122,6 +122,20 @@ def withPlan(cls, plan: plan.LogicalPlan, session: "RemoteSparkSession") -> "Dat
         new_frame._plan = plan
         return new_frame
 
+    def isEmpty(self) -> bool:
+        """Returns ``True`` if this :class:`DataFrame` is empty.
+
+        .. versionadded:: 3.4.0
+
+        Returns
+        -------
+        bool
+            Whether it's empty DataFrame or not.
+        """
+        if "is_empty" not in self._cache:
+            self._cache["is_empty"] = len(self.take(1)) == 0
+        return bool(self._cache["is_empty"])

Review Comment:
   I think we talked about that do not do cache now but thinking about build general caching layer cc @HyukjinKwon 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] amaliujia commented on a diff in pull request #38734: [SPARK-41212][CONNECT][PYTHON] Implement `DataFrame.isEmpty`

Posted by GitBox <gi...@apache.org>.
amaliujia commented on code in PR #38734:
URL: https://github.com/apache/spark/pull/38734#discussion_r1028711954


##########
python/pyspark/sql/connect/dataframe.py:
##########
@@ -122,6 +122,20 @@ def withPlan(cls, plan: plan.LogicalPlan, session: "RemoteSparkSession") -> "Dat
         new_frame._plan = plan
         return new_frame
 
+    def isEmpty(self) -> bool:
+        """Returns ``True`` if this :class:`DataFrame` is empty.
+
+        .. versionadded:: 3.4.0
+
+        Returns
+        -------
+        bool
+            Whether it's empty DataFrame or not.
+        """
+        if "is_empty" not in self._cache:
+            self._cache["is_empty"] = len(self.take(1)) == 0
+        return bool(self._cache["is_empty"])

Review Comment:
   here https://github.com/apache/spark/pull/38546#discussion_r1017482919



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] amaliujia commented on a diff in pull request #38734: [SPARK-41212][CONNECT][PYTHON] Implement `DataFrame.isEmpty`

Posted by GitBox <gi...@apache.org>.
amaliujia commented on code in PR #38734:
URL: https://github.com/apache/spark/pull/38734#discussion_r1028724092


##########
python/pyspark/sql/connect/dataframe.py:
##########
@@ -122,6 +122,20 @@ def withPlan(cls, plan: plan.LogicalPlan, session: "RemoteSparkSession") -> "Dat
         new_frame._plan = plan
         return new_frame
 
+    def isEmpty(self) -> bool:
+        """Returns ``True`` if this :class:`DataFrame` is empty.

Review Comment:
   Nit: 
   
   When I read this, I actually not sure if it means the size of DataFrame or it means DataFrame is None. I guess it is the size. Not sure if there is a better way to clarify this.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org