You are viewing a plain text version of this content. The canonical link for it is here.

Posted to reviews@spark.apache.org by GitBox <gi...@apache.org> on 2022/08/25 21:33:26 UTC

[GitHub] [spark] khalidmammadov opened a new pull request, #37662: [SPARK-40142][PYTHON][SQL][FOLLOW-UP] Make pyspark.sql.functions examples self-contained (part 3, 28 functions)

khalidmammadov opened a new pull request, #37662:
URL: https://github.com/apache/spark/pull/37662

   What changes were proposed in this pull request?
   Docstring improvements
   
   Why are the changes needed?
   To help users to understand pyspark API
   
   Does this PR introduce any user-facing change?
   Yes, documentation
   
   How was this patch tested?
   python/run-tests --testnames pyspark.sql.functions


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] itholic commented on pull request #37662: [SPARK-40142][PYTHON][SQL][FOLLOW-UP] Make pyspark.sql.functions examples self-contained (part 3, 28 functions)

Posted by GitBox <gi...@apache.org>.

itholic commented on PR #37662:
URL: https://github.com/apache/spark/pull/37662#issuecomment-1227879190

   Would you like to keep the existing format for PR description ??


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] AmplabJenkins commented on pull request #37662: [SPARK-40142][PYTHON][SQL][FOLLOW-UP] Make pyspark.sql.functions examples self-contained (part 3, 28 functions)

Posted by GitBox <gi...@apache.org>.

AmplabJenkins commented on PR #37662:
URL: https://github.com/apache/spark/pull/37662#issuecomment-1229250742

   Can one of the admins verify this patch?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] itholic commented on a diff in pull request #37662: [SPARK-40142][PYTHON][SQL][FOLLOW-UP] Make pyspark.sql.functions examples self-contained (part 3, 28 functions)

Posted by GitBox <gi...@apache.org>.

itholic commented on code in PR #37662:
URL: https://github.com/apache/spark/pull/37662#discussion_r955805923


##########
python/pyspark/sql/functions.py:
##########
@@ -2189,6 +2355,25 @@ def broadcast(df: DataFrame) -> DataFrame:
     Marks a DataFrame as small enough for use in broadcast joins.
 
     .. versionadded:: 1.6.0
+
+    Returns
+    -------
+    :class:`~pyspark.sql.DataFrame`
+        DataFrame marked as ready for broadcast join.
+
+    Examples
+    --------
+    >>> from pyspark.sql import types
+    >>> df = spark.createDataFrame([1, 2, 3, 3, 4], types.IntegerType())
+    >>> df_small = spark.range(3)
+    >>> df_b = broadcast(df_small)
+    >>> df.join(df_b, df.value == df_small.id).show()

Review Comment:
   What about using `explain(True)` to explicitly show the `broadcast` is used as a strategy for `ResolvedHint` ??
   
   ```python
   >>> df.join(df_b, df.value == df_small.id).explain(True)
   == Parsed Logical Plan ==
   Join Inner, (cast(value#267 as bigint) = id#269L)
   :- LogicalRDD [value#267], false
   +- ResolvedHint (strategy=broadcast)
      +- Range (0, 3, step=1, splits=Some(16))
   
   == Analyzed Logical Plan ==
   value: int, id: bigint
   Join Inner, (cast(value#267 as bigint) = id#269L)
   :- LogicalRDD [value#267], false
   +- ResolvedHint (strategy=broadcast)
      +- Range (0, 3, step=1, splits=Some(16))
   
   == Optimized Logical Plan ==
   Join Inner, (cast(value#267 as bigint) = id#269L), rightHint=(strategy=broadcast)
   :- Filter isnotnull(value#267)
   :  +- LogicalRDD [value#267], false
   +- Range (0, 3, step=1, splits=Some(16))
   
   == Physical Plan ==
   AdaptiveSparkPlan isFinalPlan=false
   +- BroadcastHashJoin [cast(value#267 as bigint)], [id#269L], Inner, BuildRight, false
      :- Filter isnotnull(value#267)
      :  +- Scan ExistingRDD[value#267]
      +- BroadcastExchange HashedRelationBroadcastMode(List(input[0, bigint, false]),false), [plan_id=164]
         +- Range (0, 3, step=1, splits=16)
   ```



##########
python/pyspark/sql/functions.py:
##########
@@ -2440,6 +2804,38 @@ def last(col: "ColumnOrName", ignorenulls: bool = False) -> Column:
     -----
     The function is non-deterministic because its results depends on the order of the
     rows which may be non-deterministic after a shuffle.
+
+    Parameters
+    ----------
+    col : :class:`~pyspark.sql.Column` or str
+        column to fetch last value for.
+    ignorenulls : :class:`~pyspark.sql.Column` or str
+        if last value is null then look for non-null value.
+
+    Returns
+    -------
+    :class:`~pyspark.sql.Column`
+        last value of the group.
+
+    Examples
+    --------
+    >>> df = spark.createDataFrame([("Alice", 2), ("Bob", 5), ("Alice", None)], ("name", "age"))
+    >>> df = df.orderBy(df.age.desc())
+    >>> df.groupby("name").agg(last("age")).orderBy("name").show()
+    +-----+---------+
+    | name|last(age)|
+    +-----+---------+
+    |Alice|     null|
+    |  Bob|        5|
+    +-----+---------+
+
+    >>> df.groupby("name").agg(last("age", True)).orderBy("name").show()

Review Comment:
   Here, also can we add a simple description why we're setting the `ignorenulls` as `True` ?



##########
python/pyspark/sql/functions.py:
##########
@@ -2377,6 +2672,16 @@ def grouping_id(*cols: "ColumnOrName") -> Column:
     The list of columns should match with grouping columns exactly, or empty (means all
     the grouping columns).
 
+    Parameters
+    ----------
+    cols : :class:`~pyspark.sql.Column` or str
+        columns to check for.
+
+    Returns
+    -------
+    :class:`~pyspark.sql.Column`
+        returns level of the grouping it relates to.
+
     Examples
     --------
     >>> df.cube("name").agg(grouping_id(), sum("age")).orderBy("name").show()

Review Comment:
   Here, also we can just remove the existing one since now we have improved one ?



##########
python/pyspark/sql/functions.py:
##########
@@ -2301,13 +2532,46 @@ def count_distinct(col: "ColumnOrName", *cols: "ColumnOrName") -> Column:
 
     .. versionadded:: 3.2.0
 
+    Parameters
+    ----------
+    col : :class:`~pyspark.sql.Column` or str
+        first column to compute on.
+    cols : :class:`~pyspark.sql.Column` or str
+        other columns to compute on.
+
+    Returns
+    -------
+    :class:`~pyspark.sql.Column`
+        distinct values of these two column values.
+
     Examples
     --------
     >>> df.agg(count_distinct(df.age, df.name).alias('c')).collect()
     [Row(c=2)]
 
     >>> df.agg(count_distinct("age", "name").alias('c')).collect()
     [Row(c=2)]

Review Comment:
   May be we can just remove the existing example since now we have a better one ?



##########
python/pyspark/sql/functions.py:
##########
@@ -2329,13 +2593,34 @@ def first(col: "ColumnOrName", ignorenulls: bool = False) -> Column:
     The function is non-deterministic because its results depends on the order of the
     rows which may be non-deterministic after a shuffle.
 
+    Parameters
+    ----------
+    col : :class:`~pyspark.sql.Column` or str
+        column to fetch first value for.
+    ignorenulls : :class:`~pyspark.sql.Column` or str
+        if first value is null then look for first non-null value.
+
+    Returns
+    -------
+    :class:`~pyspark.sql.Column`
+        first value of the group.
+
     Examples
     --------
-    >>> df = spark.createDataFrame([("Alice", 2), ("Bob", 5)], ("name", "age"))
+    >>> df = spark.createDataFrame([("Alice", 2), ("Bob", 5), ("Alice", None)], ("name", "age"))
+    >>> df = df.orderBy(df.age)
     >>> df.groupby("name").agg(first("age")).orderBy("name").show()
     +-----+----------+
     | name|first(age)|
     +-----+----------+
+    |Alice|      null|
+    |  Bob|         5|
+    +-----+----------+
+
+    >>> df.groupby("name").agg(first("age", True)).orderBy("name").show()

Review Comment:
   Can we add a short description for this example why here we set the `ignorenulls` as `True` ??



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] itholic commented on pull request #37662: [SPARK-40142][PYTHON][SQL][FOLLOW-UP] Make pyspark.sql.functions examples self-contained (part 3, 28 functions)

Posted by GitBox <gi...@apache.org>.

itholic commented on PR #37662:
URL: https://github.com/apache/spark/pull/37662#issuecomment-1227881595

   I think we should also check if `dev/lint-python` passed, not only `python/run-tests --testnames pyspark.sql.functions`.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] khalidmammadov commented on a diff in pull request #37662: [SPARK-40142][PYTHON][SQL][FOLLOW-UP] Make pyspark.sql.functions examples self-contained (part 3, 28 functions)

Posted by GitBox <gi...@apache.org>.

khalidmammadov commented on code in PR #37662:
URL: https://github.com/apache/spark/pull/37662#discussion_r955964936


##########
python/pyspark/sql/functions.py:
##########
@@ -2377,6 +2672,16 @@ def grouping_id(*cols: "ColumnOrName") -> Column:
     The list of columns should match with grouping columns exactly, or empty (means all
     the grouping columns).
 
+    Parameters
+    ----------
+    cols : :class:`~pyspark.sql.Column` or str
+        columns to check for.
+
+    Returns
+    -------
+    :class:`~pyspark.sql.Column`
+        returns level of the grouping it relates to.
+
     Examples
     --------
     >>> df.cube("name").agg(grouping_id(), sum("age")).orderBy("name").show()

Review Comment:
   Done



##########
python/pyspark/sql/functions.py:
##########
@@ -2440,6 +2804,38 @@ def last(col: "ColumnOrName", ignorenulls: bool = False) -> Column:
     -----
     The function is non-deterministic because its results depends on the order of the
     rows which may be non-deterministic after a shuffle.
+
+    Parameters
+    ----------
+    col : :class:`~pyspark.sql.Column` or str
+        column to fetch last value for.
+    ignorenulls : :class:`~pyspark.sql.Column` or str
+        if last value is null then look for non-null value.
+
+    Returns
+    -------
+    :class:`~pyspark.sql.Column`
+        last value of the group.
+
+    Examples
+    --------
+    >>> df = spark.createDataFrame([("Alice", 2), ("Bob", 5), ("Alice", None)], ("name", "age"))
+    >>> df = df.orderBy(df.age.desc())
+    >>> df.groupby("name").agg(last("age")).orderBy("name").show()
+    +-----+---------+
+    | name|last(age)|
+    +-----+---------+
+    |Alice|     null|
+    |  Bob|        5|
+    +-----+---------+
+
+    >>> df.groupby("name").agg(last("age", True)).orderBy("name").show()

Review Comment:
   Done



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] HyukjinKwon commented on pull request #37662: [SPARK-40142][PYTHON][SQL][FOLLOW-UP] Make pyspark.sql.functions examples self-contained (part 3, 28 functions)

Posted by GitBox <gi...@apache.org>.

HyukjinKwon commented on PR #37662:
URL: https://github.com/apache/spark/pull/37662#issuecomment-1229595335

   Merged to master.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] khalidmammadov commented on pull request #37662: [SPARK-40142][PYTHON][SQL][FOLLOW-UP] Make pyspark.sql.functions examples self-contained (part 3, 28 functions)

Posted by GitBox <gi...@apache.org>.

khalidmammadov commented on PR #37662:
URL: https://github.com/apache/spark/pull/37662#issuecomment-1228138151

   > I think we should also check if `dev/lint-python` passed, not only `python/run-tests --testnames pyspark.sql.functions`.
   
   Thanks, done


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] khalidmammadov commented on pull request #37662: [SPARK-40142][PYTHON][SQL][FOLLOW-UP] Make pyspark.sql.functions examples self-contained (part 3, 28 functions)

Posted by GitBox <gi...@apache.org>.

khalidmammadov commented on PR #37662:
URL: https://github.com/apache/spark/pull/37662#issuecomment-1228670798

   @HyukjinKwon FYI


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] HyukjinKwon closed pull request #37662: [SPARK-40142][PYTHON][SQL][FOLLOW-UP] Make pyspark.sql.functions examples self-contained (part 3, 28 functions)

Posted by GitBox <gi...@apache.org>.

HyukjinKwon closed pull request #37662: [SPARK-40142][PYTHON][SQL][FOLLOW-UP] Make pyspark.sql.functions examples self-contained (part 3, 28 functions)
URL: https://github.com/apache/spark/pull/37662


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] khalidmammadov commented on pull request #37662: [SPARK-40142][PYTHON][SQL][FOLLOW-UP] Make pyspark.sql.functions examples self-contained (part 3, 28 functions)

Posted by GitBox <gi...@apache.org>.

khalidmammadov commented on PR #37662:
URL: https://github.com/apache/spark/pull/37662#issuecomment-1228137889

   > Would you like to keep the existing format for PR description ??
   
   Fixed


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] khalidmammadov commented on a diff in pull request #37662: [SPARK-40142][PYTHON][SQL][FOLLOW-UP] Make pyspark.sql.functions examples self-contained (part 3, 28 functions)

Posted by GitBox <gi...@apache.org>.

khalidmammadov commented on code in PR #37662:
URL: https://github.com/apache/spark/pull/37662#discussion_r955928703


##########
python/pyspark/sql/functions.py:
##########
@@ -2189,6 +2355,25 @@ def broadcast(df: DataFrame) -> DataFrame:
     Marks a DataFrame as small enough for use in broadcast joins.
 
     .. versionadded:: 1.6.0
+
+    Returns
+    -------
+    :class:`~pyspark.sql.DataFrame`
+        DataFrame marked as ready for broadcast join.
+
+    Examples
+    --------
+    >>> from pyspark.sql import types
+    >>> df = spark.createDataFrame([1, 2, 3, 3, 4], types.IntegerType())
+    >>> df_small = spark.range(3)
+    >>> df_b = broadcast(df_small)
+    >>> df.join(df_b, df.value == df_small.id).show()

Review Comment:
   The problem is that those IDs are subject to change from run to run and these docstring examples are run during tests/validations and setting them hardcoded will brake the builds.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] khalidmammadov commented on a diff in pull request #37662: [SPARK-40142][PYTHON][SQL][FOLLOW-UP] Make pyspark.sql.functions examples self-contained (part 3, 28 functions)

Posted by GitBox <gi...@apache.org>.

khalidmammadov commented on code in PR #37662:
URL: https://github.com/apache/spark/pull/37662#discussion_r955964731


##########
python/pyspark/sql/functions.py:
##########
@@ -2301,13 +2532,46 @@ def count_distinct(col: "ColumnOrName", *cols: "ColumnOrName") -> Column:
 
     .. versionadded:: 3.2.0
 
+    Parameters
+    ----------
+    col : :class:`~pyspark.sql.Column` or str
+        first column to compute on.
+    cols : :class:`~pyspark.sql.Column` or str
+        other columns to compute on.
+
+    Returns
+    -------
+    :class:`~pyspark.sql.Column`
+        distinct values of these two column values.
+
     Examples
     --------
     >>> df.agg(count_distinct(df.age, df.name).alias('c')).collect()
     [Row(c=2)]
 
     >>> df.agg(count_distinct("age", "name").alias('c')).collect()
     [Row(c=2)]

Review Comment:
   Removed



##########
python/pyspark/sql/functions.py:
##########
@@ -2329,13 +2593,34 @@ def first(col: "ColumnOrName", ignorenulls: bool = False) -> Column:
     The function is non-deterministic because its results depends on the order of the
     rows which may be non-deterministic after a shuffle.
 
+    Parameters
+    ----------
+    col : :class:`~pyspark.sql.Column` or str
+        column to fetch first value for.
+    ignorenulls : :class:`~pyspark.sql.Column` or str
+        if first value is null then look for first non-null value.
+
+    Returns
+    -------
+    :class:`~pyspark.sql.Column`
+        first value of the group.
+
     Examples
     --------
-    >>> df = spark.createDataFrame([("Alice", 2), ("Bob", 5)], ("name", "age"))
+    >>> df = spark.createDataFrame([("Alice", 2), ("Bob", 5), ("Alice", None)], ("name", "age"))
+    >>> df = df.orderBy(df.age)
     >>> df.groupby("name").agg(first("age")).orderBy("name").show()
     +-----+----------+
     | name|first(age)|
     +-----+----------+
+    |Alice|      null|
+    |  Bob|         5|
+    +-----+----------+
+
+    >>> df.groupby("name").agg(first("age", True)).orderBy("name").show()

Review Comment:
   Added



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org