You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@spark.apache.org by "allisonwang-db (via GitHub)" <gi...@apache.org> on 2023/08/22 16:49:55 UTC

[GitHub] [spark] allisonwang-db commented on a diff in pull request #42596: [SPARK-44903][PYTHON][DOCS] Refine docstring of `approx_count_distinct`

allisonwang-db commented on code in PR #42596:
URL: https://github.com/apache/spark/pull/42596#discussion_r1301927648


##########
python/pyspark/sql/functions.py:
##########
@@ -3669,38 +3669,83 @@ def approxCountDistinct(col: "ColumnOrName", rsd: Optional[float] = None) -> Col
 
 @try_remote_functions
 def approx_count_distinct(col: "ColumnOrName", rsd: Optional[float] = None) -> Column:
-    """Aggregate function: returns a new :class:`~pyspark.sql.Column` for approximate distinct count
-    of column `col`.
+    """
+    Applies an aggregate function to return an approximate distinct count of the specified column.
 
-    .. versionadded:: 2.1.0
+    This function returns a new :class:`~pyspark.sql.Column` that estimates the number of distinct
+    elements in a column or a group of columns.
 
-    .. versionchanged:: 3.4.0
-        Supports Spark Connect.
+    .. versionadded:: 2.1.0
 
     .. versionchanged:: 3.4.0
         Supports Spark Connect.
 
     Parameters
     ----------
     col : :class:`~pyspark.sql.Column` or str
+        The label of the column to count distinct values in.
     rsd : float, optional
-        maximum relative standard deviation allowed (default = 0.05).
-        For rsd < 0.01, it is more efficient to use :func:`count_distinct`
+        The maximum allowed relative standard deviation (default = 0.05).
+        If rsd < 0.01, it would be more efficient to use :func:`count_distinct`.
 
     Returns
     -------
     :class:`~pyspark.sql.Column`
-        the column of computed results.
+        A new Column object representing the approximate unique count.
+
+    See Also
+    ----------
+    :meth:`pyspark.sql.functions.count_distinct`
 
     Examples
     --------
+    Example 1: Counting distinct values in a single column DataFrame representing integers
+
+    >>> from pyspark.sql.functions import approx_count_distinct
     >>> df = spark.createDataFrame([1,2,2,3], "INT")
     >>> df.agg(approx_count_distinct("value").alias('distinct_values')).show()
     +---------------+
     |distinct_values|
     +---------------+
     |              3|
     +---------------+
+
+    Example 2: Counting distinct values in a single column DataFrame representing strings
+
+    >>> from pyspark.sql.functions import approx_count_distinct
+    >>> df = spark.createDataFrame(["apple", "orange", "apple", "banana"], "string").toDF("fruit")

Review Comment:
   Nit: instead of using `toDF`, can we specify the schema of the dataframe when creating it?



##########
python/pyspark/sql/functions.py:
##########
@@ -3669,38 +3669,83 @@ def approxCountDistinct(col: "ColumnOrName", rsd: Optional[float] = None) -> Col
 
 @try_remote_functions
 def approx_count_distinct(col: "ColumnOrName", rsd: Optional[float] = None) -> Column:
-    """Aggregate function: returns a new :class:`~pyspark.sql.Column` for approximate distinct count
-    of column `col`.
+    """
+    Applies an aggregate function to return an approximate distinct count of the specified column.
 
-    .. versionadded:: 2.1.0
+    This function returns a new :class:`~pyspark.sql.Column` that estimates the number of distinct
+    elements in a column or a group of columns.
 
-    .. versionchanged:: 3.4.0
-        Supports Spark Connect.
+    .. versionadded:: 2.1.0
 
     .. versionchanged:: 3.4.0
         Supports Spark Connect.
 
     Parameters
     ----------
     col : :class:`~pyspark.sql.Column` or str
+        The label of the column to count distinct values in.
     rsd : float, optional
-        maximum relative standard deviation allowed (default = 0.05).
-        For rsd < 0.01, it is more efficient to use :func:`count_distinct`
+        The maximum allowed relative standard deviation (default = 0.05).
+        If rsd < 0.01, it would be more efficient to use :func:`count_distinct`.
 
     Returns
     -------
     :class:`~pyspark.sql.Column`
-        the column of computed results.
+        A new Column object representing the approximate unique count.
+
+    See Also
+    ----------
+    :meth:`pyspark.sql.functions.count_distinct`
 
     Examples
     --------
+    Example 1: Counting distinct values in a single column DataFrame representing integers
+
+    >>> from pyspark.sql.functions import approx_count_distinct
     >>> df = spark.createDataFrame([1,2,2,3], "INT")

Review Comment:
   Can we use lower case for `INT` here?



##########
python/pyspark/sql/functions.py:
##########
@@ -3669,38 +3669,83 @@ def approxCountDistinct(col: "ColumnOrName", rsd: Optional[float] = None) -> Col
 
 @try_remote_functions
 def approx_count_distinct(col: "ColumnOrName", rsd: Optional[float] = None) -> Column:
-    """Aggregate function: returns a new :class:`~pyspark.sql.Column` for approximate distinct count
-    of column `col`.
+    """
+    Applies an aggregate function to return an approximate distinct count of the specified column.
 
-    .. versionadded:: 2.1.0
+    This function returns a new :class:`~pyspark.sql.Column` that estimates the number of distinct
+    elements in a column or a group of columns.

Review Comment:
   I think this is a bit redundant. Can we combine it with the previous sentence?



##########
python/pyspark/sql/functions.py:
##########
@@ -3669,38 +3669,83 @@ def approxCountDistinct(col: "ColumnOrName", rsd: Optional[float] = None) -> Col
 
 @try_remote_functions
 def approx_count_distinct(col: "ColumnOrName", rsd: Optional[float] = None) -> Column:
-    """Aggregate function: returns a new :class:`~pyspark.sql.Column` for approximate distinct count
-    of column `col`.
+    """
+    Applies an aggregate function to return an approximate distinct count of the specified column.
 
-    .. versionadded:: 2.1.0
+    This function returns a new :class:`~pyspark.sql.Column` that estimates the number of distinct
+    elements in a column or a group of columns.
 
-    .. versionchanged:: 3.4.0
-        Supports Spark Connect.
+    .. versionadded:: 2.1.0
 
     .. versionchanged:: 3.4.0
         Supports Spark Connect.
 
     Parameters
     ----------
     col : :class:`~pyspark.sql.Column` or str
+        The label of the column to count distinct values in.
     rsd : float, optional
-        maximum relative standard deviation allowed (default = 0.05).
-        For rsd < 0.01, it is more efficient to use :func:`count_distinct`
+        The maximum allowed relative standard deviation (default = 0.05).
+        If rsd < 0.01, it would be more efficient to use :func:`count_distinct`.
 
     Returns
     -------
     :class:`~pyspark.sql.Column`
-        the column of computed results.
+        A new Column object representing the approximate unique count.
+
+    See Also
+    ----------
+    :meth:`pyspark.sql.functions.count_distinct`
 
     Examples
     --------
+    Example 1: Counting distinct values in a single column DataFrame representing integers
+
+    >>> from pyspark.sql.functions import approx_count_distinct
     >>> df = spark.createDataFrame([1,2,2,3], "INT")
     >>> df.agg(approx_count_distinct("value").alias('distinct_values')).show()
     +---------------+
     |distinct_values|
     +---------------+
     |              3|
     +---------------+
+
+    Example 2: Counting distinct values in a single column DataFrame representing strings
+
+    >>> from pyspark.sql.functions import approx_count_distinct
+    >>> df = spark.createDataFrame(["apple", "orange", "apple", "banana"], "string").toDF("fruit")
+    >>> df.agg(approx_count_distinct("fruit").alias('distinct_fruits')).show()
+    +---------------+
+    |distinct_fruits|
+    +---------------+
+    |              3|
+    +---------------+
+
+    Example 3: Counting distinct values in a DataFrame with multiple columns
+
+    >>> from pyspark.sql.functions import approx_count_distinct, struct
+    >>> df = spark.createDataFrame([("Alice", 1),
+    ...                             ("Alice", 2),
+    ...                             ("Bob", 3),
+    ...                             ("Bob", 3)], ["name", "value"])
+    >>> df = df.withColumn("combined", struct("name", "value"))
+    >>> df.agg(approx_count_distinct(df["combined"]).alias('distinct_pairs')).show()

Review Comment:
   can we make this example consistent with the others by using `approx_count_distinct("combined")`?



##########
python/pyspark/sql/functions.py:
##########
@@ -3669,38 +3669,83 @@ def approxCountDistinct(col: "ColumnOrName", rsd: Optional[float] = None) -> Col
 
 @try_remote_functions
 def approx_count_distinct(col: "ColumnOrName", rsd: Optional[float] = None) -> Column:
-    """Aggregate function: returns a new :class:`~pyspark.sql.Column` for approximate distinct count
-    of column `col`.
+    """
+    Applies an aggregate function to return an approximate distinct count of the specified column.
 
-    .. versionadded:: 2.1.0
+    This function returns a new :class:`~pyspark.sql.Column` that estimates the number of distinct
+    elements in a column or a group of columns.
 
-    .. versionchanged:: 3.4.0
-        Supports Spark Connect.
+    .. versionadded:: 2.1.0
 
     .. versionchanged:: 3.4.0
         Supports Spark Connect.
 
     Parameters
     ----------
     col : :class:`~pyspark.sql.Column` or str
+        The label of the column to count distinct values in.
     rsd : float, optional
-        maximum relative standard deviation allowed (default = 0.05).
-        For rsd < 0.01, it is more efficient to use :func:`count_distinct`
+        The maximum allowed relative standard deviation (default = 0.05).
+        If rsd < 0.01, it would be more efficient to use :func:`count_distinct`.
 
     Returns
     -------
     :class:`~pyspark.sql.Column`
-        the column of computed results.
+        A new Column object representing the approximate unique count.
+
+    See Also
+    ----------
+    :meth:`pyspark.sql.functions.count_distinct`
 
     Examples
     --------
+    Example 1: Counting distinct values in a single column DataFrame representing integers
+
+    >>> from pyspark.sql.functions import approx_count_distinct
     >>> df = spark.createDataFrame([1,2,2,3], "INT")
     >>> df.agg(approx_count_distinct("value").alias('distinct_values')).show()
     +---------------+
     |distinct_values|
     +---------------+
     |              3|
     +---------------+
+
+    Example 2: Counting distinct values in a single column DataFrame representing strings
+
+    >>> from pyspark.sql.functions import approx_count_distinct
+    >>> df = spark.createDataFrame(["apple", "orange", "apple", "banana"], "string").toDF("fruit")
+    >>> df.agg(approx_count_distinct("fruit").alias('distinct_fruits')).show()
+    +---------------+
+    |distinct_fruits|
+    +---------------+
+    |              3|
+    +---------------+
+
+    Example 3: Counting distinct values in a DataFrame with multiple columns
+
+    >>> from pyspark.sql.functions import approx_count_distinct, struct
+    >>> df = spark.createDataFrame([("Alice", 1),
+    ...                             ("Alice", 2),
+    ...                             ("Bob", 3),
+    ...                             ("Bob", 3)], ["name", "value"])
+    >>> df = df.withColumn("combined", struct("name", "value"))
+    >>> df.agg(approx_count_distinct(df["combined"]).alias('distinct_pairs')).show()
+    +--------------+
+    |distinct_pairs|
+    +--------------+
+    |             3|
+    +--------------+
+
+    Example 4: Counting distinct values with a specified relative standard deviation
+
+    >>> from pyspark.sql.functions import approx_count_distinct
+    >>> df = spark.range(100000)
+    >>> df.agg(approx_count_distinct("id", 0.1).alias('distinct_values')).show()

Review Comment:
   can we compare the results of approx_count_distinct when 1) using the default rsd and 2) using 0.1 in this example?



##########
python/pyspark/sql/functions.py:
##########
@@ -3669,38 +3669,83 @@ def approxCountDistinct(col: "ColumnOrName", rsd: Optional[float] = None) -> Col
 
 @try_remote_functions
 def approx_count_distinct(col: "ColumnOrName", rsd: Optional[float] = None) -> Column:
-    """Aggregate function: returns a new :class:`~pyspark.sql.Column` for approximate distinct count
-    of column `col`.
+    """
+    Applies an aggregate function to return an approximate distinct count of the specified column.
 
-    .. versionadded:: 2.1.0
+    This function returns a new :class:`~pyspark.sql.Column` that estimates the number of distinct
+    elements in a column or a group of columns.
 
-    .. versionchanged:: 3.4.0
-        Supports Spark Connect.
+    .. versionadded:: 2.1.0
 
     .. versionchanged:: 3.4.0
         Supports Spark Connect.
 
     Parameters
     ----------
     col : :class:`~pyspark.sql.Column` or str
+        The label of the column to count distinct values in.
     rsd : float, optional
-        maximum relative standard deviation allowed (default = 0.05).
-        For rsd < 0.01, it is more efficient to use :func:`count_distinct`
+        The maximum allowed relative standard deviation (default = 0.05).
+        If rsd < 0.01, it would be more efficient to use :func:`count_distinct`.
 
     Returns
     -------
     :class:`~pyspark.sql.Column`
-        the column of computed results.
+        A new Column object representing the approximate unique count.
+
+    See Also
+    ----------
+    :meth:`pyspark.sql.functions.count_distinct`
 
     Examples
     --------
+    Example 1: Counting distinct values in a single column DataFrame representing integers
+
+    >>> from pyspark.sql.functions import approx_count_distinct

Review Comment:
   yea I would also prefer `from pyspark.sql.functions import approx_count_distinct`



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org