You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@spark.apache.org by li...@apache.org on 2017/03/21 15:46:02 UTC

spark git commit: [SPARK-20041][DOC] Update docs for NaN handling in approxQuantile

Repository: spark
Updated Branches:
  refs/heads/master 14865d7ff -> 63f077fbe


[SPARK-20041][DOC] Update docs for NaN handling in approxQuantile

## What changes were proposed in this pull request?
Update docs for NaN handling in approxQuantile.

## How was this patch tested?
existing tests.

Author: Zheng RuiFeng <ru...@foxmail.com>

Closes #17369 from zhengruifeng/doc_quantiles_nan.


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/63f077fb
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/63f077fb
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/63f077fb

Branch: refs/heads/master
Commit: 63f077fbe50b4094340e9915db41d7dbdba52975
Parents: 14865d7
Author: Zheng RuiFeng <ru...@foxmail.com>
Authored: Tue Mar 21 08:45:59 2017 -0700
Committer: Xiao Li <ga...@gmail.com>
Committed: Tue Mar 21 08:45:59 2017 -0700

----------------------------------------------------------------------
 R/pkg/R/stats.R                                                  | 3 ++-
 .../scala/org/apache/spark/ml/feature/QuantileDiscretizer.scala  | 4 ++--
 python/pyspark/sql/dataframe.py                                  | 3 ++-
 3 files changed, 6 insertions(+), 4 deletions(-)
----------------------------------------------------------------------


http://git-wip-us.apache.org/repos/asf/spark/blob/63f077fb/R/pkg/R/stats.R
----------------------------------------------------------------------
diff --git a/R/pkg/R/stats.R b/R/pkg/R/stats.R
index 8d1d165..d78a108 100644
--- a/R/pkg/R/stats.R
+++ b/R/pkg/R/stats.R
@@ -149,7 +149,8 @@ setMethod("freqItems", signature(x = "SparkDataFrame", cols = "character"),
 #' This method implements a variation of the Greenwald-Khanna algorithm (with some speed
 #' optimizations). The algorithm was first present in [[http://dx.doi.org/10.1145/375663.375670
 #' Space-efficient Online Computation of Quantile Summaries]] by Greenwald and Khanna.
-#' Note that rows containing any NA values will be removed before calculation.
+#' Note that NA values will be ignored in numerical columns before calculation. For
+#'   columns only containing NA values, an empty list is returned.
 #'
 #' @param x A SparkDataFrame.
 #' @param cols A single column name, or a list of names for multiple columns.

http://git-wip-us.apache.org/repos/asf/spark/blob/63f077fb/mllib/src/main/scala/org/apache/spark/ml/feature/QuantileDiscretizer.scala
----------------------------------------------------------------------
diff --git a/mllib/src/main/scala/org/apache/spark/ml/feature/QuantileDiscretizer.scala b/mllib/src/main/scala/org/apache/spark/ml/feature/QuantileDiscretizer.scala
index 80c7f55..feceeba 100644
--- a/mllib/src/main/scala/org/apache/spark/ml/feature/QuantileDiscretizer.scala
+++ b/mllib/src/main/scala/org/apache/spark/ml/feature/QuantileDiscretizer.scala
@@ -93,8 +93,8 @@ private[feature] trait QuantileDiscretizerBase extends Params
  * are too few distinct values of the input to create enough distinct quantiles.
  *
  * NaN handling:
- * NaN values will be removed from the column during `QuantileDiscretizer` fitting. This will
- * produce a `Bucketizer` model for making predictions. During the transformation,
+ * null and NaN values will be ignored from the column during `QuantileDiscretizer` fitting. This
+ * will produce a `Bucketizer` model for making predictions. During the transformation,
  * `Bucketizer` will raise an error when it finds NaN values in the dataset, but the user can
  * also choose to either keep or remove NaN values within the dataset by setting `handleInvalid`.
  * If the user chooses to keep NaN values, they will be handled specially and placed into their own

http://git-wip-us.apache.org/repos/asf/spark/blob/63f077fb/python/pyspark/sql/dataframe.py
----------------------------------------------------------------------
diff --git a/python/pyspark/sql/dataframe.py b/python/pyspark/sql/dataframe.py
index bb6df22..a24512f 100644
--- a/python/pyspark/sql/dataframe.py
+++ b/python/pyspark/sql/dataframe.py
@@ -1384,7 +1384,8 @@ class DataFrame(object):
         Space-efficient Online Computation of Quantile Summaries]]
         by Greenwald and Khanna.
 
-        Note that rows containing any null values will be removed before calculation.
+        Note that null values will be ignored in numerical columns before calculation.
+        For columns only containing null values, an empty list is returned.
 
         :param col: str, list.
           Can be a single column name, or a list of names for multiple columns.


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@spark.apache.org
For additional commands, e-mail: commits-help@spark.apache.org