You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@spark.apache.org by GitBox <gi...@apache.org> on 2022/12/05 00:05:42 UTC

[GitHub] [spark] huaxingao opened a new pull request, #38904: [SPARK-41378][SQL] Support Column Stats in DS v2

huaxingao opened a new pull request, #38904:
URL: https://github.com/apache/spark/pull/38904

   
   
   ### What changes were proposed in this pull request?
   Support Col Stats in DS v2
   
   
   ### Why are the changes needed?
   Currently only Table stats is supported in DS V2. Column stats should be supported too.
   
   
   ### Does this PR introduce _any_ user-facing change?
   Yes
   `ColumnStatistics` interface is introduced and added as a part of `Statistics`
   
   ### How was this patch tested?
   new test
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] dongjoon-hyun commented on a diff in pull request #38904: [SPARK-41378][SQL] Support Column Stats in DS v2

Posted by GitBox <gi...@apache.org>.
dongjoon-hyun commented on code in PR #38904:
URL: https://github.com/apache/spark/pull/38904#discussion_r1042043454


##########
sql/catalyst/src/main/java/org/apache/spark/sql/connector/read/colstats/ColumnStatistics.java:
##########
@@ -0,0 +1,84 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *    http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.connector.read.colstats;
+
+import java.math.BigInteger;
+import java.util.Optional;
+import java.util.OptionalLong;
+
+import org.apache.spark.annotation.Evolving;
+import org.apache.spark.sql.connector.read.Statistics;
+
+/**
+ * An interface to represent column statistics, which is part of
+ * {@link Statistics}.
+ *
+ * @since 3.4.0
+ */
+@Evolving
+public interface ColumnStatistics {
+
+  /**
+   * @return number of distinct values in the column
+   */
+  default Optional<BigInteger> distinctCount() {
+    return Optional.empty();
+  }
+
+  /**
+   * @return minimum value in the column
+   */
+  default Optional<Object> min() {
+    return Optional.empty();
+  }
+
+  /**
+   * @return maximum value in the column
+   */
+  default Optional<Object> max() {
+    return Optional.empty();
+  }
+
+  /**
+   * @return number of nulls in the column
+   */
+  default Optional<BigInteger> nullCount() {
+    return Optional.empty();
+  }
+
+  /**
+   * @return average length of the values in the column
+   */
+  default OptionalLong avgLen() {
+    return OptionalLong.empty();
+  }
+
+  /**
+   * @return maximum length of the values in the column
+   */
+  default OptionalLong maxLen() {
+    return OptionalLong.empty();
+  }
+
+  /**
+   * @return histogram of the values in the column
+   */
+  default Optional<Histogram> histogram() {

Review Comment:
   I agree with you that this could be unused APIs due to many practical reasons in many DSv2, but the interface itself is useful.
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] huaxingao commented on a diff in pull request #38904: [SPARK-41378][SQL] Support Column Stats in DS v2

Posted by GitBox <gi...@apache.org>.
huaxingao commented on code in PR #38904:
URL: https://github.com/apache/spark/pull/38904#discussion_r1042468836


##########
sql/catalyst/src/main/java/org/apache/spark/sql/connector/read/colstats/ColumnStatistics.java:
##########
@@ -0,0 +1,84 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *    http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.connector.read.colstats;
+
+import java.math.BigInteger;
+import java.util.Optional;
+import java.util.OptionalLong;
+
+import org.apache.spark.annotation.Evolving;
+import org.apache.spark.sql.connector.read.Statistics;
+
+/**
+ * An interface to represent column statistics, which is part of
+ * {@link Statistics}.
+ *
+ * @since 3.4.0
+ */
+@Evolving
+public interface ColumnStatistics {
+
+  /**
+   * @return number of distinct values in the column
+   */
+  default Optional<BigInteger> distinctCount() {

Review Comment:
   Changed to `OptionalLong`. Thanks for the suggestion!



##########
sql/catalyst/src/main/java/org/apache/spark/sql/connector/read/colstats/ColumnStatistics.java:
##########
@@ -0,0 +1,84 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *    http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.connector.read.colstats;
+
+import java.math.BigInteger;
+import java.util.Optional;
+import java.util.OptionalLong;
+
+import org.apache.spark.annotation.Evolving;
+import org.apache.spark.sql.connector.read.Statistics;
+
+/**
+ * An interface to represent column statistics, which is part of
+ * {@link Statistics}.
+ *
+ * @since 3.4.0
+ */
+@Evolving
+public interface ColumnStatistics {
+
+  /**
+   * @return number of distinct values in the column
+   */
+  default Optional<BigInteger> distinctCount() {
+    return Optional.empty();
+  }
+
+  /**
+   * @return minimum value in the column
+   */
+  default Optional<Object> min() {
+    return Optional.empty();
+  }
+
+  /**
+   * @return maximum value in the column
+   */
+  default Optional<Object> max() {
+    return Optional.empty();
+  }
+
+  /**
+   * @return number of nulls in the column
+   */
+  default Optional<BigInteger> nullCount() {

Review Comment:
   Changed. Thanks



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] huaxingao commented on a diff in pull request #38904: [SPARK-41378][SQL] Support Column Stats in DS v2

Posted by GitBox <gi...@apache.org>.
huaxingao commented on code in PR #38904:
URL: https://github.com/apache/spark/pull/38904#discussion_r1042469146


##########
sql/catalyst/src/main/scala/org/apache/spark/sql/execution/datasources/v2/DataSourceV2Relation.scala:
##########
@@ -18,11 +18,12 @@
 package org.apache.spark.sql.execution.datasources.v2
 
 import org.apache.spark.sql.catalyst.analysis.{MultiInstanceRelation, NamedRelation}
-import org.apache.spark.sql.catalyst.expressions.{Attribute, AttributeReference, Expression, SortOrder}
-import org.apache.spark.sql.catalyst.plans.logical.{ExposesMetadataColumns, LeafNode, LogicalPlan, Statistics}
+import org.apache.spark.sql.catalyst.expressions.{Attribute, AttributeMap, AttributeReference, Expression, SortOrder}
+import org.apache.spark.sql.catalyst.plans.logical.{ColumnStat, ExposesMetadataColumns, Histogram, HistogramBin, LeafNode, LogicalPlan, Statistics}
 import org.apache.spark.sql.catalyst.util.{truncatedString, CharVarcharUtils}
 import org.apache.spark.sql.connector.catalog.{CatalogPlugin, FunctionCatalog, Identifier, MetadataColumn, SupportsMetadataColumns, Table, TableCapability}
-import org.apache.spark.sql.connector.read.{Scan, Statistics => V2Statistics, SupportsReportStatistics}
+import org.apache.spark.sql.connector.read.{Scan, SupportsReportStatistics}

Review Comment:
   Fixed. Thanks



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] dongjoon-hyun commented on a diff in pull request #38904: [SPARK-41378][SQL] Support Column Stats in DS v2

Posted by GitBox <gi...@apache.org>.
dongjoon-hyun commented on code in PR #38904:
URL: https://github.com/apache/spark/pull/38904#discussion_r1039141790


##########
sql/core/src/test/scala/org/apache/spark/sql/connector/DataSourceV2Suite.scala:
##########
@@ -30,6 +30,7 @@ import org.apache.spark.sql.connector.expressions.{Expression, FieldReference, L
 import org.apache.spark.sql.connector.expressions.filter.Predicate
 import org.apache.spark.sql.connector.read._
 import org.apache.spark.sql.connector.read.partitioning.{KeyGroupedPartitioning, Partitioning, UnknownPartitioning}
+import org.apache.spark.sql.connector.read.stats.Statistics

Review Comment:
   Could you revert this file change?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] dongjoon-hyun commented on a diff in pull request #38904: [SPARK-41378][SQL] Support Column Stats in DS v2

Posted by GitBox <gi...@apache.org>.
dongjoon-hyun commented on code in PR #38904:
URL: https://github.com/apache/spark/pull/38904#discussion_r1042004906


##########
sql/catalyst/src/main/java/org/apache/spark/sql/connector/read/colstats/ColumnStatistics.java:
##########
@@ -0,0 +1,84 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *    http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.connector.read.colstats;
+
+import java.math.BigInteger;
+import java.util.Optional;
+import java.util.OptionalLong;
+
+import org.apache.spark.annotation.Evolving;
+import org.apache.spark.sql.connector.read.Statistics;
+
+/**
+ * An interface to represent column statistics, which is part of
+ * {@link Statistics}.
+ *
+ * @since 3.4.0
+ */
+@Evolving
+public interface ColumnStatistics {
+
+  /**
+   * @return number of distinct values in the column
+   */
+  default Optional<BigInteger> distinctCount() {

Review Comment:
   So, do you suggest `java.util.OptionalLong`?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] sunchao commented on a diff in pull request #38904: [SPARK-41378][SQL] Support Column Stats in DS v2

Posted by GitBox <gi...@apache.org>.
sunchao commented on code in PR #38904:
URL: https://github.com/apache/spark/pull/38904#discussion_r1041451334


##########
sql/catalyst/src/main/java/org/apache/spark/sql/connector/read/colstats/ColumnStatistics.java:
##########
@@ -0,0 +1,60 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *    http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.connector.read.colstats;

Review Comment:
   I wonder if it's better to name the package `org.apache.spark.sql.connector.read.stats`



##########
sql/catalyst/src/main/java/org/apache/spark/sql/connector/read/colstats/ColumnStatistics.java:
##########
@@ -0,0 +1,60 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *    http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.connector.read.colstats;
+
+import org.apache.spark.annotation.Evolving;
+import java.math.BigInteger;
+import java.util.Optional;
+import java.util.OptionalLong;
+
+/**
+ * An interface to represent column statistics, which is part of
+ * {@link Statistics}.
+ *
+ * @since 3.4.0
+ */
+@Evolving
+public interface ColumnStatistics {
+  default Optional<BigInteger> distinctCount() {
+    return Optional.empty();
+  }
+
+  default Optional<Object> min() {
+    return Optional.empty();
+  }
+
+  default Optional<Object> max() {
+    return Optional.empty();
+  }
+
+  default Optional<BigInteger> nullCount() {
+    return Optional.empty();
+  }
+
+  default OptionalLong avgLen() {

Review Comment:
   could we add some comments for each of these methods? since they are public APIs.



##########
sql/catalyst/src/main/java/org/apache/spark/sql/connector/read/Statistics.java:
##########
@@ -31,4 +35,7 @@
 public interface Statistics {
   OptionalLong sizeInBytes();
   OptionalLong numRows();
+  default Optional<HashMap<NamedReference, ColumnStatistics>> columnStats() {

Review Comment:
   why it has to be a `HashMap` in the API? can it just be `Map`?



##########
sql/catalyst/src/main/java/org/apache/spark/sql/connector/read/colstats/Histogram.java:
##########
@@ -0,0 +1,33 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *    http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.connector.read.colstats;
+
+import org.apache.spark.annotation.Evolving;
+
+/**
+ * An interface to represent an equi-height histogram, which is a part of
+ * {@link ColumnStatistics}. Equi-height histogram represents the distribution of
+ * a column's values by a sequence of bins.
+ *
+ * @since 3.4.0
+ */
+@Evolving
+public interface Histogram {
+  double height();

Review Comment:
   ditto: more comments.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] HyukjinKwon commented on a diff in pull request #38904: [SPARK-41378][SQL] Support Column Stats in DS v2

Posted by GitBox <gi...@apache.org>.
HyukjinKwon commented on code in PR #38904:
URL: https://github.com/apache/spark/pull/38904#discussion_r1045728767


##########
sql/core/src/test/scala/org/apache/spark/sql/connector/DataSourceV2SQLSuite.scala:
##########
@@ -2772,6 +2773,26 @@ class DataSourceV2SQLSuiteV1Filter
     }
   }
 
+  test("SPARK-41378: test column stats") {

Review Comment:
   This test fails with Scala 2.13:
   
   ```
   - SPARK-41378: test column stats *** FAILED *** (19 milliseconds)
     5 did not equal 3 (DataSourceV2SQLSuite.scala:2789)
     org.scalatest.exceptions.TestFailedException:
     at org.scalatest.Assertions.newAssertionFailedException(Assertions.scala:472)
     at org.scalatest.Assertions.newAssertionFailedException$(Assertions.scala:471)
     at org.scalatest.Assertions$.newAssertionFailedException(Assertions.scala:1231)
     at org.scalatest.Assertions$AssertionsHelper.macroAssert(Assertions.scala:1295)
     at org.apache.spark.sql.connector.DataSourceV2SQLSuiteV1Filter$$anonfun$$nestedInanonfun$new$386$1.applyOrElse(DataSourceV2SQLSuite.scala:2789)
     at org.apache.spark.sql.connector.DataSourceV2SQLSuiteV1Filter$$anonfun$$nestedInanonfun$new$386$1.applyOrElse(DataSourceV2SQLSuite.scala:2782)
     at scala.PartialFunction$Lifted.apply(PartialFunction.scala:338)
     at scala.PartialFunction$Lifted.apply(PartialFunction.scala:334)
     at org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$collect$1(TreeNode.scala:326)
     at org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$collect$1$adapted(TreeNode.scala:326)
     at org.apache.spark.sql.catalyst.trees.TreeNode.foreach(TreeNode.scala:285)
     at org.apache.spark.sql.catalyst.trees.TreeNode.collect(TreeNode.scala:326)
     at org.apache.spark.sql.connector.DataSourceV2SQLSuiteV1Filter.$anonfun$new$386(DataSourceV2SQLSuite.scala:2782)
     at org.scalatest.OutcomeOf.outcomeOf(OutcomeOf.scala:85)
     at org.scalatest.OutcomeOf.outcomeOf$(OutcomeOf.scala:83)
     at org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104)
     at org.scalatest.Transformer.apply(Transformer.scala:22)
     at org.scalatest.Transformer.apply(Transformer.scala:20)
     at org.scalatest.funsuite.AnyFunSuiteLike$$anon$1.apply(AnyFunSuiteLike.scala:226)
     at org.apache.spark.SparkFunSuite.withFixture(SparkFunSuite.scala:207)
     at org.scalatest.funsuite.AnyFunSuiteLike.invokeWithFixture$1(AnyFunSuiteLike.scala:224)
     at org.scalatest.funsuite.AnyFunSuiteLike.$anonfun$runTest$1(AnyFunSuiteLike.scala:236)
   ```
   
   https://github.com/apache/spark/actions/runs/3670384591/jobs/6204890447
   https://github.com/apache/spark/actions/runs/3665545037/jobs/6196700142
   https://github.com/apache/spark/actions/runs/3660066892/jobs/6186794437
   
   Mind taking a look please?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] cloud-fan commented on a diff in pull request #38904: [SPARK-41378][SQL] Support Column Stats in DS v2

Posted by GitBox <gi...@apache.org>.
cloud-fan commented on code in PR #38904:
URL: https://github.com/apache/spark/pull/38904#discussion_r1044214341


##########
sql/catalyst/src/main/java/org/apache/spark/sql/connector/read/Statistics.java:
##########
@@ -31,4 +35,7 @@
 public interface Statistics {
   OptionalLong sizeInBytes();
   OptionalLong numRows();
+  default Optional<Map<NamedReference, ColumnStatistics>> columnStats() {

Review Comment:
   shall we use empty map to indicate no column stats? Catalyst column stats also use map directly.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] dongjoon-hyun commented on a diff in pull request #38904: [SPARK-41378][SQL] Support Column Stats in DS v2

Posted by GitBox <gi...@apache.org>.
dongjoon-hyun commented on code in PR #38904:
URL: https://github.com/apache/spark/pull/38904#discussion_r1042043454


##########
sql/catalyst/src/main/java/org/apache/spark/sql/connector/read/colstats/ColumnStatistics.java:
##########
@@ -0,0 +1,84 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *    http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.connector.read.colstats;
+
+import java.math.BigInteger;
+import java.util.Optional;
+import java.util.OptionalLong;
+
+import org.apache.spark.annotation.Evolving;
+import org.apache.spark.sql.connector.read.Statistics;
+
+/**
+ * An interface to represent column statistics, which is part of
+ * {@link Statistics}.
+ *
+ * @since 3.4.0
+ */
+@Evolving
+public interface ColumnStatistics {
+
+  /**
+   * @return number of distinct values in the column
+   */
+  default Optional<BigInteger> distinctCount() {
+    return Optional.empty();
+  }
+
+  /**
+   * @return minimum value in the column
+   */
+  default Optional<Object> min() {
+    return Optional.empty();
+  }
+
+  /**
+   * @return maximum value in the column
+   */
+  default Optional<Object> max() {
+    return Optional.empty();
+  }
+
+  /**
+   * @return number of nulls in the column
+   */
+  default Optional<BigInteger> nullCount() {
+    return Optional.empty();
+  }
+
+  /**
+   * @return average length of the values in the column
+   */
+  default OptionalLong avgLen() {
+    return OptionalLong.empty();
+  }
+
+  /**
+   * @return maximum length of the values in the column
+   */
+  default OptionalLong maxLen() {
+    return OptionalLong.empty();
+  }
+
+  /**
+   * @return histogram of the values in the column
+   */
+  default Optional<Histogram> histogram() {

Review Comment:
   I agree with you that this could be unused APIs due to many practical reasons in many DSv2 implementations, but the interface itself is useful with a concrete purpose.
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] dongjoon-hyun commented on a diff in pull request #38904: [SPARK-41378][SQL] Support Column Stats in DS v2

Posted by GitBox <gi...@apache.org>.
dongjoon-hyun commented on code in PR #38904:
URL: https://github.com/apache/spark/pull/38904#discussion_r1042041417


##########
sql/catalyst/src/main/java/org/apache/spark/sql/connector/read/colstats/ColumnStatistics.java:
##########
@@ -0,0 +1,84 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *    http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.connector.read.colstats;
+
+import java.math.BigInteger;
+import java.util.Optional;
+import java.util.OptionalLong;
+
+import org.apache.spark.annotation.Evolving;
+import org.apache.spark.sql.connector.read.Statistics;
+
+/**
+ * An interface to represent column statistics, which is part of
+ * {@link Statistics}.
+ *
+ * @since 3.4.0
+ */
+@Evolving
+public interface ColumnStatistics {
+
+  /**
+   * @return number of distinct values in the column
+   */
+  default Optional<BigInteger> distinctCount() {
+    return Optional.empty();
+  }
+
+  /**
+   * @return minimum value in the column
+   */
+  default Optional<Object> min() {
+    return Optional.empty();
+  }
+
+  /**
+   * @return maximum value in the column
+   */
+  default Optional<Object> max() {
+    return Optional.empty();
+  }
+
+  /**
+   * @return number of nulls in the column
+   */
+  default Optional<BigInteger> nullCount() {
+    return Optional.empty();
+  }
+
+  /**
+   * @return average length of the values in the column
+   */
+  default OptionalLong avgLen() {
+    return OptionalLong.empty();
+  }
+
+  /**
+   * @return maximum length of the values in the column
+   */
+  default OptionalLong maxLen() {
+    return OptionalLong.empty();
+  }
+
+  /**
+   * @return histogram of the values in the column
+   */
+  default Optional<Histogram> histogram() {

Review Comment:
   We are trying to start to use this like Trino, @cloud-fan . That's the reason of this proposal.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] dongjoon-hyun commented on a diff in pull request #38904: [SPARK-41378][SQL] Support Column Stats in DS v2

Posted by GitBox <gi...@apache.org>.
dongjoon-hyun commented on code in PR #38904:
URL: https://github.com/apache/spark/pull/38904#discussion_r1042011083


##########
sql/catalyst/src/main/java/org/apache/spark/sql/connector/read/colstats/ColumnStatistics.java:
##########
@@ -0,0 +1,84 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *    http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.connector.read.colstats;
+
+import java.math.BigInteger;
+import java.util.Optional;
+import java.util.OptionalLong;
+
+import org.apache.spark.annotation.Evolving;
+import org.apache.spark.sql.connector.read.Statistics;
+
+/**
+ * An interface to represent column statistics, which is part of
+ * {@link Statistics}.
+ *
+ * @since 3.4.0
+ */
+@Evolving
+public interface ColumnStatistics {
+
+  /**
+   * @return number of distinct values in the column
+   */
+  default Optional<BigInteger> distinctCount() {
+    return Optional.empty();
+  }
+
+  /**
+   * @return minimum value in the column
+   */
+  default Optional<Object> min() {
+    return Optional.empty();
+  }
+
+  /**
+   * @return maximum value in the column
+   */
+  default Optional<Object> max() {
+    return Optional.empty();
+  }
+
+  /**
+   * @return number of nulls in the column
+   */
+  default Optional<BigInteger> nullCount() {
+    return Optional.empty();
+  }
+
+  /**
+   * @return average length of the values in the column
+   */
+  default OptionalLong avgLen() {
+    return OptionalLong.empty();
+  }
+
+  /**
+   * @return maximum length of the values in the column
+   */
+  default OptionalLong maxLen() {
+    return OptionalLong.empty();
+  }
+
+  /**
+   * @return histogram of the values in the column
+   */
+  default Optional<Histogram> histogram() {

Review Comment:
   It's okay to introduce this new DSv2 AP, isn't it? Since this new DSv2 Interface allows the new data source and SQL extension can utilize this, it sounds like a good idea to add this in this PR. The provided AS-IS default implementation will not cause a new burden for the other data sources.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] huaxingao commented on a diff in pull request #38904: [SPARK-41378][SQL] Support Column Stats in DS v2

Posted by GitBox <gi...@apache.org>.
huaxingao commented on code in PR #38904:
URL: https://github.com/apache/spark/pull/38904#discussion_r1039161555


##########
sql/core/src/test/scala/org/apache/spark/sql/connector/DataSourceV2SQLSuite.scala:
##########
@@ -2772,6 +2773,39 @@ class DataSourceV2SQLSuiteV1Filter
     }
   }
 
+  test("SPARK-XXXXX: test column stats") {

Review Comment:
   Fixed. Thanks



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] huaxingao commented on a diff in pull request #38904: [SPARK-41378][SQL] Support Column Stats in DS v2

Posted by GitBox <gi...@apache.org>.
huaxingao commented on code in PR #38904:
URL: https://github.com/apache/spark/pull/38904#discussion_r1039177239


##########
sql/catalyst/src/test/scala/org/apache/spark/sql/connector/catalog/InMemoryBaseTable.scala:
##########
@@ -273,7 +275,24 @@ abstract class InMemoryBaseTable(
     }
   }
 
-  case class InMemoryStats(sizeInBytes: OptionalLong, numRows: OptionalLong) extends Statistics
+  case class InMemoryStats(
+      sizeInBytes: OptionalLong,
+      numRows: OptionalLong,
+      override val columnStats: Optional[HashMap[NamedReference, ColumnStatistics]])
+    extends Statistics
+
+  case class InMemoryColumnStats (
+      override val distinctCount: Optional[BigInteger],

Review Comment:
   Should this be 4-space indentation? 
   I have an extra space after `InMemoryColumnStats`, will remove.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] dongjoon-hyun commented on pull request #38904: [SPARK-41378][SQL] Support Column Stats in DS v2

Posted by GitBox <gi...@apache.org>.
dongjoon-hyun commented on PR #38904:
URL: https://github.com/apache/spark/pull/38904#issuecomment-1336797980

   Thank you for updates.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] huaxingao commented on pull request #38904: [SPARK-41378][SQL] Support Column Stats in DS v2

Posted by GitBox <gi...@apache.org>.
huaxingao commented on PR #38904:
URL: https://github.com/apache/spark/pull/38904#issuecomment-1336843298

   also cc @cloud-fan 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] cloud-fan commented on a diff in pull request #38904: [SPARK-41378][SQL] Support Column Stats in DS v2

Posted by GitBox <gi...@apache.org>.
cloud-fan commented on code in PR #38904:
URL: https://github.com/apache/spark/pull/38904#discussion_r1042043395


##########
sql/catalyst/src/main/java/org/apache/spark/sql/connector/read/colstats/ColumnStatistics.java:
##########
@@ -0,0 +1,84 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *    http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.connector.read.colstats;
+
+import java.math.BigInteger;
+import java.util.Optional;
+import java.util.OptionalLong;
+
+import org.apache.spark.annotation.Evolving;
+import org.apache.spark.sql.connector.read.Statistics;
+
+/**
+ * An interface to represent column statistics, which is part of
+ * {@link Statistics}.
+ *
+ * @since 3.4.0
+ */
+@Evolving
+public interface ColumnStatistics {
+
+  /**
+   * @return number of distinct values in the column
+   */
+  default Optional<BigInteger> distinctCount() {
+    return Optional.empty();
+  }
+
+  /**
+   * @return minimum value in the column
+   */
+  default Optional<Object> min() {
+    return Optional.empty();
+  }
+
+  /**
+   * @return maximum value in the column
+   */
+  default Optional<Object> max() {
+    return Optional.empty();
+  }
+
+  /**
+   * @return number of nulls in the column
+   */
+  default Optional<BigInteger> nullCount() {
+    return Optional.empty();
+  }
+
+  /**
+   * @return average length of the values in the column
+   */
+  default OptionalLong avgLen() {
+    return OptionalLong.empty();
+  }
+
+  /**
+   * @return maximum length of the values in the column
+   */
+  default OptionalLong maxLen() {
+    return OptionalLong.empty();
+  }
+
+  /**
+   * @return histogram of the values in the column
+   */
+  default Optional<Histogram> histogram() {

Review Comment:
   OK let's keep it.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] huaxingao commented on pull request #38904: [SPARK-41378][SQL] Support Column Stats in DS v2

Posted by GitBox <gi...@apache.org>.
huaxingao commented on PR #38904:
URL: https://github.com/apache/spark/pull/38904#issuecomment-1341615389

   Thank you all very much!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] dongjoon-hyun commented on a diff in pull request #38904: [SPARK-41378][SQL] Support Column Stats in DS v2

Posted by GitBox <gi...@apache.org>.
dongjoon-hyun commented on code in PR #38904:
URL: https://github.com/apache/spark/pull/38904#discussion_r1039140899


##########
sql/core/src/test/scala/org/apache/spark/sql/connector/DataSourceV2SQLSuite.scala:
##########
@@ -2772,6 +2773,39 @@ class DataSourceV2SQLSuiteV1Filter
     }
   }
 
+  test("SPARK-XXXXX: test column stats") {

Review Comment:
   Could you use SPARK-41378 instead of SPARK-XXXXX?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] dongjoon-hyun commented on pull request #38904: [SPARK-41378][SQL] Support Column Stats in DS v2

Posted by GitBox <gi...@apache.org>.
dongjoon-hyun commented on PR #38904:
URL: https://github.com/apache/spark/pull/38904#issuecomment-1341608721

   Merged to master for Apache Spark 3.4.0. Thank you, @huaxingao and all!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] dongjoon-hyun commented on a diff in pull request #38904: [SPARK-41378][SQL] Support Column Stats in DS v2

Posted by GitBox <gi...@apache.org>.
dongjoon-hyun commented on code in PR #38904:
URL: https://github.com/apache/spark/pull/38904#discussion_r1042043454


##########
sql/catalyst/src/main/java/org/apache/spark/sql/connector/read/colstats/ColumnStatistics.java:
##########
@@ -0,0 +1,84 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *    http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.connector.read.colstats;
+
+import java.math.BigInteger;
+import java.util.Optional;
+import java.util.OptionalLong;
+
+import org.apache.spark.annotation.Evolving;
+import org.apache.spark.sql.connector.read.Statistics;
+
+/**
+ * An interface to represent column statistics, which is part of
+ * {@link Statistics}.
+ *
+ * @since 3.4.0
+ */
+@Evolving
+public interface ColumnStatistics {
+
+  /**
+   * @return number of distinct values in the column
+   */
+  default Optional<BigInteger> distinctCount() {
+    return Optional.empty();
+  }
+
+  /**
+   * @return minimum value in the column
+   */
+  default Optional<Object> min() {
+    return Optional.empty();
+  }
+
+  /**
+   * @return maximum value in the column
+   */
+  default Optional<Object> max() {
+    return Optional.empty();
+  }
+
+  /**
+   * @return number of nulls in the column
+   */
+  default Optional<BigInteger> nullCount() {
+    return Optional.empty();
+  }
+
+  /**
+   * @return average length of the values in the column
+   */
+  default OptionalLong avgLen() {
+    return OptionalLong.empty();
+  }
+
+  /**
+   * @return maximum length of the values in the column
+   */
+  default OptionalLong maxLen() {
+    return OptionalLong.empty();
+  }
+
+  /**
+   * @return histogram of the values in the column
+   */
+  default Optional<Histogram> histogram() {

Review Comment:
   I agree with you that this could be unused APIs due to many practical reasons in many DSv2, but the interface itself is useful with a concrete purpose.
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] LuciferYang commented on a diff in pull request #38904: [SPARK-41378][SQL] Support Column Stats in DS v2

Posted by GitBox <gi...@apache.org>.
LuciferYang commented on code in PR #38904:
URL: https://github.com/apache/spark/pull/38904#discussion_r1042048712


##########
sql/catalyst/src/main/java/org/apache/spark/sql/connector/read/colstats/ColumnStatistics.java:
##########
@@ -0,0 +1,84 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *    http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.connector.read.colstats;
+
+import java.math.BigInteger;
+import java.util.Optional;
+import java.util.OptionalLong;
+
+import org.apache.spark.annotation.Evolving;
+import org.apache.spark.sql.connector.read.Statistics;
+
+/**
+ * An interface to represent column statistics, which is part of
+ * {@link Statistics}.
+ *
+ * @since 3.4.0
+ */
+@Evolving
+public interface ColumnStatistics {
+
+  /**
+   * @return number of distinct values in the column
+   */
+  default Optional<BigInteger> distinctCount() {

Review Comment:
   +1



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] dongjoon-hyun commented on a diff in pull request #38904: [SPARK-41378][SQL] Support Column Stats in DS v2

Posted by GitBox <gi...@apache.org>.
dongjoon-hyun commented on code in PR #38904:
URL: https://github.com/apache/spark/pull/38904#discussion_r1046280138


##########
sql/core/src/test/scala/org/apache/spark/sql/connector/DataSourceV2SQLSuite.scala:
##########
@@ -2772,6 +2773,26 @@ class DataSourceV2SQLSuiteV1Filter
     }
   }
 
+  test("SPARK-41378: test column stats") {

Review Comment:
   Here is the followup PR to fix Scala 2.13.
   - https://github.com/apache/spark/pull/39038



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] cloud-fan commented on a diff in pull request #38904: [SPARK-41378][SQL] Support Column Stats in DS v2

Posted by GitBox <gi...@apache.org>.
cloud-fan commented on code in PR #38904:
URL: https://github.com/apache/spark/pull/38904#discussion_r1044220536


##########
sql/catalyst/src/test/scala/org/apache/spark/sql/connector/catalog/InMemoryBaseTable.scala:
##########
@@ -294,7 +307,39 @@ abstract class InMemoryBaseTable(
       val objectHeaderSizeInBytes = 12L
       val rowSizeInBytes = objectHeaderSizeInBytes + schema.defaultSize
       val sizeInBytes = numRows * rowSizeInBytes
-      InMemoryStats(OptionalLong.of(sizeInBytes), OptionalLong.of(numRows))
+
+      val numOfCols = tableSchema.fields.length
+      val dataTypes = tableSchema.fields.map(_.dataType)
+      val colValueSets = new Array[util.HashSet[Object]](numOfCols)
+      val numOfNulls = new Array[Long](numOfCols)
+      for (i <- 0 until numOfCols) {
+        colValueSets(i) = new util.HashSet[Object]
+      }
+
+      inputPartitions.foreach(inputPartition =>
+        inputPartition.rows.foreach(row =>
+          for (i <- 0 until numOfCols) {
+            colValueSets(i).add(row.get(i, dataTypes(i)))
+            if (row.isNullAt(i)) {
+              numOfNulls(i) += 1
+            }
+          }
+        )
+      )
+
+      val map = new util.HashMap[NamedReference, ColumnStatistics]()
+      val colNames = tableSchema.fields.map(_.name)
+      var i = 0
+      for (col <- colNames) {
+        val fieldReference = FieldReference(col)

Review Comment:
   `FieldReference.column(col)` as it's plain column name, while `FieldReference.apply` parses the string.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] dongjoon-hyun closed pull request #38904: [SPARK-41378][SQL] Support Column Stats in DS v2

Posted by GitBox <gi...@apache.org>.
dongjoon-hyun closed pull request #38904: [SPARK-41378][SQL] Support Column Stats in DS v2
URL: https://github.com/apache/spark/pull/38904


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] dongjoon-hyun commented on a diff in pull request #38904: [SPARK-41378][SQL] Support Column Stats in DS v2

Posted by GitBox <gi...@apache.org>.
dongjoon-hyun commented on code in PR #38904:
URL: https://github.com/apache/spark/pull/38904#discussion_r1039174225


##########
sql/catalyst/src/test/scala/org/apache/spark/sql/connector/catalog/InMemoryBaseTable.scala:
##########
@@ -273,7 +275,24 @@ abstract class InMemoryBaseTable(
     }
   }
 
-  case class InMemoryStats(sizeInBytes: OptionalLong, numRows: OptionalLong) extends Statistics
+  case class InMemoryStats(
+      sizeInBytes: OptionalLong,
+      numRows: OptionalLong,
+      override val columnStats: Optional[HashMap[NamedReference, ColumnStatistics]])
+    extends Statistics
+
+  case class InMemoryColumnStats (
+      override val distinctCount: Optional[BigInteger],

Review Comment:
   Indentation?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] huaxingao commented on a diff in pull request #38904: [SPARK-41378][SQL] Support Column Stats in DS v2

Posted by GitBox <gi...@apache.org>.
huaxingao commented on code in PR #38904:
URL: https://github.com/apache/spark/pull/38904#discussion_r1041841165


##########
sql/catalyst/src/test/scala/org/apache/spark/sql/connector/catalog/InMemoryBaseTable.scala:
##########
@@ -294,7 +313,30 @@ abstract class InMemoryBaseTable(
       val objectHeaderSizeInBytes = 12L
       val rowSizeInBytes = objectHeaderSizeInBytes + schema.defaultSize
       val sizeInBytes = numRows * rowSizeInBytes
-      InMemoryStats(OptionalLong.of(sizeInBytes), OptionalLong.of(numRows))
+
+      val map = new util.HashMap[NamedReference, ColumnStatistics]()
+      val colNames = readSchema.fields.map(_.name)
+      for (col <- colNames) {
+        val fieldReference = FieldReference(col)
+        // put some fake data for testing only
+        val bin1 = InMemoryHistogramBin(1, 2, 5L)
+        val bin2 = InMemoryHistogramBin(3, 4, 5L)
+        val bin3 = InMemoryHistogramBin(5, 6, 5L)
+        val bin4 = InMemoryHistogramBin(7, 8, 5L)
+        val bin5 = InMemoryHistogramBin(9, 10, 5L)

Review Comment:
   I removed the fake data and computed NDV and null Count for testing purpose. 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] viirya commented on a diff in pull request #38904: [SPARK-41378][SQL] Support Column Stats in DS v2

Posted by GitBox <gi...@apache.org>.
viirya commented on code in PR #38904:
URL: https://github.com/apache/spark/pull/38904#discussion_r1041483047


##########
sql/catalyst/src/test/scala/org/apache/spark/sql/connector/catalog/InMemoryBaseTable.scala:
##########
@@ -294,7 +313,30 @@ abstract class InMemoryBaseTable(
       val objectHeaderSizeInBytes = 12L
       val rowSizeInBytes = objectHeaderSizeInBytes + schema.defaultSize
       val sizeInBytes = numRows * rowSizeInBytes
-      InMemoryStats(OptionalLong.of(sizeInBytes), OptionalLong.of(numRows))
+
+      val map = new util.HashMap[NamedReference, ColumnStatistics]()
+      val colNames = readSchema.fields.map(_.name)
+      for (col <- colNames) {
+        val fieldReference = FieldReference(col)
+        // put some fake data for testing only
+        val bin1 = InMemoryHistogramBin(1, 2, 5L)
+        val bin2 = InMemoryHistogramBin(3, 4, 5L)
+        val bin3 = InMemoryHistogramBin(5, 6, 5L)
+        val bin4 = InMemoryHistogramBin(7, 8, 5L)
+        val bin5 = InMemoryHistogramBin(9, 10, 5L)

Review Comment:
   Hmm, not sure if fake statistics cause will cause unexpected result later? Ideally we should compute real statistics like `sizeInBytes` and `numRows` from `data` .



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] huaxingao commented on a diff in pull request #38904: [SPARK-41378][SQL] Support Column Stats in DS v2

Posted by GitBox <gi...@apache.org>.
huaxingao commented on code in PR #38904:
URL: https://github.com/apache/spark/pull/38904#discussion_r1041841092


##########
sql/catalyst/src/main/java/org/apache/spark/sql/connector/read/Statistics.java:
##########
@@ -31,4 +35,7 @@
 public interface Statistics {
   OptionalLong sizeInBytes();
   OptionalLong numRows();
+  default Optional<HashMap<NamedReference, ColumnStatistics>> columnStats() {

Review Comment:
   Changed. Thanks



##########
sql/catalyst/src/test/scala/org/apache/spark/sql/connector/catalog/InMemoryBaseTable.scala:
##########
@@ -294,7 +313,30 @@ abstract class InMemoryBaseTable(
       val objectHeaderSizeInBytes = 12L
       val rowSizeInBytes = objectHeaderSizeInBytes + schema.defaultSize
       val sizeInBytes = numRows * rowSizeInBytes
-      InMemoryStats(OptionalLong.of(sizeInBytes), OptionalLong.of(numRows))
+
+      val map = new util.HashMap[NamedReference, ColumnStatistics]()
+      val colNames = readSchema.fields.map(_.name)
+      for (col <- colNames) {
+        val fieldReference = FieldReference(col)
+        // put some fake data for testing only
+        val bin1 = InMemoryHistogramBin(1, 2, 5L)
+        val bin2 = InMemoryHistogramBin(3, 4, 5L)
+        val bin3 = InMemoryHistogramBin(5, 6, 5L)
+        val bin4 = InMemoryHistogramBin(7, 8, 5L)
+        val bin5 = InMemoryHistogramBin(9, 10, 5L)

Review Comment:
   I computed NDV and null Count for testing purpose. 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] cloud-fan commented on a diff in pull request #38904: [SPARK-41378][SQL] Support Column Stats in DS v2

Posted by GitBox <gi...@apache.org>.
cloud-fan commented on code in PR #38904:
URL: https://github.com/apache/spark/pull/38904#discussion_r1042033399


##########
sql/catalyst/src/main/java/org/apache/spark/sql/connector/read/colstats/ColumnStatistics.java:
##########
@@ -0,0 +1,84 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *    http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.connector.read.colstats;
+
+import java.math.BigInteger;
+import java.util.Optional;
+import java.util.OptionalLong;
+
+import org.apache.spark.annotation.Evolving;
+import org.apache.spark.sql.connector.read.Statistics;
+
+/**
+ * An interface to represent column statistics, which is part of
+ * {@link Statistics}.
+ *
+ * @since 3.4.0
+ */
+@Evolving
+public interface ColumnStatistics {
+
+  /**
+   * @return number of distinct values in the column
+   */
+  default Optional<BigInteger> distinctCount() {
+    return Optional.empty();
+  }
+
+  /**
+   * @return minimum value in the column
+   */
+  default Optional<Object> min() {
+    return Optional.empty();
+  }
+
+  /**
+   * @return maximum value in the column
+   */
+  default Optional<Object> max() {
+    return Optional.empty();
+  }
+
+  /**
+   * @return number of nulls in the column
+   */
+  default Optional<BigInteger> nullCount() {
+    return Optional.empty();
+  }
+
+  /**
+   * @return average length of the values in the column
+   */
+  default OptionalLong avgLen() {
+    return OptionalLong.empty();
+  }
+
+  /**
+   * @return maximum length of the values in the column
+   */
+  default OptionalLong maxLen() {
+    return OptionalLong.empty();
+  }
+
+  /**
+   * @return histogram of the values in the column
+   */
+  default Optional<Histogram> histogram() {

Review Comment:
   I don't have a strong opinion here, but it's better to not add useless APIs.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] cloud-fan commented on a diff in pull request #38904: [SPARK-41378][SQL] Support Column Stats in DS v2

Posted by GitBox <gi...@apache.org>.
cloud-fan commented on code in PR #38904:
URL: https://github.com/apache/spark/pull/38904#discussion_r1041880121


##########
sql/catalyst/src/main/java/org/apache/spark/sql/connector/read/colstats/ColumnStatistics.java:
##########
@@ -0,0 +1,84 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *    http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.connector.read.colstats;
+
+import java.math.BigInteger;
+import java.util.Optional;
+import java.util.OptionalLong;
+
+import org.apache.spark.annotation.Evolving;
+import org.apache.spark.sql.connector.read.Statistics;
+
+/**
+ * An interface to represent column statistics, which is part of
+ * {@link Statistics}.
+ *
+ * @since 3.4.0
+ */
+@Evolving
+public interface ColumnStatistics {
+
+  /**
+   * @return number of distinct values in the column
+   */
+  default Optional<BigInteger> distinctCount() {
+    return Optional.empty();
+  }
+
+  /**
+   * @return minimum value in the column
+   */
+  default Optional<Object> min() {
+    return Optional.empty();
+  }
+
+  /**
+   * @return maximum value in the column
+   */
+  default Optional<Object> max() {
+    return Optional.empty();
+  }
+
+  /**
+   * @return number of nulls in the column
+   */
+  default Optional<BigInteger> nullCount() {

Review Comment:
   ditto



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] LuciferYang commented on a diff in pull request #38904: [SPARK-41378][SQL] Support Column Stats in DS v2

Posted by GitBox <gi...@apache.org>.
LuciferYang commented on code in PR #38904:
URL: https://github.com/apache/spark/pull/38904#discussion_r1042062792


##########
sql/catalyst/src/main/scala/org/apache/spark/sql/execution/datasources/v2/DataSourceV2Relation.scala:
##########
@@ -18,11 +18,12 @@
 package org.apache.spark.sql.execution.datasources.v2
 
 import org.apache.spark.sql.catalyst.analysis.{MultiInstanceRelation, NamedRelation}
-import org.apache.spark.sql.catalyst.expressions.{Attribute, AttributeReference, Expression, SortOrder}
-import org.apache.spark.sql.catalyst.plans.logical.{ExposesMetadataColumns, LeafNode, LogicalPlan, Statistics}
+import org.apache.spark.sql.catalyst.expressions.{Attribute, AttributeMap, AttributeReference, Expression, SortOrder}
+import org.apache.spark.sql.catalyst.plans.logical.{ColumnStat, ExposesMetadataColumns, Histogram, HistogramBin, LeafNode, LogicalPlan, Statistics}
 import org.apache.spark.sql.catalyst.util.{truncatedString, CharVarcharUtils}
 import org.apache.spark.sql.connector.catalog.{CatalogPlugin, FunctionCatalog, Identifier, MetadataColumn, SupportsMetadataColumns, Table, TableCapability}
-import org.apache.spark.sql.connector.read.{Scan, Statistics => V2Statistics, SupportsReportStatistics}
+import org.apache.spark.sql.connector.read.{Scan, SupportsReportStatistics}

Review Comment:
   This change just splits one import line into two lines?
   
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] dongjoon-hyun commented on a diff in pull request #38904: [SPARK-41378][SQL] Support Column Stats in DS v2

Posted by GitBox <gi...@apache.org>.
dongjoon-hyun commented on code in PR #38904:
URL: https://github.com/apache/spark/pull/38904#discussion_r1039216766


##########
sql/catalyst/src/test/scala/org/apache/spark/sql/connector/catalog/InMemoryBaseTable.scala:
##########
@@ -273,7 +275,24 @@ abstract class InMemoryBaseTable(
     }
   }
 
-  case class InMemoryStats(sizeInBytes: OptionalLong, numRows: OptionalLong) extends Statistics
+  case class InMemoryStats(
+      sizeInBytes: OptionalLong,
+      numRows: OptionalLong,
+      override val columnStats: Optional[HashMap[NamedReference, ColumnStatistics]])
+    extends Statistics
+
+  case class InMemoryColumnStats (
+      override val distinctCount: Optional[BigInteger],

Review Comment:
   Ah, you are right. This is a `class` definition.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] huaxingao commented on a diff in pull request #38904: [SPARK-41378][SQL] Support Column Stats in DS v2

Posted by GitBox <gi...@apache.org>.
huaxingao commented on code in PR #38904:
URL: https://github.com/apache/spark/pull/38904#discussion_r1041840770


##########
sql/catalyst/src/main/java/org/apache/spark/sql/connector/read/colstats/ColumnStatistics.java:
##########
@@ -0,0 +1,60 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *    http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.connector.read.colstats;
+
+import org.apache.spark.annotation.Evolving;
+import java.math.BigInteger;
+import java.util.Optional;
+import java.util.OptionalLong;
+
+/**
+ * An interface to represent column statistics, which is part of
+ * {@link Statistics}.
+ *
+ * @since 3.4.0
+ */
+@Evolving
+public interface ColumnStatistics {
+  default Optional<BigInteger> distinctCount() {
+    return Optional.empty();
+  }
+
+  default Optional<Object> min() {
+    return Optional.empty();
+  }
+
+  default Optional<Object> max() {
+    return Optional.empty();
+  }
+
+  default Optional<BigInteger> nullCount() {
+    return Optional.empty();
+  }
+
+  default OptionalLong avgLen() {

Review Comment:
   Comments added. Thanks



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] huaxingao commented on pull request #38904: [SPARK-41378][SQL] Support Column Stats in DS v2

Posted by GitBox <gi...@apache.org>.
huaxingao commented on PR #38904:
URL: https://github.com/apache/spark/pull/38904#issuecomment-1340511363

   > Also curious how this is to be used by Spark
   
   
   The newly added `ColumnStatistics` is converted to logical `ColumnStat` in this [method](https://github.com/apache/spark/blob/0cddab9a618dc185efc2424ea934af5aa565a213/sql/catalyst/src/main/scala/org/apache/spark/sql/execution/datasources/v2/DataSourceV2Relation.scala#L213) and is used in CBO


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] dongjoon-hyun commented on a diff in pull request #38904: [SPARK-41378][SQL] Support Column Stats in DS v2

Posted by GitBox <gi...@apache.org>.
dongjoon-hyun commented on code in PR #38904:
URL: https://github.com/apache/spark/pull/38904#discussion_r1042046890


##########
sql/catalyst/src/main/java/org/apache/spark/sql/connector/read/colstats/ColumnStatistics.java:
##########
@@ -0,0 +1,84 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *    http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.connector.read.colstats;
+
+import java.math.BigInteger;
+import java.util.Optional;
+import java.util.OptionalLong;
+
+import org.apache.spark.annotation.Evolving;
+import org.apache.spark.sql.connector.read.Statistics;
+
+/**
+ * An interface to represent column statistics, which is part of
+ * {@link Statistics}.
+ *
+ * @since 3.4.0
+ */
+@Evolving
+public interface ColumnStatistics {
+
+  /**
+   * @return number of distinct values in the column
+   */
+  default Optional<BigInteger> distinctCount() {
+    return Optional.empty();
+  }
+
+  /**
+   * @return minimum value in the column
+   */
+  default Optional<Object> min() {
+    return Optional.empty();
+  }
+
+  /**
+   * @return maximum value in the column
+   */
+  default Optional<Object> max() {
+    return Optional.empty();
+  }
+
+  /**
+   * @return number of nulls in the column
+   */
+  default Optional<BigInteger> nullCount() {
+    return Optional.empty();
+  }
+
+  /**
+   * @return average length of the values in the column
+   */
+  default OptionalLong avgLen() {
+    return OptionalLong.empty();
+  }
+
+  /**
+   * @return maximum length of the values in the column
+   */
+  default OptionalLong maxLen() {
+    return OptionalLong.empty();
+  }
+
+  /**
+   * @return histogram of the values in the column
+   */
+  default Optional<Histogram> histogram() {

Review Comment:
   Thank you!



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] viirya commented on a diff in pull request #38904: [SPARK-41378][SQL] Support Column Stats in DS v2

Posted by GitBox <gi...@apache.org>.
viirya commented on code in PR #38904:
URL: https://github.com/apache/spark/pull/38904#discussion_r1041429490


##########
sql/catalyst/src/main/java/org/apache/spark/sql/connector/read/colstats/ColumnStatistics.java:
##########
@@ -0,0 +1,60 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *    http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.connector.read.colstats;
+
+import org.apache.spark.annotation.Evolving;
+import java.math.BigInteger;
+import java.util.Optional;
+import java.util.OptionalLong;

Review Comment:
   import order looks incorrect.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] cloud-fan commented on a diff in pull request #38904: [SPARK-41378][SQL] Support Column Stats in DS v2

Posted by GitBox <gi...@apache.org>.
cloud-fan commented on code in PR #38904:
URL: https://github.com/apache/spark/pull/38904#discussion_r1041879681


##########
sql/catalyst/src/main/java/org/apache/spark/sql/connector/read/colstats/ColumnStatistics.java:
##########
@@ -0,0 +1,84 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *    http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.connector.read.colstats;
+
+import java.math.BigInteger;
+import java.util.Optional;
+import java.util.OptionalLong;
+
+import org.apache.spark.annotation.Evolving;
+import org.apache.spark.sql.connector.read.Statistics;
+
+/**
+ * An interface to represent column statistics, which is part of
+ * {@link Statistics}.
+ *
+ * @since 3.4.0
+ */
+@Evolving
+public interface ColumnStatistics {
+
+  /**
+   * @return number of distinct values in the column
+   */
+  default Optional<BigInteger> distinctCount() {

Review Comment:
   In CBO, we need the distinct count as a `BigInteger` because the estimated row count can be very large due to join, generate, etc. But for a single table, do we really need `BigInteger`?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] viirya commented on a diff in pull request #38904: [SPARK-41378][SQL] Support Column Stats in DS v2

Posted by GitBox <gi...@apache.org>.
viirya commented on code in PR #38904:
URL: https://github.com/apache/spark/pull/38904#discussion_r1041484930


##########
sql/catalyst/src/test/scala/org/apache/spark/sql/connector/catalog/InMemoryBaseTable.scala:
##########
@@ -294,7 +313,30 @@ abstract class InMemoryBaseTable(
       val objectHeaderSizeInBytes = 12L
       val rowSizeInBytes = objectHeaderSizeInBytes + schema.defaultSize
       val sizeInBytes = numRows * rowSizeInBytes
-      InMemoryStats(OptionalLong.of(sizeInBytes), OptionalLong.of(numRows))
+
+      val map = new util.HashMap[NamedReference, ColumnStatistics]()
+      val colNames = readSchema.fields.map(_.name)
+      for (col <- colNames) {
+        val fieldReference = FieldReference(col)
+        // put some fake data for testing only
+        val bin1 = InMemoryHistogramBin(1, 2, 5L)
+        val bin2 = InMemoryHistogramBin(3, 4, 5L)
+        val bin3 = InMemoryHistogramBin(5, 6, 5L)
+        val bin4 = InMemoryHistogramBin(7, 8, 5L)
+        val bin5 = InMemoryHistogramBin(9, 10, 5L)

Review Comment:
   If it's too complicated, maybe we can just compute max/min for test purpose.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] huaxingao commented on a diff in pull request #38904: [SPARK-41378][SQL] Support Column Stats in DS v2

Posted by GitBox <gi...@apache.org>.
huaxingao commented on code in PR #38904:
URL: https://github.com/apache/spark/pull/38904#discussion_r1041840929


##########
sql/catalyst/src/main/java/org/apache/spark/sql/connector/read/colstats/ColumnStatistics.java:
##########
@@ -0,0 +1,60 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *    http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.connector.read.colstats;

Review Comment:
   The classes inside this package are for Column Stats. There is one existing class `Statistics`,  I can't group it in the new package.  It's probably better to call `colstats`. 



##########
sql/catalyst/src/main/java/org/apache/spark/sql/connector/read/colstats/Histogram.java:
##########
@@ -0,0 +1,33 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *    http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.connector.read.colstats;
+
+import org.apache.spark.annotation.Evolving;
+
+/**
+ * An interface to represent an equi-height histogram, which is a part of
+ * {@link ColumnStatistics}. Equi-height histogram represents the distribution of
+ * a column's values by a sequence of bins.
+ *
+ * @since 3.4.0
+ */
+@Evolving
+public interface Histogram {
+  double height();

Review Comment:
   Added



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] huaxingao commented on a diff in pull request #38904: [SPARK-41378][SQL] Support Column Stats in DS v2

Posted by GitBox <gi...@apache.org>.
huaxingao commented on code in PR #38904:
URL: https://github.com/apache/spark/pull/38904#discussion_r1041840529


##########
sql/catalyst/src/main/java/org/apache/spark/sql/connector/read/colstats/ColumnStatistics.java:
##########
@@ -0,0 +1,60 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *    http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.connector.read.colstats;
+
+import org.apache.spark.annotation.Evolving;
+import java.math.BigInteger;
+import java.util.Optional;
+import java.util.OptionalLong;

Review Comment:
   Fixed. Thanks



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] cloud-fan commented on a diff in pull request #38904: [SPARK-41378][SQL] Support Column Stats in DS v2

Posted by GitBox <gi...@apache.org>.
cloud-fan commented on code in PR #38904:
URL: https://github.com/apache/spark/pull/38904#discussion_r1041881179


##########
sql/catalyst/src/main/java/org/apache/spark/sql/connector/read/colstats/ColumnStatistics.java:
##########
@@ -0,0 +1,84 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *    http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.connector.read.colstats;
+
+import java.math.BigInteger;
+import java.util.Optional;
+import java.util.OptionalLong;
+
+import org.apache.spark.annotation.Evolving;
+import org.apache.spark.sql.connector.read.Statistics;
+
+/**
+ * An interface to represent column statistics, which is part of
+ * {@link Statistics}.
+ *
+ * @since 3.4.0
+ */
+@Evolving
+public interface ColumnStatistics {
+
+  /**
+   * @return number of distinct values in the column
+   */
+  default Optional<BigInteger> distinctCount() {
+    return Optional.empty();
+  }
+
+  /**
+   * @return minimum value in the column
+   */
+  default Optional<Object> min() {
+    return Optional.empty();
+  }
+
+  /**
+   * @return maximum value in the column
+   */
+  default Optional<Object> max() {
+    return Optional.empty();
+  }
+
+  /**
+   * @return number of nulls in the column
+   */
+  default Optional<BigInteger> nullCount() {
+    return Optional.empty();
+  }
+
+  /**
+   * @return average length of the values in the column
+   */
+  default OptionalLong avgLen() {
+    return OptionalLong.empty();
+  }
+
+  /**
+   * @return maximum length of the values in the column
+   */
+  default OptionalLong maxLen() {
+    return OptionalLong.empty();
+  }
+
+  /**
+   * @return histogram of the values in the column
+   */
+  default Optional<Histogram> histogram() {

Review Comment:
   do you use histograms in practice? we never use it...



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] cloud-fan commented on a diff in pull request #38904: [SPARK-41378][SQL] Support Column Stats in DS v2

Posted by GitBox <gi...@apache.org>.
cloud-fan commented on code in PR #38904:
URL: https://github.com/apache/spark/pull/38904#discussion_r1042030102


##########
sql/catalyst/src/main/java/org/apache/spark/sql/connector/read/colstats/ColumnStatistics.java:
##########
@@ -0,0 +1,84 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *    http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.connector.read.colstats;
+
+import java.math.BigInteger;
+import java.util.Optional;
+import java.util.OptionalLong;
+
+import org.apache.spark.annotation.Evolving;
+import org.apache.spark.sql.connector.read.Statistics;
+
+/**
+ * An interface to represent column statistics, which is part of
+ * {@link Statistics}.
+ *
+ * @since 3.4.0
+ */
+@Evolving
+public interface ColumnStatistics {
+
+  /**
+   * @return number of distinct values in the column
+   */
+  default Optional<BigInteger> distinctCount() {

Review Comment:
   Yup



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] dongjoon-hyun commented on a diff in pull request #38904: [SPARK-41378][SQL] Support Column Stats in DS v2

Posted by GitBox <gi...@apache.org>.
dongjoon-hyun commented on code in PR #38904:
URL: https://github.com/apache/spark/pull/38904#discussion_r1042031796


##########
sql/catalyst/src/main/java/org/apache/spark/sql/connector/read/colstats/ColumnStatistics.java:
##########
@@ -0,0 +1,84 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *    http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.connector.read.colstats;
+
+import java.math.BigInteger;
+import java.util.Optional;
+import java.util.OptionalLong;
+
+import org.apache.spark.annotation.Evolving;
+import org.apache.spark.sql.connector.read.Statistics;
+
+/**
+ * An interface to represent column statistics, which is part of
+ * {@link Statistics}.
+ *
+ * @since 3.4.0
+ */
+@Evolving
+public interface ColumnStatistics {
+
+  /**
+   * @return number of distinct values in the column
+   */
+  default Optional<BigInteger> distinctCount() {

Review Comment:
   +1



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] huaxingao commented on a diff in pull request #38904: [SPARK-41378][SQL] Support Column Stats in DS v2

Posted by GitBox <gi...@apache.org>.
huaxingao commented on code in PR #38904:
URL: https://github.com/apache/spark/pull/38904#discussion_r1039161731


##########
sql/core/src/test/scala/org/apache/spark/sql/connector/DataSourceV2Suite.scala:
##########
@@ -30,6 +30,7 @@ import org.apache.spark.sql.connector.expressions.{Expression, FieldReference, L
 import org.apache.spark.sql.connector.expressions.filter.Predicate
 import org.apache.spark.sql.connector.read._
 import org.apache.spark.sql.connector.read.partitioning.{KeyGroupedPartitioning, Partitioning, UnknownPartitioning}
+import org.apache.spark.sql.connector.read.stats.Statistics

Review Comment:
   Fixed. Thanks



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] dongjoon-hyun commented on a diff in pull request #38904: [SPARK-41378][SQL] Support Column Stats in DS v2

Posted by GitBox <gi...@apache.org>.
dongjoon-hyun commented on code in PR #38904:
URL: https://github.com/apache/spark/pull/38904#discussion_r1046110437


##########
sql/core/src/test/scala/org/apache/spark/sql/connector/DataSourceV2SQLSuite.scala:
##########
@@ -2772,6 +2773,26 @@ class DataSourceV2SQLSuiteV1Filter
     }
   }
 
+  test("SPARK-41378: test column stats") {

Review Comment:
   Let me take a look at this.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org