You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@spark.apache.org by sr...@apache.org on 2022/08/03 13:38:32 UTC
[spark] branch master updated: [SPARK-39928][CORE] Optimize `Utils.getIteratorSize` for Scala 2.13 refer to `IterableOnceOps.size`
This is an automated email from the ASF dual-hosted git repository.
srowen pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git
The following commit(s) were added to refs/heads/master by this push:
new 5412688cd69 [SPARK-39928][CORE] Optimize `Utils.getIteratorSize` for Scala 2.13 refer to `IterableOnceOps.size`
5412688cd69 is described below
commit 5412688cd69792d804ba04ef43104e58c208e26c
Author: yangjie01 <ya...@baidu.com>
AuthorDate: Wed Aug 3 08:38:14 2022 -0500
[SPARK-39928][CORE] Optimize `Utils.getIteratorSize` for Scala 2.13 refer to `IterableOnceOps.size`
### What changes were proposed in this pull request?
`Utils.getIteratorSize` is slightly slower than `Iterator.size` when using Scala 2.13, but it seems that we can't call `Iterator.size` directly, because `Utils.getIteratorSize` method returns `Long` type data, using `Iterator.size` directly can't ensure that overflow will not occur.
So this pr adds `if (iterator.knownSize > 0)` conditional branches to optimize this method when using Scala 2.13, this optimization way refers to `IterableOnceOps.size`:
https://github.com/scala/scala/blob/cbc012e573346dc685c478eec5fcbb56d22ea884/src/library/scala/collection/IterableOnce.scala#L835-L849
<img width="553" alt="image" src="https://user-images.githubusercontent.com/1475305/182060742-c56ead62-4164-4633-a501-48e2e3151752.png">
This optimization way is only suitable for Scala 2.13 and before using Scala 2.13, `Utils.getIteratorSize` is not slower than `Iterator.size`.
### Why are the changes needed?
Optimize `Utils.getIteratorSize` for Scala 2.13.
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
- Pass GitHub Actions
Closes #37353 from LuciferYang/Optimize-getIteratorSize-213.
Authored-by: yangjie01 <ya...@baidu.com>
Signed-off-by: Sean Owen <sr...@gmail.com>
---
.../org/apache/spark/util/Iterators.scala | 35 +++++++++++++++++++
.../org/apache/spark/util/Iterators.scala | 40 ++++++++++++++++++++++
.../main/scala/org/apache/spark/util/Utils.scala | 13 ++-----
3 files changed, 77 insertions(+), 11 deletions(-)
diff --git a/core/src/main/scala-2.12/org/apache/spark/util/Iterators.scala b/core/src/main/scala-2.12/org/apache/spark/util/Iterators.scala
new file mode 100644
index 00000000000..59530b4ba27
--- /dev/null
+++ b/core/src/main/scala-2.12/org/apache/spark/util/Iterators.scala
@@ -0,0 +1,35 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements. See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License. You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.util
+
+object Iterators {
+
+ /**
+ * Counts the number of elements of an iterator using a while loop rather than calling
+ * [[scala.collection.Iterator#size]] because it uses a for loop, which is slightly slower
+ * in the current version of Scala.
+ */
+ def size(iterator: Iterator[_]): Long = {
+ var count = 0L
+ while (iterator.hasNext) {
+ count += 1L
+ iterator.next()
+ }
+ count
+ }
+}
diff --git a/core/src/main/scala-2.13/org/apache/spark/util/Iterators.scala b/core/src/main/scala-2.13/org/apache/spark/util/Iterators.scala
new file mode 100644
index 00000000000..0ffba8d13a3
--- /dev/null
+++ b/core/src/main/scala-2.13/org/apache/spark/util/Iterators.scala
@@ -0,0 +1,40 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements. See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License. You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.util
+
+object Iterators {
+
+ /**
+ * Counts the number of elements of an iterator.
+ * This method is slower than `iterator.size` when using Scala 2.13,
+ * but it can avoid overflowing problem.
+ */
+ def size(iterator: Iterator[_]): Long = {
+ // SPARK-39928: For Scala 2.13, add check of `iterator.knownSize` refer to
+ // `IterableOnceOps#size` to reduce the performance gap with `iterator.size`.
+ if (iterator.knownSize > 0) iterator.knownSize.toLong
+ else {
+ var count = 0L
+ while (iterator.hasNext) {
+ count += 1L
+ iterator.next()
+ }
+ count
+ }
+ }
+}
diff --git a/core/src/main/scala/org/apache/spark/util/Utils.scala b/core/src/main/scala/org/apache/spark/util/Utils.scala
index 56392f0239c..aef79c7882c 100644
--- a/core/src/main/scala/org/apache/spark/util/Utils.scala
+++ b/core/src/main/scala/org/apache/spark/util/Utils.scala
@@ -1908,18 +1908,9 @@ private[spark] object Utils extends Logging {
}
/**
- * Counts the number of elements of an iterator using a while loop rather than calling
- * [[scala.collection.Iterator#size]] because it uses a for loop, which is slightly slower
- * in the current version of Scala.
+ * Counts the number of elements of an iterator.
*/
- def getIteratorSize(iterator: Iterator[_]): Long = {
- var count = 0L
- while (iterator.hasNext) {
- count += 1L
- iterator.next()
- }
- count
- }
+ def getIteratorSize(iterator: Iterator[_]): Long = Iterators.size(iterator)
/**
* Generate a zipWithIndex iterator, avoid index value overflowing problem
---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@spark.apache.org
For additional commands, e-mail: commits-help@spark.apache.org