You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Kohki Nishio (Jira)" <ji...@apache.org> on 2021/09/13 03:17:00 UTC
[jira] [Commented] (SPARK-36733) Perf issue in SchemaPruning when a
struct has million fields
[ https://issues.apache.org/jira/browse/SPARK-36733?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17413882#comment-17413882 ]
Kohki Nishio commented on SPARK-36733:
--------------------------------------
[https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/SchemaPruning.scala#L69]
often time (as long as I observed), left struct and the right struct are the same one. And every call to {{StructType.fieldNames}} runs\{{ fields.map(_.name). }}
this computation is quite expensive for 10K fields.
{{ val filteredRightFieldNames = rightStruct.fieldNames}}
{{ .filter(name => leftStruct.fieldNames.exists(resolver(_, name)))}}{{ }}
> Perf issue in SchemaPruning when a struct has million fields
> ------------------------------------------------------------
>
> Key: SPARK-36733
> URL: https://issues.apache.org/jira/browse/SPARK-36733
> Project: Spark
> Issue Type: Bug
> Components: SQL
> Affects Versions: 3.1.2
> Reporter: Kohki Nishio
> Priority: Major
>
> Seeing a significant performance degradation in query processing when a table contains a significantly large number of fields (>10K).
> Here's the stacktraces while processing a query
> {code:java}
> java.lang.Thread.State: RUNNABLE java.lang.Thread.State: RUNNABLE at scala.collection.TraversableLike.$anonfun$map$1(TraversableLike.scala:285) at scala.collection.TraversableLike$$Lambda$296/874023329.apply(Unknown Source) at scala.collection.IndexedSeqOptimized.foreach(IndexedSeqOptimized.scala:36) at scala.collection.IndexedSeqOptimized.foreach$(IndexedSeqOptimized.scala:33) at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:198) at scala.collection.TraversableLike.map(TraversableLike.scala:285) at scala.collection.TraversableLike.map$(TraversableLike.scala:278) at scala.collection.mutable.ArrayOps$ofRef.map(ArrayOps.scala:198) at org.apache.spark.sql.types.StructType.fieldNames(StructType.scala:108) at org.apache.spark.sql.catalyst.expressions.SchemaPruning$.$anonfun$sortLeftFieldsByRight$1(SchemaPruning.scala:70) at org.apache.spark.sql.catalyst.expressions.SchemaPruning$.$anonfun$sortLeftFieldsByRight$1$adapted(SchemaPruning.scala:70) at org.apache.spark.sql.catalyst.expressions.SchemaPruning$$$Lambda$3963/249742655.apply(Unknown Source) at scala.collection.TraversableLike.$anonfun$filterImpl$1(TraversableLike.scala:303) at scala.collection.TraversableLike$$Lambda$403/465534593.apply(Unknown Source) at scala.collection.IndexedSeqOptimized.foreach(IndexedSeqOptimized.scala:36) at scala.collection.IndexedSeqOptimized.foreach$(IndexedSeqOptimized.scala:33) at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:198) at scala.collection.TraversableLike.filterImpl(TraversableLike.scala:302) at scala.collection.TraversableLike.filterImpl$(TraversableLike.scala:296) at scala.collection.mutable.ArrayOps$ofRef.filterImpl(ArrayOps.scala:198) at scala.collection.TraversableLike.filter(TraversableLike.scala:394) at scala.collection.TraversableLike.filter$(TraversableLike.scala:394) at scala.collection.mutable.ArrayOps$ofRef.filter(ArrayOps.scala:198) at org.apache.spark.sql.catalyst.expressions.SchemaPruning$.sortLeftFieldsByRight(SchemaPruning.scala:70) at org.apache.spark.sql.catalyst.expressions.SchemaPruning$.$anonfun$sortLeftFieldsByRight$3(SchemaPruning.scala:75) at org.apache.spark.sql.catalyst.expressions.SchemaPruning$$$Lambda$3965/461314749.apply(Unknown Source) {code}
--
This message was sent by Atlassian Jira
(v8.3.4#803005)
---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org