You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Rio Wibowo (JIRA)" <ji...@apache.org> on 2018/12/10 10:59:00 UTC
[jira] [Created] (SPARK-26320) udf with multiple arrays as input
Rio Wibowo created SPARK-26320:
----------------------------------
Summary: udf with multiple arrays as input
Key: SPARK-26320
URL: https://issues.apache.org/jira/browse/SPARK-26320
Project: Spark
Issue Type: Bug
Components: Spark Core
Affects Versions: 2.3.1, 2.1.1
Reporter: Rio Wibowo
Spark get GC out of memory error when passing many arrays when we use many arrays input
Description :
# 3 different Input arrays
# Every array has size 10
UDF :
val getResult = udf(
(index: Integer, array1: Seq[Integer], array2: Seq[Integer], array3: Seq[Double]) => 0
doSomeThing(index, array1, array2, array3)
)
DataFrame Schema :
root
|-- frequency_1: integer (nullable = true)
|-- code_1: integer (nullable = true)
|-- power_1: integer (nullable = false)
|-- frequency_2: integer (nullable = true)
|-- code_2: integer (nullable = true)
|-- power_2: integer (nullable = false)
|-- frequency_3: integer (nullable = true)
|-- code_3: integer (nullable = true)
|-- power_3: integer (nullable = false)
|-- frequency_4: integer (nullable = true)
|-- code_4: integer (nullable = true)
|-- power_4: integer (nullable = false)
|-- frequency_5: integer (nullable = true)
|-- code_5: integer (nullable = true)
|-- power_5: integer (nullable = false)
|-- frequency_6: integer (nullable = true)
|-- code_6: integer (nullable = true)
|-- power_6: integer (nullable = false)
|-- frequency_7: integer (nullable = true)
|-- code_7: integer (nullable = true)
|-- power_7: double (nullable = true)
|-- frequency_8: integer (nullable = true)
|-- code_8: integer (nullable = true)
|-- power_8: double (nullable = true)
|-- frequency_9: integer (nullable = true)
|-- code_9: integer (nullable = true)
|-- power_9: double (nullable = true)
|-- frequency_10: integer (nullable = true)
|-- code_10: integer (nullable = true)
|-- power_10: double (nullable = true)
Call the UDF to enrich dataframe using withColumn function 10 times:
{code:java}
df.withColumn("out1", getResult(0,
array(col("frequency_1"), col("frequency_2), ....,col("frequency_10)),
array(col("code_1"), col("code_2), ....,col("code_10)),
array(col("power_1"), col("power_2), ....,col("power_10)))
.withColumn("out2", getResult(0,
array(col("frequency_1"), col("frequency_2), ....,col("frequency_10)),
array(col("code_1"), col("code_2), ....,col("code_10)),
array(col("power_1"), col("power_2), ....,col("power_10)))
.withColumn("out9", getResult(9,....
.withColumn("out10", getResult(10,
array(col("frequency_1"), col("frequency_2), ....,col("frequency_10)), array(col("code_1"), col("code_2), ....,col("code_10)),
array(col("power_1"), col("power_2), ....,col("power_10)))
{code}
Error Log :
{code:java}
12:56:08.461 [dispatcher-event-loop-3] ERROR o.a.s.scheduler.TaskSchedulerImpl - Lost executor driver on localhost: Executor heartbeat timed out after 150014 ms
[info] com.xxx.xx.xx.xx.xx.xx.xx.xx.CSVExporterSpec *** ABORTED *** (9 minutes, 24 seconds)
[info] java.lang.OutOfMemoryError: GC overhead limit exceeded
[info] ...
[error] Uncaught exception when running com.xxx.xx.xx.xx.xx.xx.xx.xx.CSVExporterSpec: java.lang.OutOfMemoryError: GC overhead limit exceeded
java.lang.OutOfMemoryError: GC overhead limit exceeded
at java.util.AbstractList.iterator(AbstractList.java:288)
at org.apache.cassandra.gms.Gossiper.addLocalApplicationStates(Gossiper.java:1513)
at org.apache.cassandra.gms.Gossiper.addLocalApplicationState(Gossiper.java:1505)
at org.apache.cassandra.service.LoadBroadcaster$1.run(LoadBroadcaster.java:92)
at org.apache.cassandra.concurrent.DebuggableScheduledThreadPoolExecutor$UncomplainingRunnable.run(DebuggableScheduledThreadPoolExecutor.java:118)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:308)
at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:180)
at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:294)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at org.apache.cassandra.concurrent.NamedThreadFactory.lambda$threadLocalDeallocator$0(NamedThreadFactory.java:81)
at org.apache.cassandra.concurrent.NamedThreadFactory$$Lambda$4/821749187.run(Unknown Source)
at java.lang.Thread.run(Thread.java:745)
12:56:13.552 [ScheduledTasks:1] ERROR o.a.c.utils.JVMStabilityInspector - JVM state determined to be unstable. Exiting forcefully due to:
java.lang.OutOfMemoryError: GC overhead limit exceeded
at java.util.AbstractList.iterator(AbstractList.java:288)
at org.apache.cassandra.gms.Gossiper.addLocalApplicationStates(Gossiper.java:1513)
at org.apache.cassandra.gms.Gossiper.addLocalApplicationState(Gossiper.java:1505)
at org.apache.cassandra.service.LoadBroadcaster$1.run(LoadBroadcaster.java:92)
at org.apache.cassandra.concurrent.DebuggableScheduledThreadPoolExecutor$UncomplainingRunnable.run(DebuggableScheduledThreadPoolExecutor.java:118)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:308)
at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:180)
at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:294)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at org.apache.cassandra.concurrent.NamedThreadFactory.lambda$threadLocalDeallocator$0(NamedThreadFactory.java:81)
at org.apache.cassandra.concurrent.NamedThreadFactory$$Lambda$4/821749187.run(Unknown Source)
at java.lang.Thread.run(Thread.java:745){code}
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)
---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org