You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Jakub Liska (JIRA)" <ji...@apache.org> on 2016/03/15 13:00:37 UTC
[jira] [Commented] (SPARK-12546) Writing to partitioned parquet table can fail with OOM

    [ https://issues.apache.org/jira/browse/SPARK-12546?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15195157#comment-15195157 ] 

Jakub Liska commented on SPARK-12546:
-------------------------------------

Hey, I migrated to 1.6.0, is it possible that it somehow relates to this really weird problem? Because there are plenty of resources and the sample data is really really tiny : 
{code}
val coreRdd = sc.textFile("s3n://gwiq-views-p/external/core/tsv/*.tsv").map(_.split("\t")).map( fields => Row(fields:_*) )
val coreDataFrame = sqlContext.createDataFrame(coreRdd, schema)
coreDataFrame.registerTempTable("core")
coreDataFrame.persist(StorageLevel.DISK_ONLY)
{code}

{code}
SELECT COUNT(*) FROM core
{code}

{code}
------ Create new SparkContext spark://master:7077 -------
Exception in thread "pool-1-thread-5" java.lang.OutOfMemoryError: GC overhead limit exceeded
	at com.google.gson.internal.bind.ObjectTypeAdapter.read(ObjectTypeAdapter.java:66)
	at com.google.gson.internal.bind.ObjectTypeAdapter.read(ObjectTypeAdapter.java:69)
	at com.google.gson.internal.bind.TypeAdapterRuntimeTypeWrapper.read(TypeAdapterRuntimeTypeWrapper.java:40)
	at com.google.gson.internal.bind.MapTypeAdapterFactory$Adapter.read(MapTypeAdapterFactory.java:188)
	at com.google.gson.internal.bind.MapTypeAdapterFactory$Adapter.read(MapTypeAdapterFactory.java:146)
	at com.google.gson.Gson.fromJson(Gson.java:791)
	at com.google.gson.Gson.fromJson(Gson.java:757)
	at com.google.gson.Gson.fromJson(Gson.java:706)
	at org.apache.zeppelin.interpreter.remote.RemoteInterpreterServer.convert(RemoteInterpreterServer.java:417)
	at org.apache.zeppelin.interpreter.remote.RemoteInterpreterServer.getProgress(RemoteInterpreterServer.java:384)
	at org.apache.zeppelin.interpreter.thrift.RemoteInterpreterService$Processor$getProgress.getResult(RemoteInterpreterService.java:1376)
	at org.apache.zeppelin.interpreter.thrift.RemoteInterpreterService$Processor$getProgress.getResult(RemoteInterpreterService.java:1361)
	at org.apache.thrift.ProcessFunction.process(ProcessFunction.java:39)
	at org.apache.thrift.TBaseProcessor.process(TBaseProcessor.java:39)
	at org.apache.thrift.server.TThreadPoolServer$WorkerProcess.run(TThreadPoolServer.java:285)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
	at java.lang.Thread.run(Thread.java:745)
{code}

> Writing to partitioned parquet table can fail with OOM
> ------------------------------------------------------
>
>                 Key: SPARK-12546
>                 URL: https://issues.apache.org/jira/browse/SPARK-12546
>             Project: Spark
>          Issue Type: Bug
>          Components: SQL
>    Affects Versions: 1.6.0
>            Reporter: Nong Li
>            Assignee: Michael Armbrust
>            Priority: Blocker
>              Labels: releasenotes
>             Fix For: 1.6.1, 2.0.0
>
>
> It is possible to have jobs fail with OOM when writing to a partitioned parquet table. While this was probably always possible, it is more likely in 1.6 due to the memory manager changes. The unified memory manager enables Spark to use more of the process memory (in particular, for execution) which gets us in this state more often. This issue can happen for libraries that consume a lot of memory, such as parquet. Prior to 1.6, these libraries would more likely use memory that spark was not using (i.e. from the storage pool). In 1.6, this storage memory can now be used for execution.
> There are a couple of configs that can help with this issue.
>   - parquet.memory.pool.ratio: This is a parquet config on how much of the heap the parquet writers should use. This default to .95. Consider a much lower value (e.g. 0.1)
>   - spark.memory.faction: This is a spark config to control how much of the memory should be allocated to spark. Consider setting this to 0.6.
> This should cause jobs to potentially spill more but require less memory. More aggressive tuning will control this trade off.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org