You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@spark.apache.org by Haopu Wang <HW...@qilinsoft.com> on 2014/09/26 10:04:18 UTC
Spark SQL question: is cached SchemaRDD storage controlled by "spark.storage.memoryFraction"?
Hi, I'm querying a big table using Spark SQL. I see very long GC time in
some stages. I wonder if I can improve it by tuning the storage
parameter.
The question is: the schemaRDD has been cached with "cacheTable()"
function. So is the cached schemaRDD part of memory storage controlled
by the "spark.storage.memoryFraction" parameter?
Thanks!
---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
For additional commands, e-mail: user-help@spark.apache.org
Re: Spark SQL question: how to control the storage level of cached SchemaRDD?
Posted by Michael Armbrust <mi...@databricks.com>.
You might consider instead storing the data using saveAsParquetFile and
then querying that after running
sqlContext.parquetFile(...).registerTempTable(...).
On Sun, Sep 28, 2014 at 6:43 PM, Michael Armbrust <mi...@databricks.com>
wrote:
> This is not possible until https://github.com/apache/spark/pull/2501 is
> merged.
>
> On Sun, Sep 28, 2014 at 6:39 PM, Haopu Wang <HW...@qilinsoft.com> wrote:
>
>> Thanks for the response. From Spark Web-UI's Storage tab, I do see
>> cached RDD there.
>>
>>
>>
>> But the storage level is "Memory Deserialized 1x Replicated". How can I
>> change the storage level? Because I have a big table there.
>>
>>
>>
>> Thanks!
>>
>>
>> ------------------------------
>>
>> *From:* Cheng Lian [mailto:lian.cs.zju@gmail.com]
>> *Sent:* 2014年9月26日 21:24
>> *To:* Haopu Wang; user@spark.apache.org
>> *Subject:* Re: Spark SQL question: is cached SchemaRDD storage
>> controlled by "spark.storage.memoryFraction"?
>>
>>
>>
>> Yes it is. The in-memory storage used with SchemaRDD also uses
>> RDD.cache() under the hood.
>>
>> On 9/26/14 4:04 PM, Haopu Wang wrote:
>>
>> Hi, I'm querying a big table using Spark SQL. I see very long GC time in
>>
>> some stages. I wonder if I can improve it by tuning the storage
>>
>> parameter.
>>
>>
>>
>> The question is: the schemaRDD has been cached with "cacheTable()"
>>
>> function. So is the cached schemaRDD part of memory storage controlled
>>
>> by the "spark.storage.memoryFraction" parameter?
>>
>>
>>
>> Thanks!
>>
>>
>>
>> ---------------------------------------------------------------------
>>
>> To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
>>
>> For additional commands, e-mail: user-help@spark.apache.org
>>
>>
>>
>>
>>
>
>
Re: Spark SQL question: how to control the storage level of cached SchemaRDD?
Posted by Michael Armbrust <mi...@databricks.com>.
This is not possible until https://github.com/apache/spark/pull/2501 is
merged.
On Sun, Sep 28, 2014 at 6:39 PM, Haopu Wang <HW...@qilinsoft.com> wrote:
> Thanks for the response. From Spark Web-UI's Storage tab, I do see
> cached RDD there.
>
>
>
> But the storage level is "Memory Deserialized 1x Replicated". How can I
> change the storage level? Because I have a big table there.
>
>
>
> Thanks!
>
>
> ------------------------------
>
> *From:* Cheng Lian [mailto:lian.cs.zju@gmail.com]
> *Sent:* 2014年9月26日 21:24
> *To:* Haopu Wang; user@spark.apache.org
> *Subject:* Re: Spark SQL question: is cached SchemaRDD storage controlled
> by "spark.storage.memoryFraction"?
>
>
>
> Yes it is. The in-memory storage used with SchemaRDD also uses RDD.cache()
> under the hood.
>
> On 9/26/14 4:04 PM, Haopu Wang wrote:
>
> Hi, I'm querying a big table using Spark SQL. I see very long GC time in
>
> some stages. I wonder if I can improve it by tuning the storage
>
> parameter.
>
>
>
> The question is: the schemaRDD has been cached with "cacheTable()"
>
> function. So is the cached schemaRDD part of memory storage controlled
>
> by the "spark.storage.memoryFraction" parameter?
>
>
>
> Thanks!
>
>
>
> ---------------------------------------------------------------------
>
> To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
>
> For additional commands, e-mail: user-help@spark.apache.org
>
>
>
>
>
Spark SQL question: how to control the storage level of cached SchemaRDD?
Posted by Haopu Wang <HW...@qilinsoft.com>.
Thanks for the response. From Spark Web-UI's Storage tab, I do see cached RDD there.
But the storage level is "Memory Deserialized 1x Replicated". How can I change the storage level? Because I have a big table there.
Thanks!
________________________________
From: Cheng Lian [mailto:lian.cs.zju@gmail.com]
Sent: 2014年9月26日 21:24
To: Haopu Wang; user@spark.apache.org
Subject: Re: Spark SQL question: is cached SchemaRDD storage controlled by "spark.storage.memoryFraction"?
Yes it is. The in-memory storage used with SchemaRDD also uses RDD.cache() under the hood.
On 9/26/14 4:04 PM, Haopu Wang wrote:
Hi, I'm querying a big table using Spark SQL. I see very long GC time in
some stages. I wonder if I can improve it by tuning the storage
parameter.
The question is: the schemaRDD has been cached with "cacheTable()"
function. So is the cached schemaRDD part of memory storage controlled
by the "spark.storage.memoryFraction" parameter?
Thanks!
---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
For additional commands, e-mail: user-help@spark.apache.org
Re: Spark SQL question: is cached SchemaRDD storage controlled by
"spark.storage.memoryFraction"?
Posted by Cheng Lian <li...@gmail.com>.
Yes it is. The in-memory storage used with |SchemaRDD| also uses
|RDD.cache()| under the hood.
On 9/26/14 4:04 PM, Haopu Wang wrote:
> Hi, I'm querying a big table using Spark SQL. I see very long GC time in
> some stages. I wonder if I can improve it by tuning the storage
> parameter.
>
> The question is: the schemaRDD has been cached with "cacheTable()"
> function. So is the cached schemaRDD part of memory storage controlled
> by the "spark.storage.memoryFraction" parameter?
>
> Thanks!
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
> For additional commands, e-mail: user-help@spark.apache.org
>
Fwd: Spark SQL question: is cached SchemaRDD storage controlled by "spark.storage.memoryFraction"?
Posted by Liquan Pei <li...@gmail.com>.
---------- Forwarded message ----------
From: Liquan Pei <li...@gmail.com>
Date: Fri, Sep 26, 2014 at 1:33 AM
Subject: Re: Spark SQL question: is cached SchemaRDD storage controlled by
"spark.storage.memoryFraction"?
To: Haopu Wang <HW...@qilinsoft.com>
Hi Haopu,
Internally, cactheTable on a schemaRDD is implemented as a cache() on a
MapPartitionsRDD. As memory reserved for caching RDDs is controlled by
spark.storage.memoryFraction,
memory storage of cached schemaRDD is controlled by
spark.storage.memoryFraction.
Hope this helps!
Liquan
On Fri, Sep 26, 2014 at 1:04 AM, Haopu Wang <HW...@qilinsoft.com> wrote:
> Hi, I'm querying a big table using Spark SQL. I see very long GC time in
> some stages. I wonder if I can improve it by tuning the storage
> parameter.
>
> The question is: the schemaRDD has been cached with "cacheTable()"
> function. So is the cached schemaRDD part of memory storage controlled
> by the "spark.storage.memoryFraction" parameter?
>
> Thanks!
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
> For additional commands, e-mail: user-help@spark.apache.org
>
>
--
Liquan Pei
Department of Physics
University of Massachusetts Amherst
--
Liquan Pei
Department of Physics
University of Massachusetts Amherst