You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@spark.apache.org by ankits <an...@gmail.com> on 2015/02/02 21:23:45 UTC
Re: Get size of rdd in memory
Thanks for your response. So AFAICT
calling parallelize(1 to1024).map(i =>KV(i,
i.toString)).toSchemaRDD.cache().count(), will allow me to see the size of
the schemardd in memory
and parallelize(1 to1024).map(i =>KV(i, i.toString)).cache().count() will
show me the size of a regular rdd.
But this will not show us the size when using cacheTable() right? Like if i
call
parallelize(1 to1024).map(i =>KV(i,
i.toString)).toSchemaRDD.registerTempTable("test")
sqc.cacheTable("test")
sqc.sql("SELECT COUNT(*) FROM test")
the web UI does not show us the size of the cached table.
--
View this message in context: http://apache-spark-developers-list.1001551.n3.nabble.com/Get-size-of-rdd-in-memory-tp10366p10388.html
Sent from the Apache Spark Developers List mailing list archive at Nabble.com.
---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@spark.apache.org
For additional commands, e-mail: dev-help@spark.apache.org
Re: Get size of rdd in memory
Posted by Cheng Lian <li...@gmail.com>.
It's already fixed in the master branch. Sorry that we forgot to update
this before releasing 1.2.0 and caused you trouble...
Cheng
On 2/2/15 2:03 PM, ankits wrote:
> Great, thank you very much. I was confused because this is in the docs:
>
> https://spark.apache.org/docs/1.2.0/sql-programming-guide.html, and on the
> "branch-1.2" branch,
> https://github.com/apache/spark/blob/branch-1.2/docs/sql-programming-guide.md
>
> "Note that if you call schemaRDD.cache() rather than
> sqlContext.cacheTable(...), tables will not be cached using the in-memory
> columnar format, and therefore sqlContext.cacheTable(...) is strongly
> recommended for this use case.".
>
> If this is no longer accurate, i could make a PR to remove it.
>
>
>
> --
> View this message in context: http://apache-spark-developers-list.1001551.n3.nabble.com/Get-size-of-rdd-in-memory-tp10366p10392.html
> Sent from the Apache Spark Developers List mailing list archive at Nabble.com.
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@spark.apache.org
> For additional commands, e-mail: dev-help@spark.apache.org
>
>
---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@spark.apache.org
For additional commands, e-mail: dev-help@spark.apache.org
Re: Get size of rdd in memory
Posted by ankits <an...@gmail.com>.
Great, thank you very much. I was confused because this is in the docs:
https://spark.apache.org/docs/1.2.0/sql-programming-guide.html, and on the
"branch-1.2" branch,
https://github.com/apache/spark/blob/branch-1.2/docs/sql-programming-guide.md
"Note that if you call schemaRDD.cache() rather than
sqlContext.cacheTable(...), tables will not be cached using the in-memory
columnar format, and therefore sqlContext.cacheTable(...) is strongly
recommended for this use case.".
If this is no longer accurate, i could make a PR to remove it.
--
View this message in context: http://apache-spark-developers-list.1001551.n3.nabble.com/Get-size-of-rdd-in-memory-tp10366p10392.html
Sent from the Apache Spark Developers List mailing list archive at Nabble.com.
---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@spark.apache.org
For additional commands, e-mail: dev-help@spark.apache.org
Re: Get size of rdd in memory
Posted by Cheng Lian <li...@gmail.com>.
Actually |SchemaRDD.cache()| behaves exactly the same as |cacheTable|
since Spark 1.2.0. The reason why your web UI didn’t show you the cached
table is that both |cacheTable| and |sql("SELECT ...")| are lazy :-)
Simply add a |.collect()| after the |sql(...)| call.
Cheng
On 2/2/15 12:23 PM, ankits wrote:
> Thanks for your response. So AFAICT
>
> calling parallelize(1 to1024).map(i =>KV(i,
> i.toString)).toSchemaRDD.cache().count(), will allow me to see the size of
> the schemardd in memory
>
> and parallelize(1 to1024).map(i =>KV(i, i.toString)).cache().count() will
> show me the size of a regular rdd.
>
> But this will not show us the size when using cacheTable() right? Like if i
> call
>
> parallelize(1 to1024).map(i =>KV(i,
> i.toString)).toSchemaRDD.registerTempTable("test")
> sqc.cacheTable("test")
> sqc.sql("SELECT COUNT(*) FROM test")
>
> the web UI does not show us the size of the cached table.
>
>
>
>
>
> --
> View this message in context: http://apache-spark-developers-list.1001551.n3.nabble.com/Get-size-of-rdd-in-memory-tp10366p10388.html
> Sent from the Apache Spark Developers List mailing list archive at Nabble.com.
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@spark.apache.org
> For additional commands, e-mail: dev-help@spark.apache.org
>
>