You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@spark.apache.org by Cheng Lian <li...@gmail.com> on 2015/03/15 18:35:12 UTC
Re: Running spark function on parquet without sql
That's an unfortunate documentation bug in the programming guide... We
failed to update it after making the change.
Cheng
On 2/28/15 8:13 AM, Deborah Siegel wrote:
> Hi Michael,
>
> Would you help me understand the apparent difference here..
>
> The Spark 1.2.1 programming guide indicates:
>
> "Note that if you call |schemaRDD.cache()| rather than
> |sqlContext.cacheTable(...)|, tables will /not/ be cached using the
> in-memory columnar format, and therefore
> |sqlContext.cacheTable(...)| is strongly recommended for this use case."
>
> Yet the API doc shows that :
>
>
> def cache(): SchemaRDD
> <https://spark.apache.org/docs/1.2.0/api/scala/org/apache/spark/sql/SchemaRDD.html>.this.type
>
>
> Overridden cache function will always use the in-memory
> columnar caching.
>
>
>
> links
> https://spark.apache.org/docs/latest/sql-programming-guide.html#caching-data-in-memory
> https://spark.apache.org/docs/1.2.1/api/scala/index.html#org.apache.spark.sql.SchemaRDD
>
> Thanks
> Sincerely
> Deb
>
> On Fri, Feb 27, 2015 at 2:13 PM, Michael Armbrust
> <michael@databricks.com <ma...@databricks.com>> wrote:
>
> From Zhan Zhang's reply, yes I still get the parquet's advantage.
>
> You will need to at least use SQL or the DataFrame API (coming in
> Spark 1.3) to specify the columns that you want in order to get
> the parquet benefits. The rest of your operations can be
> standard Spark.
>
> My next question is, if I operate on SchemaRdd will I get the
> advantage of
> Spark SQL's in memory columnar store when cached the table using
> cacheTable()?
>
>
> Yes, SchemaRDDs always use the in-memory columnar cache for
> cacheTable and .cache() since Spark 1.2+
>
>