You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@spark.apache.org by Cheng Lian <li...@gmail.com> on 2015/03/15 18:35:12 UTC

Re: Running spark function on parquet without sql

That's an unfortunate documentation bug in the programming guide... We 
failed to update it after making the change.

Cheng

On 2/28/15 8:13 AM, Deborah Siegel wrote:
> Hi Michael,
>
> Would you help me understand  the apparent difference here..
>
> The Spark 1.2.1 programming guide indicates:
>
> "Note that if you call |schemaRDD.cache()| rather than 
> |sqlContext.cacheTable(...)|, tables will /not/ be cached using the 
> in-memory columnar format, and therefore 
> |sqlContext.cacheTable(...)| is strongly recommended for this use case."
>
> Yet the API doc shows that :
>
>
>         def cache(): SchemaRDD
>         <https://spark.apache.org/docs/1.2.0/api/scala/org/apache/spark/sql/SchemaRDD.html>.this.type
>
>
>         Overridden cache function will always use the in-memory
>         columnar caching.
>
>
>
> links
> https://spark.apache.org/docs/latest/sql-programming-guide.html#caching-data-in-memory
> https://spark.apache.org/docs/1.2.1/api/scala/index.html#org.apache.spark.sql.SchemaRDD
>
> Thanks
> Sincerely
> Deb
>
> On Fri, Feb 27, 2015 at 2:13 PM, Michael Armbrust 
> <michael@databricks.com <ma...@databricks.com>> wrote:
>
>         From Zhan Zhang's reply, yes I still get the parquet's advantage.
>
>     You will need to at least use SQL or the DataFrame API (coming in
>     Spark 1.3) to specify the columns that you want in order to get
>     the parquet benefits.   The rest of your operations can be
>     standard Spark.
>
>         My next question is, if I operate on SchemaRdd will I get the
>         advantage of
>         Spark SQL's in memory columnar store when cached the table using
>         cacheTable()?
>
>
>     Yes, SchemaRDDs always use the in-memory columnar cache for
>     cacheTable and .cache() since Spark 1.2+
>
>