You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@spark.apache.org by mrm <ma...@skimlinks.com> on 2014/11/24 17:20:47 UTC

advantages of SparkSQL?

Hi,

Is there any advantage to storing data as a parquet format, loading it using
the sparkSQL context, but never registering as a table/using sql on it?
Something like:

Something like:
data = sqc.parquetFile(path)
results =  data.map(lambda x: applyfunc(x.field))

Is this faster/more optimised than having the data stored as a text file and
using Spark (non-SQL) to process it?



--
View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/advantages-of-SparkSQL-tp19661.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
For additional commands, e-mail: user-help@spark.apache.org


Re: advantages of SparkSQL?

Posted by mrm <ma...@skimlinks.com>.
Thank you for answering, this is all very helpful!



--
View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/advantages-of-SparkSQL-tp19661p19753.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
For additional commands, e-mail: user-help@spark.apache.org


Re: advantages of SparkSQL?

Posted by Cheng Lian <li...@gmail.com>.
For the “never register a table” part, actually you /can/ use Spark SQL 
without registering a table via its DSL. Say you’re going to extract an 
|Int| field named |key| from the table and double it:

|import  org.apache.sql.catalyst.dsl._
val  data  =  sqc.parquetFile(path)
val  double  =  (i:Int) => i *2
data.select(double.call('key) as'result).collect()
|

|SchemaRDD.select| constructs a proper SQL logical plan, which makes 
Spark SQL aware of the schema and enables Parquet fcolumn pruning 
optimization. The |double.call('key)| part is the expression DSL, which 
turns a plain Scala function into a Spark SQL UDF, and applies this UDF 
to the |key| field.

Notice that the |.call| method is only available in the most recent 
master and branch-1.2.

On 11/25/14 5:19 AM, Michael Armbrust wrote:

> Akshat is correct about the benefits of parquet as a columnar format, 
> but I'll add that some of this is lost if you just use a lambda 
> function to process the data. Since your lambda function is a black 
> box Spark SQL does not know which columns it is going to use and thus 
> will do a full tablescan.  I'd suggest writing a very simple SQL query 
> that pulls out just the columns you need and does any filtering before 
> dropping back into standard spark operations.  The result of SQL 
> queries is an RDD of rows so you can do any normal spark processing 
> you want on them.
>
> Either way though it will often be faster than a text filed due to 
> better encoding/compression.
>
> On Mon, Nov 24, 2014 at 8:54 AM, Akshat Aranya <aaranya@gmail.com 
> <ma...@gmail.com>> wrote:
>
>     Parquet is a column-oriented format, which means that you need to
>     read in less data from the file system if you're only interested
>     in a subset of your columns.  Also, Parquet pushes down selection
>     predicates, which can eliminate needless deserialization of rows
>     that don't match a selection criterion.  Other than that, you
>     would also get compression, and likely save processor cycles when
>     parsing lines from text files.
>
>
>
>     On Mon, Nov 24, 2014 at 8:20 AM, mrm <maria@skimlinks.com
>     <ma...@skimlinks.com>> wrote:
>
>         Hi,
>
>         Is there any advantage to storing data as a parquet format,
>         loading it using
>         the sparkSQL context, but never registering as a table/using
>         sql on it?
>         Something like:
>
>         Something like:
>         data = sqc.parquetFile(path)
>         results =  data.map(lambda x: applyfunc(x.field))
>
>         Is this faster/more optimised than having the data stored as a
>         text file and
>         using Spark (non-SQL) to process it?
>
>
>
>         --
>         View this message in context:
>         http://apache-spark-user-list.1001560.n3.nabble.com/advantages-of-SparkSQL-tp19661.html
>         Sent from the Apache Spark User List mailing list archive at
>         Nabble.com.
>
>         ---------------------------------------------------------------------
>         To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
>         <ma...@spark.apache.org>
>         For additional commands, e-mail: user-help@spark.apache.org
>         <ma...@spark.apache.org>
>
>
>
​

Re: advantages of SparkSQL?

Posted by Michael Armbrust <mi...@databricks.com>.
Akshat is correct about the benefits of parquet as a columnar format, but
I'll add that some of this is lost if you just use a lambda function to
process the data.  Since your lambda function is a black box Spark SQL does
not know which columns it is going to use and thus will do a full
tablescan.  I'd suggest writing a very simple SQL query that pulls out just
the columns you need and does any filtering before dropping back into
standard spark operations.  The result of SQL queries is an RDD of rows so
you can do any normal spark processing you want on them.

Either way though it will often be faster than a text filed due to better
encoding/compression.

On Mon, Nov 24, 2014 at 8:54 AM, Akshat Aranya <aa...@gmail.com> wrote:

> Parquet is a column-oriented format, which means that you need to read in
> less data from the file system if you're only interested in a subset of
> your columns.  Also, Parquet pushes down selection predicates, which can
> eliminate needless deserialization of rows that don't match a selection
> criterion.  Other than that, you would also get compression, and likely
> save processor cycles when parsing lines from text files.
>
>
>
> On Mon, Nov 24, 2014 at 8:20 AM, mrm <ma...@skimlinks.com> wrote:
>
>> Hi,
>>
>> Is there any advantage to storing data as a parquet format, loading it
>> using
>> the sparkSQL context, but never registering as a table/using sql on it?
>> Something like:
>>
>> Something like:
>> data = sqc.parquetFile(path)
>> results =  data.map(lambda x: applyfunc(x.field))
>>
>> Is this faster/more optimised than having the data stored as a text file
>> and
>> using Spark (non-SQL) to process it?
>>
>>
>>
>> --
>> View this message in context:
>> http://apache-spark-user-list.1001560.n3.nabble.com/advantages-of-SparkSQL-tp19661.html
>> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
>> For additional commands, e-mail: user-help@spark.apache.org
>>
>>
>

Re: advantages of SparkSQL?

Posted by Akshat Aranya <aa...@gmail.com>.
Parquet is a column-oriented format, which means that you need to read in
less data from the file system if you're only interested in a subset of
your columns.  Also, Parquet pushes down selection predicates, which can
eliminate needless deserialization of rows that don't match a selection
criterion.  Other than that, you would also get compression, and likely
save processor cycles when parsing lines from text files.



On Mon, Nov 24, 2014 at 8:20 AM, mrm <ma...@skimlinks.com> wrote:

> Hi,
>
> Is there any advantage to storing data as a parquet format, loading it
> using
> the sparkSQL context, but never registering as a table/using sql on it?
> Something like:
>
> Something like:
> data = sqc.parquetFile(path)
> results =  data.map(lambda x: applyfunc(x.field))
>
> Is this faster/more optimised than having the data stored as a text file
> and
> using Spark (non-SQL) to process it?
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/advantages-of-SparkSQL-tp19661.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
> For additional commands, e-mail: user-help@spark.apache.org
>
>