You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@spark.apache.org by Mike Wheeler <ro...@gmail.com> on 2017/06/03 16:31:14 UTC
Parquet Read Speed: Spark SQL vs Parquet MR
Hi Spark User,
I have run into some situation where Spark SQL is much slower than Parquet
MR for processing parquet files. Can you provide some guidance on
optimization?
Suppose I have a table "person" with columns: gender, age, name, address,
etc, which is stored in parquet files. I tried two ways to read the table
in spark:
1) define case class Person(gender: String, age: Int, etc), then I use
Spark SQL Dataset API:
val ds = spark.read.parquet("...").as[Person]
2) define avrò record "record Person {string gender, int age, etc}". Then
use parquet-avro
<https://github.com/apache/parquet-mr/tree/master/parquet-avro> and
newapiHadoopFile:
val rdd = sc.newAPIHadoopFile(
path,
classOf[ParquetInputFormat[avro.Person]],
classOf[Void],
classOf[avro.Person],
job.getConfiguration).values
Then I compare 3 actions (spark 2.1.1, parquet-avro 1.8.1):
a) ds.filter(" gender='female' and age > 50").count // 1 min
b) ds.filter(" gender='female'").filter(_.age > 50).count // 15 min
c) rdd.filter(r => r.gender == "female" && r.age > 50).count // 7 min
I can understand a) is faster than c) because a) is limited to sql
query so Spark can do a lot things to optimize (such as not fully
deserialize the objects). But I don't understand b) is much slower
than c) because I assume both requires full deserialization.
Is there anything I can try to improve b)?
Thanks,
Mike