You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@spark.apache.org by Jerry Lam <ch...@gmail.com> on 2015/07/27 20:37:17 UTC
Unexpected performance issues with Spark SQL using Parquet
Hi spark users and developers,
I have been trying to understand how Spark SQL works with Parquet for the
couple of days. There is a performance problem that is unexpected using the
column pruning. Here is a dummy example:
The parquet file has the 3 fields:
|-- customer_id: string (nullable = true)
|-- type: string (nullable = true)
|-- mapping: map (nullable = true)
| |-- key: string
| |-- value: string (nullable = true)
Note that mapping is just a field with a lot of key value pairs.
I just created a parquet files with 1 billion entries with each entry
having 10 key-value pairs in the mapping.
After I generate this parquet file, I generate another parquet without the
mapping field that is:
|-- customer_id: string (nullable = true)
|-- type: string (nullable = true)
Let call the first parquet file data-with-mapping and the second parquet
file data-without-mapping.
Then I ran a very simple query over two parquet files:
val df = sqlContext.read.parquet(path)
df.select(df("type")).count
The run on the data-with-mapping takes 34 seconds with the input size
of 11.7 MB.
The run on the data-without-mapping takes 8 seconds with the input size of
7.6 MB.
They all ran on the same cluster with spark 1.4.1.
What bothers me the most is the input size because I supposed column
pruning will only deserialize columns that are relevant to the query (in
this case the field type) but for sure, it reads more data on the
data-with-mapping than the data-without-mapping. The speed is 4x faster in
the data-without-mapping that means that the more columns a parquet file
has the slower it is even only a specific column is needed.
Anyone has an explanation on this? I was expecting both of them will finish
approximate the same time.
Best Regards,
Jerry
Re: Unexpected performance issues with Spark SQL using Parquet
Posted by Cheng Lian <li...@gmail.com>.
Hi Jerry,
Thanks for the detailed report! I haven't investigate this issue in
detail. But for the input size issue, I believe this is due to a
limitation of HDFS API. It seems that Hadoop FileSystem adds the size of
a whole block to the metrics even if you only touch a fraction of that
block. In Parquet, all columns within a single row group are stored in a
single HDFS block. This is probably the reason why you observed weird
task input size. You may find more information in one of my earlier
posts
http://mail-archives.us.apache.org/mod_mbox/spark-user/201501.mbox/%3C54C9899E.2030408@gmail.com%3E
For the performance issue, I don't have a proper explanation yet. Need
further investigation.
Cheng
On 7/28/15 2:37 AM, Jerry Lam wrote:
> Hi spark users and developers,
>
> I have been trying to understand how Spark SQL works with Parquet for
> the couple of days. There is a performance problem that is unexpected
> using the column pruning. Here is a dummy example:
>
> The parquet file has the 3 fields:
>
> |-- customer_id: string (nullable = true)
> |-- type: string (nullable = true)
> |-- mapping: map (nullable = true)
> | |-- key: string
> | |-- value: string (nullable = true)
>
> Note that mapping is just a field with a lot of key value pairs.
> I just created a parquet files with 1 billion entries with each entry
> having 10 key-value pairs in the mapping.
>
> After I generate this parquet file, I generate another parquet without
> the mapping field that is:
> |-- customer_id: string (nullable = true)
> |-- type: string (nullable = true)
>
> Let call the first parquet file data-with-mapping and the second
> parquet file data-without-mapping.
>
> Then I ran a very simple query over two parquet files:
> val df = sqlContext.read.parquet(path)
> df.select(df("type")).count
>
> The run on the data-with-mapping takes 34 seconds with the input size
> of 11.7 MB.
> The run on the data-without-mapping takes 8 seconds with the input
> size of 7.6 MB.
>
> They all ran on the same cluster with spark 1.4.1.
> What bothers me the most is the input size because I supposed column
> pruning will only deserialize columns that are relevant to the query
> (in this case the field type) but for sure, it reads more data on the
> data-with-mapping than the data-without-mapping. The speed is 4x
> faster in the data-without-mapping that means that the more columns a
> parquet file has the slower it is even only a specific column is needed.
>
> Anyone has an explanation on this? I was expecting both of them will
> finish approximate the same time.
>
> Best Regards,
>
> Jerry
>
>
---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
For additional commands, e-mail: user-help@spark.apache.org