You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@spark.apache.org by java8964 <ja...@hotmail.com> on 2015/03/09 20:15:15 UTC

From Spark web ui, how to prove the parquet column pruning working

Hi, Currently most of the data in our production is using Avro + Snappy. I want to test the benefits if we store the data in Parquet format. I changed the our ETL to generate the Parquet format, instead of Avor, and want to test a simple sql in Spark SQL, to verify the benefits from Parquet.
I generated the same dataset in both Avro and Parquet in HDFS, and load them both in Spark-SQL. Now I run the same query like "select colum1 from src_table_avro/parqut where colum2=xxx", I can see that for the parquet data format, the job runs much fast. The test files size for both format are around 930M. So Avro job generated 8 tasks to read the data with 21s as the median duration, vs parquet job generate 7 tasks to read the data with 0.4s as the median duration.
Since the dataset has more than 100 columns, I can see the parquet file really coming with fast read. But my question is that from the spark UI, both job show 900M as the input size, and 0 for rest, in this case, how do I know column pruning really works? I think it is due to that, so parquet file can be read so fast, but is there any statistic can prove that to me on the Spark UI? Something like the input total file size is 900M, but only 10M really read due to column pruning? So in case that the columns pruning not work in parquet due to what kind of SQL query, I can identify in the first place.
Thanks
Yong 		 	   		  

Re: From Spark web ui, how to prove the parquet column pruning working

Posted by Cheng Lian <li...@gmail.com>.
Hey Yong,

It seems that Hadoop `FileSystem` adds the size of a block to the 
metrics even if you only touch a fraction of it (reading Parquet 
metadata for example). This behavior can be verified by the following 
snippet:

```scala
import org.apache.spark.sql.Row
import org.apache.spark.sql.SQLContext

val sqlContext = new SQLContext(sc)
import sc._
import sqlContext._

case class KeyValue(key: Int, value: String)

parallelize(1 to 1024 * 1024 * 20).
   flatMap(i => Seq.fill(10)(KeyValue(i, i.toString))).
   saveAsParquetFile("large.parquet")

hadoopConfiguration.set("parquet.task.side.metadata", "true")
sql("SET spark.sql.parquet.filterPushdown=true")

parquetFile("large.parquet").where('key === 
0).queryExecution.toRdd.mapPartitions { _ =>
   new Iterator[Row] {
     def hasNext = false
     def next() = ???
   }
}.collect()
```

Apparently we’re reading nothing here (except for Parquet metadata in 
the footers), but the web UI still suggests that the input size of all 
tasks equals to the file size.

Cheng


On 3/10/15 3:15 AM, java8964 wrote:
> Hi, Currently most of the data in our production is using Avro + 
> Snappy. I want to test the benefits if we store the data in Parquet 
> format. I changed the our ETL to generate the Parquet format, instead 
> of Avor, and want to test a simple sql in Spark SQL, to verify the 
> benefits from Parquet.
>
> I generated the same dataset in both Avro and Parquet in HDFS, and 
> load them both in Spark-SQL. Now I run the same query like 
> "select colum1 from src_table_avro/parqut where colum2=xxx", I can see 
> that for the parquet data format, the job runs much fast. The test 
> files size for both format are around 930M. So Avro job generated 8 
> tasks to read the data with 21s as the median duration, vs parquet job 
> generate 7 tasks to read the data with 0.4s as the median duration.
>
> Since the dataset has more than 100 columns, I can see the parquet 
> file really coming with fast read. But my question is that from the 
> spark UI, both job show 900M as the input size, and 0 for rest, in 
> this case, how do I know column pruning really works? I think it is 
> due to that, so parquet file can be read so fast, but is there any 
> statistic can prove that to me on the Spark UI? Something like the 
> input total file size is 900M, but only 10M really read due to column 
> pruning? So in case that the columns pruning not work in parquet due 
> to what kind of SQL query, I can identify in the first place.
>
> Thanks
>
> Yong