You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@spark.apache.org by CodingCat <gi...@git.apache.org> on 2017/11/01 20:16:42 UTC
[GitHub] spark issue #16578: [SPARK-4502][SQL] Parquet nested column pruning
Github user CodingCat commented on the issue:
https://github.com/apache/spark/pull/16578
made a simple test in a single-node spark environment
I used a synthetic dataset which is generated as: (that’s 20M)
```scala
import spark.implicits._
import org.apache.spark.{SparkContext, TaskContext}
case class Job(title: String, department: String)
case class Person(id: Int, name: String, job: Job)
(0 until 20000000).map(id => Person(id, id.toString, Job(id.toString, id.toString))).toDF.write.mode(SaveMode.Overwrite).parquet("/home/zhunan/parquet_test")
```
And then I read the directory and write to another place by
```scala
val df = spark.read.parquet("/home/zhunan/parquet_test")
df.select("job.title").write.mode(SaveMode.Overwrite).parquet("/home/zhunan/parquet_out")
```
without patch, it reads 169 MB, with patch, it will read around 86 MB.
Basically it proves that the PR is working
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org