You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@spark.apache.org by CodingCat <gi...@git.apache.org> on 2017/11/01 20:16:42 UTC

[GitHub] spark issue #16578: [SPARK-4502][SQL] Parquet nested column pruning

Github user CodingCat commented on the issue:

    https://github.com/apache/spark/pull/16578
  
    made a simple test in a single-node spark environment 
     
    I used a synthetic dataset which is generated as:  (that’s 20M)
     
    ```scala
    import spark.implicits._
    import org.apache.spark.{SparkContext, TaskContext}
     
    case class Job(title: String, department: String)
     
    case class Person(id: Int, name: String, job: Job)
     
    (0 until 20000000).map(id => Person(id, id.toString, Job(id.toString, id.toString))).toDF.write.mode(SaveMode.Overwrite).parquet("/home/zhunan/parquet_test")
    ```
     
    And then I read the directory and write to another place by 
     
    ```scala
    val df = spark.read.parquet("/home/zhunan/parquet_test")
    df.select("job.title").write.mode(SaveMode.Overwrite).parquet("/home/zhunan/parquet_out")
    ```
    
    
    without patch, it reads 169 MB, with patch, it will read around 86 MB. 
    
    Basically it proves that the PR is working


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org