You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@spark.apache.org by Gil Vernik <GI...@il.ibm.com> on 2015/03/18 14:46:21 UTC

parquet support - some questions about code

Hi,

I am trying to better understand the code for  Parquet support.
In particular i got lost trying to understand ParquetRelation and 
ParquetRelation2. Does ParquetRelation2 is the new code that should 
completely remove ParquetRelation? ( I think there is some remark in the 
code notifying this )

Assuming i am using 
spark.sql.parquet.filterPushdown = true
spark.sql.parquet.useDataSourceApi = true

I saw that method buildScan from newParquet.scala has filtering push down 
into Parquet, but i also saw that there is filtering and projection push 
down from ParquetOperations inside SparkStrategies.scala
However every time i debug it, the 
 object ParquetOperations extends Strategy {
    def apply(plan: LogicalPlan): Seq[SparkPlan] = plan match {
..........
Never evaluated to  case PhysicalOperation(projectList, filters: 
Seq[Expression], relation: ParquetRelation) =>

In which cases it will match this case?

Also, where is the code for Parquet projection and filter push down, is it 
inside ParquetOperations in SparkStrategies.scala or inside buildScan of 
newParquet.scala? Or both? If so i am not sure how it works...

Thanks,
Gil.

Re: parquet support - some questions about code

Posted by Cheng Lian <li...@gmail.com>.
Hey Gil,

ParquetRelation2 is based on the external data sources API, which is a 
more modular and non-intrusive way to add external data sources to Spark 
SQL. We are planning to replace ParquetRelation with ParquetRelation2 
entirely after the latter is more mature and stable. That's why you see 
two separate sets of Parquet code in the code base, and currently they 
also share part of the code.

In Spark 1.3, the new Parquet data source (ParquetRelation2) is enabled 
by default. So you can find entries of projection and filter push-down 
code in newParquet.scala.

Cheng

On 3/18/15 9:46 PM, Gil Vernik wrote:
> Hi,
>
> I am trying to better understand the code for  Parquet support.
> In particular i got lost trying to understand ParquetRelation and
> ParquetRelation2. Does ParquetRelation2 is the new code that should
> completely remove ParquetRelation? ( I think there is some remark in the
> code notifying this )
>
> Assuming i am using
> spark.sql.parquet.filterPushdown = true
> spark.sql.parquet.useDataSourceApi = true
>
> I saw that method buildScan from newParquet.scala has filtering push down
> into Parquet, but i also saw that there is filtering and projection push
> down from ParquetOperations inside SparkStrategies.scala
> However every time i debug it, the
>   object ParquetOperations extends Strategy {
>      def apply(plan: LogicalPlan): Seq[SparkPlan] = plan match {
> ..........
> Never evaluated to  case PhysicalOperation(projectList, filters:
> Seq[Expression], relation: ParquetRelation) =>
>
> In which cases it will match this case?
>
> Also, where is the code for Parquet projection and filter push down, is it
> inside ParquetOperations in SparkStrategies.scala or inside buildScan of
> newParquet.scala? Or both? If so i am not sure how it works...
>
> Thanks,
> Gil.
>


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@spark.apache.org
For additional commands, e-mail: dev-help@spark.apache.org