You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@spark.apache.org by Akhil Das <ak...@sigmoidanalytics.com> on 2015/08/11 11:13:50 UTC
Re: Inquery about contributing codes

You can create a new Issue and send a pull request for the same i think.

+ dev list

Thanks
Best Regards

On Tue, Aug 11, 2015 at 8:32 AM, Hyukjin Kwon <gu...@gmail.com> wrote:

> Dear Sir / Madam,
>
> I have a plan to contribute some codes about passing filters to a
> datasource as physical planning.
>
> In more detail, I understand when we want to build up filter operations
> from data like Parquet (when actually reading and filtering HDFS blocks at
> first not filtering in memory with Spark operations), we need to implement
>
> PrunedFilteredScan, PrunedScan or CatalystScan in package
> org.apache.spark.sql.sources.
>
>
>
> For PrunedFilteredScan and PrunedScan, it pass the filter objects in package
> org.apache.spark.sql.sources, which do not access directly to the query
> parser but are objects built by selectFilters() in package
> org.apache.spark.sql.sources.DataSourceStrategy.
>
> It looks all the filters (rather raw expressions) do not pass to the
> function below in PrunedFilteredScan and PrunedScan.
>
> def buildScan(requiredColumns: Array[String], filters: Array[Filter]): RDD[Row]
>
> The passing filters in here are defined in package
> org.apache.spark.sql.sources.
>
> On the other hand, it does not pass EqualNullSafe filter in package
> org.apache.spark.sql.catalyst.expressions even though this looks possible
> to pass for other datasources such as Parquet and JSON.
>
>
>
> I understand that  CatalystScan can take the all raw expression accessing
> to the query planner. However, it is experimental and also it needs
> different interfaces (as well as unstable for the reasons such as binary
> capability).
>
> As far as I know, Parquet also does not use this.
>
>
>
> In general, this can be a issue as a user send a query to data such as
>
> 1.
>
> SELECT *
> FROM table
> WHERE field = 1;
>
>
> 2.
>
> SELECT *
> FROM table
> WHERE field <=> 1;
>
>
> The second query can be hugely slow because of large network traffic by
> not filtered data from the source RDD.
>
>
>
> Also,I could not find a proper issue for this (except for
> https://issues.apache.org/jira/browse/SPARK-8747) which says it supports
> now binary capability.
>
> Accordingly, I want to add this issue and make a pull request with my
> codes.
>
>
> Could you please make any comments for this?
>
> Thanks.
>
>