You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@spark.apache.org by Hyukjin Kwon <gu...@gmail.com> on 2015/08/11 05:02:37 UTC

Inquery about contributing codes

Dear Sir / Madam,

I have a plan to contribute some codes about passing filters to a
datasource as physical planning.

In more detail, I understand when we want to build up filter operations
from data like Parquet (when actually reading and filtering HDFS blocks at
first not filtering in memory with Spark operations), we need to implement

PrunedFilteredScan, PrunedScan or CatalystScan in package
org.apache.spark.sql.sources.



For PrunedFilteredScan and PrunedScan, it pass the filter objects in package
org.apache.spark.sql.sources, which do not access directly to the query
parser but are objects built by selectFilters() in package
org.apache.spark.sql.sources.DataSourceStrategy.

It looks all the filters (rather raw expressions) do not pass to the
function below in PrunedFilteredScan and PrunedScan.

def buildScan(requiredColumns: Array[String], filters: Array[Filter]): RDD[Row]

The passing filters in here are defined in package
org.apache.spark.sql.sources.

On the other hand, it does not pass EqualNullSafe filter in package
org.apache.spark.sql.catalyst.expressions even though this looks possible
to pass for other datasources such as Parquet and JSON.



I understand that  CatalystScan can take the all raw expression accessing
to the query planner. However, it is experimental and also it needs
different interfaces (as well as unstable for the reasons such as binary
capability).

As far as I know, Parquet also does not use this.



In general, this can be a issue as a user send a query to data such as

1.

SELECT *
FROM table
WHERE field = 1;


2.

SELECT *
FROM table
WHERE field <=> 1;


The second query can be hugely slow because of large network traffic by not
filtered data from the source RDD.



Also,I could not find a proper issue for this (except for
https://issues.apache.org/jira/browse/SPARK-8747) which says it supports
now binary capability.

Accordingly, I want to add this issue and make a pull request with my codes.


Could you please make any comments for this?

Thanks.

Re: Inquery about contributing codes

Posted by Akhil Das <ak...@sigmoidanalytics.com>.

You can create a new Issue and send a pull request for the same i think.

+ dev list

Thanks
Best Regards

On Tue, Aug 11, 2015 at 8:32 AM, Hyukjin Kwon <gu...@gmail.com> wrote:

> Dear Sir / Madam,
>
> I have a plan to contribute some codes about passing filters to a
> datasource as physical planning.
>
> In more detail, I understand when we want to build up filter operations
> from data like Parquet (when actually reading and filtering HDFS blocks at
> first not filtering in memory with Spark operations), we need to implement
>
> PrunedFilteredScan, PrunedScan or CatalystScan in package
> org.apache.spark.sql.sources.
>
>
>
> For PrunedFilteredScan and PrunedScan, it pass the filter objects in package
> org.apache.spark.sql.sources, which do not access directly to the query
> parser but are objects built by selectFilters() in package
> org.apache.spark.sql.sources.DataSourceStrategy.
>
> It looks all the filters (rather raw expressions) do not pass to the
> function below in PrunedFilteredScan and PrunedScan.
>
> def buildScan(requiredColumns: Array[String], filters: Array[Filter]): RDD[Row]
>
> The passing filters in here are defined in package
> org.apache.spark.sql.sources.
>
> On the other hand, it does not pass EqualNullSafe filter in package
> org.apache.spark.sql.catalyst.expressions even though this looks possible
> to pass for other datasources such as Parquet and JSON.
>
>
>
> I understand that  CatalystScan can take the all raw expression accessing
> to the query planner. However, it is experimental and also it needs
> different interfaces (as well as unstable for the reasons such as binary
> capability).
>
> As far as I know, Parquet also does not use this.
>
>
>
> In general, this can be a issue as a user send a query to data such as
>
> 1.
>
> SELECT *
> FROM table
> WHERE field = 1;
>
>
> 2.
>
> SELECT *
> FROM table
> WHERE field <=> 1;
>
>
> The second query can be hugely slow because of large network traffic by
> not filtered data from the source RDD.
>
>
>
> Also,I could not find a proper issue for this (except for
> https://issues.apache.org/jira/browse/SPARK-8747) which says it supports
> now binary capability.
>
> Accordingly, I want to add this issue and make a pull request with my
> codes.
>
>
> Could you please make any comments for this?
>
> Thanks.
>
>

Re: Inquery about contributing codes

Posted by Akhil Das <ak...@sigmoidanalytics.com>.

You can create a new Issue and send a pull request for the same i think.

+ dev list

Thanks
Best Regards

On Tue, Aug 11, 2015 at 8:32 AM, Hyukjin Kwon <gu...@gmail.com> wrote:

> Dear Sir / Madam,
>
> I have a plan to contribute some codes about passing filters to a
> datasource as physical planning.
>
> In more detail, I understand when we want to build up filter operations
> from data like Parquet (when actually reading and filtering HDFS blocks at
> first not filtering in memory with Spark operations), we need to implement
>
> PrunedFilteredScan, PrunedScan or CatalystScan in package
> org.apache.spark.sql.sources.
>
>
>
> For PrunedFilteredScan and PrunedScan, it pass the filter objects in package
> org.apache.spark.sql.sources, which do not access directly to the query
> parser but are objects built by selectFilters() in package
> org.apache.spark.sql.sources.DataSourceStrategy.
>
> It looks all the filters (rather raw expressions) do not pass to the
> function below in PrunedFilteredScan and PrunedScan.
>
> def buildScan(requiredColumns: Array[String], filters: Array[Filter]): RDD[Row]
>
> The passing filters in here are defined in package
> org.apache.spark.sql.sources.
>
> On the other hand, it does not pass EqualNullSafe filter in package
> org.apache.spark.sql.catalyst.expressions even though this looks possible
> to pass for other datasources such as Parquet and JSON.
>
>
>
> I understand that  CatalystScan can take the all raw expression accessing
> to the query planner. However, it is experimental and also it needs
> different interfaces (as well as unstable for the reasons such as binary
> capability).
>
> As far as I know, Parquet also does not use this.
>
>
>
> In general, this can be a issue as a user send a query to data such as
>
> 1.
>
> SELECT *
> FROM table
> WHERE field = 1;
>
>
> 2.
>
> SELECT *
> FROM table
> WHERE field <=> 1;
>
>
> The second query can be hugely slow because of large network traffic by
> not filtered data from the source RDD.
>
>
>
> Also,I could not find a proper issue for this (except for
> https://issues.apache.org/jira/browse/SPARK-8747) which says it supports
> now binary capability.
>
> Accordingly, I want to add this issue and make a pull request with my
> codes.
>
>
> Could you please make any comments for this?
>
> Thanks.
>
>