You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@spark.apache.org by "james.green9@baesystems.com" <ja...@baesystems.com> on 2015/11/19 16:14:28 UTC

new datasource


We have written a new Spark DataSource that uses both Parquet and ElasticSearch.  It is based on the existing Parquet DataSource.   When I look at the filters being pushed down to buildScan I don’t get anything representing any filters based on UDFs – or for any fields generated by an explode – I had thought if I made it a CatalystScan I would get everything I needed.



This is fine from the Parquet point of view – but we are using ElasticSearch to index/filter the data we are searching and I need to be able to capture the UDF conditions – or have access to the Plan AST in order that I can construct a query for ElasticSearch.



I am thinking I might just need to patch Spark to do this – but I’d prefer not too if there is a way of getting round this without hacking the core code.  Any ideas?



Thanks



James



Please consider the environment before printing this email. This message should be regarded as confidential. If you have received this email in error please notify the sender and destroy it immediately. Statements of intent shall only become binding when confirmed in hard copy by an authorised signatory. The contents of this email may relate to dealings with other companies under the control of BAE Systems Applied Intelligence Limited, details of which can be found at http://www.baesystems.com/Businesses/index.htm.

Re: new datasource

Posted by Michael Armbrust <mi...@databricks.com>.
Yeah, CatalystScan should give you everything we can possibly push down in
raw form.  Note that this is not compatible across different spark versions.

On Thu, Nov 19, 2015 at 8:55 AM, james.green9@baesystems.com <
james.green9@baesystems.com> wrote:

> Thanks Hao
>
>
>
> I have written a new Data Source based on ParquetRelation and I have just
> retested what I had said about not getting anything extra when I change it
> over to a CatalystScan instead of PrunedFilteredScan and ooops it seems to
> work fine.
>
>
>
>
>
>
>
>
>
>
>
> *From:* Cheng, Hao [mailto:hao.cheng@intel.com]
> *Sent:* 19 November 2015 15:30
> *To:* Green, James (UK Guildford); dev@spark.apache.org
> *Subject:* RE: new datasource
>
>
>
> I think you probably need to write some code as you need to support the
> ES, there are 2 options per my understanding:
>
>
>
> Create a new Data Source from scratch, but you probably need to overwrite
> the interface at:
>
>
> https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/sources/interfaces.scala#L751
>
>
>
> Or you can reuse most of code in ParquetRelation in the new DataSource,
> but also need to modify your own logic, see
>
>
> https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetRelation.scala#L285
>
>
>
> Hope it helpful.
>
>
>
> Hao
>
> *From:* james.green9@baesystems.com [mailto:james.green9@baesystems.com
> <ja...@baesystems.com>]
> *Sent:* Thursday, November 19, 2015 11:14 PM
> *To:* dev@spark.apache.org
> *Subject:* new datasource
>
>
>
>
>
> We have written a new Spark DataSource that uses both Parquet and ElasticSearch.  It is based on the existing Parquet DataSource.   When I look at the filters being pushed down to buildScan I don’t get anything representing any filters based on UDFs – or for any fields generated by an explode – I had thought if I made it a CatalystScan I would get everything I needed.
>
>
>
> This is fine from the Parquet point of view – but we are using ElasticSearch to index/filter the data we are searching and I need to be able to capture the UDF conditions – or have access to the Plan AST in order that I can construct a query for ElasticSearch.
>
>
>
> I am thinking I might just need to patch Spark to do this – but I’d prefer not too if there is a way of getting round this without hacking the core code.  Any ideas?
>
>
>
> Thanks
>
>
>
> James
>
>
>
> Please consider the environment before printing this email. This message
> should be regarded as confidential. If you have received this email in
> error please notify the sender and destroy it immediately. Statements of
> intent shall only become binding when confirmed in hard copy by an
> authorised signatory. The contents of this email may relate to dealings
> with other companies under the control of BAE Systems Applied Intelligence
> Limited, details of which can be found at
> http://www.baesystems.com/Businesses/index.htm.
> Please consider the environment before printing this email. This message
> should be regarded as confidential. If you have received this email in
> error please notify the sender and destroy it immediately. Statements of
> intent shall only become binding when confirmed in hard copy by an
> authorised signatory. The contents of this email may relate to dealings
> with other companies under the control of BAE Systems Applied Intelligence
> Limited, details of which can be found at
> http://www.baesystems.com/Businesses/index.htm.
>

RE: new datasource

Posted by "james.green9@baesystems.com" <ja...@baesystems.com>.
Thanks Hao

I have written a new Data Source based on ParquetRelation and I have just retested what I had said about not getting anything extra when I change it over to a CatalystScan instead of PrunedFilteredScan and ooops it seems to work fine.





From: Cheng, Hao [mailto:hao.cheng@intel.com]
Sent: 19 November 2015 15:30
To: Green, James (UK Guildford); dev@spark.apache.org
Subject: RE: new datasource

I think you probably need to write some code as you need to support the ES, there are 2 options per my understanding:

Create a new Data Source from scratch, but you probably need to overwrite the interface at:
https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/sources/interfaces.scala#L751

Or you can reuse most of code in ParquetRelation in the new DataSource, but also need to modify your own logic, see
https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetRelation.scala#L285

Hope it helpful.

Hao
From: james.green9@baesystems.com<ma...@baesystems.com> [mailto:james.green9@baesystems.com]
Sent: Thursday, November 19, 2015 11:14 PM
To: dev@spark.apache.org<ma...@spark.apache.org>
Subject: new datasource



We have written a new Spark DataSource that uses both Parquet and ElasticSearch.  It is based on the existing Parquet DataSource.   When I look at the filters being pushed down to buildScan I don’t get anything representing any filters based on UDFs – or for any fields generated by an explode – I had thought if I made it a CatalystScan I would get everything I needed.



This is fine from the Parquet point of view – but we are using ElasticSearch to index/filter the data we are searching and I need to be able to capture the UDF conditions – or have access to the Plan AST in order that I can construct a query for ElasticSearch.



I am thinking I might just need to patch Spark to do this – but I’d prefer not too if there is a way of getting round this without hacking the core code.  Any ideas?



Thanks



James


Please consider the environment before printing this email. This message should be regarded as confidential. If you have received this email in error please notify the sender and destroy it immediately. Statements of intent shall only become binding when confirmed in hard copy by an authorised signatory. The contents of this email may relate to dealings with other companies under the control of BAE Systems Applied Intelligence Limited, details of which can be found at http://www.baesystems.com/Businesses/index.htm.
Please consider the environment before printing this email. This message should be regarded as confidential. If you have received this email in error please notify the sender and destroy it immediately. Statements of intent shall only become binding when confirmed in hard copy by an authorised signatory. The contents of this email may relate to dealings with other companies under the control of BAE Systems Applied Intelligence Limited, details of which can be found at http://www.baesystems.com/Businesses/index.htm.

RE: new datasource

Posted by "Cheng, Hao" <ha...@intel.com>.
I think you probably need to write some code as you need to support the ES, there are 2 options per my understanding:

Create a new Data Source from scratch, but you probably need to overwrite the interface at:
https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/sources/interfaces.scala#L751

Or you can reuse most of code in ParquetRelation in the new DataSource, but also need to modify your own logic, see
https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetRelation.scala#L285

Hope it helpful.

Hao
From: james.green9@baesystems.com [mailto:james.green9@baesystems.com]
Sent: Thursday, November 19, 2015 11:14 PM
To: dev@spark.apache.org
Subject: new datasource



We have written a new Spark DataSource that uses both Parquet and ElasticSearch.  It is based on the existing Parquet DataSource.   When I look at the filters being pushed down to buildScan I don’t get anything representing any filters based on UDFs – or for any fields generated by an explode – I had thought if I made it a CatalystScan I would get everything I needed.



This is fine from the Parquet point of view – but we are using ElasticSearch to index/filter the data we are searching and I need to be able to capture the UDF conditions – or have access to the Plan AST in order that I can construct a query for ElasticSearch.



I am thinking I might just need to patch Spark to do this – but I’d prefer not too if there is a way of getting round this without hacking the core code.  Any ideas?



Thanks



James


Please consider the environment before printing this email. This message should be regarded as confidential. If you have received this email in error please notify the sender and destroy it immediately. Statements of intent shall only become binding when confirmed in hard copy by an authorised signatory. The contents of this email may relate to dealings with other companies under the control of BAE Systems Applied Intelligence Limited, details of which can be found at http://www.baesystems.com/Businesses/index.htm.