You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@spark.apache.org by Harish Butani <rh...@gmail.com> on 2022/01/14 00:49:34 UTC

Spark on Oracle available as an Apache licensed open source repo

Spark on Oracle is now available as an open source Apache licensed github
repo <https://github.com/oracle/spark-oracle>. Build and deploy it as an
extension jar in your Spark clusters.

Use it to combine Apache Spark programs with data in your existing Oracle
databases without expensive data copying or query time data movement.

The core capability is Optimizer extensions that collapse SQL operator
sub-graphs to an OraScan that executes equivalent SQL in Oracle. Physical
plan parallelism
<https://github.com/oracle/spark-oracle/wiki/Query-Splitting>can be
controlled to split Spark tasks to operate on Oracle data block ranges, or
on resultset pages or on table partitions.

We pushdown large parts of Spark SQL to Oracle, for example 95 of 99 TPCDS
queries are completely pushed to Oracle.
<https://github.com/oracle/spark-oracle/wiki/TPCDS-Queries>

With Spark SQL macros
<https://github.com/oracle/spark-oracle/wiki/Spark_SQL_macros>  you can
write custom Spark UDFs that get translated and pushed as Oracle SQL
expressions.

With DML pushdown <https://github.com/oracle/spark-oracle/wiki/DML-Support>
inserts in Spark SQL get pushed as transactionally consistent
inserts/updates on Oracle tables.

See Quick Start Guide
<https://github.com/oracle/spark-oracle/wiki/Quick-Start-Guide>  on how to
set up an Oracle free tier ADW instance, load it with TPCDS data and try
out the Spark on Oracle Demo
<https://github.com/oracle/spark-oracle/wiki/Demo>  on your Spark cluster.

More  details can be found in our blog
<https://hbutani.github.io/blogs/blog/Spark_on_Oracle_Blog.html> and
the project
wiki. <https://github.com/oracle/spark-oracle/wiki>

regards,
Harish Butani

Re: Spark on Oracle available as an Apache licensed open source repo

Posted by Harish Butani <rh...@gmail.com>.

- happy to link to Apache Spark. Will add the link to the README
- Sorry didn’t know about the trademark rules. Let me think about a name, though ‘Oracle Translator for Apache Spark’ sounds pretty good.

regards,
Harish.

> On Jan 13, 2022, at 5:00 PM, Sean Owen <sr...@gmail.com> wrote:
> 
> -user
> Thank you for this, but just a small but important point about the use of the Spark name. Please take a look at https://spark.apache.org/trademarks.html <https://spark.apache.org/trademarks.html>
> Specifically, this should reference "Apache Spark" at least once prominently with a link to the project.
> It's also advisable to avoid using "Spark" in a project or product name entirely. "Oracle Translator for Apache Spark" or something like that would be more in line with trademark guidance.
> 
> On Thu, Jan 13, 2022 at 6:50 PM Harish Butani <rhbutani.spark@gmail.com <ma...@gmail.com>> wrote:
> Spark on Oracle is now available as an open source Apache licensed github repo <https://github.com/oracle/spark-oracle>. Build and deploy it as an extension jar in your Spark clusters.
> 
> Use it to combine Apache Spark programs with data in your existing Oracle databases without expensive data copying or query time data movement. 
> 
> The core capability is Optimizer extensions that collapse SQL operator sub-graphs to an OraScan that executes equivalent SQL in Oracle. Physical plan parallelism  <https://github.com/oracle/spark-oracle/wiki/Query-Splitting>can be controlled to split Spark tasks to operate on Oracle data block ranges, or on resultset pages or on table partitions.
> 
> We pushdown large parts of Spark SQL to Oracle, for example 95 of 99 TPCDS queries are completely pushed to Oracle. <https://github.com/oracle/spark-oracle/wiki/TPCDS-Queries>
> 
> With Spark SQL macros <https://github.com/oracle/spark-oracle/wiki/Spark_SQL_macros>  you can write custom Spark UDFs that get translated and pushed as Oracle SQL expressions. 
> 
> With DML pushdown <https://github.com/oracle/spark-oracle/wiki/DML-Support> inserts in Spark SQL get pushed as transactionally consistent inserts/updates on Oracle tables.
> 
> See Quick Start Guide <https://github.com/oracle/spark-oracle/wiki/Quick-Start-Guide>  on how to set up an Oracle free tier ADW instance, load it with TPCDS data and try out the Spark on Oracle Demo <https://github.com/oracle/spark-oracle/wiki/Demo>  on your Spark cluster. 
> 
> More  details can be found in our blog <https://hbutani.github.io/blogs/blog/Spark_on_Oracle_Blog.html> and the project wiki. <https://github.com/oracle/spark-oracle/wiki>
> 
> regards,
> Harish Butani

Re: Spark on Oracle available as an Apache licensed open source repo

Posted by Sean Owen <sr...@gmail.com>.

-user
Thank you for this, but just a small but important point about the use of
the Spark name. Please take a look at
https://spark.apache.org/trademarks.html
Specifically, this should reference "Apache Spark" at least once
prominently with a link to the project.
It's also advisable to avoid using "Spark" in a project or product name
entirely. "Oracle Translator for Apache Spark" or something like that would
be more in line with trademark guidance.

On Thu, Jan 13, 2022 at 6:50 PM Harish Butani <rh...@gmail.com>
wrote:

> Spark on Oracle is now available as an open source Apache licensed github
> repo <https://github.com/oracle/spark-oracle>. Build and deploy it as an
> extension jar in your Spark clusters.
>
> Use it to combine Apache Spark programs with data in your existing Oracle
> databases without expensive data copying or query time data movement.
>
> The core capability is Optimizer extensions that collapse SQL operator
> sub-graphs to an OraScan that executes equivalent SQL in Oracle. Physical
> plan parallelism
> <https://github.com/oracle/spark-oracle/wiki/Query-Splitting>can be
> controlled to split Spark tasks to operate on Oracle data block ranges, or
> on resultset pages or on table partitions.
>
> We pushdown large parts of Spark SQL to Oracle, for example 95 of 99 TPCDS
> queries are completely pushed to Oracle.
> <https://github.com/oracle/spark-oracle/wiki/TPCDS-Queries>
>
> With Spark SQL macros
> <https://github.com/oracle/spark-oracle/wiki/Spark_SQL_macros>  you can
> write custom Spark UDFs that get translated and pushed as Oracle SQL
> expressions.
>
> With DML pushdown
> <https://github.com/oracle/spark-oracle/wiki/DML-Support> inserts in
> Spark SQL get pushed as transactionally consistent inserts/updates on
> Oracle tables.
>
> See Quick Start Guide
> <https://github.com/oracle/spark-oracle/wiki/Quick-Start-Guide>  on how
> to set up an Oracle free tier ADW instance, load it with TPCDS data and try
> out the Spark on Oracle Demo
> <https://github.com/oracle/spark-oracle/wiki/Demo>  on your Spark
> cluster.
>
> More  details can be found in our blog
> <https://hbutani.github.io/blogs/blog/Spark_on_Oracle_Blog.html> and the project
> wiki. <https://github.com/oracle/spark-oracle/wiki>
>
> regards,
> Harish Butani
>

Re: Spark on Oracle available as an Apache licensed open source repo

Posted by Harish Butani <rh...@gmail.com>.

Look at the pushdown plans for all the TPCDS queries here <https://github.com/oracle/spark-oracle/wiki/TPCDS-Queries>
We push Joins, Aggregates, Windowing etc, as I said we can do complete pushdown of 95 of 99 TPCDS queries.
The Generic JDBC Datasource push single table scans, filters and partial aggregates. In that case a lot of data is moved from the Oracle instance to Spark, during query execution.

Beyond this, the SQL Macro <https://github.com/oracle/spark-oracle/wiki/Spark_SQL_macros> feature can translate certain kinds of UDFs to Oracle expressions, which again avoids a lot of data movement because instead of UDF
 execution happening in Spark an equivalent Oracle expression is evaluated in Oracle.

This works on-premise Oracle, currently tested on 19c.

regards,
Harish.

> On Jan 14, 2022, at 2:51 AM, Mich Talebzadeh <mi...@gmail.com> wrote:
> 
> Hello,
> 
> Thanks for this info.
> 
> Have you tested this feature on Oracle on-premise say, 11c, 12c besides ADW in Cloud?
> 
> I can see the transactional feature useful in terms of commit/rollback to Oracle but I cannot figure out the performance gains in your blog etc.
> 
> My concern is we currently connect to Oracle as well as many other JDBC compliant databases  through Spark generic JDBC connections <https://spark.apache.org/docs/latest/sql-data-sources-jdbc.html> with the same look and feel. Unless there is an overriding reason, I don't  see why there is a need to switch to this feature.
> 
> 
> Cheers
> 
>    view my Linkedin profile <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
>  
> Disclaimer: Use it at your own risk. Any and all responsibility for any loss, damage or destruction of data or any other property which may arise from relying on this email's technical content is explicitly disclaimed. The author will in no case be liable for any monetary damages arising from such loss, damage or destruction.
>  
> 
> 
> On Fri, 14 Jan 2022 at 00:50, Harish Butani <rhbutani.spark@gmail.com <ma...@gmail.com>> wrote:
> Spark on Oracle is now available as an open source Apache licensed github repo <https://github.com/oracle/spark-oracle>. Build and deploy it as an extension jar in your Spark clusters.
> 
> Use it to combine Apache Spark programs with data in your existing Oracle databases without expensive data copying or query time data movement. 
> 
> The core capability is Optimizer extensions that collapse SQL operator sub-graphs to an OraScan that executes equivalent SQL in Oracle. Physical plan parallelism  <https://github.com/oracle/spark-oracle/wiki/Query-Splitting>can be controlled to split Spark tasks to operate on Oracle data block ranges, or on resultset pages or on table partitions.
> 
> We pushdown large parts of Spark SQL to Oracle, for example 95 of 99 TPCDS queries are completely pushed to Oracle. <https://github.com/oracle/spark-oracle/wiki/TPCDS-Queries>
> 
> With Spark SQL macros <https://github.com/oracle/spark-oracle/wiki/Spark_SQL_macros>  you can write custom Spark UDFs that get translated and pushed as Oracle SQL expressions. 
> 
> With DML pushdown <https://github.com/oracle/spark-oracle/wiki/DML-Support> inserts in Spark SQL get pushed as transactionally consistent inserts/updates on Oracle tables.
> 
> See Quick Start Guide <https://github.com/oracle/spark-oracle/wiki/Quick-Start-Guide>  on how to set up an Oracle free tier ADW instance, load it with TPCDS data and try out the Spark on Oracle Demo <https://github.com/oracle/spark-oracle/wiki/Demo>  on your Spark cluster. 
> 
> More  details can be found in our blog <https://hbutani.github.io/blogs/blog/Spark_on_Oracle_Blog.html> and the project wiki. <https://github.com/oracle/spark-oracle/wiki>
> 
> regards,
> Harish Butani

Re: Spark on Oracle available as an Apache licensed open source repo

Posted by Harish Butani <rh...@gmail.com>.

Look at the pushdown plans for all the TPCDS queries here <https://github.com/oracle/spark-oracle/wiki/TPCDS-Queries>
We push Joins, Aggregates, Windowing etc, as I said we can do complete pushdown of 95 of 99 TPCDS queries.
The Generic JDBC Datasource push single table scans, filters and partial aggregates. In that case a lot of data is moved from the Oracle instance to Spark, during query execution.

Beyond this, the SQL Macro <https://github.com/oracle/spark-oracle/wiki/Spark_SQL_macros> feature can translate certain kinds of UDFs to Oracle expressions, which again avoids a lot of data movement because instead of UDF
 execution happening in Spark an equivalent Oracle expression is evaluated in Oracle.

This works on-premise Oracle, currently tested on 19c.

regards,
Harish.

> On Jan 14, 2022, at 2:51 AM, Mich Talebzadeh <mi...@gmail.com> wrote:
> 
> Hello,
> 
> Thanks for this info.
> 
> Have you tested this feature on Oracle on-premise say, 11c, 12c besides ADW in Cloud?
> 
> I can see the transactional feature useful in terms of commit/rollback to Oracle but I cannot figure out the performance gains in your blog etc.
> 
> My concern is we currently connect to Oracle as well as many other JDBC compliant databases  through Spark generic JDBC connections <https://spark.apache.org/docs/latest/sql-data-sources-jdbc.html> with the same look and feel. Unless there is an overriding reason, I don't  see why there is a need to switch to this feature.
> 
> 
> Cheers
> 
>    view my Linkedin profile <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
>  
> Disclaimer: Use it at your own risk. Any and all responsibility for any loss, damage or destruction of data or any other property which may arise from relying on this email's technical content is explicitly disclaimed. The author will in no case be liable for any monetary damages arising from such loss, damage or destruction.
>  
> 
> 
> On Fri, 14 Jan 2022 at 00:50, Harish Butani <rhbutani.spark@gmail.com <ma...@gmail.com>> wrote:
> Spark on Oracle is now available as an open source Apache licensed github repo <https://github.com/oracle/spark-oracle>. Build and deploy it as an extension jar in your Spark clusters.
> 
> Use it to combine Apache Spark programs with data in your existing Oracle databases without expensive data copying or query time data movement. 
> 
> The core capability is Optimizer extensions that collapse SQL operator sub-graphs to an OraScan that executes equivalent SQL in Oracle. Physical plan parallelism  <https://github.com/oracle/spark-oracle/wiki/Query-Splitting>can be controlled to split Spark tasks to operate on Oracle data block ranges, or on resultset pages or on table partitions.
> 
> We pushdown large parts of Spark SQL to Oracle, for example 95 of 99 TPCDS queries are completely pushed to Oracle. <https://github.com/oracle/spark-oracle/wiki/TPCDS-Queries>
> 
> With Spark SQL macros <https://github.com/oracle/spark-oracle/wiki/Spark_SQL_macros>  you can write custom Spark UDFs that get translated and pushed as Oracle SQL expressions. 
> 
> With DML pushdown <https://github.com/oracle/spark-oracle/wiki/DML-Support> inserts in Spark SQL get pushed as transactionally consistent inserts/updates on Oracle tables.
> 
> See Quick Start Guide <https://github.com/oracle/spark-oracle/wiki/Quick-Start-Guide>  on how to set up an Oracle free tier ADW instance, load it with TPCDS data and try out the Spark on Oracle Demo <https://github.com/oracle/spark-oracle/wiki/Demo>  on your Spark cluster. 
> 
> More  details can be found in our blog <https://hbutani.github.io/blogs/blog/Spark_on_Oracle_Blog.html> and the project wiki. <https://github.com/oracle/spark-oracle/wiki>
> 
> regards,
> Harish Butani

Re: Spark on Oracle available as an Apache licensed open source repo

Posted by Mich Talebzadeh <mi...@gmail.com>.

Hello,

Thanks for this info.

Have you tested this feature on Oracle on-premise say, 11c, 12c besides ADW
in Cloud?

I can see the transactional feature useful in terms of commit/rollback to
Oracle but I cannot figure out the performance gains in your blog etc.

My concern is we currently connect to Oracle as well as many other JDBC
compliant databases  through Spark generic JDBC connections
<https://spark.apache.org/docs/latest/sql-data-sources-jdbc.html> with the
same look and feel. Unless there is an overriding reason, I don't  see why
there is a need to switch to this feature.

Cheers

   view my Linkedin profile
<https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>

*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.

On Fri, 14 Jan 2022 at 00:50, Harish Butani <rh...@gmail.com>
wrote:

> Spark on Oracle is now available as an open source Apache licensed github
> repo <https://github.com/oracle/spark-oracle>. Build and deploy it as an
> extension jar in your Spark clusters.
>
> Use it to combine Apache Spark programs with data in your existing Oracle
> databases without expensive data copying or query time data movement.
>
> The core capability is Optimizer extensions that collapse SQL operator
> sub-graphs to an OraScan that executes equivalent SQL in Oracle. Physical
> plan parallelism
> <https://github.com/oracle/spark-oracle/wiki/Query-Splitting>can be
> controlled to split Spark tasks to operate on Oracle data block ranges, or
> on resultset pages or on table partitions.
>
> We pushdown large parts of Spark SQL to Oracle, for example 95 of 99 TPCDS
> queries are completely pushed to Oracle.
> <https://github.com/oracle/spark-oracle/wiki/TPCDS-Queries>
>
> With Spark SQL macros
> <https://github.com/oracle/spark-oracle/wiki/Spark_SQL_macros>  you can
> write custom Spark UDFs that get translated and pushed as Oracle SQL
> expressions.
>
> With DML pushdown
> <https://github.com/oracle/spark-oracle/wiki/DML-Support> inserts in
> Spark SQL get pushed as transactionally consistent inserts/updates on
> Oracle tables.
>
> See Quick Start Guide
> <https://github.com/oracle/spark-oracle/wiki/Quick-Start-Guide>  on how
> to set up an Oracle free tier ADW instance, load it with TPCDS data and try
> out the Spark on Oracle Demo
> <https://github.com/oracle/spark-oracle/wiki/Demo>  on your Spark
> cluster.
>
> More  details can be found in our blog
> <https://hbutani.github.io/blogs/blog/Spark_on_Oracle_Blog.html> and the project
> wiki. <https://github.com/oracle/spark-oracle/wiki>
>
> regards,
> Harish Butani
>

Re: Spark on Oracle available as an Apache licensed open source repo

Posted by Mich Talebzadeh <mi...@gmail.com>.

Hello,

Thanks for this info.

Have you tested this feature on Oracle on-premise say, 11c, 12c besides ADW
in Cloud?

I can see the transactional feature useful in terms of commit/rollback to
Oracle but I cannot figure out the performance gains in your blog etc.

My concern is we currently connect to Oracle as well as many other JDBC
compliant databases  through Spark generic JDBC connections
<https://spark.apache.org/docs/latest/sql-data-sources-jdbc.html> with the
same look and feel. Unless there is an overriding reason, I don't  see why
there is a need to switch to this feature.

Cheers

   view my Linkedin profile
<https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>

*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.

On Fri, 14 Jan 2022 at 00:50, Harish Butani <rh...@gmail.com>
wrote:

> Spark on Oracle is now available as an open source Apache licensed github
> repo <https://github.com/oracle/spark-oracle>. Build and deploy it as an
> extension jar in your Spark clusters.
>
> Use it to combine Apache Spark programs with data in your existing Oracle
> databases without expensive data copying or query time data movement.
>
> The core capability is Optimizer extensions that collapse SQL operator
> sub-graphs to an OraScan that executes equivalent SQL in Oracle. Physical
> plan parallelism
> <https://github.com/oracle/spark-oracle/wiki/Query-Splitting>can be
> controlled to split Spark tasks to operate on Oracle data block ranges, or
> on resultset pages or on table partitions.
>
> We pushdown large parts of Spark SQL to Oracle, for example 95 of 99 TPCDS
> queries are completely pushed to Oracle.
> <https://github.com/oracle/spark-oracle/wiki/TPCDS-Queries>
>
> With Spark SQL macros
> <https://github.com/oracle/spark-oracle/wiki/Spark_SQL_macros>  you can
> write custom Spark UDFs that get translated and pushed as Oracle SQL
> expressions.
>
> With DML pushdown
> <https://github.com/oracle/spark-oracle/wiki/DML-Support> inserts in
> Spark SQL get pushed as transactionally consistent inserts/updates on
> Oracle tables.
>
> See Quick Start Guide
> <https://github.com/oracle/spark-oracle/wiki/Quick-Start-Guide>  on how
> to set up an Oracle free tier ADW instance, load it with TPCDS data and try
> out the Spark on Oracle Demo
> <https://github.com/oracle/spark-oracle/wiki/Demo>  on your Spark
> cluster.
>
> More  details can be found in our blog
> <https://hbutani.github.io/blogs/blog/Spark_on_Oracle_Blog.html> and the project
> wiki. <https://github.com/oracle/spark-oracle/wiki>
>
> regards,
> Harish Butani
>