You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@spark.apache.org by Timothy Potter <th...@gmail.com> on 2016/07/27 13:59:08 UTC

Possible to push sub-queries down into the DataSource impl?

Take this simple join:

SELECT m.title as title, solr.aggCount as aggCount FROM movies m INNER
JOIN (SELECT movie_id, COUNT(*) as aggCount FROM ratings WHERE rating
>= 4 GROUP BY movie_id ORDER BY aggCount desc LIMIT 10) as solr ON
solr.movie_id = m.movie_id ORDER BY aggCount DESC

I would like the ability to push the inner sub-query aliased as "solr"
down into the data source engine, in this case Solr as it will
greatlly reduce the amount of data that has to be transferred from
Solr into Spark. I would imagine this issue comes up frequently if the
underlying engine is a JDBC data source as well ...

Is this possible? Of course, my example is a bit cherry-picked so
determining if a sub-query can be pushed down into the data source
engine is probably not a trivial task, but I'm wondering if Spark has
the hooks to allow me to try ;-)

Cheers,
Tim

---------------------------------------------------------------------
To unsubscribe e-mail: user-unsubscribe@spark.apache.org

Re: Possible to push sub-queries down into the DataSource impl?

Posted by Timothy Potter <th...@gmail.com>.

yes, that's exactly what I was looking for, thanks for the pointer ;-)

On Thu, Jul 28, 2016 at 1:07 AM, Takeshi Yamamuro <li...@gmail.com> wrote:
> Hi,
>
> Have you seen this ticket?
> https://issues.apache.org/jira/browse/SPARK-12449
>
> // maropu
>
> On Thu, Jul 28, 2016 at 2:13 AM, Timothy Potter <th...@gmail.com>
> wrote:
>>
>> I'm not looking for a one-off solution for a specific query that can
>> be solved on the client side as you suggest, but rather a generic
>> solution that can be implemented within the DataSource impl itself
>> when it knows a sub-query can be pushed down into the engine. In other
>> words, I'd like to intercept the query planning process to be able to
>> push-down computation into the engine when it makes sense.
>>
>> On Wed, Jul 27, 2016 at 8:04 AM, Marco Colombo
>> <in...@gmail.com> wrote:
>> > Why don't you create a dataframe filtered, map it as temporary table and
>> > then use it in your query? You can also cache it, of multiple queries on
>> > the
>> > same inner queries are requested.
>> >
>> >
>> > Il mercoledì 27 luglio 2016, Timothy Potter <th...@gmail.com> ha
>> > scritto:
>> >>
>> >> Take this simple join:
>> >>
>> >> SELECT m.title as title, solr.aggCount as aggCount FROM movies m INNER
>> >> JOIN (SELECT movie_id, COUNT(*) as aggCount FROM ratings WHERE rating
>> >> >= 4 GROUP BY movie_id ORDER BY aggCount desc LIMIT 10) as solr ON
>> >> solr.movie_id = m.movie_id ORDER BY aggCount DESC
>> >>
>> >> I would like the ability to push the inner sub-query aliased as "solr"
>> >> down into the data source engine, in this case Solr as it will
>> >> greatlly reduce the amount of data that has to be transferred from
>> >> Solr into Spark. I would imagine this issue comes up frequently if the
>> >> underlying engine is a JDBC data source as well ...
>> >>
>> >> Is this possible? Of course, my example is a bit cherry-picked so
>> >> determining if a sub-query can be pushed down into the data source
>> >> engine is probably not a trivial task, but I'm wondering if Spark has
>> >> the hooks to allow me to try ;-)
>> >>
>> >> Cheers,
>> >> Tim
>> >>
>> >> ---------------------------------------------------------------------
>> >> To unsubscribe e-mail: user-unsubscribe@spark.apache.org
>> >>
>> >
>> >
>> > --
>> > Ing. Marco Colombo
>>
>> ---------------------------------------------------------------------
>> To unsubscribe e-mail: user-unsubscribe@spark.apache.org
>>
>
>
>
> --
> ---
> Takeshi Yamamuro

---------------------------------------------------------------------
To unsubscribe e-mail: user-unsubscribe@spark.apache.org

Re: Possible to push sub-queries down into the DataSource impl?

Posted by Takeshi Yamamuro <li...@gmail.com>.

Hi,

Have you seen this ticket?
https://issues.apache.org/jira/browse/SPARK-12449

// maropu

On Thu, Jul 28, 2016 at 2:13 AM, Timothy Potter <th...@gmail.com>
wrote:

> I'm not looking for a one-off solution for a specific query that can
> be solved on the client side as you suggest, but rather a generic
> solution that can be implemented within the DataSource impl itself
> when it knows a sub-query can be pushed down into the engine. In other
> words, I'd like to intercept the query planning process to be able to
> push-down computation into the engine when it makes sense.
>
> On Wed, Jul 27, 2016 at 8:04 AM, Marco Colombo
> <in...@gmail.com> wrote:
> > Why don't you create a dataframe filtered, map it as temporary table and
> > then use it in your query? You can also cache it, of multiple queries on
> the
> > same inner queries are requested.
> >
> >
> > Il mercoledì 27 luglio 2016, Timothy Potter <th...@gmail.com> ha
> > scritto:
> >>
> >> Take this simple join:
> >>
> >> SELECT m.title as title, solr.aggCount as aggCount FROM movies m INNER
> >> JOIN (SELECT movie_id, COUNT(*) as aggCount FROM ratings WHERE rating
> >> >= 4 GROUP BY movie_id ORDER BY aggCount desc LIMIT 10) as solr ON
> >> solr.movie_id = m.movie_id ORDER BY aggCount DESC
> >>
> >> I would like the ability to push the inner sub-query aliased as "solr"
> >> down into the data source engine, in this case Solr as it will
> >> greatlly reduce the amount of data that has to be transferred from
> >> Solr into Spark. I would imagine this issue comes up frequently if the
> >> underlying engine is a JDBC data source as well ...
> >>
> >> Is this possible? Of course, my example is a bit cherry-picked so
> >> determining if a sub-query can be pushed down into the data source
> >> engine is probably not a trivial task, but I'm wondering if Spark has
> >> the hooks to allow me to try ;-)
> >>
> >> Cheers,
> >> Tim
> >>
> >> ---------------------------------------------------------------------
> >> To unsubscribe e-mail: user-unsubscribe@spark.apache.org
> >>
> >
> >
> > --
> > Ing. Marco Colombo
>
> ---------------------------------------------------------------------
> To unsubscribe e-mail: user-unsubscribe@spark.apache.org
>
>


-- 
---
Takeshi Yamamuro

Re: Possible to push sub-queries down into the DataSource impl?

Posted by Timothy Potter <th...@gmail.com>.

I'm not looking for a one-off solution for a specific query that can
be solved on the client side as you suggest, but rather a generic
solution that can be implemented within the DataSource impl itself
when it knows a sub-query can be pushed down into the engine. In other
words, I'd like to intercept the query planning process to be able to
push-down computation into the engine when it makes sense.

On Wed, Jul 27, 2016 at 8:04 AM, Marco Colombo
<in...@gmail.com> wrote:
> Why don't you create a dataframe filtered, map it as temporary table and
> then use it in your query? You can also cache it, of multiple queries on the
> same inner queries are requested.
>
>
> Il mercoledì 27 luglio 2016, Timothy Potter <th...@gmail.com> ha
> scritto:
>>
>> Take this simple join:
>>
>> SELECT m.title as title, solr.aggCount as aggCount FROM movies m INNER
>> JOIN (SELECT movie_id, COUNT(*) as aggCount FROM ratings WHERE rating
>> >= 4 GROUP BY movie_id ORDER BY aggCount desc LIMIT 10) as solr ON
>> solr.movie_id = m.movie_id ORDER BY aggCount DESC
>>
>> I would like the ability to push the inner sub-query aliased as "solr"
>> down into the data source engine, in this case Solr as it will
>> greatlly reduce the amount of data that has to be transferred from
>> Solr into Spark. I would imagine this issue comes up frequently if the
>> underlying engine is a JDBC data source as well ...
>>
>> Is this possible? Of course, my example is a bit cherry-picked so
>> determining if a sub-query can be pushed down into the data source
>> engine is probably not a trivial task, but I'm wondering if Spark has
>> the hooks to allow me to try ;-)
>>
>> Cheers,
>> Tim
>>
>> ---------------------------------------------------------------------
>> To unsubscribe e-mail: user-unsubscribe@spark.apache.org
>>
>
>
> --
> Ing. Marco Colombo

---------------------------------------------------------------------
To unsubscribe e-mail: user-unsubscribe@spark.apache.org

Re: Possible to push sub-queries down into the DataSource impl?

Posted by Marco Colombo <in...@gmail.com>.

Why don't you create a dataframe filtered, map it as temporary table and
then use it in your query? You can also cache it, of multiple queries on
the same inner queries are requested.

Il mercoledì 27 luglio 2016, Timothy Potter <th...@gmail.com> ha
scritto:

> Take this simple join:
>
> SELECT m.title as title, solr.aggCount as aggCount FROM movies m INNER
> JOIN (SELECT movie_id, COUNT(*) as aggCount FROM ratings WHERE rating
> >= 4 GROUP BY movie_id ORDER BY aggCount desc LIMIT 10) as solr ON
> solr.movie_id = m.movie_id ORDER BY aggCount DESC
>
> I would like the ability to push the inner sub-query aliased as "solr"
> down into the data source engine, in this case Solr as it will
> greatlly reduce the amount of data that has to be transferred from
> Solr into Spark. I would imagine this issue comes up frequently if the
> underlying engine is a JDBC data source as well ...
>
> Is this possible? Of course, my example is a bit cherry-picked so
> determining if a sub-query can be pushed down into the data source
> engine is probably not a trivial task, but I'm wondering if Spark has
> the hooks to allow me to try ;-)
>
> Cheers,
> Tim
>
> ---------------------------------------------------------------------
> To unsubscribe e-mail: user-unsubscribe@spark.apache.org <javascript:;>
>
>

-- 
Ing. Marco Colombo