You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@spark.apache.org by Ran Bai <li...@icloud.com> on 2016/10/23 12:21:31 UTC

LIMIT issue of SparkSQL

Hi all,

I found the runtime for query with or without “LIMIT” keyword is the same. We looked into it and found actually there is “GlobalLimit / LocalLimit” in logical plan, however no relevant physical plan there. Is this a bug or something else? Attached are the logical and physical plans when running "SELECT * FROM seq LIMIT 1".

Re: LIMIT issue of SparkSQL

Posted by Xiao Li <ga...@gmail.com>.

The rule SpecialLimits converted GlobalLimit / LocalLimit
to CollectLimitExec.

https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/SparkStrategies.scala#L74-L75

Spark will not scan all the records based on your plan. CollectLimitExec
should behave as you expected.

Thanks,

Xiao





2016-10-23 20:40 GMT-07:00 Liz Bai <li...@icloud.com>:

> Hi all,
>
> Let me clarify the problem:
>
> Suppose we have a simple table `A` with 100 000 000 records
>
> Problem:
> When we execute sql query ‘select * from A Limit 500`,
> It scan through all 100 000 000 records.
> Normal behaviour should be that once 500 records is found, engine stop
> scanning.
>
> Detailed observation:
> We found that there are “GlobalLimit / LocalLimit” physical operators
> https://github.com/apache/spark/blob/branch-2.0/sql/
> core/src/main/scala/org/apache/spark/sql/execution/limit.scala
> But during query plan generation, GlobalLimit / LocalLimit is not applied
> to the query plan.
>
> Could you please help us to inspect LIMIT problem?
> Thanks.
>
> Best,
> Liz
>
> On 23 Oct 2016, at 10:11 PM, Xiao Li <ga...@gmail.com> wrote:
>
> Hi, Liz,
>
> CollectLimit means `Take the first `limit` elements and collect them to a
> single partition.`
>
> Thanks,
>
> Xiao
>
> 2016-10-23 5:21 GMT-07:00 Ran Bai <li...@icloud.com>:
>
>> Hi all,
>>
>> I found the runtime for query with or without “LIMIT” keyword is the
>> same. We looked into it and found actually there is “GlobalLimit /
>> LocalLimit” in logical plan, however no relevant physical plan there. Is
>> this a bug or something else? Attached are the logical and physical plans
>> when running "SELECT * FROM seq LIMIT 1".
>>
>>
>> More specifically, We expected a early stop upon getting adequate results.
>> Thanks so much.
>>
>> Best,
>> Liz
>>
>>
>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe e-mail: dev-unsubscribe@spark.apache.org
>>
>
>
>

Re: LIMIT issue of SparkSQL

Posted by Asher Krim <ak...@hubspot.com>.

We have also found LIMIT to take an unacceptable amount of time when
reading parquet formatted data from s3.
LIMIT was not strictly needed for our usecase, so we worked around it

-- 
Asher Krim
Senior Software Engineer

On Fri, Oct 28, 2016 at 5:36 AM, Liz Bai <li...@icloud.com> wrote:

> Sorry for the late reply.
> The size of the raw data is 20G and it is composed of two columns. We
> generated it by this
> <https://github.com/lizbai0821/DataGenerator/blob/master/src/main/scala/Gen.scala>
> .
> The test queries are very simple,
> 1). select ColA from Table limit 1
> 2). select ColA from Table
> 3). select ColA from Table where ColB=0
> 4). select ColA from Table where ColB=0 limit 1
> We found that if we use `result.collect()`, it does early stop upon
> getting adequate results for query 1) and 4).
> However, we used to run `result.write.parquet`, and there is no early stop
> and scans much more data than `result.collect()`.
>
> Below are the detailed testing summary,
> *Query*
> *Method of Saving Results*
> *Run Time*
> select ColA from Table limit 1
> result.write.Parquet
> 1m 56s
> select ColA from Table
> 1m 40s
> select ColA from Table where ColB=0 limit 1
> 1m 32s
> select ColA from Table where ColB=0
> 1m 21s
> select ColA from Table limit 1
> result.collect()
> 18s
> select ColA from Table where ColB=0 limit 1
> 18s
>
> Thanks.
>
> Best,
> Liz
>
> On 27 Oct 2016, at 2:16 AM, Michael Armbrust <mi...@databricks.com>
> wrote:
>
> That is surprising then, you may have found a bug.  What timings are you
> seeing?  Can you reproduce it with data you can share? I'd open a JIRA if
> so.
>
> On Tue, Oct 25, 2016 at 4:32 AM, Liz Bai <li...@icloud.com> wrote:
>
>> We used Parquet as data source. The query is like “select ColA from table
>> limit 1”. Attached is the query plan of it. (However its run time is just
>> the same as “select ColA from table”.)
>> We expected an early stop upon getting 1 result, rather than scanning all
>> records and finally collect it with limit in the final phase.
>> Btw, I agree with Mich’s concerning. `Limit push down` is impossible when
>> involving table joins. But some cases such as “Filter + Projection + Limit”
>>  will benefit from `limit push down`.
>> May I know if there is any detailed solutions for this?
>>
>> Thanks so much.
>>
>> Best,
>> Liz
>>
>> <queryplan.png>
>>
>> On 25 Oct 2016, at 5:54 AM, Michael Armbrust <mi...@databricks.com>
>> wrote:
>>
>> It is not about limits on specific tables.  We do support that.  The case
>> I'm describing involves pushing limits across system boundaries.  It is
>> certainly possible to do this, but the current datasource API does provide
>> this information (other than the implicit limit that is pushed down to the
>> consumed iterator of the data source).
>>
>> On Mon, Oct 24, 2016 at 9:11 AM, Mich Talebzadeh <
>> mich.talebzadeh@gmail.com> wrote:
>>
>>> This is an interesting point.
>>>
>>> As far as I know in any database (practically all RDBMS Oracle, SAP
>>> etc), the LIMIT affects the collection part of the result set.
>>>
>>> The result set is carried out fully on the query that may involve
>>> multiple joins on multiple underlying tables.
>>>
>>> To limit the actual query by LIMIT on each underlying table does not
>>> make sense and will not be industry standard AFAK.
>>>
>>> HTH
>>>
>>> Dr Mich Talebzadeh
>>>
>>>
>>> LinkedIn * https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>>> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>>>
>>>
>>> http://talebzadehmich.wordpress.com
>>>
>>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>>> any loss, damage or destruction of data or any other property which may
>>> arise from relying on this email's technical content is explicitly
>>> disclaimed. The author will in no case be liable for any monetary damages
>>> arising from such loss, damage or destruction.
>>>
>>>
>>>
>>> On 24 October 2016 at 06:48, Michael Armbrust <mi...@databricks.com>
>>> wrote:
>>>
>>>> - dev + user
>>>>
>>>> Can you give more info about the query?  Maybe a full explain()?  Are
>>>> you using a datasource like JDBC?  The API does not currently push down
>>>> limits, but the documentation talks about how you can use a query instead
>>>> of a table if that is what you are looking to do.
>>>>
>>>> On Mon, Oct 24, 2016 at 5:40 AM, Liz Bai <li...@icloud.com> wrote:
>>>>
>>>>> Hi all,
>>>>>
>>>>> Let me clarify the problem:
>>>>>
>>>>> Suppose we have a simple table `A` with 100 000 000 records
>>>>>
>>>>> Problem:
>>>>> When we execute sql query ‘select * from A Limit 500`,
>>>>> It scan through all 100 000 000 records.
>>>>> Normal behaviour should be that once 500 records is found, engine stop
>>>>> scanning.
>>>>>
>>>>> Detailed observation:
>>>>> We found that there are “GlobalLimit / LocalLimit” physical operators
>>>>> https://github.com/apache/spark/blob/branch-2.0/sql/core/src
>>>>> /main/scala/org/apache/spark/sql/execution/limit.scala
>>>>> But during query plan generation, GlobalLimit / LocalLimit is not
>>>>> applied to the query plan.
>>>>>
>>>>> Could you please help us to inspect LIMIT problem?
>>>>> Thanks.
>>>>>
>>>>> Best,
>>>>> Liz
>>>>>
>>>>> On 23 Oct 2016, at 10:11 PM, Xiao Li <ga...@gmail.com> wrote:
>>>>>
>>>>> Hi, Liz,
>>>>>
>>>>> CollectLimit means `Take the first `limit` elements and collect them
>>>>> to a single partition.`
>>>>>
>>>>> Thanks,
>>>>>
>>>>> Xiao
>>>>>
>>>>> 2016-10-23 5:21 GMT-07:00 Ran Bai <li...@icloud.com>:
>>>>>
>>>>>> Hi all,
>>>>>>
>>>>>> I found the runtime for query with or without “LIMIT” keyword is the
>>>>>> same. We looked into it and found actually there is “GlobalLimit /
>>>>>> LocalLimit” in logical plan, however no relevant physical plan there. Is
>>>>>> this a bug or something else? Attached are the logical and physical plans
>>>>>> when running "SELECT * FROM seq LIMIT 1".
>>>>>>
>>>>>>
>>>>>> More specifically, We expected a early stop upon getting adequate
>>>>>> results.
>>>>>> Thanks so much.
>>>>>>
>>>>>> Best,
>>>>>> Liz
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> ---------------------------------------------------------------------
>>>>>> To unsubscribe e-mail: dev-unsubscribe@spark.apache.org
>>>>>>
>>>>>
>>>>>
>>>>>
>>>>
>>>
>>
>>
>

Re: LIMIT issue of SparkSQL

Posted by Liz Bai <li...@icloud.com>.

Sorry for the late reply.
The size of the raw data is 20G and it is composed of two columns. We generated it by this <https://github.com/lizbai0821/DataGenerator/blob/master/src/main/scala/Gen.scala>.
The test queries are very simple,
1). select ColA from Table limit 1
2). select ColA from Table
3). select ColA from Table where ColB=0
4). select ColA from Table where ColB=0 limit 1
We found that if we use `result.collect()`, it does early stop upon getting adequate results for query 1) and 4).
However, we used to run `result.write.parquet`, and there is no early stop and scans much more data than `result.collect()`.

Below are the detailed testing summary,
Query
Method of Saving Results
Run Time
select ColA from Table limit 1
result.write.Parquet
1m 56s
select ColA from Table
1m 40s
select ColA from Table where ColB=0 limit 1
1m 32s
select ColA from Table where ColB=0 
1m 21s
select ColA from Table limit 1
result.collect()
18s
select ColA from Table where ColB=0 limit 1
18s

Thanks.

Best,
Liz
> On 27 Oct 2016, at 2:16 AM, Michael Armbrust <mi...@databricks.com> wrote:
> 
> That is surprising then, you may have found a bug.  What timings are you seeing?  Can you reproduce it with data you can share? I'd open a JIRA if so.
> 
> On Tue, Oct 25, 2016 at 4:32 AM, Liz Bai <lizbai@icloud.com <ma...@icloud.com>> wrote:
> We used Parquet as data source. The query is like “select ColA from table limit 1”. Attached is the query plan of it. (However its run time is just the same as “select ColA from table”.)
> We expected an early stop upon getting 1 result, rather than scanning all records and finally collect it with limit in the final phase. 
> Btw, I agree with Mich’s concerning. `Limit push down` is impossible when involving table joins. But some cases such as “Filter + Projection + Limit”  will benefit from `limit push down`.
> May I know if there is any detailed solutions for this?
> 
> Thanks so much.
> 
> Best,
> Liz
> 
> <queryplan.png>
>> On 25 Oct 2016, at 5:54 AM, Michael Armbrust <michael@databricks.com <ma...@databricks.com>> wrote:
>> 
>> It is not about limits on specific tables.  We do support that.  The case I'm describing involves pushing limits across system boundaries.  It is certainly possible to do this, but the current datasource API does provide this information (other than the implicit limit that is pushed down to the consumed iterator of the data source).
>> 
>> On Mon, Oct 24, 2016 at 9:11 AM, Mich Talebzadeh <mich.talebzadeh@gmail.com <ma...@gmail.com>> wrote:
>> This is an interesting point.
>> 
>> As far as I know in any database (practically all RDBMS Oracle, SAP etc), the LIMIT affects the collection part of the result set.
>> 
>> The result set is carried out fully on the query that may involve multiple joins on multiple underlying tables.
>> 
>> To limit the actual query by LIMIT on each underlying table does not make sense and will not be industry standard AFAK.
>> 
>> HTH
>> 
>> Dr Mich Talebzadeh
>>  
>> LinkedIn  https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>
>>  
>> http://talebzadehmich.wordpress.com <http://talebzadehmich.wordpress.com/>
>> 
>> Disclaimer: Use it at your own risk. Any and all responsibility for any loss, damage or destruction of data or any other property which may arise from relying on this email's technical content is explicitly disclaimed. The author will in no case be liable for any monetary damages arising from such loss, damage or destruction.
>>  
>> 
>> On 24 October 2016 at 06:48, Michael Armbrust <michael@databricks.com <ma...@databricks.com>> wrote:
>> - dev + user
>> 
>> Can you give more info about the query?  Maybe a full explain()?  Are you using a datasource like JDBC?  The API does not currently push down limits, but the documentation talks about how you can use a query instead of a table if that is what you are looking to do.
>> 
>> On Mon, Oct 24, 2016 at 5:40 AM, Liz Bai <lizbai@icloud.com <ma...@icloud.com>> wrote:
>> Hi all,
>> 
>> Let me clarify the problem: 
>> 
>> Suppose we have a simple table `A` with 100 000 000 records
>> 
>> Problem:
>> When we execute sql query ‘select * from A Limit 500`,
>> It scan through all 100 000 000 records. 
>> Normal behaviour should be that once 500 records is found, engine stop scanning.
>> 
>> Detailed observation:
>> We found that there are “GlobalLimit / LocalLimit” physical operators
>> https://github.com/apache/spark/blob/branch-2.0/sql/core/src/main/scala/org/apache/spark/sql/execution/limit.scala <https://github.com/apache/spark/blob/branch-2.0/sql/core/src/main/scala/org/apache/spark/sql/execution/limit.scala>
>> But during query plan generation, GlobalLimit / LocalLimit is not applied to the query plan.
>> 
>> Could you please help us to inspect LIMIT problem? 
>> Thanks.
>> 
>> Best,
>> Liz
>>> On 23 Oct 2016, at 10:11 PM, Xiao Li <gatorsmile@gmail.com <ma...@gmail.com>> wrote:
>>> 
>>> Hi, Liz,
>>> 
>>> CollectLimit means `Take the first `limit` elements and collect them to a single partition.`
>>> 
>>> Thanks,
>>> 
>>> Xiao 
>>> 
>>> 2016-10-23 5:21 GMT-07:00 Ran Bai <lizbai@icloud.com <ma...@icloud.com>>:
>>> Hi all,
>>> 
>>> I found the runtime for query with or without “LIMIT” keyword is the same. We looked into it and found actually there is “GlobalLimit / LocalLimit” in logical plan, however no relevant physical plan there. Is this a bug or something else? Attached are the logical and physical plans when running "SELECT * FROM seq LIMIT 1".
>>> 
>>> 
>>> More specifically, We expected a early stop upon getting adequate results.
>>> Thanks so much.
>>> 
>>> Best,
>>> Liz
>>> 
>>> 
>>> 
>>> 
>>> ---------------------------------------------------------------------
>>> To unsubscribe e-mail: dev-unsubscribe@spark.apache.org <ma...@spark.apache.org>
>>> 
>> 
>> 
>> 
>> 
> 
>

Re: LIMIT issue of SparkSQL

Posted by Michael Armbrust <mi...@databricks.com>.

That is surprising then, you may have found a bug.  What timings are you
seeing?  Can you reproduce it with data you can share? I'd open a JIRA if
so.

On Tue, Oct 25, 2016 at 4:32 AM, Liz Bai <li...@icloud.com> wrote:

> We used Parquet as data source. The query is like “select ColA from table
> limit 1”. Attached is the query plan of it. (However its run time is just
> the same as “select ColA from table”.)
> We expected an early stop upon getting 1 result, rather than scanning all
> records and finally collect it with limit in the final phase.
> Btw, I agree with Mich’s concerning. `Limit push down` is impossible when
> involving table joins. But some cases such as “Filter + Projection + Limit”
>  will benefit from `limit push down`.
> May I know if there is any detailed solutions for this?
>
> Thanks so much.
>
> Best,
> Liz
>
> On 25 Oct 2016, at 5:54 AM, Michael Armbrust <mi...@databricks.com>
> wrote:
>
> It is not about limits on specific tables.  We do support that.  The case
> I'm describing involves pushing limits across system boundaries.  It is
> certainly possible to do this, but the current datasource API does provide
> this information (other than the implicit limit that is pushed down to the
> consumed iterator of the data source).
>
> On Mon, Oct 24, 2016 at 9:11 AM, Mich Talebzadeh <
> mich.talebzadeh@gmail.com> wrote:
>
>> This is an interesting point.
>>
>> As far as I know in any database (practically all RDBMS Oracle, SAP etc),
>> the LIMIT affects the collection part of the result set.
>>
>> The result set is carried out fully on the query that may involve
>> multiple joins on multiple underlying tables.
>>
>> To limit the actual query by LIMIT on each underlying table does not make
>> sense and will not be industry standard AFAK.
>>
>> HTH
>>
>> Dr Mich Talebzadeh
>>
>>
>> LinkedIn * https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>>
>>
>> http://talebzadehmich.wordpress.com
>>
>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>> any loss, damage or destruction of data or any other property which may
>> arise from relying on this email's technical content is explicitly
>> disclaimed. The author will in no case be liable for any monetary damages
>> arising from such loss, damage or destruction.
>>
>>
>>
>> On 24 October 2016 at 06:48, Michael Armbrust <mi...@databricks.com>
>> wrote:
>>
>>> - dev + user
>>>
>>> Can you give more info about the query?  Maybe a full explain()?  Are
>>> you using a datasource like JDBC?  The API does not currently push down
>>> limits, but the documentation talks about how you can use a query instead
>>> of a table if that is what you are looking to do.
>>>
>>> On Mon, Oct 24, 2016 at 5:40 AM, Liz Bai <li...@icloud.com> wrote:
>>>
>>>> Hi all,
>>>>
>>>> Let me clarify the problem:
>>>>
>>>> Suppose we have a simple table `A` with 100 000 000 records
>>>>
>>>> Problem:
>>>> When we execute sql query ‘select * from A Limit 500`,
>>>> It scan through all 100 000 000 records.
>>>> Normal behaviour should be that once 500 records is found, engine stop
>>>> scanning.
>>>>
>>>> Detailed observation:
>>>> We found that there are “GlobalLimit / LocalLimit” physical operators
>>>> https://github.com/apache/spark/blob/branch-2.0/sql/core/src
>>>> /main/scala/org/apache/spark/sql/execution/limit.scala
>>>> But during query plan generation, GlobalLimit / LocalLimit is not
>>>> applied to the query plan.
>>>>
>>>> Could you please help us to inspect LIMIT problem?
>>>> Thanks.
>>>>
>>>> Best,
>>>> Liz
>>>>
>>>> On 23 Oct 2016, at 10:11 PM, Xiao Li <ga...@gmail.com> wrote:
>>>>
>>>> Hi, Liz,
>>>>
>>>> CollectLimit means `Take the first `limit` elements and collect them to
>>>> a single partition.`
>>>>
>>>> Thanks,
>>>>
>>>> Xiao
>>>>
>>>> 2016-10-23 5:21 GMT-07:00 Ran Bai <li...@icloud.com>:
>>>>
>>>>> Hi all,
>>>>>
>>>>> I found the runtime for query with or without “LIMIT” keyword is the
>>>>> same. We looked into it and found actually there is “GlobalLimit /
>>>>> LocalLimit” in logical plan, however no relevant physical plan there. Is
>>>>> this a bug or something else? Attached are the logical and physical plans
>>>>> when running "SELECT * FROM seq LIMIT 1".
>>>>>
>>>>>
>>>>> More specifically, We expected a early stop upon getting adequate
>>>>> results.
>>>>> Thanks so much.
>>>>>
>>>>> Best,
>>>>> Liz
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> ---------------------------------------------------------------------
>>>>> To unsubscribe e-mail: dev-unsubscribe@spark.apache.org
>>>>>
>>>>
>>>>
>>>>
>>>
>>
>
>

Re: LIMIT issue of SparkSQL

Posted by Liz Bai <li...@icloud.com>.

We used Parquet as data source. The query is like “select ColA from table limit 1”. Attached is the query plan of it. (However its run time is just the same as “select ColA from table”.)
We expected an early stop upon getting 1 result, rather than scanning all records and finally collect it with limit in the final phase. 
Btw, I agree with Mich’s concerning. `Limit push down` is impossible when involving table joins. But some cases such as “Filter + Projection + Limit”  will benefit from `limit push down`.
May I know if there is any detailed solutions for this?

Thanks so much.

Best,
Liz


> On 25 Oct 2016, at 5:54 AM, Michael Armbrust <mi...@databricks.com> wrote:
> 
> It is not about limits on specific tables.  We do support that.  The case I'm describing involves pushing limits across system boundaries.  It is certainly possible to do this, but the current datasource API does provide this information (other than the implicit limit that is pushed down to the consumed iterator of the data source).
> 
> On Mon, Oct 24, 2016 at 9:11 AM, Mich Talebzadeh <mich.talebzadeh@gmail.com <ma...@gmail.com>> wrote:
> This is an interesting point.
> 
> As far as I know in any database (practically all RDBMS Oracle, SAP etc), the LIMIT affects the collection part of the result set.
> 
> The result set is carried out fully on the query that may involve multiple joins on multiple underlying tables.
> 
> To limit the actual query by LIMIT on each underlying table does not make sense and will not be industry standard AFAK.
> 
> HTH
> 
> Dr Mich Talebzadeh
>  
> LinkedIn  https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>
>  
> http://talebzadehmich.wordpress.com <http://talebzadehmich.wordpress.com/>
> 
> Disclaimer: Use it at your own risk. Any and all responsibility for any loss, damage or destruction of data or any other property which may arise from relying on this email's technical content is explicitly disclaimed. The author will in no case be liable for any monetary damages arising from such loss, damage or destruction.
>  
> 
> On 24 October 2016 at 06:48, Michael Armbrust <michael@databricks.com <ma...@databricks.com>> wrote:
> - dev + user
> 
> Can you give more info about the query?  Maybe a full explain()?  Are you using a datasource like JDBC?  The API does not currently push down limits, but the documentation talks about how you can use a query instead of a table if that is what you are looking to do.
> 
> On Mon, Oct 24, 2016 at 5:40 AM, Liz Bai <lizbai@icloud.com <ma...@icloud.com>> wrote:
> Hi all,
> 
> Let me clarify the problem: 
> 
> Suppose we have a simple table `A` with 100 000 000 records
> 
> Problem:
> When we execute sql query ‘select * from A Limit 500`,
> It scan through all 100 000 000 records. 
> Normal behaviour should be that once 500 records is found, engine stop scanning.
> 
> Detailed observation:
> We found that there are “GlobalLimit / LocalLimit” physical operators
> https://github.com/apache/spark/blob/branch-2.0/sql/core/src/main/scala/org/apache/spark/sql/execution/limit.scala <https://github.com/apache/spark/blob/branch-2.0/sql/core/src/main/scala/org/apache/spark/sql/execution/limit.scala>
> But during query plan generation, GlobalLimit / LocalLimit is not applied to the query plan.
> 
> Could you please help us to inspect LIMIT problem? 
> Thanks.
> 
> Best,
> Liz
>> On 23 Oct 2016, at 10:11 PM, Xiao Li <gatorsmile@gmail.com <ma...@gmail.com>> wrote:
>> 
>> Hi, Liz,
>> 
>> CollectLimit means `Take the first `limit` elements and collect them to a single partition.`
>> 
>> Thanks,
>> 
>> Xiao 
>> 
>> 2016-10-23 5:21 GMT-07:00 Ran Bai <lizbai@icloud.com <ma...@icloud.com>>:
>> Hi all,
>> 
>> I found the runtime for query with or without “LIMIT” keyword is the same. We looked into it and found actually there is “GlobalLimit / LocalLimit” in logical plan, however no relevant physical plan there. Is this a bug or something else? Attached are the logical and physical plans when running "SELECT * FROM seq LIMIT 1".
>> 
>> 
>> More specifically, We expected a early stop upon getting adequate results.
>> Thanks so much.
>> 
>> Best,
>> Liz
>> 
>> 
>> 
>> 
>> ---------------------------------------------------------------------
>> To unsubscribe e-mail: dev-unsubscribe@spark.apache.org <ma...@spark.apache.org>
>> 
> 
> 
> 
>

Re: LIMIT issue of SparkSQL

Posted by Michael Armbrust <mi...@databricks.com>.

It is not about limits on specific tables.  We do support that.  The case
I'm describing involves pushing limits across system boundaries.  It is
certainly possible to do this, but the current datasource API does provide
this information (other than the implicit limit that is pushed down to the
consumed iterator of the data source).

On Mon, Oct 24, 2016 at 9:11 AM, Mich Talebzadeh <mi...@gmail.com>
wrote:

> This is an interesting point.
>
> As far as I know in any database (practically all RDBMS Oracle, SAP etc),
> the LIMIT affects the collection part of the result set.
>
> The result set is carried out fully on the query that may involve multiple
> joins on multiple underlying tables.
>
> To limit the actual query by LIMIT on each underlying table does not make
> sense and will not be industry standard AFAK.
>
> HTH
>
> Dr Mich Talebzadeh
>
>
>
> LinkedIn * https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>
>
>
> http://talebzadehmich.wordpress.com
>
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>
> On 24 October 2016 at 06:48, Michael Armbrust <mi...@databricks.com>
> wrote:
>
>> - dev + user
>>
>> Can you give more info about the query?  Maybe a full explain()?  Are you
>> using a datasource like JDBC?  The API does not currently push down limits,
>> but the documentation talks about how you can use a query instead of a
>> table if that is what you are looking to do.
>>
>> On Mon, Oct 24, 2016 at 5:40 AM, Liz Bai <li...@icloud.com> wrote:
>>
>>> Hi all,
>>>
>>> Let me clarify the problem:
>>>
>>> Suppose we have a simple table `A` with 100 000 000 records
>>>
>>> Problem:
>>> When we execute sql query ‘select * from A Limit 500`,
>>> It scan through all 100 000 000 records.
>>> Normal behaviour should be that once 500 records is found, engine stop
>>> scanning.
>>>
>>> Detailed observation:
>>> We found that there are “GlobalLimit / LocalLimit” physical operators
>>> https://github.com/apache/spark/blob/branch-2.0/sql/core/src
>>> /main/scala/org/apache/spark/sql/execution/limit.scala
>>> But during query plan generation, GlobalLimit / LocalLimit is not
>>> applied to the query plan.
>>>
>>> Could you please help us to inspect LIMIT problem?
>>> Thanks.
>>>
>>> Best,
>>> Liz
>>>
>>> On 23 Oct 2016, at 10:11 PM, Xiao Li <ga...@gmail.com> wrote:
>>>
>>> Hi, Liz,
>>>
>>> CollectLimit means `Take the first `limit` elements and collect them to
>>> a single partition.`
>>>
>>> Thanks,
>>>
>>> Xiao
>>>
>>> 2016-10-23 5:21 GMT-07:00 Ran Bai <li...@icloud.com>:
>>>
>>>> Hi all,
>>>>
>>>> I found the runtime for query with or without “LIMIT” keyword is the
>>>> same. We looked into it and found actually there is “GlobalLimit /
>>>> LocalLimit” in logical plan, however no relevant physical plan there. Is
>>>> this a bug or something else? Attached are the logical and physical plans
>>>> when running "SELECT * FROM seq LIMIT 1".
>>>>
>>>>
>>>> More specifically, We expected a early stop upon getting adequate
>>>> results.
>>>> Thanks so much.
>>>>
>>>> Best,
>>>> Liz
>>>>
>>>>
>>>>
>>>>
>>>> ---------------------------------------------------------------------
>>>> To unsubscribe e-mail: dev-unsubscribe@spark.apache.org
>>>>
>>>
>>>
>>>
>>
>

Re: LIMIT issue of SparkSQL

Posted by Mich Talebzadeh <mi...@gmail.com>.

This is an interesting point.

As far as I know in any database (practically all RDBMS Oracle, SAP etc),
the LIMIT affects the collection part of the result set.

The result set is carried out fully on the query that may involve multiple
joins on multiple underlying tables.

To limit the actual query by LIMIT on each underlying table does not make
sense and will not be industry standard AFAK.

HTH

Dr Mich Talebzadeh



LinkedIn * https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
<https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*



http://talebzadehmich.wordpress.com


*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.



On 24 October 2016 at 06:48, Michael Armbrust <mi...@databricks.com>
wrote:

> - dev + user
>
> Can you give more info about the query?  Maybe a full explain()?  Are you
> using a datasource like JDBC?  The API does not currently push down limits,
> but the documentation talks about how you can use a query instead of a
> table if that is what you are looking to do.
>
> On Mon, Oct 24, 2016 at 5:40 AM, Liz Bai <li...@icloud.com> wrote:
>
>> Hi all,
>>
>> Let me clarify the problem:
>>
>> Suppose we have a simple table `A` with 100 000 000 records
>>
>> Problem:
>> When we execute sql query ‘select * from A Limit 500`,
>> It scan through all 100 000 000 records.
>> Normal behaviour should be that once 500 records is found, engine stop
>> scanning.
>>
>> Detailed observation:
>> We found that there are “GlobalLimit / LocalLimit” physical operators
>> https://github.com/apache/spark/blob/branch-2.0/sql/core/
>> src/main/scala/org/apache/spark/sql/execution/limit.scala
>> But during query plan generation, GlobalLimit / LocalLimit is not applied
>> to the query plan.
>>
>> Could you please help us to inspect LIMIT problem?
>> Thanks.
>>
>> Best,
>> Liz
>>
>> On 23 Oct 2016, at 10:11 PM, Xiao Li <ga...@gmail.com> wrote:
>>
>> Hi, Liz,
>>
>> CollectLimit means `Take the first `limit` elements and collect them to a
>> single partition.`
>>
>> Thanks,
>>
>> Xiao
>>
>> 2016-10-23 5:21 GMT-07:00 Ran Bai <li...@icloud.com>:
>>
>>> Hi all,
>>>
>>> I found the runtime for query with or without “LIMIT” keyword is the
>>> same. We looked into it and found actually there is “GlobalLimit /
>>> LocalLimit” in logical plan, however no relevant physical plan there. Is
>>> this a bug or something else? Attached are the logical and physical plans
>>> when running "SELECT * FROM seq LIMIT 1".
>>>
>>>
>>> More specifically, We expected a early stop upon getting adequate
>>> results.
>>> Thanks so much.
>>>
>>> Best,
>>> Liz
>>>
>>>
>>>
>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe e-mail: dev-unsubscribe@spark.apache.org
>>>
>>
>>
>>
>

Re: LIMIT issue of SparkSQL

Posted by Michael Armbrust <mi...@databricks.com>.

- dev + user

Can you give more info about the query?  Maybe a full explain()?  Are you
using a datasource like JDBC?  The API does not currently push down limits,
but the documentation talks about how you can use a query instead of a
table if that is what you are looking to do.

On Mon, Oct 24, 2016 at 5:40 AM, Liz Bai <li...@icloud.com> wrote:

> Hi all,
>
> Let me clarify the problem:
>
> Suppose we have a simple table `A` with 100 000 000 records
>
> Problem:
> When we execute sql query ‘select * from A Limit 500`,
> It scan through all 100 000 000 records.
> Normal behaviour should be that once 500 records is found, engine stop
> scanning.
>
> Detailed observation:
> We found that there are “GlobalLimit / LocalLimit” physical operators
> https://github.com/apache/spark/blob/branch-2.0/sql/
> core/src/main/scala/org/apache/spark/sql/execution/limit.scala
> But during query plan generation, GlobalLimit / LocalLimit is not applied
> to the query plan.
>
> Could you please help us to inspect LIMIT problem?
> Thanks.
>
> Best,
> Liz
>
> On 23 Oct 2016, at 10:11 PM, Xiao Li <ga...@gmail.com> wrote:
>
> Hi, Liz,
>
> CollectLimit means `Take the first `limit` elements and collect them to a
> single partition.`
>
> Thanks,
>
> Xiao
>
> 2016-10-23 5:21 GMT-07:00 Ran Bai <li...@icloud.com>:
>
>> Hi all,
>>
>> I found the runtime for query with or without “LIMIT” keyword is the
>> same. We looked into it and found actually there is “GlobalLimit /
>> LocalLimit” in logical plan, however no relevant physical plan there. Is
>> this a bug or something else? Attached are the logical and physical plans
>> when running "SELECT * FROM seq LIMIT 1".
>>
>>
>> More specifically, We expected a early stop upon getting adequate results.
>> Thanks so much.
>>
>> Best,
>> Liz
>>
>>
>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe e-mail: dev-unsubscribe@spark.apache.org
>>
>
>
>

Re: LIMIT issue of SparkSQL

Posted by Michael Armbrust <mi...@databricks.com>.

- dev + user

Can you give more info about the query?  Maybe a full explain()?  Are you
using a datasource like JDBC?  The API does not currently push down limits,
but the documentation talks about how you can use a query instead of a
table if that is what you are looking to do.

On Mon, Oct 24, 2016 at 5:40 AM, Liz Bai <li...@icloud.com> wrote:

> Hi all,
>
> Let me clarify the problem:
>
> Suppose we have a simple table `A` with 100 000 000 records
>
> Problem:
> When we execute sql query ‘select * from A Limit 500`,
> It scan through all 100 000 000 records.
> Normal behaviour should be that once 500 records is found, engine stop
> scanning.
>
> Detailed observation:
> We found that there are “GlobalLimit / LocalLimit” physical operators
> https://github.com/apache/spark/blob/branch-2.0/sql/
> core/src/main/scala/org/apache/spark/sql/execution/limit.scala
> But during query plan generation, GlobalLimit / LocalLimit is not applied
> to the query plan.
>
> Could you please help us to inspect LIMIT problem?
> Thanks.
>
> Best,
> Liz
>
> On 23 Oct 2016, at 10:11 PM, Xiao Li <ga...@gmail.com> wrote:
>
> Hi, Liz,
>
> CollectLimit means `Take the first `limit` elements and collect them to a
> single partition.`
>
> Thanks,
>
> Xiao
>
> 2016-10-23 5:21 GMT-07:00 Ran Bai <li...@icloud.com>:
>
>> Hi all,
>>
>> I found the runtime for query with or without “LIMIT” keyword is the
>> same. We looked into it and found actually there is “GlobalLimit /
>> LocalLimit” in logical plan, however no relevant physical plan there. Is
>> this a bug or something else? Attached are the logical and physical plans
>> when running "SELECT * FROM seq LIMIT 1".
>>
>>
>> More specifically, We expected a early stop upon getting adequate results.
>> Thanks so much.
>>
>> Best,
>> Liz
>>
>>
>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe e-mail: dev-unsubscribe@spark.apache.org
>>
>
>
>

Re: LIMIT issue of SparkSQL

Posted by Liz Bai <li...@icloud.com>.

Hi all,

Let me clarify the problem: 

Suppose we have a simple table `A` with 100 000 000 records

Problem:
When we execute sql query ‘select * from A Limit 500`,
It scan through all 100 000 000 records. 
Normal behaviour should be that once 500 records is found, engine stop scanning.

Detailed observation:
We found that there are “GlobalLimit / LocalLimit” physical operators
https://github.com/apache/spark/blob/branch-2.0/sql/core/src/main/scala/org/apache/spark/sql/execution/limit.scala <https://github.com/apache/spark/blob/branch-2.0/sql/core/src/main/scala/org/apache/spark/sql/execution/limit.scala>
But during query plan generation, GlobalLimit / LocalLimit is not applied to the query plan.

Could you please help us to inspect LIMIT problem? 
Thanks.

Best,
Liz
> On 23 Oct 2016, at 10:11 PM, Xiao Li <ga...@gmail.com> wrote:
> 
> Hi, Liz,
> 
> CollectLimit means `Take the first `limit` elements and collect them to a single partition.`
> 
> Thanks,
> 
> Xiao 
> 
> 2016-10-23 5:21 GMT-07:00 Ran Bai <lizbai@icloud.com <ma...@icloud.com>>:
> Hi all,
> 
> I found the runtime for query with or without “LIMIT” keyword is the same. We looked into it and found actually there is “GlobalLimit / LocalLimit” in logical plan, however no relevant physical plan there. Is this a bug or something else? Attached are the logical and physical plans when running "SELECT * FROM seq LIMIT 1".
> 
> 
> More specifically, We expected a early stop upon getting adequate results.
> Thanks so much.
> 
> Best,
> Liz
> 
> 
> 
> 
> ---------------------------------------------------------------------
> To unsubscribe e-mail: dev-unsubscribe@spark.apache.org <ma...@spark.apache.org>
>

Re: LIMIT issue of SparkSQL

Posted by Xiao Li <ga...@gmail.com>.

Hi, Liz,

CollectLimit means `Take the first `limit` elements and collect them to a
single partition.`

Thanks,

Xiao

2016-10-23 5:21 GMT-07:00 Ran Bai <li...@icloud.com>:

> Hi all,
>
> I found the runtime for query with or without “LIMIT” keyword is the same.
> We looked into it and found actually there is “GlobalLimit / LocalLimit” in
> logical plan, however no relevant physical plan there. Is this a bug or
> something else? Attached are the logical and physical plans when running
> "SELECT * FROM seq LIMIT 1".
>
>
> More specifically, We expected a early stop upon getting adequate results.
> Thanks so much.
>
> Best,
> Liz
>
>
>
>
> ---------------------------------------------------------------------
> To unsubscribe e-mail: dev-unsubscribe@spark.apache.org
>