You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@ignite.apache.org by Alexander Paschenko <al...@gmail.com> on 2018/02/20 17:45:52 UTC

Re: TX SQL: SELECT FOR UPDATE implementation

Hello again, and here I go with some more thoughts on SELECT FOR UPDATE.
After having talking to Vlad, we returned again to question about how
SELECT FOR UPDATE should work with JOINs.

It seems that distributed joins currently are better to be left aside
as long as rework of distributed queries and joins is a separate big
issue.
However, probably we still should think about what we could make work
after all with what we have in our hands now. And it seems that some
scenarios are pretty doable - say, when there's an ordinary JOIN of
few tables, we can simply append keys of those tables to list of
selected columns, and that would work in case of collocated data.

* Alas, major vendors (Postgres, MySQL, Oracle) all also honor
subqueries in JOINs - say, like this:

select p.id , c.id from person p inner join (select * from company) c
on p.company_id = c.id where p.id > 3 and p.id < 10 for update;

Obviously, this is a bit corner case as such JOIN is not much
different from an ordinary multi-table one, but it gives overall idea
of what I'm talking about - major databases lock what's been selected
in subquery as well.

That said, aforementioned case probably could also be processed as
suggested above - that is, we take innermost query, append key to list
of its selected columns, so that we can refer to that new column from
outer query, etc, etc. In principle we could process even
multiple-level nesting of queries with such approach.
Also no major vendors, except probably MySQL, support GROUP BY with
SELECT FOR UPDATE, and this simplifies our task a bit. :)

If someone can think of an example of a query that cannot be processed
in the way described above, please write about it.

* Also it looks like no major vendors out of three mentioned above
lock record from tables mentioned in subqueries in SELECT columns or
in WHERE - say, this query

select p.id, (select count(*) from company) from person p

will lock only records in Person table.

Thoughts? Should we attempt to implement JOINs support right away? Or
would it be better to leave it be for (near) future and start with
simple implementation that would ban JOINs altogether?

Regards,
Alex


2018-01-25 10:58 GMT+03:00 Vladimir Ozerov <vo...@gridgain.com>:
> Alex,
>
> Thank you for detailed analysis. My 50 cents:
>
> 1) I would not allow GROUP BYs in the first place. These are good
> candidates for further iterations IMO
> 2) Same as p.1
> 3) In final TX SQL solution we woll lock rows, not keys
> (GridCacheMapEntry). Can we simply lock every row returned from the query?
> 4) Same as p.1
> 5) Yes, it should work the same way we would implement it for normal queries
>
> Also I am not quite understand why should we keep results on map node until
> all keys are locked. This increases memory pressure on the server. Instead,
> I would start sending batches to reducer immediately, but do not return the
> very first result to the user until all results are collected. This way
> pressure is moved to the client what increases cluster stability. Will that
> work?
>
> On Wed, Jan 24, 2018 at 6:23 PM, Alexander Paschenko <
> alexander.a.paschenko@gmail.com> wrote:
>
>> Hello Igniters,
>>
>> I'd like to bring up the discussion about implementation details of
>> https://issues.apache.org/jira/browse/IGNITE-6937 about
>> support of SELECT FOR UPDATE statements as a part of overall activity
>> on transactional SQL.
>>
>> That kind of statements allows the user proactively obtain row level
>> locks for the records that they may (or may not) want to
>> update in further statements of the same transaction.
>>
>> Suggested general approach to implementation is as follows:
>>
>> - Perform two-step SELECT as usual (e.g. split, map, reduce stages), BUT
>> - Map nodes start giving away their first result pages for such query
>> only after they have traversed whole their result set
>> and have obtained locks for all keys corresponding to selected rows
>> (e.g. rows matching WHERE clause).
>>
>> I've made some research on how some RDBMSs behave (MySQL, Oracle, Pg)
>> and found that they all also consider SELECT FOR UPDATE
>> finished and give away its result set only after all locks have been
>> acquired, so in general suggested strategy is similar to
>> what other databases do. Alas, everyone concerned is welcome to share
>> their views on how this feature could work in Ignite - as well as on
>> some stuff stated below. These are some caveats that I'd like to
>> discuss in addition to overall approach.
>>
>> First: should we allow GROUP BY clause in SELECT FOR UPDATE? Oracle
>> and Pg do not support it in any way, while MySQL does.
>>
>> Second: should we allow JOIN clause in SELECT FOR UPDATE? All
>> mentioned major vendors support it, but with Ignite implementation
>> likely
>> will not be straightforward - this has to do with pt. 3.
>>
>> Third: in general case, how do we detect *keys* that we should lock?
>> Consider following statement: SELECT name FROM person FOR UPDATE.
>> This query does not mention table key at all. Moreover, if we have
>> JOIN clause in the query, it is concerned with more than one table
>> (=cache), thus making it even harder to detect which keys we should
>> lock. Suggested solution is as follows: on split stage,
>> append to MAP query key columns for all concerned tables (i.e. SELECT
>> a.name, b.surname from a join b will become
>> SELECT a.name, b.surname, a._key, b._key) - with that done, we'll
>> definitely be able to lock everything we need on MAP nodes.
>>
>> Fourth: should we allow LIMIT clause in SELECT FOR UPDATE? In general,
>> RDBMS vendors do not recommend combining LIMIT with
>> SELECT FOR UPDATE. First, this is because LIMIT results are subject to
>> ordering, and in this case it's harder to predict what records
>> will be locked after all. Second, some vendors state that they lock
>> not just what LIMIT yields, but also *everything before it* - say,
>> if your SELECT returns 100 records and you apply LIMIT 50,60, then all
>> first 60 records will be locked even if you don't need first 50.
>> Moreover, in case of Ignite, any LIMIT will make query need merge
>> table, all data will have to be copied to reduce node first,
>> yadda yadda yadda, and we won't be able to actually lock anything
>> until all data is on reduce node. See pt. 5 for more on this.
>>
>> Fifth: in fact, above question stands for all cases when we need merge
>> table: how things should work when we have to pull all data to
>> reduce node prior to actually filtering anything? Should we aim to
>> support such cases in SELECT FOR UPDATE?
>>
>> Regards,
>> Alex
>>

Re: TX SQL: SELECT FOR UPDATE implementation

Posted by Vladimir Ozerov <vo...@gridgain.com>.
Agree. SELECT FOR UPDATE appears to be a command with the most loosly
defined locking semantics in SQL world. So I would start with something
safe, simple and straightforward. Let's add support for simple statements
(no joins, not subqueries) first. Then add support for co-located complex
queries. And at last add support for distributed joins*.

* Though, I think it doesn't make sense with current implementation of
distributed joins, as it is pretty limited and require significant rethink
from architectural perspective.

Vladimir.

On Thu, Feb 22, 2018 at 5:00 AM, Dmitriy Setrakyan <ds...@apache.org>
wrote:

> Alexander, thanks for the detailed explanation. I would start simple,
> without JOINs. However, we should throw proper exceptions if unsupported
> SQL is used.
>
> D.
>
> On Tue, Feb 20, 2018 at 9:45 AM, Alexander Paschenko <
> alexander.a.paschenko@gmail.com> wrote:
>
> > Hello again, and here I go with some more thoughts on SELECT FOR UPDATE.
> > After having talking to Vlad, we returned again to question about how
> > SELECT FOR UPDATE should work with JOINs.
> >
> > It seems that distributed joins currently are better to be left aside
> > as long as rework of distributed queries and joins is a separate big
> > issue.
> > However, probably we still should think about what we could make work
> > after all with what we have in our hands now. And it seems that some
> > scenarios are pretty doable - say, when there's an ordinary JOIN of
> > few tables, we can simply append keys of those tables to list of
> > selected columns, and that would work in case of collocated data.
> >
> > * Alas, major vendors (Postgres, MySQL, Oracle) all also honor
> > subqueries in JOINs - say, like this:
> >
> > select p.id , c.id from person p inner join (select * from company) c
> > on p.company_id = c.id where p.id > 3 and p.id < 10 for update;
> >
> > Obviously, this is a bit corner case as such JOIN is not much
> > different from an ordinary multi-table one, but it gives overall idea
> > of what I'm talking about - major databases lock what's been selected
> > in subquery as well.
> >
> > That said, aforementioned case probably could also be processed as
> > suggested above - that is, we take innermost query, append key to list
> > of its selected columns, so that we can refer to that new column from
> > outer query, etc, etc. In principle we could process even
> > multiple-level nesting of queries with such approach.
> > Also no major vendors, except probably MySQL, support GROUP BY with
> > SELECT FOR UPDATE, and this simplifies our task a bit. :)
> >
> > If someone can think of an example of a query that cannot be processed
> > in the way described above, please write about it.
> >
> > * Also it looks like no major vendors out of three mentioned above
> > lock record from tables mentioned in subqueries in SELECT columns or
> > in WHERE - say, this query
> >
> > select p.id, (select count(*) from company) from person p
> >
> > will lock only records in Person table.
> >
> > Thoughts? Should we attempt to implement JOINs support right away? Or
> > would it be better to leave it be for (near) future and start with
> > simple implementation that would ban JOINs altogether?
> >
> > Regards,
> > Alex
> >
> >
> > 2018-01-25 10:58 GMT+03:00 Vladimir Ozerov <vo...@gridgain.com>:
> > > Alex,
> > >
> > > Thank you for detailed analysis. My 50 cents:
> > >
> > > 1) I would not allow GROUP BYs in the first place. These are good
> > > candidates for further iterations IMO
> > > 2) Same as p.1
> > > 3) In final TX SQL solution we woll lock rows, not keys
> > > (GridCacheMapEntry). Can we simply lock every row returned from the
> > query?
> > > 4) Same as p.1
> > > 5) Yes, it should work the same way we would implement it for normal
> > queries
> > >
> > > Also I am not quite understand why should we keep results on map node
> > until
> > > all keys are locked. This increases memory pressure on the server.
> > Instead,
> > > I would start sending batches to reducer immediately, but do not return
> > the
> > > very first result to the user until all results are collected. This way
> > > pressure is moved to the client what increases cluster stability. Will
> > that
> > > work?
> > >
> > > On Wed, Jan 24, 2018 at 6:23 PM, Alexander Paschenko <
> > > alexander.a.paschenko@gmail.com> wrote:
> > >
> > >> Hello Igniters,
> > >>
> > >> I'd like to bring up the discussion about implementation details of
> > >> https://issues.apache.org/jira/browse/IGNITE-6937 about
> > >> support of SELECT FOR UPDATE statements as a part of overall activity
> > >> on transactional SQL.
> > >>
> > >> That kind of statements allows the user proactively obtain row level
> > >> locks for the records that they may (or may not) want to
> > >> update in further statements of the same transaction.
> > >>
> > >> Suggested general approach to implementation is as follows:
> > >>
> > >> - Perform two-step SELECT as usual (e.g. split, map, reduce stages),
> BUT
> > >> - Map nodes start giving away their first result pages for such query
> > >> only after they have traversed whole their result set
> > >> and have obtained locks for all keys corresponding to selected rows
> > >> (e.g. rows matching WHERE clause).
> > >>
> > >> I've made some research on how some RDBMSs behave (MySQL, Oracle, Pg)
> > >> and found that they all also consider SELECT FOR UPDATE
> > >> finished and give away its result set only after all locks have been
> > >> acquired, so in general suggested strategy is similar to
> > >> what other databases do. Alas, everyone concerned is welcome to share
> > >> their views on how this feature could work in Ignite - as well as on
> > >> some stuff stated below. These are some caveats that I'd like to
> > >> discuss in addition to overall approach.
> > >>
> > >> First: should we allow GROUP BY clause in SELECT FOR UPDATE? Oracle
> > >> and Pg do not support it in any way, while MySQL does.
> > >>
> > >> Second: should we allow JOIN clause in SELECT FOR UPDATE? All
> > >> mentioned major vendors support it, but with Ignite implementation
> > >> likely
> > >> will not be straightforward - this has to do with pt. 3.
> > >>
> > >> Third: in general case, how do we detect *keys* that we should lock?
> > >> Consider following statement: SELECT name FROM person FOR UPDATE.
> > >> This query does not mention table key at all. Moreover, if we have
> > >> JOIN clause in the query, it is concerned with more than one table
> > >> (=cache), thus making it even harder to detect which keys we should
> > >> lock. Suggested solution is as follows: on split stage,
> > >> append to MAP query key columns for all concerned tables (i.e. SELECT
> > >> a.name, b.surname from a join b will become
> > >> SELECT a.name, b.surname, a._key, b._key) - with that done, we'll
> > >> definitely be able to lock everything we need on MAP nodes.
> > >>
> > >> Fourth: should we allow LIMIT clause in SELECT FOR UPDATE? In general,
> > >> RDBMS vendors do not recommend combining LIMIT with
> > >> SELECT FOR UPDATE. First, this is because LIMIT results are subject to
> > >> ordering, and in this case it's harder to predict what records
> > >> will be locked after all. Second, some vendors state that they lock
> > >> not just what LIMIT yields, but also *everything before it* - say,
> > >> if your SELECT returns 100 records and you apply LIMIT 50,60, then all
> > >> first 60 records will be locked even if you don't need first 50.
> > >> Moreover, in case of Ignite, any LIMIT will make query need merge
> > >> table, all data will have to be copied to reduce node first,
> > >> yadda yadda yadda, and we won't be able to actually lock anything
> > >> until all data is on reduce node. See pt. 5 for more on this.
> > >>
> > >> Fifth: in fact, above question stands for all cases when we need merge
> > >> table: how things should work when we have to pull all data to
> > >> reduce node prior to actually filtering anything? Should we aim to
> > >> support such cases in SELECT FOR UPDATE?
> > >>
> > >> Regards,
> > >> Alex
> > >>
> >
>

Re: TX SQL: SELECT FOR UPDATE implementation

Posted by Dmitriy Setrakyan <ds...@apache.org>.
Alexander, thanks for the detailed explanation. I would start simple,
without JOINs. However, we should throw proper exceptions if unsupported
SQL is used.

D.

On Tue, Feb 20, 2018 at 9:45 AM, Alexander Paschenko <
alexander.a.paschenko@gmail.com> wrote:

> Hello again, and here I go with some more thoughts on SELECT FOR UPDATE.
> After having talking to Vlad, we returned again to question about how
> SELECT FOR UPDATE should work with JOINs.
>
> It seems that distributed joins currently are better to be left aside
> as long as rework of distributed queries and joins is a separate big
> issue.
> However, probably we still should think about what we could make work
> after all with what we have in our hands now. And it seems that some
> scenarios are pretty doable - say, when there's an ordinary JOIN of
> few tables, we can simply append keys of those tables to list of
> selected columns, and that would work in case of collocated data.
>
> * Alas, major vendors (Postgres, MySQL, Oracle) all also honor
> subqueries in JOINs - say, like this:
>
> select p.id , c.id from person p inner join (select * from company) c
> on p.company_id = c.id where p.id > 3 and p.id < 10 for update;
>
> Obviously, this is a bit corner case as such JOIN is not much
> different from an ordinary multi-table one, but it gives overall idea
> of what I'm talking about - major databases lock what's been selected
> in subquery as well.
>
> That said, aforementioned case probably could also be processed as
> suggested above - that is, we take innermost query, append key to list
> of its selected columns, so that we can refer to that new column from
> outer query, etc, etc. In principle we could process even
> multiple-level nesting of queries with such approach.
> Also no major vendors, except probably MySQL, support GROUP BY with
> SELECT FOR UPDATE, and this simplifies our task a bit. :)
>
> If someone can think of an example of a query that cannot be processed
> in the way described above, please write about it.
>
> * Also it looks like no major vendors out of three mentioned above
> lock record from tables mentioned in subqueries in SELECT columns or
> in WHERE - say, this query
>
> select p.id, (select count(*) from company) from person p
>
> will lock only records in Person table.
>
> Thoughts? Should we attempt to implement JOINs support right away? Or
> would it be better to leave it be for (near) future and start with
> simple implementation that would ban JOINs altogether?
>
> Regards,
> Alex
>
>
> 2018-01-25 10:58 GMT+03:00 Vladimir Ozerov <vo...@gridgain.com>:
> > Alex,
> >
> > Thank you for detailed analysis. My 50 cents:
> >
> > 1) I would not allow GROUP BYs in the first place. These are good
> > candidates for further iterations IMO
> > 2) Same as p.1
> > 3) In final TX SQL solution we woll lock rows, not keys
> > (GridCacheMapEntry). Can we simply lock every row returned from the
> query?
> > 4) Same as p.1
> > 5) Yes, it should work the same way we would implement it for normal
> queries
> >
> > Also I am not quite understand why should we keep results on map node
> until
> > all keys are locked. This increases memory pressure on the server.
> Instead,
> > I would start sending batches to reducer immediately, but do not return
> the
> > very first result to the user until all results are collected. This way
> > pressure is moved to the client what increases cluster stability. Will
> that
> > work?
> >
> > On Wed, Jan 24, 2018 at 6:23 PM, Alexander Paschenko <
> > alexander.a.paschenko@gmail.com> wrote:
> >
> >> Hello Igniters,
> >>
> >> I'd like to bring up the discussion about implementation details of
> >> https://issues.apache.org/jira/browse/IGNITE-6937 about
> >> support of SELECT FOR UPDATE statements as a part of overall activity
> >> on transactional SQL.
> >>
> >> That kind of statements allows the user proactively obtain row level
> >> locks for the records that they may (or may not) want to
> >> update in further statements of the same transaction.
> >>
> >> Suggested general approach to implementation is as follows:
> >>
> >> - Perform two-step SELECT as usual (e.g. split, map, reduce stages), BUT
> >> - Map nodes start giving away their first result pages for such query
> >> only after they have traversed whole their result set
> >> and have obtained locks for all keys corresponding to selected rows
> >> (e.g. rows matching WHERE clause).
> >>
> >> I've made some research on how some RDBMSs behave (MySQL, Oracle, Pg)
> >> and found that they all also consider SELECT FOR UPDATE
> >> finished and give away its result set only after all locks have been
> >> acquired, so in general suggested strategy is similar to
> >> what other databases do. Alas, everyone concerned is welcome to share
> >> their views on how this feature could work in Ignite - as well as on
> >> some stuff stated below. These are some caveats that I'd like to
> >> discuss in addition to overall approach.
> >>
> >> First: should we allow GROUP BY clause in SELECT FOR UPDATE? Oracle
> >> and Pg do not support it in any way, while MySQL does.
> >>
> >> Second: should we allow JOIN clause in SELECT FOR UPDATE? All
> >> mentioned major vendors support it, but with Ignite implementation
> >> likely
> >> will not be straightforward - this has to do with pt. 3.
> >>
> >> Third: in general case, how do we detect *keys* that we should lock?
> >> Consider following statement: SELECT name FROM person FOR UPDATE.
> >> This query does not mention table key at all. Moreover, if we have
> >> JOIN clause in the query, it is concerned with more than one table
> >> (=cache), thus making it even harder to detect which keys we should
> >> lock. Suggested solution is as follows: on split stage,
> >> append to MAP query key columns for all concerned tables (i.e. SELECT
> >> a.name, b.surname from a join b will become
> >> SELECT a.name, b.surname, a._key, b._key) - with that done, we'll
> >> definitely be able to lock everything we need on MAP nodes.
> >>
> >> Fourth: should we allow LIMIT clause in SELECT FOR UPDATE? In general,
> >> RDBMS vendors do not recommend combining LIMIT with
> >> SELECT FOR UPDATE. First, this is because LIMIT results are subject to
> >> ordering, and in this case it's harder to predict what records
> >> will be locked after all. Second, some vendors state that they lock
> >> not just what LIMIT yields, but also *everything before it* - say,
> >> if your SELECT returns 100 records and you apply LIMIT 50,60, then all
> >> first 60 records will be locked even if you don't need first 50.
> >> Moreover, in case of Ignite, any LIMIT will make query need merge
> >> table, all data will have to be copied to reduce node first,
> >> yadda yadda yadda, and we won't be able to actually lock anything
> >> until all data is on reduce node. See pt. 5 for more on this.
> >>
> >> Fifth: in fact, above question stands for all cases when we need merge
> >> table: how things should work when we have to pull all data to
> >> reduce node prior to actually filtering anything? Should we aim to
> >> support such cases in SELECT FOR UPDATE?
> >>
> >> Regards,
> >> Alex
> >>
>