You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@hive.apache.org by Granville Barnett <gr...@gmail.com> on 2018/11/20 13:50:52 UTC

Why are TXN IDs not partitioned per database?

Hi,

Reading the source code of Hive 3.x and I have a question regarding
transaction IDs which form the span of a transaction: it's begin (TXN ID)
and commit ID (NEXT_TXN_ID at time of commit).

Why is it that we have a global timeline for transactions rather than a
timeline partitioned at the granularity of a database, kind of similar to
how write IDs are partitioned per table but at the database scope?

E.g.,

NEXT_TXN_ID
+-------+-------------------+
| DB    | NTXN_NEXT  |
+-------+-------------------+
| test1 | 23                   |
| test2 | 4                     |
+-------+-------------------+

Same question could also be applied to NEXT_LOCK_ID.

I am just curious because it seems like partitioning the transaction (and
lock IDs) would reduce the granularity of locking in the various
transactional methods. For example, openTxn invocations are mutexed with
all other openTxn invocations even if they are for transactions running in
distinct database domains.  Similarly for openTxn mutexing with respect to
commitTxn if there is a write-write conflict, which I would have thought
would only be the case if they are applicable to the same database. I'm
sure that this would have the side effect of increasing the complexity of
other subsystems but I had to ask what the rationale was behind this.

(I'm new to Hive to please forgive me if the answer is obvious.)

Regards,

Granville

Re: Why are TXN IDs not partitioned per database?

Posted by Granville Barnett <gr...@gmail.com>.

Thanks Alan.

On Tue, 20 Nov 2018, 17:23 Alan Gates <alanfgates@gmail.com wrote:

> History.  Originally there were only transaction ids, which were global.
> Write ids for tables came later as a way to limit the amount of information
> each transaction needed to track and to make it easier to replicate table
> changes between Hive instances.
>
> But even if we had put them in from the start, we'd have them span
> databases, otherwise transactions couldn't span databases.  Hive has no
> restrictions on queries spanning databases so we wouldn't want to restrict
> transactions from doing so.
>
> Alan.
>
> On Tue, Nov 20, 2018 at 7:32 AM Granville Barnett <
> granvillebarnett@gmail.com> wrote:
>
> > Hi,
> >
> > Reading the source code of Hive 3.x and I have a question regarding
> > transaction IDs which form the span of a transaction: it's begin (TXN ID)
> > and commit ID (NEXT_TXN_ID at time of commit).
> >
> > Why is it that we have a global timeline for transactions rather than a
> > timeline partitioned at the granularity of a database, kind of similar to
> > how write IDs are partitioned per table but at the database scope?
> >
> > E.g.,
> >
> > NEXT_TXN_ID
> > +-------+-------------------+
> > | DB    | NTXN_NEXT  |
> > +-------+-------------------+
> > | test1 | 23                   |
> > | test2 | 4                     |
> > +-------+-------------------+
> >
> > Same question could also be applied to NEXT_LOCK_ID.
> >
> > I am just curious because it seems like partitioning the transaction (and
> > lock IDs) would reduce the granularity of locking in the various
> > transactional methods. For example, openTxn invocations are mutexed with
> > all other openTxn invocations even if they are for transactions running
> in
> > distinct database domains.  Similarly for openTxn mutexing with respect
> to
> > commitTxn if there is a write-write conflict, which I would have thought
> > would only be the case if they are applicable to the same database. I'm
> > sure that this would have the side effect of increasing the complexity of
> > other subsystems but I had to ask what the rationale was behind this.
> >
> > (I'm new to Hive to please forgive me if the answer is obvious.)
> >
> > Regards,
> >
> > Granville
> >
>

Re: Why are TXN IDs not partitioned per database?

Posted by Alan Gates <al...@gmail.com>.

History.  Originally there were only transaction ids, which were global.
Write ids for tables came later as a way to limit the amount of information
each transaction needed to track and to make it easier to replicate table
changes between Hive instances.

But even if we had put them in from the start, we'd have them span
databases, otherwise transactions couldn't span databases.  Hive has no
restrictions on queries spanning databases so we wouldn't want to restrict
transactions from doing so.

Alan.

On Tue, Nov 20, 2018 at 7:32 AM Granville Barnett <
granvillebarnett@gmail.com> wrote:

> Hi,
>
> Reading the source code of Hive 3.x and I have a question regarding
> transaction IDs which form the span of a transaction: it's begin (TXN ID)
> and commit ID (NEXT_TXN_ID at time of commit).
>
> Why is it that we have a global timeline for transactions rather than a
> timeline partitioned at the granularity of a database, kind of similar to
> how write IDs are partitioned per table but at the database scope?
>
> E.g.,
>
> NEXT_TXN_ID
> +-------+-------------------+
> | DB    | NTXN_NEXT  |
> +-------+-------------------+
> | test1 | 23                   |
> | test2 | 4                     |
> +-------+-------------------+
>
> Same question could also be applied to NEXT_LOCK_ID.
>
> I am just curious because it seems like partitioning the transaction (and
> lock IDs) would reduce the granularity of locking in the various
> transactional methods. For example, openTxn invocations are mutexed with
> all other openTxn invocations even if they are for transactions running in
> distinct database domains.  Similarly for openTxn mutexing with respect to
> commitTxn if there is a write-write conflict, which I would have thought
> would only be the case if they are applicable to the same database. I'm
> sure that this would have the side effect of increasing the complexity of
> other subsystems but I had to ask what the rationale was behind this.
>
> (I'm new to Hive to please forgive me if the answer is obvious.)
>
> Regards,
>
> Granville
>