You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@hive.apache.org by Slava Markeyev <sl...@upsight.com> on 2015/06/12 05:15:20 UTC

nested join issue

I'm running into a peculiar issue with nested joins and outer select. I see
this error on 1.1.0 and 1.2.0 but not 0.13 which seems like a regression.

The following query produces no results:

select sfrom (
  select last.*, action.st2, action.n
  from (
    select purchase.s, purchase.timestamp, max (mevt.timestamp) as
last_stage_timestamp
    from (select * from purchase_history) purchase
    join (select * from cart_history) mevt
    on purchase.s = mevt.s
    where purchase.timestamp > mevt.timestamp
    group by purchase.s, purchase.timestamp
  ) last
  join (select * from events) action
  on last.s = action.s and last.last_stage_timestamp = action.timestamp
) list;

While this one does produce results

select *from (
  select last.*, action.st2, action.n
  from (
    select purchase.s, purchase.timestamp, max (mevt.timestamp) as
last_stage_timestamp
    from (select * from purchase_history) purchase
    join (select * from cart_history) mevt
    on purchase.s = mevt.s
    where purchase.timestamp > mevt.timestamp
    group by purchase.s, purchase.timestamp
  ) last
  join (select * from events) action
  on last.s = action.s and last.last_stage_timestamp = action.timestamp
) list;

1 21 20 Bob 1234
1 31 30 Bob 1234
3 51 50 Jeff 1234

The setup to test this is:

create table purchase_history (s string, product string, price double,
timestamp int);
insert into purchase_history values ('1', 'Belt', 20.00, 21);
insert into purchase_history values ('1', 'Socks', 3.50, 31);
insert into purchase_history values ('3', 'Belt', 20.00, 51);
insert into purchase_history values ('4', 'Shirt', 15.50, 59);

create table cart_history (s string, cart_id int, timestamp int);
insert into cart_history values ('1', 1, 10);
insert into cart_history values ('1', 2, 20);
insert into cart_history values ('1', 3, 30);
insert into cart_history values ('1', 4, 40);
insert into cart_history values ('3', 5, 50);
insert into cart_history values ('4', 6, 60);

create table events (s string, st2 string, n int, timestamp int);
insert into events values ('1', 'Bob', 1234, 20);
insert into events values ('1', 'Bob', 1234, 30);
insert into events values ('1', 'Bob', 1234, 25);
insert into events values ('2', 'Sam', 1234, 30);
insert into events values ('3', 'Jeff', 1234, 50);
insert into events values ('4', 'Ted', 1234, 60);

I realize select * and select s are not all that interesting in this
context but what lead me to this issue was select count(distinct s) was not
returning results. The above queries are the simplified queries that
produce the issue. I will note that if I convert the inner join to a table
and select from that the issue does not appear.

-- 

Slava Markeyev | Engineering | Upsight

Find me on LinkedIn <http://www.linkedin.com/in/slavamarkeyev>
<http://www.linkedin.com/in/slavamarkeyev>

Re: nested join issue

Posted by Gautam <ga...@gmail.com>.
To clarify, HIVE-8435 introduced the regression. Turning that feature off
fixes the issue. So we still need to fix that optimization to not produce
this incorrect result.

On Fri, Jun 12, 2015 at 11:31 PM, Gautam <ga...@gmail.com> wrote:

> Found that turning off hive.optimize.remove.identity.project ( ref:
> HIVE-8435 <https://issues.apache.org/jira/browse/HIVE-8435> ) fixes the
> issue.
>
> This gives us a workaround, but dunno the performance degradation this
> impacts yet.
>
> Thanks!
> -Gautam.
>
>
> On Fri, Jun 12, 2015 at 6:02 PM, Gautam <ga...@gmail.com> wrote:
>
>> Done. https://issues.apache.org/jira/browse/HIVE-10996
>>
>> On Fri, Jun 12, 2015 at 1:47 PM, Gopal Vijayaraghavan <go...@apache.org>
>> wrote:
>>
>>> Hi
>>>
>>> > Thanks for investigating..  Trying to locate the patch that fixes this
>>> >between 1.1 and 2.0.0-SNAPSHOT. Any leads on what Jira this fix was part
>>> >of? Or what part of the code the patch is likely to be on?
>>>
>>> git bisect is the only way usually to identify these things.
>>>
>>> But before you hunt into the patches I suggest trying combinations of
>>> constant propogation, null-scan and identity projection remover
>>> optimizations to see if there¹s a workaround in there.
>>>
>>> An explain of the query added to a new JIRA would be good, to continue
>>> the
>>> analysis.
>>>
>>> Cheers,
>>> Gopal
>>>
>>>
>>>
>>
>>
>> --
>> "If you really want something in this life, you have to work for it. Now,
>> quiet! They're about to announce the lottery numbers..."
>>
>
>
>
> --
> "If you really want something in this life, you have to work for it. Now,
> quiet! They're about to announce the lottery numbers..."
>



-- 
"If you really want something in this life, you have to work for it. Now,
quiet! They're about to announce the lottery numbers..."

Re: nested join issue

Posted by Gautam <ga...@gmail.com>.
Found that turning off hive.optimize.remove.identity.project ( ref:
HIVE-8435 <https://issues.apache.org/jira/browse/HIVE-8435> ) fixes the
issue.

This gives us a workaround, but dunno the performance degradation this
impacts yet.

Thanks!
-Gautam.


On Fri, Jun 12, 2015 at 6:02 PM, Gautam <ga...@gmail.com> wrote:

> Done. https://issues.apache.org/jira/browse/HIVE-10996
>
> On Fri, Jun 12, 2015 at 1:47 PM, Gopal Vijayaraghavan <go...@apache.org>
> wrote:
>
>> Hi
>>
>> > Thanks for investigating..  Trying to locate the patch that fixes this
>> >between 1.1 and 2.0.0-SNAPSHOT. Any leads on what Jira this fix was part
>> >of? Or what part of the code the patch is likely to be on?
>>
>> git bisect is the only way usually to identify these things.
>>
>> But before you hunt into the patches I suggest trying combinations of
>> constant propogation, null-scan and identity projection remover
>> optimizations to see if there¹s a workaround in there.
>>
>> An explain of the query added to a new JIRA would be good, to continue the
>> analysis.
>>
>> Cheers,
>> Gopal
>>
>>
>>
>
>
> --
> "If you really want something in this life, you have to work for it. Now,
> quiet! They're about to announce the lottery numbers..."
>



-- 
"If you really want something in this life, you have to work for it. Now,
quiet! They're about to announce the lottery numbers..."

Re: nested join issue

Posted by Gautam <ga...@gmail.com>.
Done. https://issues.apache.org/jira/browse/HIVE-10996

On Fri, Jun 12, 2015 at 1:47 PM, Gopal Vijayaraghavan <go...@apache.org>
wrote:

> Hi
>
> > Thanks for investigating..  Trying to locate the patch that fixes this
> >between 1.1 and 2.0.0-SNAPSHOT. Any leads on what Jira this fix was part
> >of? Or what part of the code the patch is likely to be on?
>
> git bisect is the only way usually to identify these things.
>
> But before you hunt into the patches I suggest trying combinations of
> constant propogation, null-scan and identity projection remover
> optimizations to see if there¹s a workaround in there.
>
> An explain of the query added to a new JIRA would be good, to continue the
> analysis.
>
> Cheers,
> Gopal
>
>
>


-- 
"If you really want something in this life, you have to work for it. Now,
quiet! They're about to announce the lottery numbers..."

Re: nested join issue

Posted by Gopal Vijayaraghavan <go...@apache.org>.
Hi

> Thanks for investigating..  Trying to locate the patch that fixes this
>between 1.1 and 2.0.0-SNAPSHOT. Any leads on what Jira this fix was part
>of? Or what part of the code the patch is likely to be on?

git bisect is the only way usually to identify these things.

But before you hunt into the patches I suggest trying combinations of
constant propogation, null-scan and identity projection remover
optimizations to see if there¹s a workaround in there.

An explain of the query added to a new JIRA would be good, to continue the
analysis.

Cheers,
Gopal



Re: nested join issue

Posted by Gautam <ga...@gmail.com>.
Thanks for investigating..  Trying to locate the patch that fixes this
between 1.1 and 2.0.0-SNAPSHOT. Any leads on what Jira this fix was part
of? Or what part of the code the patch is likely to be on?

-Gautam.

On Thu, Jun 11, 2015 at 8:35 PM, Gopal Vijayaraghavan <go...@apache.org>
wrote:

> Hi,
>
> > I'm running into a peculiar issue with nested joins and outer select. I
> >see this error on 1.1.0 and 1.2.0 but not 0.13 which seems like a
> >regression.
> ...
> > create table events (s string, st2 string, n int, timestamp int);
>
>
> The issue does not seem to be happening in hive-2.0.0-SNAPSHOT, which
> means it has already been fixed & possibly can be backported easily to
> 1.2.1.
>
> Your test-cases threw a parse-exception when run as-is - naming a column
> ³timestamp² will kill you when you upgrade to the next version.
>
> Cheers,
> Gopal
>
>
>


-- 
"If you really want something in this life, you have to work for it. Now,
quiet! They're about to announce the lottery numbers..."

Re: nested join issue

Posted by Gopal Vijayaraghavan <go...@apache.org>.
Hi,

> I'm running into a peculiar issue with nested joins and outer select. I
>see this error on 1.1.0 and 1.2.0 but not 0.13 which seems like a
>regression.
...
> create table events (s string, st2 string, n int, timestamp int);


The issue does not seem to be happening in hive-2.0.0-SNAPSHOT, which
means it has already been fixed & possibly can be backported easily to
1.2.1.

Your test-cases threw a parse-exception when run as-is - naming a column
³timestamp² will kill you when you upgrade to the next version.

Cheers,
Gopal