You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@hive.apache.org by Himanish Kushary <hi...@gmail.com> on 2012/08/16 20:07:24 UTC

Help with Join involving Non-Equality condition

Hi,

We have two tables in the following structure :

Table1 :

|  id   |        packcreatetime         |   packid |
----------------------------------------------------------------------
| 505  |	2012-07-16 11:51:12	| 111024   |
| 505  |	2012-07-18 11:52:13	| 111025   |
| 505  |	2012-07-19 11:53:14   | 111026   |
| 504  |      2012-07-17  23:50:13  |  101020  |
------------------------------------------------------------------------

Table-2

| id   |   requesttime
----------------------------------------
| 505 | 2012-07-18 12:09:47
| 505 | 2012-07-19 12:09:59
| 505 | 2012-07-19 12:09:56
| 505 | 2012-07-17 12:06:40
| 505 | 2012-07-17 12:06:40
| 505 | 2012-07-17 12:09:15
| 504 | 2012-07-18 00:03:18
| 504 | 2012-07-18 00:15:41

We want to find out the packid from Table1 where the  is corresponding
in Table2 and the requesttime(in Table2) is between the
packcreatetime of two relevant records(in Table1)

So for the above example the final output will be:

| id   |   requesttime            |   packid
-------------------------------------------------------
| 505 | 2012-07-18 12:09:47 |  111025
| 505 | 2012-07-19 12:09:59 |  111026
| 505 | 2012-07-19 12:09:56 |  111026
| 505 | 2012-07-17 12:06:40 |  111024
| 505 | 2012-07-17 12:06:40 |  111024
| 505 | 2012-07-17 12:09:15 |  111024
| 504 | 2012-07-18 00:03:18 |  101020
| 504 | 2012-07-18 00:15:41 |  101020


As we cannot use >= , <= in Hive joins the between logic cannot be
implemented in joins, is there any way to accomplish this or do we
need to write custom M/R code for this.Looking forward for any
suggestions to accomplish this.

--
Thanks & Regards
Himanish

Re: Help with Join involving Non-Equality condition

Posted by Edward Capriolo <ed...@gmail.com>.
You should check out the work being done on. non-equi map joins
http://mail-archives.apache.org/mod_mbox/hive-dev/201206.mbox/%3C1948451998.13482.1339612423225.JavaMail.jiratomcat@issues-vm%3E

https://issues.apache.org/jira/browse/HIVE-3133

On Fri, Aug 17, 2012 at 1:52 AM, Bertrand Dechoux <de...@gmail.com> wrote:
> What are the data volume? And what are the meaning of those data?
>
> From what I can see, you have a 'pack' per day. If that's true, a map join
> could be used because you should not have that many pack creation (But I am
> not sure how to enforce that.)
> I so filtering could happen right after. You would indeed generate lots of
> tuple but they wouldn't be transported over the network nor written to disk.
>
> Even better if you really have (at least) a pack per day then you only need
> to group each request with three pack creation : the day before, the current
> day and the day after.
>
> Regards
>
> Bertrand
>
> On Fri, Aug 17, 2012 at 1:27 AM, Navis류승우 <na...@nexr.com> wrote:
>>
>> If you don't specify join condition, hive performs cross join.
>>
>> What is added to hive 0.10.0 is just a clarifying grammar.
>>
>>
>> 2012/8/17 Himanish Kushary <hi...@gmail.com>
>>>
>>> We are on Hive 0.8 , I think cross join is available only since 0.10.0
>>>
>>> Do we have any other options ?
>>>
>>> On Thu, Aug 16, 2012 at 2:28 PM, Ablimit Aji <ab...@gmail.com> wrote:
>>> > You can do a CROSS JOIN, then filter with the original inequality join
>>> > condition.
>>> > This would generate a lot of redundant tuples and may not work if you
>>> > have
>>> > large amounts of data.
>>> >
>>> > On Thu, Aug 16, 2012 at 2:07 PM, Himanish Kushary <hi...@gmail.com>
>>> > wrote:
>>> >>
>>> >> Hi,
>>> >>
>>> >> We have two tables in the following structure :
>>> >>
>>> >> Table1 :
>>> >>
>>> >> |  id   |        packcreatetime         |   packid |
>>> >> ----------------------------------------------------------------------
>>> >> | 505  |        2012-07-16 11:51:12     | 111024   |
>>> >> | 505  |        2012-07-18 11:52:13     | 111025   |
>>> >> | 505  |        2012-07-19 11:53:14   | 111026   |
>>> >> | 504  |      2012-07-17  23:50:13  |  101020  |
>>> >>
>>> >> ------------------------------------------------------------------------
>>> >>
>>> >> Table-2
>>> >>
>>> >> | id   |   requesttime
>>> >> ----------------------------------------
>>> >> | 505 | 2012-07-18 12:09:47
>>> >> | 505 | 2012-07-19 12:09:59
>>> >> | 505 | 2012-07-19 12:09:56
>>> >> | 505 | 2012-07-17 12:06:40
>>> >> | 505 | 2012-07-17 12:06:40
>>> >> | 505 | 2012-07-17 12:09:15
>>> >> | 504 | 2012-07-18 00:03:18
>>> >> | 504 | 2012-07-18 00:15:41
>>> >>
>>> >> We want to find out the packid from Table1 where the  is corresponding
>>> >> in Table2 and the requesttime(in Table2) is between the
>>> >> packcreatetime of two relevant records(in Table1)
>>> >>
>>> >> So for the above example the final output will be:
>>> >>
>>> >> | id   |   requesttime            |   packid
>>> >> -------------------------------------------------------
>>> >> | 505 | 2012-07-18 12:09:47 |  111025
>>> >> | 505 | 2012-07-19 12:09:59 |  111026
>>> >> | 505 | 2012-07-19 12:09:56 |  111026
>>> >> | 505 | 2012-07-17 12:06:40 |  111024
>>> >> | 505 | 2012-07-17 12:06:40 |  111024
>>> >> | 505 | 2012-07-17 12:09:15 |  111024
>>> >> | 504 | 2012-07-18 00:03:18 |  101020
>>> >> | 504 | 2012-07-18 00:15:41 |  101020
>>> >>
>>> >>
>>> >> As we cannot use >= , <= in Hive joins the between logic cannot be
>>> >> implemented in joins, is there any way to accomplish this or do we
>>> >> need to write custom M/R code for this.Looking forward for any
>>> >> suggestions to accomplish this.
>>> >>
>>> >> --
>>> >> Thanks & Regards
>>> >> Himanish
>>> >
>>> >
>>>
>>>
>>>
>>> --
>>> Thanks & Regards
>>> Himanish
>>
>>
>
>
>
> --
> Bertrand Dechoux

Re: Help with Join involving Non-Equality condition

Posted by Bertrand Dechoux <de...@gmail.com>.
What are the data volume? And what are the meaning of those data?

>From what I can see, you have a 'pack' per day. If that's true, a map join
could be used because you should not have that many pack creation (But I am
not sure how to enforce that.)
I so filtering could happen right after. You would indeed generate lots of
tuple but they wouldn't be transported over the network nor written to disk.

Even better if you really have (at least) a pack per day then you only need
to group each request with three pack creation : the day before, the
current day and the day after.

Regards

Bertrand

On Fri, Aug 17, 2012 at 1:27 AM, Navis류승우 <na...@nexr.com> wrote:

> If you don't specify join condition, hive performs cross join.
>
> What is added to hive 0.10.0 is just a clarifying grammar.
>
>
> 2012/8/17 Himanish Kushary <hi...@gmail.com>
>
>> We are on Hive 0.8 , I think cross join is available only since 0.10.0
>>
>> Do we have any other options ?
>>
>> On Thu, Aug 16, 2012 at 2:28 PM, Ablimit Aji <ab...@gmail.com> wrote:
>> > You can do a CROSS JOIN, then filter with the original inequality join
>> > condition.
>> > This would generate a lot of redundant tuples and may not work if you
>> have
>> > large amounts of data.
>> >
>> > On Thu, Aug 16, 2012 at 2:07 PM, Himanish Kushary <hi...@gmail.com>
>> > wrote:
>> >>
>> >> Hi,
>> >>
>> >> We have two tables in the following structure :
>> >>
>> >> Table1 :
>> >>
>> >> |  id   |        packcreatetime         |   packid |
>> >> ----------------------------------------------------------------------
>> >> | 505  |        2012-07-16 11:51:12     | 111024   |
>> >> | 505  |        2012-07-18 11:52:13     | 111025   |
>> >> | 505  |        2012-07-19 11:53:14   | 111026   |
>> >> | 504  |      2012-07-17  23:50:13  |  101020  |
>> >>
>> ------------------------------------------------------------------------
>> >>
>> >> Table-2
>> >>
>> >> | id   |   requesttime
>> >> ----------------------------------------
>> >> | 505 | 2012-07-18 12:09:47
>> >> | 505 | 2012-07-19 12:09:59
>> >> | 505 | 2012-07-19 12:09:56
>> >> | 505 | 2012-07-17 12:06:40
>> >> | 505 | 2012-07-17 12:06:40
>> >> | 505 | 2012-07-17 12:09:15
>> >> | 504 | 2012-07-18 00:03:18
>> >> | 504 | 2012-07-18 00:15:41
>> >>
>> >> We want to find out the packid from Table1 where the  is corresponding
>> >> in Table2 and the requesttime(in Table2) is between the
>> >> packcreatetime of two relevant records(in Table1)
>> >>
>> >> So for the above example the final output will be:
>> >>
>> >> | id   |   requesttime            |   packid
>> >> -------------------------------------------------------
>> >> | 505 | 2012-07-18 12:09:47 |  111025
>> >> | 505 | 2012-07-19 12:09:59 |  111026
>> >> | 505 | 2012-07-19 12:09:56 |  111026
>> >> | 505 | 2012-07-17 12:06:40 |  111024
>> >> | 505 | 2012-07-17 12:06:40 |  111024
>> >> | 505 | 2012-07-17 12:09:15 |  111024
>> >> | 504 | 2012-07-18 00:03:18 |  101020
>> >> | 504 | 2012-07-18 00:15:41 |  101020
>> >>
>> >>
>> >> As we cannot use >= , <= in Hive joins the between logic cannot be
>> >> implemented in joins, is there any way to accomplish this or do we
>> >> need to write custom M/R code for this.Looking forward for any
>> >> suggestions to accomplish this.
>> >>
>> >> --
>> >> Thanks & Regards
>> >> Himanish
>> >
>> >
>>
>>
>>
>> --
>> Thanks & Regards
>> Himanish
>>
>
>


-- 
Bertrand Dechoux

Re: Help with Join involving Non-Equality condition

Posted by Navis류승우 <na...@nexr.com>.
If you don't specify join condition, hive performs cross join.

What is added to hive 0.10.0 is just a clarifying grammar.

2012/8/17 Himanish Kushary <hi...@gmail.com>

> We are on Hive 0.8 , I think cross join is available only since 0.10.0
>
> Do we have any other options ?
>
> On Thu, Aug 16, 2012 at 2:28 PM, Ablimit Aji <ab...@gmail.com> wrote:
> > You can do a CROSS JOIN, then filter with the original inequality join
> > condition.
> > This would generate a lot of redundant tuples and may not work if you
> have
> > large amounts of data.
> >
> > On Thu, Aug 16, 2012 at 2:07 PM, Himanish Kushary <hi...@gmail.com>
> > wrote:
> >>
> >> Hi,
> >>
> >> We have two tables in the following structure :
> >>
> >> Table1 :
> >>
> >> |  id   |        packcreatetime         |   packid |
> >> ----------------------------------------------------------------------
> >> | 505  |        2012-07-16 11:51:12     | 111024   |
> >> | 505  |        2012-07-18 11:52:13     | 111025   |
> >> | 505  |        2012-07-19 11:53:14   | 111026   |
> >> | 504  |      2012-07-17  23:50:13  |  101020  |
> >> ------------------------------------------------------------------------
> >>
> >> Table-2
> >>
> >> | id   |   requesttime
> >> ----------------------------------------
> >> | 505 | 2012-07-18 12:09:47
> >> | 505 | 2012-07-19 12:09:59
> >> | 505 | 2012-07-19 12:09:56
> >> | 505 | 2012-07-17 12:06:40
> >> | 505 | 2012-07-17 12:06:40
> >> | 505 | 2012-07-17 12:09:15
> >> | 504 | 2012-07-18 00:03:18
> >> | 504 | 2012-07-18 00:15:41
> >>
> >> We want to find out the packid from Table1 where the  is corresponding
> >> in Table2 and the requesttime(in Table2) is between the
> >> packcreatetime of two relevant records(in Table1)
> >>
> >> So for the above example the final output will be:
> >>
> >> | id   |   requesttime            |   packid
> >> -------------------------------------------------------
> >> | 505 | 2012-07-18 12:09:47 |  111025
> >> | 505 | 2012-07-19 12:09:59 |  111026
> >> | 505 | 2012-07-19 12:09:56 |  111026
> >> | 505 | 2012-07-17 12:06:40 |  111024
> >> | 505 | 2012-07-17 12:06:40 |  111024
> >> | 505 | 2012-07-17 12:09:15 |  111024
> >> | 504 | 2012-07-18 00:03:18 |  101020
> >> | 504 | 2012-07-18 00:15:41 |  101020
> >>
> >>
> >> As we cannot use >= , <= in Hive joins the between logic cannot be
> >> implemented in joins, is there any way to accomplish this or do we
> >> need to write custom M/R code for this.Looking forward for any
> >> suggestions to accomplish this.
> >>
> >> --
> >> Thanks & Regards
> >> Himanish
> >
> >
>
>
>
> --
> Thanks & Regards
> Himanish
>

Re: Help with Join involving Non-Equality condition

Posted by Himanish Kushary <hi...@gmail.com>.
We are on Hive 0.8 , I think cross join is available only since 0.10.0

Do we have any other options ?

On Thu, Aug 16, 2012 at 2:28 PM, Ablimit Aji <ab...@gmail.com> wrote:
> You can do a CROSS JOIN, then filter with the original inequality join
> condition.
> This would generate a lot of redundant tuples and may not work if you have
> large amounts of data.
>
> On Thu, Aug 16, 2012 at 2:07 PM, Himanish Kushary <hi...@gmail.com>
> wrote:
>>
>> Hi,
>>
>> We have two tables in the following structure :
>>
>> Table1 :
>>
>> |  id   |        packcreatetime         |   packid |
>> ----------------------------------------------------------------------
>> | 505  |        2012-07-16 11:51:12     | 111024   |
>> | 505  |        2012-07-18 11:52:13     | 111025   |
>> | 505  |        2012-07-19 11:53:14   | 111026   |
>> | 504  |      2012-07-17  23:50:13  |  101020  |
>> ------------------------------------------------------------------------
>>
>> Table-2
>>
>> | id   |   requesttime
>> ----------------------------------------
>> | 505 | 2012-07-18 12:09:47
>> | 505 | 2012-07-19 12:09:59
>> | 505 | 2012-07-19 12:09:56
>> | 505 | 2012-07-17 12:06:40
>> | 505 | 2012-07-17 12:06:40
>> | 505 | 2012-07-17 12:09:15
>> | 504 | 2012-07-18 00:03:18
>> | 504 | 2012-07-18 00:15:41
>>
>> We want to find out the packid from Table1 where the  is corresponding
>> in Table2 and the requesttime(in Table2) is between the
>> packcreatetime of two relevant records(in Table1)
>>
>> So for the above example the final output will be:
>>
>> | id   |   requesttime            |   packid
>> -------------------------------------------------------
>> | 505 | 2012-07-18 12:09:47 |  111025
>> | 505 | 2012-07-19 12:09:59 |  111026
>> | 505 | 2012-07-19 12:09:56 |  111026
>> | 505 | 2012-07-17 12:06:40 |  111024
>> | 505 | 2012-07-17 12:06:40 |  111024
>> | 505 | 2012-07-17 12:09:15 |  111024
>> | 504 | 2012-07-18 00:03:18 |  101020
>> | 504 | 2012-07-18 00:15:41 |  101020
>>
>>
>> As we cannot use >= , <= in Hive joins the between logic cannot be
>> implemented in joins, is there any way to accomplish this or do we
>> need to write custom M/R code for this.Looking forward for any
>> suggestions to accomplish this.
>>
>> --
>> Thanks & Regards
>> Himanish
>
>



-- 
Thanks & Regards
Himanish

Re: Help with Join involving Non-Equality condition

Posted by Ablimit Aji <ab...@gmail.com>.
You can do a CROSS JOIN, then filter with the original inequality join
condition.
This would generate a lot of redundant tuples and may not work if you have
large amounts of data.

On Thu, Aug 16, 2012 at 2:07 PM, Himanish Kushary <hi...@gmail.com>wrote:

> Hi,
>
> We have two tables in the following structure :
>
> Table1 :
>
> |  id   |        packcreatetime         |   packid |
> ----------------------------------------------------------------------
> | 505  |        2012-07-16 11:51:12     | 111024   |
> | 505  |        2012-07-18 11:52:13     | 111025   |
> | 505  |        2012-07-19 11:53:14   | 111026   |
> | 504  |      2012-07-17  23:50:13  |  101020  |
> ------------------------------------------------------------------------
>
> Table-2
>
> | id   |   requesttime
> ----------------------------------------
> | 505 | 2012-07-18 12:09:47
> | 505 | 2012-07-19 12:09:59
> | 505 | 2012-07-19 12:09:56
> | 505 | 2012-07-17 12:06:40
> | 505 | 2012-07-17 12:06:40
> | 505 | 2012-07-17 12:09:15
> | 504 | 2012-07-18 00:03:18
> | 504 | 2012-07-18 00:15:41
>
> We want to find out the packid from Table1 where the  is corresponding
> in Table2 and the requesttime(in Table2) is between the
> packcreatetime of two relevant records(in Table1)
>
> So for the above example the final output will be:
>
> | id   |   requesttime            |   packid
> -------------------------------------------------------
> | 505 | 2012-07-18 12:09:47 |  111025
> | 505 | 2012-07-19 12:09:59 |  111026
> | 505 | 2012-07-19 12:09:56 |  111026
> | 505 | 2012-07-17 12:06:40 |  111024
> | 505 | 2012-07-17 12:06:40 |  111024
> | 505 | 2012-07-17 12:09:15 |  111024
> | 504 | 2012-07-18 00:03:18 |  101020
> | 504 | 2012-07-18 00:15:41 |  101020
>
>
> As we cannot use >= , <= in Hive joins the between logic cannot be
> implemented in joins, is there any way to accomplish this or do we
> need to write custom M/R code for this.Looking forward for any
> suggestions to accomplish this.
>
> --
> Thanks & Regards
> Himanish
>