You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@hive.apache.org by Edward Capriolo <ed...@gmail.com> on 2010/02/19 00:21:17 UTC

Putting the big table rightmost in the join

I have worked through this issue.

* When doing Join, please put the table with big number of rows
containing the same join key to
the rightmost in the JOIN clause. Otherwise we may see OutOfMemory errors.

This advice does work, but should we open up a jira to create a simple
optimizer that does this?

Edward

Re: Putting the big table rightmost in the join

Posted by Edward Capriolo <ed...@gmail.com>.

On Fri, Feb 19, 2010 at 2:42 PM, Amr Awadallah <aa...@cloudera.com> wrote:
> Ed,
>
>  You apply the hint like this:
>
> SELECT /*+ STREAMTABLE(A) */ ......
>
> that will force table A to be streamed through whether you mention it as
> rightmost or not.
>
> What you are asking for is also reasonable, you are asking for a simple
> cost-based optimizer that looks at the sizes of the tables in bytes then
> automatically apply the STREAMTABLE hint for that table (the only problem
> here is a large table in terms of bytes doesn't mean it is large in terms of
> rows, that is why table stat collection would be useful).
>
> Yongqiang, besides the STREAMTABLE  and MAPJOIN hints, what other hints are
> allowed?
>
> -- amr
>
> On Fri, Feb 19, 2010 at 11:25 AM, Edward Capriolo <ed...@gmail.com>wrote:
>
>> On Fri, Feb 19, 2010 at 12:35 AM, Yongqiang He
>> <he...@software.ict.ac.cn> wrote:
>> > Hi Edward,
>> > You can do it with streamtable hint. Hive will put the table in that hint
>> in
>> > the rightmost.
>> >
>> > -yongqiang
>> > On 2/18/10 3:21 PM, "Edward Capriolo" <ed...@gmail.com> wrote:
>> >
>> >> I have worked through this issue.
>> >>
>> >> * When doing Join, please put the table with big number of rows
>> >> containing the same join key to
>> >> the rightmost in the JOIN clause. Otherwise we may see OutOfMemory
>> errors.
>> >>
>> >> This advice does work, but should we open up a jira to create a simple
>> >> optimizer that does this?
>> >>
>> >> Edward
>> >>
>> >>
>> >
>> >
>> >
>>
>> I do not understand the hint. A user can re-write the query can't they?
>>
>> select a join b
>> select b join a
>>
>> What I am asking, should we add an optimizer that uses does heuristics
>> on the tables and automatically streams the smaller/larger?
>>
>

I figured looking at table size may be misleading, and we want row
count. I was thinking we can do something similar to what the
TotalOrderPartitioner does, we can sample some rows from some random
blocks. With that information on average row size and table size we
should be able to computer which table has more rows. Not sure how
much overhead this process would add to each query but we could make
that a Conf switch.

Re: Putting the big table rightmost in the join

Posted by Yongqiang He <he...@software.ict.ac.cn>.

>Yongqiang, besides the STREAMTABLE  and MAPJOIN hints, what other hints are
allowed?
I think there are no others now.
There are few config setting for skew join and bucket map join, but not
hints.


Thanks
Yongqiang
On 2/19/10 11:42 AM, "Amr Awadallah" <aa...@cloudera.com> wrote:

> allowed?

Re: Putting the big table rightmost in the join

Posted by Amr Awadallah <aa...@cloudera.com>.

Ed,

  You apply the hint like this:

SELECT /*+ STREAMTABLE(A) */ ......

that will force table A to be streamed through whether you mention it as
rightmost or not.

What you are asking for is also reasonable, you are asking for a simple
cost-based optimizer that looks at the sizes of the tables in bytes then
automatically apply the STREAMTABLE hint for that table (the only problem
here is a large table in terms of bytes doesn't mean it is large in terms of
rows, that is why table stat collection would be useful).

Yongqiang, besides the STREAMTABLE  and MAPJOIN hints, what other hints are
allowed?

-- amr

On Fri, Feb 19, 2010 at 11:25 AM, Edward Capriolo <ed...@gmail.com>wrote:

> On Fri, Feb 19, 2010 at 12:35 AM, Yongqiang He
> <he...@software.ict.ac.cn> wrote:
> > Hi Edward,
> > You can do it with streamtable hint. Hive will put the table in that hint
> in
> > the rightmost.
> >
> > -yongqiang
> > On 2/18/10 3:21 PM, "Edward Capriolo" <ed...@gmail.com> wrote:
> >
> >> I have worked through this issue.
> >>
> >> * When doing Join, please put the table with big number of rows
> >> containing the same join key to
> >> the rightmost in the JOIN clause. Otherwise we may see OutOfMemory
> errors.
> >>
> >> This advice does work, but should we open up a jira to create a simple
> >> optimizer that does this?
> >>
> >> Edward
> >>
> >>
> >
> >
> >
>
> I do not understand the hint. A user can re-write the query can't they?
>
> select a join b
> select b join a
>
> What I am asking, should we add an optimizer that uses does heuristics
> on the tables and automatically streams the smaller/larger?
>

Re: Putting the big table rightmost in the join

Posted by Edward Capriolo <ed...@gmail.com>.

On Fri, Feb 19, 2010 at 3:40 PM, Yongqiang He
<he...@software.ict.ac.cn> wrote:
> Yes. I agree.
> Moving work should be done by developers to users is always not
> user-friendly. :)
> For star-schemes, people can easily find out which table is the small table.
> Joining with more than one big table is always risky.
>
> But before a cost optimizer, it seems we have no other choice.
>
> Thanks
> Yongqiang
> On 2/19/10 11:35 AM, "Edward Capriolo" <ed...@gmail.com> wrote:
>
>> On Fri, Feb 19, 2010 at 2:25 PM, Edward Capriolo <ed...@gmail.com>
>> wrote:
>>> On Fri, Feb 19, 2010 at 12:35 AM, Yongqiang He
>>> <he...@software.ict.ac.cn> wrote:
>>>> Hi Edward,
>>>> You can do it with streamtable hint. Hive will put the table in that hint in
>>>> the rightmost.
>>>>
>>>> -yongqiang
>>>> On 2/18/10 3:21 PM, "Edward Capriolo" <ed...@gmail.com> wrote:
>>>>
>>>>> I have worked through this issue.
>>>>>
>>>>> * When doing Join, please put the table with big number of rows
>>>>> containing the same join key to
>>>>> the rightmost in the JOIN clause. Otherwise we may see OutOfMemory errors.
>>>>>
>>>>> This advice does work, but should we open up a jira to create a simple
>>>>> optimizer that does this?
>>>>>
>>>>> Edward
>>>>>
>>>>>
>>>>
>>>>
>>>>
>>>
>>> I do not understand the hint. A user can re-write the query can't they?
>>>
>>> select a join b
>>> select b join a
>>>
>>> What I am asking, should we add an optimizer that uses does heuristics
>>> on the tables and automatically streams the smaller/larger?
>>>
>>
>> The reason I am mentioning this is I am training hive users right now.
>> You can imagine that the first three table join someone did caused an
>> OOM. I explained to them roughly how a hive join works and how you
>> should move the largest table to one side. They understood but
>> replied, "Sounds like something an optimizer could handle."
>>
>> Even joining two tables it is a pain to ask someone to find which
>> table is larger. Imagine joining 10 or so. Also user perception, image
>> your first join throwing in OOM.
>>
>>
>
>
>

Understood but do you believe sampling tables as I suggested above,
could work as an effective cost optimizer? The other option to compute
stats is a problem Hive does not always produce the data it queries.

Re: Putting the big table rightmost in the join

Posted by Yongqiang He <he...@software.ict.ac.cn>.

Yes. I agree. 
Moving work should be done by developers to users is always not
user-friendly. :)
For star-schemes, people can easily find out which table is the small table.
Joining with more than one big table is always risky.

But before a cost optimizer, it seems we have no other choice.

Thanks
Yongqiang
On 2/19/10 11:35 AM, "Edward Capriolo" <ed...@gmail.com> wrote:

> On Fri, Feb 19, 2010 at 2:25 PM, Edward Capriolo <ed...@gmail.com>
> wrote:
>> On Fri, Feb 19, 2010 at 12:35 AM, Yongqiang He
>> <he...@software.ict.ac.cn> wrote:
>>> Hi Edward,
>>> You can do it with streamtable hint. Hive will put the table in that hint in
>>> the rightmost.
>>> 
>>> -yongqiang
>>> On 2/18/10 3:21 PM, "Edward Capriolo" <ed...@gmail.com> wrote:
>>> 
>>>> I have worked through this issue.
>>>> 
>>>> * When doing Join, please put the table with big number of rows
>>>> containing the same join key to
>>>> the rightmost in the JOIN clause. Otherwise we may see OutOfMemory errors.
>>>> 
>>>> This advice does work, but should we open up a jira to create a simple
>>>> optimizer that does this?
>>>> 
>>>> Edward
>>>> 
>>>> 
>>> 
>>> 
>>> 
>> 
>> I do not understand the hint. A user can re-write the query can't they?
>> 
>> select a join b
>> select b join a
>> 
>> What I am asking, should we add an optimizer that uses does heuristics
>> on the tables and automatically streams the smaller/larger?
>> 
> 
> The reason I am mentioning this is I am training hive users right now.
> You can imagine that the first three table join someone did caused an
> OOM. I explained to them roughly how a hive join works and how you
> should move the largest table to one side. They understood but
> replied, "Sounds like something an optimizer could handle."
> 
> Even joining two tables it is a pain to ask someone to find which
> table is larger. Imagine joining 10 or so. Also user perception, image
> your first join throwing in OOM.
> 
>

Re: Putting the big table rightmost in the join

Posted by Edward Capriolo <ed...@gmail.com>.

On Fri, Feb 19, 2010 at 2:25 PM, Edward Capriolo <ed...@gmail.com> wrote:
> On Fri, Feb 19, 2010 at 12:35 AM, Yongqiang He
> <he...@software.ict.ac.cn> wrote:
>> Hi Edward,
>> You can do it with streamtable hint. Hive will put the table in that hint in
>> the rightmost.
>>
>> -yongqiang
>> On 2/18/10 3:21 PM, "Edward Capriolo" <ed...@gmail.com> wrote:
>>
>>> I have worked through this issue.
>>>
>>> * When doing Join, please put the table with big number of rows
>>> containing the same join key to
>>> the rightmost in the JOIN clause. Otherwise we may see OutOfMemory errors.
>>>
>>> This advice does work, but should we open up a jira to create a simple
>>> optimizer that does this?
>>>
>>> Edward
>>>
>>>
>>
>>
>>
>
> I do not understand the hint. A user can re-write the query can't they?
>
> select a join b
> select b join a
>
> What I am asking, should we add an optimizer that uses does heuristics
> on the tables and automatically streams the smaller/larger?
>

The reason I am mentioning this is I am training hive users right now.
You can imagine that the first three table join someone did caused an
OOM. I explained to them roughly how a hive join works and how you
should move the largest table to one side. They understood but
replied, "Sounds like something an optimizer could handle."

Even joining two tables it is a pain to ask someone to find which
table is larger. Imagine joining 10 or so. Also user perception, image
your first join throwing in OOM.

Re: Putting the big table rightmost in the join

Posted by Edward Capriolo <ed...@gmail.com>.

On Fri, Feb 19, 2010 at 12:35 AM, Yongqiang He
<he...@software.ict.ac.cn> wrote:
> Hi Edward,
> You can do it with streamtable hint. Hive will put the table in that hint in
> the rightmost.
>
> -yongqiang
> On 2/18/10 3:21 PM, "Edward Capriolo" <ed...@gmail.com> wrote:
>
>> I have worked through this issue.
>>
>> * When doing Join, please put the table with big number of rows
>> containing the same join key to
>> the rightmost in the JOIN clause. Otherwise we may see OutOfMemory errors.
>>
>> This advice does work, but should we open up a jira to create a simple
>> optimizer that does this?
>>
>> Edward
>>
>>
>
>
>

I do not understand the hint. A user can re-write the query can't they?

select a join b
select b join a

What I am asking, should we add an optimizer that uses does heuristics
on the tables and automatically streams the smaller/larger?

Re: Putting the big table rightmost in the join

Posted by Yongqiang He <he...@software.ict.ac.cn>.

Hi Edward,
You can do it with streamtable hint. Hive will put the table in that hint in
the rightmost.
 
-yongqiang
On 2/18/10 3:21 PM, "Edward Capriolo" <ed...@gmail.com> wrote:

> I have worked through this issue.
> 
> * When doing Join, please put the table with big number of rows
> containing the same join key to
> the rightmost in the JOIN clause. Otherwise we may see OutOfMemory errors.
> 
> This advice does work, but should we open up a jira to create a simple
> optimizer that does this?
> 
> Edward
> 
>