You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@hive.apache.org by Anja Gruenheid <an...@gatech.edu> on 2011/01/31 23:04:52 UTC

Query Optimization in Hive

Hi!

I'm a graduate student from Georgia Tech and I'm working with Hive for a 
research project. I am interested in query optimization and the Hive 
MetaStore in that context. Working through the documentation and code, I 
noticed that the implementation right now is using a rule-based 
optimization system. Therefore, I was wondering whether cost-based query 
optimization will be a future task in the development of Hive and if it 
would be possible for me to cooperate with the developers of Hive to 
advance the project in general.

Best regards,
Anja Gruenheid

Re: Query Optimization in Hive

Posted by bharath vissapragada <bh...@gmail.com>.
Hi ,

I updated the JIRA . Kindly give your suggestions  so that I can go
ahead and complete the task.

Thanks


On Tue, Feb 1, 2011 at 12:25 PM, bharath vissapragada
<bh...@gmail.com> wrote:
> Thanks for replying namit..
>
> It is motivating to receive a mail from the authors of Hive :).
>
> I filed the jira based on the discussion..
> https://issues.apache.org/jira/browse/HIVE-1938
>
> I will try to update my idea asap.
>
> Thanks
> Bharath,V
> 4th year Undergrad,IIIT Hyderabad.
> w: http://research.iiit.ac.in/~bharath.v
>
>
>
> On Tue, Feb 1, 2011 at 11:46 AM, Namit Jain <nj...@fb.com> wrote:
>> Bharath,
>>
>> This would be great.
>>
>> Why don¹t you write up something about how you are planning to proceed ?
>> File a new jira and load some design notes/spec. there.
>> We can definitely sync up. from there.
>>
>>
>> This feature would be very useful to the community - We, at facebook,
>> Would definitely like to use it.
>>
>>
>> Thanks,
>> -namit
>>
>>
>> On 1/31/11 9:50 PM, "bharath vissapragada"
>> <bh...@gmail.com> wrote:
>>
>>>Hi Ning,Anja,
>>>
>>>I am doing my Masters thesis on this topic . I have implemented all
>>>SQL features like joins , selects etc on top of Hadoop (before knowing
>>>about Hive) and we have derived some basic cost-models for join
>>>re-ordering which seem to be working fine on some basic scales of TPCH
>>>datasets .. Later I came to know about Hive and I am trying to
>>>implement the same in Hive .
>>>
>>>Right now I am in the process of understanding Hive's source and I am
>>>almost done with  "ql" package. I think it would be great if you guys
>>>can help us in this regard .. I am a bit confused about the
>>>implementation of joins and once i'm done with that , I can modify the
>>>"joinReorder" of Optimizer package by using the cost-formulae and
>>>metadata. It would be a great opportunity to work with you guys at fb
>>>and contribute to Hive..
>>>
>>>Thanks
>>>Bharath,V
>>>4th year Undergrad,IIIT Hyderabad.
>>>w: http://research.iiit.ac.in/~bharath.v
>>>
>>>On Tue, Feb 1, 2011 at 9:22 AM, Ning Zhang <nz...@fb.com> wrote:
>>>> Hi Anja,
>>>>
>>>> As you noticed Hive only have limited supports for cost-baesd
>>>>optimization. One of the reasons is that Hive used to have very small
>>>>number of optional execution plans to choose from. One exception is
>>>>mapjoin vs common joins. Liying Tang had some work on his last intern to
>>>>convert common joins to mapjoin in a rule-based fashion. One of his
>>>>future works is to automatically convert common join to mapjoins based
>>>>on stats. There are also ongoing work on indexes on Hive. With the
>>>>support of indexes, CBO will be much needed.
>>>>
>>>> In order for a decent CBO to work, we need stats and cost models. There
>>>>are some work in stats. Table/partition level stats has already been
>>>>supported. There is a JIRA open for column level stats (HIVE-1362). Cost
>>>>model is much more complex in Hadoop environment and closely dependent
>>>>on the mapjoin/index implementations. Given al these in place, we can
>>>>then talk about plan enumeration etc.
>>>>
>>>> So yes, we are interested in CBO, but it is a large area and many
>>>>missing pieces need to be filled in Hive. If you have particular
>>>>interest in some area, you can propose your ideas in
>>>>hive-dev@hive.apache.org mailing list or even apply for an intern at FB
>>>>if you would like to work closely with us.
>>>>
>>>> Thanks,
>>>> Ning
>>>>
>>>> On Jan 31, 2011, at 2:04 PM, Anja Gruenheid wrote:
>>>>
>>>>> Hi!
>>>>>
>>>>> I'm a graduate student from Georgia Tech and I'm working with Hive for
>>>>>a research project. I am interested in query optimization and the Hive
>>>>>MetaStore in that context. Working through the documentation and code,
>>>>>I noticed that the implementation right now is using a rule-based
>>>>>optimization system. Therefore, I was wondering whether cost-based
>>>>>query optimization will be a future task in the development of Hive and
>>>>>if it would be possible for me to cooperate with the developers of Hive
>>>>>to advance the project in general.
>>>>>
>>>>> Best regards,
>>>>> Anja Gruenheid
>>>>
>>>>
>>
>>
>

Re: Query Optimization in Hive

Posted by bharath vissapragada <bh...@gmail.com>.
Thanks for replying namit..

It is motivating to receive a mail from the authors of Hive :).

I filed the jira based on the discussion..
https://issues.apache.org/jira/browse/HIVE-1938

I will try to update my idea asap.

Thanks
Bharath,V
4th year Undergrad,IIIT Hyderabad.
w: http://research.iiit.ac.in/~bharath.v



On Tue, Feb 1, 2011 at 11:46 AM, Namit Jain <nj...@fb.com> wrote:
> Bharath,
>
> This would be great.
>
> Why don¹t you write up something about how you are planning to proceed ?
> File a new jira and load some design notes/spec. there.
> We can definitely sync up. from there.
>
>
> This feature would be very useful to the community - We, at facebook,
> Would definitely like to use it.
>
>
> Thanks,
> -namit
>
>
> On 1/31/11 9:50 PM, "bharath vissapragada"
> <bh...@gmail.com> wrote:
>
>>Hi Ning,Anja,
>>
>>I am doing my Masters thesis on this topic . I have implemented all
>>SQL features like joins , selects etc on top of Hadoop (before knowing
>>about Hive) and we have derived some basic cost-models for join
>>re-ordering which seem to be working fine on some basic scales of TPCH
>>datasets .. Later I came to know about Hive and I am trying to
>>implement the same in Hive .
>>
>>Right now I am in the process of understanding Hive's source and I am
>>almost done with  "ql" package. I think it would be great if you guys
>>can help us in this regard .. I am a bit confused about the
>>implementation of joins and once i'm done with that , I can modify the
>>"joinReorder" of Optimizer package by using the cost-formulae and
>>metadata. It would be a great opportunity to work with you guys at fb
>>and contribute to Hive..
>>
>>Thanks
>>Bharath,V
>>4th year Undergrad,IIIT Hyderabad.
>>w: http://research.iiit.ac.in/~bharath.v
>>
>>On Tue, Feb 1, 2011 at 9:22 AM, Ning Zhang <nz...@fb.com> wrote:
>>> Hi Anja,
>>>
>>> As you noticed Hive only have limited supports for cost-baesd
>>>optimization. One of the reasons is that Hive used to have very small
>>>number of optional execution plans to choose from. One exception is
>>>mapjoin vs common joins. Liying Tang had some work on his last intern to
>>>convert common joins to mapjoin in a rule-based fashion. One of his
>>>future works is to automatically convert common join to mapjoins based
>>>on stats. There are also ongoing work on indexes on Hive. With the
>>>support of indexes, CBO will be much needed.
>>>
>>> In order for a decent CBO to work, we need stats and cost models. There
>>>are some work in stats. Table/partition level stats has already been
>>>supported. There is a JIRA open for column level stats (HIVE-1362). Cost
>>>model is much more complex in Hadoop environment and closely dependent
>>>on the mapjoin/index implementations. Given al these in place, we can
>>>then talk about plan enumeration etc.
>>>
>>> So yes, we are interested in CBO, but it is a large area and many
>>>missing pieces need to be filled in Hive. If you have particular
>>>interest in some area, you can propose your ideas in
>>>hive-dev@hive.apache.org mailing list or even apply for an intern at FB
>>>if you would like to work closely with us.
>>>
>>> Thanks,
>>> Ning
>>>
>>> On Jan 31, 2011, at 2:04 PM, Anja Gruenheid wrote:
>>>
>>>> Hi!
>>>>
>>>> I'm a graduate student from Georgia Tech and I'm working with Hive for
>>>>a research project. I am interested in query optimization and the Hive
>>>>MetaStore in that context. Working through the documentation and code,
>>>>I noticed that the implementation right now is using a rule-based
>>>>optimization system. Therefore, I was wondering whether cost-based
>>>>query optimization will be a future task in the development of Hive and
>>>>if it would be possible for me to cooperate with the developers of Hive
>>>>to advance the project in general.
>>>>
>>>> Best regards,
>>>> Anja Gruenheid
>>>
>>>
>
>

Re: Query Optimization in Hive

Posted by Namit Jain <nj...@fb.com>.
Bharath,

This would be great.

Why don¹t you write up something about how you are planning to proceed ?
File a new jira and load some design notes/spec. there.
We can definitely sync up. from there.


This feature would be very useful to the community - We, at facebook,
Would definitely like to use it.


Thanks,
-namit


On 1/31/11 9:50 PM, "bharath vissapragada"
<bh...@gmail.com> wrote:

>Hi Ning,Anja,
>
>I am doing my Masters thesis on this topic . I have implemented all
>SQL features like joins , selects etc on top of Hadoop (before knowing
>about Hive) and we have derived some basic cost-models for join
>re-ordering which seem to be working fine on some basic scales of TPCH
>datasets .. Later I came to know about Hive and I am trying to
>implement the same in Hive .
>
>Right now I am in the process of understanding Hive's source and I am
>almost done with  "ql" package. I think it would be great if you guys
>can help us in this regard .. I am a bit confused about the
>implementation of joins and once i'm done with that , I can modify the
>"joinReorder" of Optimizer package by using the cost-formulae and
>metadata. It would be a great opportunity to work with you guys at fb
>and contribute to Hive..
>
>Thanks
>Bharath,V
>4th year Undergrad,IIIT Hyderabad.
>w: http://research.iiit.ac.in/~bharath.v
>
>On Tue, Feb 1, 2011 at 9:22 AM, Ning Zhang <nz...@fb.com> wrote:
>> Hi Anja,
>>
>> As you noticed Hive only have limited supports for cost-baesd
>>optimization. One of the reasons is that Hive used to have very small
>>number of optional execution plans to choose from. One exception is
>>mapjoin vs common joins. Liying Tang had some work on his last intern to
>>convert common joins to mapjoin in a rule-based fashion. One of his
>>future works is to automatically convert common join to mapjoins based
>>on stats. There are also ongoing work on indexes on Hive. With the
>>support of indexes, CBO will be much needed.
>>
>> In order for a decent CBO to work, we need stats and cost models. There
>>are some work in stats. Table/partition level stats has already been
>>supported. There is a JIRA open for column level stats (HIVE-1362). Cost
>>model is much more complex in Hadoop environment and closely dependent
>>on the mapjoin/index implementations. Given al these in place, we can
>>then talk about plan enumeration etc.
>>
>> So yes, we are interested in CBO, but it is a large area and many
>>missing pieces need to be filled in Hive. If you have particular
>>interest in some area, you can propose your ideas in
>>hive-dev@hive.apache.org mailing list or even apply for an intern at FB
>>if you would like to work closely with us.
>>
>> Thanks,
>> Ning
>>
>> On Jan 31, 2011, at 2:04 PM, Anja Gruenheid wrote:
>>
>>> Hi!
>>>
>>> I'm a graduate student from Georgia Tech and I'm working with Hive for
>>>a research project. I am interested in query optimization and the Hive
>>>MetaStore in that context. Working through the documentation and code,
>>>I noticed that the implementation right now is using a rule-based
>>>optimization system. Therefore, I was wondering whether cost-based
>>>query optimization will be a future task in the development of Hive and
>>>if it would be possible for me to cooperate with the developers of Hive
>>>to advance the project in general.
>>>
>>> Best regards,
>>> Anja Gruenheid
>>
>>


Re: Query Optimization in Hive

Posted by bharath vissapragada <bh...@gmail.com>.
Hi Ning,Anja,

I am doing my Masters thesis on this topic . I have implemented all
SQL features like joins , selects etc on top of Hadoop (before knowing
about Hive) and we have derived some basic cost-models for join
re-ordering which seem to be working fine on some basic scales of TPCH
datasets .. Later I came to know about Hive and I am trying to
implement the same in Hive .

Right now I am in the process of understanding Hive's source and I am
almost done with  "ql" package. I think it would be great if you guys
can help us in this regard .. I am a bit confused about the
implementation of joins and once i'm done with that , I can modify the
"joinReorder" of Optimizer package by using the cost-formulae and
metadata. It would be a great opportunity to work with you guys at fb
and contribute to Hive..

Thanks
Bharath,V
4th year Undergrad,IIIT Hyderabad.
w: http://research.iiit.ac.in/~bharath.v

On Tue, Feb 1, 2011 at 9:22 AM, Ning Zhang <nz...@fb.com> wrote:
> Hi Anja,
>
> As you noticed Hive only have limited supports for cost-baesd optimization. One of the reasons is that Hive used to have very small number of optional execution plans to choose from. One exception is mapjoin vs common joins. Liying Tang had some work on his last intern to convert common joins to mapjoin in a rule-based fashion. One of his future works is to automatically convert common join to mapjoins based on stats. There are also ongoing work on indexes on Hive. With the support of indexes, CBO will be much needed.
>
> In order for a decent CBO to work, we need stats and cost models. There are some work in stats. Table/partition level stats has already been supported. There is a JIRA open for column level stats (HIVE-1362). Cost model is much more complex in Hadoop environment and closely dependent on the mapjoin/index implementations. Given al these in place, we can then talk about plan enumeration etc.
>
> So yes, we are interested in CBO, but it is a large area and many missing pieces need to be filled in Hive. If you have particular interest in some area, you can propose your ideas in hive-dev@hive.apache.org mailing list or even apply for an intern at FB if you would like to work closely with us.
>
> Thanks,
> Ning
>
> On Jan 31, 2011, at 2:04 PM, Anja Gruenheid wrote:
>
>> Hi!
>>
>> I'm a graduate student from Georgia Tech and I'm working with Hive for a research project. I am interested in query optimization and the Hive MetaStore in that context. Working through the documentation and code, I noticed that the implementation right now is using a rule-based optimization system. Therefore, I was wondering whether cost-based query optimization will be a future task in the development of Hive and if it would be possible for me to cooperate with the developers of Hive to advance the project in general.
>>
>> Best regards,
>> Anja Gruenheid
>
>

Re: Query Optimization in Hive

Posted by Ning Zhang <nz...@fb.com>.
Hi Anja,

As you noticed Hive only have limited supports for cost-baesd optimization. One of the reasons is that Hive used to have very small number of optional execution plans to choose from. One exception is mapjoin vs common joins. Liying Tang had some work on his last intern to convert common joins to mapjoin in a rule-based fashion. One of his future works is to automatically convert common join to mapjoins based on stats. There are also ongoing work on indexes on Hive. With the support of indexes, CBO will be much needed. 

In order for a decent CBO to work, we need stats and cost models. There are some work in stats. Table/partition level stats has already been supported. There is a JIRA open for column level stats (HIVE-1362). Cost model is much more complex in Hadoop environment and closely dependent on the mapjoin/index implementations. Given al these in place, we can then talk about plan enumeration etc. 

So yes, we are interested in CBO, but it is a large area and many missing pieces need to be filled in Hive. If you have particular interest in some area, you can propose your ideas in hive-dev@hive.apache.org mailing list or even apply for an intern at FB if you would like to work closely with us. 

Thanks,
Ning

On Jan 31, 2011, at 2:04 PM, Anja Gruenheid wrote:

> Hi!
> 
> I'm a graduate student from Georgia Tech and I'm working with Hive for a research project. I am interested in query optimization and the Hive MetaStore in that context. Working through the documentation and code, I noticed that the implementation right now is using a rule-based optimization system. Therefore, I was wondering whether cost-based query optimization will be a future task in the development of Hive and if it would be possible for me to cooperate with the developers of Hive to advance the project in general.
> 
> Best regards,
> Anja Gruenheid


Re: Query Optimization in Hive

Posted by Ajo Fod <aj...@gmail.com>.
I think there is a developer mailing list ... that is probably the best
place for this question.

Also, I think there is a cost-based query optimizer in the works somewhere.

-Ajo

On Mon, Jan 31, 2011 at 2:04 PM, Anja Gruenheid
<an...@gatech.edu>wrote:

> Hi!
>
> I'm a graduate student from Georgia Tech and I'm working with Hive for a
> research project. I am interested in query optimization and the Hive
> MetaStore in that context. Working through the documentation and code, I
> noticed that the implementation right now is using a rule-based optimization
> system. Therefore, I was wondering whether cost-based query optimization
> will be a future task in the development of Hive and if it would be possible
> for me to cooperate with the developers of Hive to advance the project in
> general.
>
> Best regards,
> Anja Gruenheid
>