You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@hive.apache.org by Vaibhav Shrivastava <va...@gmail.com> on 2010/04/07 02:26:05 UTC

Re: Hive and Hadoop Ideas.

Hi all,
           I am sorry if this is spamming your inbox but I wanted to
reach out. I would like to apply for the Google Summer Code project
for implementing indexes on Hive and have some basic ideas and would
like to discuss them with a mentor. I would highly appreciate if I am
given direction,

Title:  	 Using Indexes for Improved Performance of Queries
Student: 	Vaibhav Shrivastava
Abstract: 	Hive is used an SQL like abstraction for Map Reduce Jobs.
The focus of the project shall be to implement Indexes for various
tables to facilitate various operations performed on the results. In
the proposal various types of indexes which can be used, how they can
be implemented and their prospective applications are discussed. The
content shall be updated as inputs ,reviews are provided.
Content: 	

Title:Using Indexes for Improved Performance of Queries

       Indexes are a common way of speeding up row retrieval in normal
databases. The idea is to keep just an auxiliary pointer of a data
member which is considered to be crucial in the query. One can
consider an example of retrieving records where the id is equal to
some desired value range. Looking up an Index will cause a lesser
number of rows to be fetched.

      In a Hadoop environment the files are located on the HDFS. One
can consider the Index to be pointing to say a particular location in
the file.

      Another application could be to use an index for a particular
aggregation operator. Hence when a particular count of say a paricular
key is required we can obtain the individual rows in consideration
without looking into the whole file, by just counting the offsets.

      For building an index, One can think of a Map Reduce job wherein
the data fetched can be Reduced efficiently by using an combiner at an
intermediate stage and using the data to get more sorted values.

      Also one can consider the development of a kd (k dimensional
tree) for multi key queries, however much of the simplicity of the
design may be lost in such a situation.


Experience:

      I am currently working at Stony Brook University, NY as a
Masters student in the Computer Science Dept. I am working on
textmap.com which is a Text Analysis System and uses Map Reduce Jobs
to extract sentiment information from the articles processed. I have
experience using the Hadoop System and am interested in furthering my
knowledge.



Deliverables:

      After a design decision on as to how the index structure would
be (Single Dense Index, B Tree Index or a kd Tree type index), trying
to develop a prototype model.

       Emphasis during the mid term evaluation would be to compare a
query using an index as opposed to a normal query or any such other
application such as a Join or an subquery.

       Emphasis during the end would be to try to optimize the
particular application designed in the mid term evaluation phase, or
alternatively to implement more other queries which may have improved
performance using indexes.



Mentor: Unknown (Can I get some assistance?)



On Tue, Apr 6, 2010 at 1:21 PM, Scott MacVicar <ma...@facebook.com> wrote:
> You can post it here though there is also an Apache GSoC list as well since Hadoop is an Apache project.
>
> Scott
>
> On Apr 6, 2010, at 9:00 AM, Vaibhav wrote:
>
>> thanks for replying scott.
>> Did you mean this mailing list ..? hive-dev@hadoop.apache.org ... I
>> have joined this group but was hesitant to post it on the list as I
>> didnt want to cause trouble to the developers. Is it ok if I post the
>> proposal there? Else can you direct me to someone whom I can directly
>> contact.
>>
>> Thanking you,
>> Vaibhav.
>>
>> On Apr 6, 4:17 am, Scott MacVicar <ma...@facebook.com> wrote:
>>> You should use the Apache Mailing list too, if we have extra slots we'll be using some of the Facebook ones.
>>>
>>> Scott
>>>
>>> On Apr 6, 2010, at 1:09 AM, Vaibhav wrote:
>>>
>>>
>>>
>>>> Hi Mentors,
>>>>                I have posted an simple proposal of my ideas. I would
>>>> like to have inputs, comments and other views as to whether there
>>>> needs to be more clarifications or updates on the proposal. I hope you
>>>> could mail it to me at vaibhav.s.mnnit [at] gmail.com. I would like to
>>>> thank you in advance for the time you give to my proposal.
>>>
>>>> Vaibhav.
>
>



-- 
Vaibhav Shrivastava,
Graduate Student,
MS Computer Science,
Stony Brook University.

RE: Hive and Hadoop Ideas.

Posted by Namit Jain <nj...@facebook.com>.
Hi Vaibhav,

Please also take a look at :

  https://issues.apache.org/jira/browse/HIVE-417


Thanks,
-namit


-----Original Message-----
From: John Sichi [mailto:jsichi@facebook.com] 
Sent: Wednesday, April 07, 2010 2:39 PM
To: hive-dev@hadoop.apache.org; Scott MacVicar
Cc: vaibhav.s.mnnit@gmail.com
Subject: RE: Hive and Hadoop Ideas.

Greetings Vaibhav,

I'd be happy to help on this and can act as your mentor if your proposal is accepted by Google+Facebook.  One interesting twist might be to look into using HBase for the indexing; this would help carry forward our Hive+HBase integration roadmap, and also allow you to focus on the higher-level aspects of the problem.

JVS
________________________________________
From: Vaibhav Shrivastava [vaibhav.s.mnnit@gmail.com]
Sent: Tuesday, April 06, 2010 5:26 PM
To: Scott MacVicar
Cc: hive-dev@hadoop.apache.org
Subject: Re: Hive and Hadoop Ideas.

Hi all,
           I am sorry if this is spamming your inbox but I wanted to
reach out. I would like to apply for the Google Summer Code project
for implementing indexes on Hive and have some basic ideas and would
like to discuss them with a mentor. I would highly appreciate if I am
given direction,

Title:           Using Indexes for Improved Performance of Queries
Student:        Vaibhav Shrivastava
Abstract:       Hive is used an SQL like abstraction for Map Reduce Jobs.
The focus of the project shall be to implement Indexes for various
tables to facilitate various operations performed on the results. In
the proposal various types of indexes which can be used, how they can
be implemented and their prospective applications are discussed. The
content shall be updated as inputs ,reviews are provided.
Content:

Title:Using Indexes for Improved Performance of Queries

       Indexes are a common way of speeding up row retrieval in normal
databases. The idea is to keep just an auxiliary pointer of a data
member which is considered to be crucial in the query. One can
consider an example of retrieving records where the id is equal to
some desired value range. Looking up an Index will cause a lesser
number of rows to be fetched.

      In a Hadoop environment the files are located on the HDFS. One
can consider the Index to be pointing to say a particular location in
the file.

      Another application could be to use an index for a particular
aggregation operator. Hence when a particular count of say a paricular
key is required we can obtain the individual rows in consideration
without looking into the whole file, by just counting the offsets.

      For building an index, One can think of a Map Reduce job wherein
the data fetched can be Reduced efficiently by using an combiner at an
intermediate stage and using the data to get more sorted values.

      Also one can consider the development of a kd (k dimensional
tree) for multi key queries, however much of the simplicity of the
design may be lost in such a situation.


Experience:

      I am currently working at Stony Brook University, NY as a
Masters student in the Computer Science Dept. I am working on
textmap.com which is a Text Analysis System and uses Map Reduce Jobs
to extract sentiment information from the articles processed. I have
experience using the Hadoop System and am interested in furthering my
knowledge.



Deliverables:

      After a design decision on as to how the index structure would
be (Single Dense Index, B Tree Index or a kd Tree type index), trying
to develop a prototype model.

       Emphasis during the mid term evaluation would be to compare a
query using an index as opposed to a normal query or any such other
application such as a Join or an subquery.

       Emphasis during the end would be to try to optimize the
particular application designed in the mid term evaluation phase, or
alternatively to implement more other queries which may have improved
performance using indexes.



Mentor: Unknown (Can I get some assistance?)



On Tue, Apr 6, 2010 at 1:21 PM, Scott MacVicar <ma...@facebook.com> wrote:
> You can post it here though there is also an Apache GSoC list as well since Hadoop is an Apache project.
>
> Scott
>
> On Apr 6, 2010, at 9:00 AM, Vaibhav wrote:
>
>> thanks for replying scott.
>> Did you mean this mailing list ..? hive-dev@hadoop.apache.org ... I
>> have joined this group but was hesitant to post it on the list as I
>> didnt want to cause trouble to the developers. Is it ok if I post the
>> proposal there? Else can you direct me to someone whom I can directly
>> contact.
>>
>> Thanking you,
>> Vaibhav.
>>
>> On Apr 6, 4:17 am, Scott MacVicar <ma...@facebook.com> wrote:
>>> You should use the Apache Mailing list too, if we have extra slots we'll be using some of the Facebook ones.
>>>
>>> Scott
>>>
>>> On Apr 6, 2010, at 1:09 AM, Vaibhav wrote:
>>>
>>>
>>>
>>>> Hi Mentors,
>>>>                I have posted an simple proposal of my ideas. I would
>>>> like to have inputs, comments and other views as to whether there
>>>> needs to be more clarifications or updates on the proposal. I hope you
>>>> could mail it to me at vaibhav.s.mnnit [at] gmail.com. I would like to
>>>> thank you in advance for the time you give to my proposal.
>>>
>>>> Vaibhav.
>
>



--
Vaibhav Shrivastava,
Graduate Student,
MS Computer Science,
Stony Brook University.

RE: Hive and Hadoop Ideas.

Posted by John Sichi <js...@facebook.com>.
Greetings Vaibhav,

I'd be happy to help on this and can act as your mentor if your proposal is accepted by Google+Facebook.  One interesting twist might be to look into using HBase for the indexing; this would help carry forward our Hive+HBase integration roadmap, and also allow you to focus on the higher-level aspects of the problem.

JVS
________________________________________
From: Vaibhav Shrivastava [vaibhav.s.mnnit@gmail.com]
Sent: Tuesday, April 06, 2010 5:26 PM
To: Scott MacVicar
Cc: hive-dev@hadoop.apache.org
Subject: Re: Hive and Hadoop Ideas.

Hi all,
           I am sorry if this is spamming your inbox but I wanted to
reach out. I would like to apply for the Google Summer Code project
for implementing indexes on Hive and have some basic ideas and would
like to discuss them with a mentor. I would highly appreciate if I am
given direction,

Title:           Using Indexes for Improved Performance of Queries
Student:        Vaibhav Shrivastava
Abstract:       Hive is used an SQL like abstraction for Map Reduce Jobs.
The focus of the project shall be to implement Indexes for various
tables to facilitate various operations performed on the results. In
the proposal various types of indexes which can be used, how they can
be implemented and their prospective applications are discussed. The
content shall be updated as inputs ,reviews are provided.
Content:

Title:Using Indexes for Improved Performance of Queries

       Indexes are a common way of speeding up row retrieval in normal
databases. The idea is to keep just an auxiliary pointer of a data
member which is considered to be crucial in the query. One can
consider an example of retrieving records where the id is equal to
some desired value range. Looking up an Index will cause a lesser
number of rows to be fetched.

      In a Hadoop environment the files are located on the HDFS. One
can consider the Index to be pointing to say a particular location in
the file.

      Another application could be to use an index for a particular
aggregation operator. Hence when a particular count of say a paricular
key is required we can obtain the individual rows in consideration
without looking into the whole file, by just counting the offsets.

      For building an index, One can think of a Map Reduce job wherein
the data fetched can be Reduced efficiently by using an combiner at an
intermediate stage and using the data to get more sorted values.

      Also one can consider the development of a kd (k dimensional
tree) for multi key queries, however much of the simplicity of the
design may be lost in such a situation.


Experience:

      I am currently working at Stony Brook University, NY as a
Masters student in the Computer Science Dept. I am working on
textmap.com which is a Text Analysis System and uses Map Reduce Jobs
to extract sentiment information from the articles processed. I have
experience using the Hadoop System and am interested in furthering my
knowledge.



Deliverables:

      After a design decision on as to how the index structure would
be (Single Dense Index, B Tree Index or a kd Tree type index), trying
to develop a prototype model.

       Emphasis during the mid term evaluation would be to compare a
query using an index as opposed to a normal query or any such other
application such as a Join or an subquery.

       Emphasis during the end would be to try to optimize the
particular application designed in the mid term evaluation phase, or
alternatively to implement more other queries which may have improved
performance using indexes.



Mentor: Unknown (Can I get some assistance?)



On Tue, Apr 6, 2010 at 1:21 PM, Scott MacVicar <ma...@facebook.com> wrote:
> You can post it here though there is also an Apache GSoC list as well since Hadoop is an Apache project.
>
> Scott
>
> On Apr 6, 2010, at 9:00 AM, Vaibhav wrote:
>
>> thanks for replying scott.
>> Did you mean this mailing list ..? hive-dev@hadoop.apache.org ... I
>> have joined this group but was hesitant to post it on the list as I
>> didnt want to cause trouble to the developers. Is it ok if I post the
>> proposal there? Else can you direct me to someone whom I can directly
>> contact.
>>
>> Thanking you,
>> Vaibhav.
>>
>> On Apr 6, 4:17 am, Scott MacVicar <ma...@facebook.com> wrote:
>>> You should use the Apache Mailing list too, if we have extra slots we'll be using some of the Facebook ones.
>>>
>>> Scott
>>>
>>> On Apr 6, 2010, at 1:09 AM, Vaibhav wrote:
>>>
>>>
>>>
>>>> Hi Mentors,
>>>>                I have posted an simple proposal of my ideas. I would
>>>> like to have inputs, comments and other views as to whether there
>>>> needs to be more clarifications or updates on the proposal. I hope you
>>>> could mail it to me at vaibhav.s.mnnit [at] gmail.com. I would like to
>>>> thank you in advance for the time you give to my proposal.
>>>
>>>> Vaibhav.
>
>



--
Vaibhav Shrivastava,
Graduate Student,
MS Computer Science,
Stony Brook University.