You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@hive.apache.org by Daniel Einspanjer <de...@mozilla.com> on 2010/09/12 04:00:39 UTC

Question regarding region scans in HBase integration

  I was trying to spend a little time this weekend catching up with the 
current state of HBase integration for Hive.  One thing that I haven't 
seen mentioned is how exactly Hive scans an HBase table during a SELECT.

Does Hive have logic that allows it to intelligently scan only the 
participating regions during a SELECT query that uses the rowkey?  If 
not, I recently wrote some code that allows a MapReduce job to 
effectively select the regions based on a list of start/end rowkey 
ranges.  If this might be useful to the Hive integration, I could create 
a Jira and take a look at trying to set up a patch.

Daniel Einspanjer
Metrics Architect
Mozilla Corporation

Re: Question regarding region scans in HBase integration

Posted by John Sichi <js...@facebook.com>.
I've uploaded a patch for review:

http://review.cloudera.org/r/823/diff

And updated the design doc:

http://wiki.apache.org/hadoop/Hive/FilterPushdownDev

CC'ing a few people who've been working on storage handlers.

JVS

On Sep 12, 2010, at 7:51 PM, John Sichi wrote:

I see.  My changes are starting out super-simple, addressing only the case of an equality predicate and a simple key.  Once I get those committed, we can talk about how to add support for compound keys and range predicates, which is where your code could come in.

JVS

On Sep 11, 2010, at 7:26 PM, Daniel Einspanjer wrote:

Okay, that getSplits part is specifically where my code was involved.

My use case was one of salted rowkeys.  We are storing documents that have a guid as the id and the creation date of the document is important for scanning.  When we tested having a rowkey format of <creationtimestamp>+<guid>, the RegionServer hotspots became problematic, so we decided to salt the rowkey by using the first digit of the guid: <hexchar>+<creationtimestamp>+<guid>.  This gives us nice distribution of inserts throughout the cluster, but of course, it makes scanning a contiguous date range much more complicated.

The code I have allows us to write a MR that takes a list of prefixes (e.g. the hexchar) and a list of ranges (e.g. the desired timestamps) and construct a master Scan object that contains any configuration such as filters or cache settings, and a series of Scan objects that constitute the Cartesian product of the ranges.  Then, it passes those in to a custom getSplits that ensures only the needed regions participate in the Map.

If this sounds like it might be useful, I'll work on getting it cleaned up and posted somewhere so you can review it and maybe glean it for ideas.  If you are already past that point then I apologize for not checking into this sooner. :)

-Daniel

On 9/11/10 7:09 PM, John Sichi wrote:
Hi Daniel,

I'm almost done with this for HIVE-1226; the remaining step I need to finish is to get the filter passed down during getSplits, since the HBase getSplits implementation takes care of figuring out which regions contain the row in question.

JVS

On Sep 11, 2010, at 7:00 PM, Daniel Einspanjer wrote:

I was trying to spend a little time this weekend catching up with the current state of HBase integration for Hive.  One thing that I haven't seen mentioned is how exactly Hive scans an HBase table during a SELECT.

Does Hive have logic that allows it to intelligently scan only the participating regions during a SELECT query that uses the rowkey?  If not, I recently wrote some code that allows a MapReduce job to effectively select the regions based on a list of start/end rowkey ranges.  If this might be useful to the Hive integration, I could create a Jira and take a look at trying to set up a patch.

Daniel Einspanjer
Metrics Architect
Mozilla Corporation



Re: Question regarding region scans in HBase integration

Posted by John Sichi <js...@facebook.com>.
I see.  My changes are starting out super-simple, addressing only the case of an equality predicate and a simple key.  Once I get those committed, we can talk about how to add support for compound keys and range predicates, which is where your code could come in.

JVS

On Sep 11, 2010, at 7:26 PM, Daniel Einspanjer wrote:

> Okay, that getSplits part is specifically where my code was involved.
> 
> My use case was one of salted rowkeys.  We are storing documents that have a guid as the id and the creation date of the document is important for scanning.  When we tested having a rowkey format of <creationtimestamp>+<guid>, the RegionServer hotspots became problematic, so we decided to salt the rowkey by using the first digit of the guid: <hexchar>+<creationtimestamp>+<guid>.  This gives us nice distribution of inserts throughout the cluster, but of course, it makes scanning a contiguous date range much more complicated.
> 
> The code I have allows us to write a MR that takes a list of prefixes (e.g. the hexchar) and a list of ranges (e.g. the desired timestamps) and construct a master Scan object that contains any configuration such as filters or cache settings, and a series of Scan objects that constitute the Cartesian product of the ranges.  Then, it passes those in to a custom getSplits that ensures only the needed regions participate in the Map.
> 
> If this sounds like it might be useful, I'll work on getting it cleaned up and posted somewhere so you can review it and maybe glean it for ideas.  If you are already past that point then I apologize for not checking into this sooner. :)
> 
> -Daniel
> 
> On 9/11/10 7:09 PM, John Sichi wrote:
>> Hi Daniel,
>> 
>> I'm almost done with this for HIVE-1226; the remaining step I need to finish is to get the filter passed down during getSplits, since the HBase getSplits implementation takes care of figuring out which regions contain the row in question.
>> 
>> JVS
>> 
>> On Sep 11, 2010, at 7:00 PM, Daniel Einspanjer wrote:
>> 
>>> I was trying to spend a little time this weekend catching up with the current state of HBase integration for Hive.  One thing that I haven't seen mentioned is how exactly Hive scans an HBase table during a SELECT.
>>> 
>>> Does Hive have logic that allows it to intelligently scan only the participating regions during a SELECT query that uses the rowkey?  If not, I recently wrote some code that allows a MapReduce job to effectively select the regions based on a list of start/end rowkey ranges.  If this might be useful to the Hive integration, I could create a Jira and take a look at trying to set up a patch.
>>> 
>>> Daniel Einspanjer
>>> Metrics Architect
>>> Mozilla Corporation


Re: Question regarding region scans in HBase integration

Posted by Daniel Einspanjer <de...@mozilla.com>.
  Okay, that getSplits part is specifically where my code was involved.

My use case was one of salted rowkeys.  We are storing documents that 
have a guid as the id and the creation date of the document is important 
for scanning.  When we tested having a rowkey format of 
<creationtimestamp>+<guid>, the RegionServer hotspots became 
problematic, so we decided to salt the rowkey by using the first digit 
of the guid: <hexchar>+<creationtimestamp>+<guid>.  This gives us nice 
distribution of inserts throughout the cluster, but of course, it makes 
scanning a contiguous date range much more complicated.

The code I have allows us to write a MR that takes a list of prefixes 
(e.g. the hexchar) and a list of ranges (e.g. the desired timestamps) 
and construct a master Scan object that contains any configuration such 
as filters or cache settings, and a series of Scan objects that 
constitute the Cartesian product of the ranges.  Then, it passes those 
in to a custom getSplits that ensures only the needed regions 
participate in the Map.

If this sounds like it might be useful, I'll work on getting it cleaned 
up and posted somewhere so you can review it and maybe glean it for 
ideas.  If you are already past that point then I apologize for not 
checking into this sooner. :)

-Daniel

On 9/11/10 7:09 PM, John Sichi wrote:
> Hi Daniel,
>
> I'm almost done with this for HIVE-1226; the remaining step I need to finish is to get the filter passed down during getSplits, since the HBase getSplits implementation takes care of figuring out which regions contain the row in question.
>
> JVS
>
> On Sep 11, 2010, at 7:00 PM, Daniel Einspanjer wrote:
>
>> I was trying to spend a little time this weekend catching up with the current state of HBase integration for Hive.  One thing that I haven't seen mentioned is how exactly Hive scans an HBase table during a SELECT.
>>
>> Does Hive have logic that allows it to intelligently scan only the participating regions during a SELECT query that uses the rowkey?  If not, I recently wrote some code that allows a MapReduce job to effectively select the regions based on a list of start/end rowkey ranges.  If this might be useful to the Hive integration, I could create a Jira and take a look at trying to set up a patch.
>>
>> Daniel Einspanjer
>> Metrics Architect
>> Mozilla Corporation

Re: Question regarding region scans in HBase integration

Posted by John Sichi <js...@facebook.com>.
Hi Daniel,

I'm almost done with this for HIVE-1226; the remaining step I need to finish is to get the filter passed down during getSplits, since the HBase getSplits implementation takes care of figuring out which regions contain the row in question.

JVS

On Sep 11, 2010, at 7:00 PM, Daniel Einspanjer wrote:

> I was trying to spend a little time this weekend catching up with the current state of HBase integration for Hive.  One thing that I haven't seen mentioned is how exactly Hive scans an HBase table during a SELECT.
> 
> Does Hive have logic that allows it to intelligently scan only the participating regions during a SELECT query that uses the rowkey?  If not, I recently wrote some code that allows a MapReduce job to effectively select the regions based on a list of start/end rowkey ranges.  If this might be useful to the Hive integration, I could create a Jira and take a look at trying to set up a patch.
> 
> Daniel Einspanjer
> Metrics Architect
> Mozilla Corporation