You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@hbase.apache.org by "Alex Baranau (JIRA)" <ji...@apache.org> on 2012/08/20 21:45:37 UTC

[jira] [Created] (HBASE-6618) Implement FuzzyRowFilter with ranges support

Alex Baranau created HBASE-6618:
-----------------------------------

             Summary: Implement FuzzyRowFilter with ranges support
                 Key: HBASE-6618
                 URL: https://issues.apache.org/jira/browse/HBASE-6618
             Project: HBase
          Issue Type: New Feature
          Components: filters
            Reporter: Alex Baranau
            Priority: Minor


Apart from current ability to specify fuzzy row filter e.g. for <userId_actionId> format as ????_0004 (where 0004 - actionId) it would be great to also have ability to specify the "fuzzy range" , e.g. ????_0004, ..., ????_0099.

See initial discussion here: http://search-hadoop.com/m/WVLJdX0Z65

Note: currently it is possible to provide multiple fuzzy row rules to existing FuzzyRowFilter, but in case when the range is big (contains thousands of values) it is not efficient.

Filter should perform efficient fast-forwarding during the scan (this is what distinguishes it from regex row filter).

While such functionality may seem like a proper fit for custom filter (i.e. not including into standard filter set) it looks like the filter may be very re-useable. We may judge based on the implementation that will hopefully be added.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (HBASE-6618) Implement FuzzyRowFilter with ranges support

Posted by "Zhihong Ted Yu (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HBASE-6618?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13440039#comment-13440039 ] 

Zhihong Ted Yu commented on HBASE-6618:
---------------------------------------

Thanks for the update, Alex.
I get your idea, though a few arrows seem to be missing (e.g. CCF is ?) in the diagram for toInc.
                
> Implement FuzzyRowFilter with ranges support
> --------------------------------------------
>
>                 Key: HBASE-6618
>                 URL: https://issues.apache.org/jira/browse/HBASE-6618
>             Project: HBase
>          Issue Type: New Feature
>          Components: filters
>            Reporter: Alex Baranau
>            Priority: Minor
>         Attachments: HBASE-6618-algo-desc-bits.png, HBASE-6618-algo.patch
>
>
> Apart from current ability to specify fuzzy row filter e.g. for <userId_actionId> format as ????_0004 (where 0004 - actionId) it would be great to also have ability to specify the "fuzzy range" , e.g. ????_0004, ..., ????_0099.
> See initial discussion here: http://search-hadoop.com/m/WVLJdX0Z65
> Note: currently it is possible to provide multiple fuzzy row rules to existing FuzzyRowFilter, but in case when the range is big (contains thousands of values) it is not efficient.
> Filter should perform efficient fast-forwarding during the scan (this is what distinguishes it from regex row filter).
> While such functionality may seem like a proper fit for custom filter (i.e. not including into standard filter set) it looks like the filter may be very re-useable. We may judge based on the implementation that will hopefully be added.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (HBASE-6618) Implement FuzzyRowFilter with ranges support

Posted by "Anil Gupta (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HBASE-6618?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13442276#comment-13442276 ] 

Anil Gupta commented on HBASE-6618:
-----------------------------------

Hi Alex,

I am still unable to access the png file for the algorithm. Is there some problem with JIRA system? or Can you re-upload the image?

Thanks,
Thanks,
Anil Gupta
Software Engineer II, Intuit, Inc 
                
> Implement FuzzyRowFilter with ranges support
> --------------------------------------------
>
>                 Key: HBASE-6618
>                 URL: https://issues.apache.org/jira/browse/HBASE-6618
>             Project: HBase
>          Issue Type: New Feature
>          Components: filters
>            Reporter: Alex Baranau
>            Priority: Minor
>         Attachments: HBASE-6618-algo-desc-bits.png, HBASE-6618-algo.patch
>
>
> Apart from current ability to specify fuzzy row filter e.g. for <userId_actionId> format as ????_0004 (where 0004 - actionId) it would be great to also have ability to specify the "fuzzy range" , e.g. ????_0004, ..., ????_0099.
> See initial discussion here: http://search-hadoop.com/m/WVLJdX0Z65
> Note: currently it is possible to provide multiple fuzzy row rules to existing FuzzyRowFilter, but in case when the range is big (contains thousands of values) it is not efficient.
> Filter should perform efficient fast-forwarding during the scan (this is what distinguishes it from regex row filter).
> While such functionality may seem like a proper fit for custom filter (i.e. not including into standard filter set) it looks like the filter may be very re-useable. We may judge based on the implementation that will hopefully be added.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HBASE-6618) Implement FuzzyRowFilter with ranges support

Posted by "Alex Baranau (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HBASE-6618?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13440242#comment-13440242 ] 

Alex Baranau commented on HBASE-6618:
-------------------------------------

Ah, sorry, haven't said anything about that. For toInc - we may not change it at every step, so if there's a missing arrow, that means nothing should be changed.

Thanx for checking out!

One thing that I'm not 100% sure about - is it better to adjust current FuzzyRowFilter and this functionality to it or add new. I'm leaning towards adjusting FuzzyRowFilter as this new feature fits naturally in it. Thoughts?
                
> Implement FuzzyRowFilter with ranges support
> --------------------------------------------
>
>                 Key: HBASE-6618
>                 URL: https://issues.apache.org/jira/browse/HBASE-6618
>             Project: HBase
>          Issue Type: New Feature
>          Components: filters
>            Reporter: Alex Baranau
>            Priority: Minor
>         Attachments: HBASE-6618-algo-desc-bits.png, HBASE-6618-algo.patch
>
>
> Apart from current ability to specify fuzzy row filter e.g. for <userId_actionId> format as ????_0004 (where 0004 - actionId) it would be great to also have ability to specify the "fuzzy range" , e.g. ????_0004, ..., ????_0099.
> See initial discussion here: http://search-hadoop.com/m/WVLJdX0Z65
> Note: currently it is possible to provide multiple fuzzy row rules to existing FuzzyRowFilter, but in case when the range is big (contains thousands of values) it is not efficient.
> Filter should perform efficient fast-forwarding during the scan (this is what distinguishes it from regex row filter).
> While such functionality may seem like a proper fit for custom filter (i.e. not including into standard filter set) it looks like the filter may be very re-useable. We may judge based on the implementation that will hopefully be added.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Updated] (HBASE-6618) Implement FuzzyRowFilter with ranges support

Posted by "Alex Baranau (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/HBASE-6618?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Alex Baranau updated HBASE-6618:
--------------------------------

    Attachment: HBASE-6618-algo-desc-bits.png
                HBASE-6618-algo.patch

Anil,

Now that I thought about it I just realized that finding the row key to fast-forward to, when given any number of "range groups" in the fuzzy rules is quite easy. And also can be done in just *one pass*, by going through the bytes of the given row.

Didn't have much time to add this functionality to the filter itself, but implemented the algorithm that seems to find the row key to fast-forward to (if you are interested to look at it). Added static method for that with small (not full) unit-test. Also attached brief description of the algo. I hope I'm not missing anything.

Will implement the new feature of the filter as a next step.
                
> Implement FuzzyRowFilter with ranges support
> --------------------------------------------
>
>                 Key: HBASE-6618
>                 URL: https://issues.apache.org/jira/browse/HBASE-6618
>             Project: HBase
>          Issue Type: New Feature
>          Components: filters
>            Reporter: Alex Baranau
>            Priority: Minor
>         Attachments: HBASE-6618-algo-desc-bits.png, HBASE-6618-algo.patch
>
>
> Apart from current ability to specify fuzzy row filter e.g. for <userId_actionId> format as ????_0004 (where 0004 - actionId) it would be great to also have ability to specify the "fuzzy range" , e.g. ????_0004, ..., ????_0099.
> See initial discussion here: http://search-hadoop.com/m/WVLJdX0Z65
> Note: currently it is possible to provide multiple fuzzy row rules to existing FuzzyRowFilter, but in case when the range is big (contains thousands of values) it is not efficient.
> Filter should perform efficient fast-forwarding during the scan (this is what distinguishes it from regex row filter).
> While such functionality may seem like a proper fit for custom filter (i.e. not including into standard filter set) it looks like the filter may be very re-useable. We may judge based on the implementation that will hopefully be added.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (HBASE-6618) Implement FuzzyRowFilter with ranges support

Posted by "Zhihong Ted Yu (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HBASE-6618?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13440268#comment-13440268 ] 

Zhihong Ted Yu commented on HBASE-6618:
---------------------------------------

Enhancing existing class is fine. 
                
> Implement FuzzyRowFilter with ranges support
> --------------------------------------------
>
>                 Key: HBASE-6618
>                 URL: https://issues.apache.org/jira/browse/HBASE-6618
>             Project: HBase
>          Issue Type: New Feature
>          Components: filters
>            Reporter: Alex Baranau
>            Priority: Minor
>         Attachments: HBASE-6618-algo-desc-bits.png, HBASE-6618-algo.patch
>
>
> Apart from current ability to specify fuzzy row filter e.g. for <userId_actionId> format as ????_0004 (where 0004 - actionId) it would be great to also have ability to specify the "fuzzy range" , e.g. ????_0004, ..., ????_0099.
> See initial discussion here: http://search-hadoop.com/m/WVLJdX0Z65
> Note: currently it is possible to provide multiple fuzzy row rules to existing FuzzyRowFilter, but in case when the range is big (contains thousands of values) it is not efficient.
> Filter should perform efficient fast-forwarding during the scan (this is what distinguishes it from regex row filter).
> While such functionality may seem like a proper fit for custom filter (i.e. not including into standard filter set) it looks like the filter may be very re-useable. We may judge based on the implementation that will hopefully be added.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (HBASE-6618) Implement FuzzyRowFilter with ranges support

Posted by "Alex Baranau (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HBASE-6618?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13438147#comment-13438147 ] 

Alex Baranau commented on HBASE-6618:
-------------------------------------

Sorry for the spam, for some reason I cannot edit the comment and JIRA broke formatting for the text pieces of my previous comment (I should have checked that first, sorry). This is how it supposed to look:

Just an idea. May be we should try improve existing FuzzyRowFilter by allowing to specify each fuzzy rule with:
* fuzzy key start
* fuzzy key end << this is currently missing in FuzzyRowFilter
* mask

This looks flexible enough to me. E.g. one could specify rule ?\?\??(0001 - 0999)???(001 - 099), i.e. <any 4 bytes><any 4 bytes value between "0001" and "0999"><any 3 bytes><any 3 bytes value between "001" and "099"> with this definition:
* ?\?\??0001???001
* ?\?\??0999???099 << currently missing
* 11110000111000

In this case any sequence of "fixed" positions treated as one n-bytes value.

Alternatively, such fuzzy rule can be specified as list of parts, each part being one of:
* n "fuzzy" bytes
* start/stop key part range (of the same length)

This might be closer to "human-readable" definition, though the former one could be easier to deal with.

Anil, as you expressed willing to work on this, what are your thoughts? May be you have smth different in your mind?
                
> Implement FuzzyRowFilter with ranges support
> --------------------------------------------
>
>                 Key: HBASE-6618
>                 URL: https://issues.apache.org/jira/browse/HBASE-6618
>             Project: HBase
>          Issue Type: New Feature
>          Components: filters
>            Reporter: Alex Baranau
>            Priority: Minor
>
> Apart from current ability to specify fuzzy row filter e.g. for <userId_actionId> format as ????_0004 (where 0004 - actionId) it would be great to also have ability to specify the "fuzzy range" , e.g. ????_0004, ..., ????_0099.
> See initial discussion here: http://search-hadoop.com/m/WVLJdX0Z65
> Note: currently it is possible to provide multiple fuzzy row rules to existing FuzzyRowFilter, but in case when the range is big (contains thousands of values) it is not efficient.
> Filter should perform efficient fast-forwarding during the scan (this is what distinguishes it from regex row filter).
> While such functionality may seem like a proper fit for custom filter (i.e. not including into standard filter set) it looks like the filter may be very re-useable. We may judge based on the implementation that will hopefully be added.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (HBASE-6618) Implement FuzzyRowFilter with ranges support

Posted by "Anil Gupta (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HBASE-6618?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13439311#comment-13439311 ] 

Anil Gupta commented on HBASE-6618:
-----------------------------------

Hi Alex,

I agree with you idea of RangeBased Fuzzy Filter. However, I would like to take a phased approach in developing this:
In your proposal, the user can provide multiple fuzzy ranges in a single scan. i.e. <any 4 bytes><any 6 bytes value between "_0001" and "0099"><any 3 bytes><any 4 bytes value between "_001" and "_099">
Instead of the above, IMO lets try to make a filter for "<any 4 bytes><any 6 bytes value between "_0001" and "0099"><any 3 bytes>" or "<any 4 bytes><any 6 bytes value between "_0001" and "0099">". Once we develop this then we can enhance it to use multiple fuzzy ranges. This is just my thought/approach of developing this. Let me know your opinion.

>From this week, at work I had to shift focus from HBase to Hive and HCatalog for another POC. So, I'll be squeezing time for this JIRA out of work schedule. I'll start looking into the current implementation of FuzzyRowFilter to get idea about implementation.

Thanks,
Anil Gupta
Software Engineer II, Intuit, Inc 
                
> Implement FuzzyRowFilter with ranges support
> --------------------------------------------
>
>                 Key: HBASE-6618
>                 URL: https://issues.apache.org/jira/browse/HBASE-6618
>             Project: HBase
>          Issue Type: New Feature
>          Components: filters
>            Reporter: Alex Baranau
>            Priority: Minor
>
> Apart from current ability to specify fuzzy row filter e.g. for <userId_actionId> format as ????_0004 (where 0004 - actionId) it would be great to also have ability to specify the "fuzzy range" , e.g. ????_0004, ..., ????_0099.
> See initial discussion here: http://search-hadoop.com/m/WVLJdX0Z65
> Note: currently it is possible to provide multiple fuzzy row rules to existing FuzzyRowFilter, but in case when the range is big (contains thousands of values) it is not efficient.
> Filter should perform efficient fast-forwarding during the scan (this is what distinguishes it from regex row filter).
> While such functionality may seem like a proper fit for custom filter (i.e. not including into standard filter set) it looks like the filter may be very re-useable. We may judge based on the implementation that will hopefully be added.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (HBASE-6618) Implement FuzzyRowFilter with ranges support

Posted by "Alex Baranau (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HBASE-6618?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13438134#comment-13438134 ] 

Alex Baranau commented on HBASE-6618:
-------------------------------------

Just an idea. May be we should try improve existing FuzzyRowFilter by allowing to specify each fuzzy rule with:
* fuzzy key start
* fuzzy key end << this is currently missing in FuzzyRowFilter
* mask

This looks flexible enough to me. E.g. one could specify rule ????(_0001_-_0099_)???(_001-_099), i.e. <any 4 bytes><any 6 bytes value between "_0001_" and "_0099_"><any 3 bytes><any 4 bytes value between "_001" and "_099"> with this definition:
* ????_0001_???_001
* ????_0099_???_099 << currently missing
* 11110000001110000

In this case any sequence of "fixed" positions treated as one n-bytes value.

--
Alternatively, such fuzzy rule can be specified as list of parts, each part being one of:
* n "fuzzy" bytes
* start/stop key part range (of the same length)

This might be closer to "human-readable" definition, though the former one could be easier to deal with.

Anil, as you expressed willing to work on this, what are your thoughts? May be you have smth different in your mind?
                
> Implement FuzzyRowFilter with ranges support
> --------------------------------------------
>
>                 Key: HBASE-6618
>                 URL: https://issues.apache.org/jira/browse/HBASE-6618
>             Project: HBase
>          Issue Type: New Feature
>          Components: filters
>            Reporter: Alex Baranau
>            Priority: Minor
>
> Apart from current ability to specify fuzzy row filter e.g. for <userId_actionId> format as ????_0004 (where 0004 - actionId) it would be great to also have ability to specify the "fuzzy range" , e.g. ????_0004, ..., ????_0099.
> See initial discussion here: http://search-hadoop.com/m/WVLJdX0Z65
> Note: currently it is possible to provide multiple fuzzy row rules to existing FuzzyRowFilter, but in case when the range is big (contains thousands of values) it is not efficient.
> Filter should perform efficient fast-forwarding during the scan (this is what distinguishes it from regex row filter).
> While such functionality may seem like a proper fit for custom filter (i.e. not including into standard filter set) it looks like the filter may be very re-useable. We may judge based on the implementation that will hopefully be added.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (HBASE-6618) Implement FuzzyRowFilter with ranges support

Posted by "Alex Baranau (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HBASE-6618?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13442944#comment-13442944 ] 

Alex Baranau commented on HBASE-6618:
-------------------------------------

Weird. I can open it. Anyhow, sent it to your email.
                
> Implement FuzzyRowFilter with ranges support
> --------------------------------------------
>
>                 Key: HBASE-6618
>                 URL: https://issues.apache.org/jira/browse/HBASE-6618
>             Project: HBase
>          Issue Type: New Feature
>          Components: filters
>            Reporter: Alex Baranau
>            Priority: Minor
>         Attachments: HBASE-6618-algo-desc-bits.png, HBASE-6618-algo.patch
>
>
> Apart from current ability to specify fuzzy row filter e.g. for <userId_actionId> format as ????_0004 (where 0004 - actionId) it would be great to also have ability to specify the "fuzzy range" , e.g. ????_0004, ..., ????_0099.
> See initial discussion here: http://search-hadoop.com/m/WVLJdX0Z65
> Note: currently it is possible to provide multiple fuzzy row rules to existing FuzzyRowFilter, but in case when the range is big (contains thousands of values) it is not efficient.
> Filter should perform efficient fast-forwarding during the scan (this is what distinguishes it from regex row filter).
> While such functionality may seem like a proper fit for custom filter (i.e. not including into standard filter set) it looks like the filter may be very re-useable. We may judge based on the implementation that will hopefully be added.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira