You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@pig.apache.org by "Bill Graham (JIRA)" <ji...@apache.org> on 2012/09/26 23:59:07 UTC

[jira] [Created] (PIG-2934) HBaseStorage filter optimizations

Bill Graham created PIG-2934:
--------------------------------

             Summary: HBaseStorage filter optimizations
                 Key: PIG-2934
                 URL: https://issues.apache.org/jira/browse/PIG-2934
             Project: Pig
          Issue Type: Improvement
            Reporter: Bill Graham
            Assignee: Bill Graham


Our HBase pal/guru Gary Helmling was kind enough to do a code review of HBaseStorage. He suggested some good filter optimizations:

* when using the "lt*" and "gt*" options, set the start/stop rows on the Scan instance, at least in addition to the RowFilters. Without this you're doing a full table scan, regardless of the RowFilters.
* when selecting specific columns or entire families to return, it would be more efficient to set the family + columns on the Scan object (addFamily(), addColumn()), instead of using a FilterList. I'm not familiar with the family:prefix handling you mention, but that would still seem to require filters. But if that's not being used, it would be better to avoid the FilterList for columns. At minimum, we should probably call Scan.addFamily() with the distinct families, so we can skip entire column families that are not being used. In the case of a table with 4 CFs, if, say, only 1 is being used, this could be a big gain.


--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (PIG-2934) HBaseStorage filter optimizations

Posted by "Bill Graham (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/PIG-2934?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Bill Graham updated PIG-2934:
-----------------------------

    Status: Patch Available  (was: In Progress)

We uncovered a significant performance problem with HBaseStorage > 0.9 when used with a long list of columns on a tall table. The previous use of filters is too hard hitting on HBase and it pegs HBase cluster CPU. We should consider this patch to be included in Pig 0.11.
                
> HBaseStorage filter optimizations
> ---------------------------------
>
>                 Key: PIG-2934
>                 URL: https://issues.apache.org/jira/browse/PIG-2934
>             Project: Pig
>          Issue Type: Improvement
>            Reporter: Bill Graham
>            Assignee: Bill Graham
>              Labels: hbase
>         Attachments: PIG-2934.1.patch
>
>
> Our HBase pal/guru Gary Helmling was kind enough to do a code review of HBaseStorage. He suggested some good filter optimizations:
> * when using the "lt*" and "gt*" options, set the start/stop rows on the Scan instance, at least in addition to the RowFilters. Without this you're doing a full table scan, regardless of the RowFilters.
> * when selecting specific columns or entire families to return, it would be more efficient to set the family + columns on the Scan object (addFamily(), addColumn()), instead of using a FilterList. I'm not familiar with the family:prefix handling you mention, but that would still seem to require filters. But if that's not being used, it would be better to avoid the FilterList for columns. At minimum, we should probably call Scan.addFamily() with the distinct families, so we can skip entire column families that are not being used. In the case of a table with 4 CFs, if, say, only 1 is being used, this could be a big gain.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (PIG-2934) HBaseStorage filter optimizations

Posted by "Bill Graham (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/PIG-2934?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Bill Graham updated PIG-2934:
-----------------------------

      Resolution: Fixed
    Release Note: HBaseStorage filter performance improvements
          Status: Resolved  (was: Patch Available)

Committed to both trunk and Pig 0.11 branch
                
> HBaseStorage filter optimizations
> ---------------------------------
>
>                 Key: PIG-2934
>                 URL: https://issues.apache.org/jira/browse/PIG-2934
>             Project: Pig
>          Issue Type: Improvement
>    Affects Versions: 0.10.0
>            Reporter: Bill Graham
>            Assignee: Bill Graham
>              Labels: hbase
>             Fix For: 0.11
>
>         Attachments: PIG-2934.1.patch
>
>
> Our HBase pal/guru Gary Helmling was kind enough to do a code review of HBaseStorage. He suggested some good filter optimizations:
> * when using the "lt*" and "gt*" options, set the start/stop rows on the Scan instance, at least in addition to the RowFilters. Without this you're doing a full table scan, regardless of the RowFilters.
> * when selecting specific columns or entire families to return, it would be more efficient to set the family + columns on the Scan object (addFamily(), addColumn()), instead of using a FilterList. I'm not familiar with the family:prefix handling you mention, but that would still seem to require filters. But if that's not being used, it would be better to avoid the FilterList for columns. At minimum, we should probably call Scan.addFamily() with the distinct families, so we can skip entire column families that are not being used. In the case of a table with 4 CFs, if, say, only 1 is being used, this could be a big gain.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Work started] (PIG-2934) HBaseStorage filter optimizations

Posted by "Bill Graham (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/PIG-2934?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Work on PIG-2934 started by Bill Graham.

> HBaseStorage filter optimizations
> ---------------------------------
>
>                 Key: PIG-2934
>                 URL: https://issues.apache.org/jira/browse/PIG-2934
>             Project: Pig
>          Issue Type: Improvement
>            Reporter: Bill Graham
>            Assignee: Bill Graham
>              Labels: hbase
>
> Our HBase pal/guru Gary Helmling was kind enough to do a code review of HBaseStorage. He suggested some good filter optimizations:
> * when using the "lt*" and "gt*" options, set the start/stop rows on the Scan instance, at least in addition to the RowFilters. Without this you're doing a full table scan, regardless of the RowFilters.
> * when selecting specific columns or entire families to return, it would be more efficient to set the family + columns on the Scan object (addFamily(), addColumn()), instead of using a FilterList. I'm not familiar with the family:prefix handling you mention, but that would still seem to require filters. But if that's not being used, it would be better to avoid the FilterList for columns. At minimum, we should probably call Scan.addFamily() with the distinct families, so we can skip entire column families that are not being used. In the case of a table with 4 CFs, if, say, only 1 is being used, this could be a big gain.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (PIG-2934) HBaseStorage filter optimizations

Posted by "Bill Graham (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/PIG-2934?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Bill Graham updated PIG-2934:
-----------------------------

    Affects Version/s: 0.10.0
    
> HBaseStorage filter optimizations
> ---------------------------------
>
>                 Key: PIG-2934
>                 URL: https://issues.apache.org/jira/browse/PIG-2934
>             Project: Pig
>          Issue Type: Improvement
>    Affects Versions: 0.10.0
>            Reporter: Bill Graham
>            Assignee: Bill Graham
>              Labels: hbase
>         Attachments: PIG-2934.1.patch
>
>
> Our HBase pal/guru Gary Helmling was kind enough to do a code review of HBaseStorage. He suggested some good filter optimizations:
> * when using the "lt*" and "gt*" options, set the start/stop rows on the Scan instance, at least in addition to the RowFilters. Without this you're doing a full table scan, regardless of the RowFilters.
> * when selecting specific columns or entire families to return, it would be more efficient to set the family + columns on the Scan object (addFamily(), addColumn()), instead of using a FilterList. I'm not familiar with the family:prefix handling you mention, but that would still seem to require filters. But if that's not being used, it would be better to avoid the FilterList for columns. At minimum, we should probably call Scan.addFamily() with the distinct families, so we can skip entire column families that are not being used. In the case of a table with 4 CFs, if, say, only 1 is being used, this could be a big gain.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (PIG-2934) HBaseStorage filter optimizations

Posted by "Bill Graham (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/PIG-2934?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Bill Graham updated PIG-2934:
-----------------------------

    Fix Version/s: 0.11
    
> HBaseStorage filter optimizations
> ---------------------------------
>
>                 Key: PIG-2934
>                 URL: https://issues.apache.org/jira/browse/PIG-2934
>             Project: Pig
>          Issue Type: Improvement
>    Affects Versions: 0.10.0
>            Reporter: Bill Graham
>            Assignee: Bill Graham
>              Labels: hbase
>             Fix For: 0.11
>
>         Attachments: PIG-2934.1.patch
>
>
> Our HBase pal/guru Gary Helmling was kind enough to do a code review of HBaseStorage. He suggested some good filter optimizations:
> * when using the "lt*" and "gt*" options, set the start/stop rows on the Scan instance, at least in addition to the RowFilters. Without this you're doing a full table scan, regardless of the RowFilters.
> * when selecting specific columns or entire families to return, it would be more efficient to set the family + columns on the Scan object (addFamily(), addColumn()), instead of using a FilterList. I'm not familiar with the family:prefix handling you mention, but that would still seem to require filters. But if that's not being used, it would be better to avoid the FilterList for columns. At minimum, we should probably call Scan.addFamily() with the distinct families, so we can skip entire column families that are not being used. In the case of a table with 4 CFs, if, say, only 1 is being used, this could be a big gain.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (PIG-2934) HBaseStorage filter optimizations

Posted by "Bill Graham (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/PIG-2934?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13493334#comment-13493334 ] 

Bill Graham commented on PIG-2934:
----------------------------------

Initialization happens in the setLocation method often, since that's the first time the class has a conf object on the cluster. That can happen elsewhere in the initialization process though if it works.
                
> HBaseStorage filter optimizations
> ---------------------------------
>
>                 Key: PIG-2934
>                 URL: https://issues.apache.org/jira/browse/PIG-2934
>             Project: Pig
>          Issue Type: Improvement
>            Reporter: Bill Graham
>            Assignee: Bill Graham
>              Labels: hbase
>
> Our HBase pal/guru Gary Helmling was kind enough to do a code review of HBaseStorage. He suggested some good filter optimizations:
> * when using the "lt*" and "gt*" options, set the start/stop rows on the Scan instance, at least in addition to the RowFilters. Without this you're doing a full table scan, regardless of the RowFilters.
> * when selecting specific columns or entire families to return, it would be more efficient to set the family + columns on the Scan object (addFamily(), addColumn()), instead of using a FilterList. I'm not familiar with the family:prefix handling you mention, but that would still seem to require filters. But if that's not being used, it would be better to avoid the FilterList for columns. At minimum, we should probably call Scan.addFamily() with the distinct families, so we can skip entire column families that are not being used. In the case of a table with 4 CFs, if, say, only 1 is being used, this could be a big gain.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (PIG-2934) HBaseStorage filter optimizations

Posted by "Christoph Bauer (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/PIG-2934?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13493232#comment-13493232 ] 

Christoph Bauer commented on PIG-2934:
--------------------------------------

I'm starting on a patch for HBase Storage here at my company.

Regarding your first issue you're totally right. It seems weird that it was implemented with filters at all.
The second issue is different. In HBaseStorage.setLocation those Families are added to the scan object. I don't understand why it's done there though.
                
> HBaseStorage filter optimizations
> ---------------------------------
>
>                 Key: PIG-2934
>                 URL: https://issues.apache.org/jira/browse/PIG-2934
>             Project: Pig
>          Issue Type: Improvement
>            Reporter: Bill Graham
>            Assignee: Bill Graham
>              Labels: hbase
>
> Our HBase pal/guru Gary Helmling was kind enough to do a code review of HBaseStorage. He suggested some good filter optimizations:
> * when using the "lt*" and "gt*" options, set the start/stop rows on the Scan instance, at least in addition to the RowFilters. Without this you're doing a full table scan, regardless of the RowFilters.
> * when selecting specific columns or entire families to return, it would be more efficient to set the family + columns on the Scan object (addFamily(), addColumn()), instead of using a FilterList. I'm not familiar with the family:prefix handling you mention, but that would still seem to require filters. But if that's not being used, it would be better to avoid the FilterList for columns. At minimum, we should probably call Scan.addFamily() with the distinct families, so we can skip entire column families that are not being used. In the case of a table with 4 CFs, if, say, only 1 is being used, this could be a big gain.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (PIG-2934) HBaseStorage filter optimizations

Posted by "Bill Graham (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/PIG-2934?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Bill Graham updated PIG-2934:
-----------------------------

    Attachment: PIG-2934.1.patch

Attaching patch that reduces the number of filters used and improves how range scans are done.
                
> HBaseStorage filter optimizations
> ---------------------------------
>
>                 Key: PIG-2934
>                 URL: https://issues.apache.org/jira/browse/PIG-2934
>             Project: Pig
>          Issue Type: Improvement
>            Reporter: Bill Graham
>            Assignee: Bill Graham
>              Labels: hbase
>         Attachments: PIG-2934.1.patch
>
>
> Our HBase pal/guru Gary Helmling was kind enough to do a code review of HBaseStorage. He suggested some good filter optimizations:
> * when using the "lt*" and "gt*" options, set the start/stop rows on the Scan instance, at least in addition to the RowFilters. Without this you're doing a full table scan, regardless of the RowFilters.
> * when selecting specific columns or entire families to return, it would be more efficient to set the family + columns on the Scan object (addFamily(), addColumn()), instead of using a FilterList. I'm not familiar with the family:prefix handling you mention, but that would still seem to require filters. But if that's not being used, it would be better to avoid the FilterList for columns. At minimum, we should probably call Scan.addFamily() with the distinct families, so we can skip entire column families that are not being used. In the case of a table with 4 CFs, if, say, only 1 is being used, this could be a big gain.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira