You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@pig.apache.org by "Harsh J (JIRA)" <ji...@apache.org> on 2012/09/14 10:11:07 UTC

[jira] [Created] (PIG-2921) Provide a bulkloadable option in HBaseStorage

Harsh J created PIG-2921:
----------------------------

             Summary: Provide a bulkloadable option in HBaseStorage
                 Key: PIG-2921
                 URL: https://issues.apache.org/jira/browse/PIG-2921
             Project: Pig
          Issue Type: New Feature
          Components: data
    Affects Versions: 0.9.2
            Reporter: Harsh J


Right now, the Pig HBaseStorage writes Puts directly into HBase. This is slow for bulk operations (such as the ones Pig exactly does). The Puts/Deletes are more meant for realtime operations, so it would be nice if Pig had an automatic mechanism to prepare bulkloadable HFiles for the target table, and bulkload it in right at the end of the job.

For compatibility reasons, this can be optional and turned off by default until it is agreed that this must be default (but can continue to provide a turn-off option).

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (PIG-2921) Provide a bulkloadable option in HBaseStorage

Posted by "Harsh J (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/PIG-2921?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13478671#comment-13478671 ] 

Harsh J commented on PIG-2921:
------------------------------

bq. HBaseStorage could always just writes HFiles which the caller then needs to bulk import. This puts the burden on caller to know what to do. Not the greatest solution.

But this is a good thing to start with. I think we can go with this as an option step, and document it. We have a "secure" bulkload support coming up in HBase soon, which we can switch over to to avoid this.

I'll follow up shortly.
                
> Provide a bulkloadable option in HBaseStorage
> ---------------------------------------------
>
>                 Key: PIG-2921
>                 URL: https://issues.apache.org/jira/browse/PIG-2921
>             Project: Pig
>          Issue Type: New Feature
>          Components: data
>    Affects Versions: 0.9.2
>            Reporter: Harsh J
>
> Right now, the Pig HBaseStorage writes Puts directly into HBase. This is slow for bulk operations (such as the ones Pig exactly does). The Puts/Deletes are more meant for realtime operations, so it would be nice if Pig had an automatic mechanism to prepare bulkloadable HFiles for the target table, and bulkload it in right at the end of the job.
> For compatibility reasons, this can be optional and turned off by default until it is agreed that this must be default (but can continue to provide a turn-off option).

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (PIG-2921) Provide a bulkloadable option in HBaseStorage

Posted by "Bill Graham (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/PIG-2921?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13478686#comment-13478686 ] 

Bill Graham commented on PIG-2921:
----------------------------------

+1

Note that the HBaseStorage class is already > 1000 lines long and is getting unwieldily. If possible we should implement this using composition, or possibly inheritance. This would help slow the growth and complexity of this class.
                
> Provide a bulkloadable option in HBaseStorage
> ---------------------------------------------
>
>                 Key: PIG-2921
>                 URL: https://issues.apache.org/jira/browse/PIG-2921
>             Project: Pig
>          Issue Type: New Feature
>          Components: data
>    Affects Versions: 0.9.2
>            Reporter: Harsh J
>
> Right now, the Pig HBaseStorage writes Puts directly into HBase. This is slow for bulk operations (such as the ones Pig exactly does). The Puts/Deletes are more meant for realtime operations, so it would be nice if Pig had an automatic mechanism to prepare bulkloadable HFiles for the target table, and bulkload it in right at the end of the job.
> For compatibility reasons, this can be optional and turned off by default until it is agreed that this must be default (but can continue to provide a turn-off option).

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (PIG-2921) Provide a bulkloadable option in HBaseStorage

Posted by "Bill Graham (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/PIG-2921?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13458909#comment-13458909 ] 

Bill Graham commented on PIG-2921:
----------------------------------

The other tricky part is how to have just one Mapper or Reducer execute the bulk load call after all the HFiles have been created. PIG-1891 might be able to help here.
                
> Provide a bulkloadable option in HBaseStorage
> ---------------------------------------------
>
>                 Key: PIG-2921
>                 URL: https://issues.apache.org/jira/browse/PIG-2921
>             Project: Pig
>          Issue Type: New Feature
>          Components: data
>    Affects Versions: 0.9.2
>            Reporter: Harsh J
>
> Right now, the Pig HBaseStorage writes Puts directly into HBase. This is slow for bulk operations (such as the ones Pig exactly does). The Puts/Deletes are more meant for realtime operations, so it would be nice if Pig had an automatic mechanism to prepare bulkloadable HFiles for the target table, and bulkload it in right at the end of the job.
> For compatibility reasons, this can be optional and turned off by default until it is agreed that this must be default (but can continue to provide a turn-off option).

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (PIG-2921) Provide a bulkloadable option in HBaseStorage

Posted by "Harsh J (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/PIG-2921?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13458970#comment-13458970 ] 

Harsh J commented on PIG-2921:
------------------------------

Thanks for chiming in Bill! Is it a no-no to do a singular command in the frontend instead (at the client side)?
                
> Provide a bulkloadable option in HBaseStorage
> ---------------------------------------------
>
>                 Key: PIG-2921
>                 URL: https://issues.apache.org/jira/browse/PIG-2921
>             Project: Pig
>          Issue Type: New Feature
>          Components: data
>    Affects Versions: 0.9.2
>            Reporter: Harsh J
>
> Right now, the Pig HBaseStorage writes Puts directly into HBase. This is slow for bulk operations (such as the ones Pig exactly does). The Puts/Deletes are more meant for realtime operations, so it would be nice if Pig had an automatic mechanism to prepare bulkloadable HFiles for the target table, and bulkload it in right at the end of the job.
> For compatibility reasons, this can be optional and turned off by default until it is agreed that this must be default (but can continue to provide a turn-off option).

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (PIG-2921) Provide a bulkloadable option in HBaseStorage

Posted by "Bill Graham (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/PIG-2921?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13459177#comment-13459177 ] 

Bill Graham commented on PIG-2921:
----------------------------------

I don't think we'd want to do anything too crazy on the client side here, since the proper solution would be to insert a marker into the logical/physical plan that gets executed on the client. Pig isn't set up to support something like this yet (PIG-2906 could help) for custom use cases like this.

HBaseStorage could always just writes HFiles which the caller then needs to bulk import. This puts the burden on caller to know what to do. Not the greatest solution.

I think it would be worth exploring whether we can do this with PIG-1891 though. This kind of use-case (or similarly doing a table swap in SQL for example) is what I hoping PIG-1891 would handle. The tricky bit though is that you only want *one* of the mappers or reducers to take the success action. 

                
> Provide a bulkloadable option in HBaseStorage
> ---------------------------------------------
>
>                 Key: PIG-2921
>                 URL: https://issues.apache.org/jira/browse/PIG-2921
>             Project: Pig
>          Issue Type: New Feature
>          Components: data
>    Affects Versions: 0.9.2
>            Reporter: Harsh J
>
> Right now, the Pig HBaseStorage writes Puts directly into HBase. This is slow for bulk operations (such as the ones Pig exactly does). The Puts/Deletes are more meant for realtime operations, so it would be nice if Pig had an automatic mechanism to prepare bulkloadable HFiles for the target table, and bulkload it in right at the end of the job.
> For compatibility reasons, this can be optional and turned off by default until it is agreed that this must be default (but can continue to provide a turn-off option).

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (PIG-2921) Provide a bulkloadable option in HBaseStorage

Posted by "Harsh J (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/PIG-2921?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13458726#comment-13458726 ] 

Harsh J commented on PIG-2921:
------------------------------

We could document that preparing bulk load + bulkloading it in via HBaseStorage requires that you run the Pig jobs as the HBase-running user?
                
> Provide a bulkloadable option in HBaseStorage
> ---------------------------------------------
>
>                 Key: PIG-2921
>                 URL: https://issues.apache.org/jira/browse/PIG-2921
>             Project: Pig
>          Issue Type: New Feature
>          Components: data
>    Affects Versions: 0.9.2
>            Reporter: Harsh J
>
> Right now, the Pig HBaseStorage writes Puts directly into HBase. This is slow for bulk operations (such as the ones Pig exactly does). The Puts/Deletes are more meant for realtime operations, so it would be nice if Pig had an automatic mechanism to prepare bulkloadable HFiles for the target table, and bulkload it in right at the end of the job.
> For compatibility reasons, this can be optional and turned off by default until it is agreed that this must be default (but can continue to provide a turn-off option).

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Assigned] (PIG-2921) Provide a bulkloadable option in HBaseStorage

Posted by "Cheolsoo Park (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/PIG-2921?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Cheolsoo Park reassigned PIG-2921:
----------------------------------

    Assignee: Cheolsoo Park
    
> Provide a bulkloadable option in HBaseStorage
> ---------------------------------------------
>
>                 Key: PIG-2921
>                 URL: https://issues.apache.org/jira/browse/PIG-2921
>             Project: Pig
>          Issue Type: New Feature
>          Components: data
>    Affects Versions: 0.9.2
>            Reporter: Harsh J
>            Assignee: Cheolsoo Park
>
> Right now, the Pig HBaseStorage writes Puts directly into HBase. This is slow for bulk operations (such as the ones Pig exactly does). The Puts/Deletes are more meant for realtime operations, so it would be nice if Pig had an automatic mechanism to prepare bulkloadable HFiles for the target table, and bulkload it in right at the end of the job.
> For compatibility reasons, this can be optional and turned off by default until it is agreed that this must be default (but can continue to provide a turn-off option).

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (PIG-2921) Provide a bulkloadable option in HBaseStorage

Posted by "Harsh J (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/PIG-2921?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13455665#comment-13455665 ] 

Harsh J commented on PIG-2921:
------------------------------

One problem though: Bulkload needs to be done via the user HBase runs as, AFAICT.
                
> Provide a bulkloadable option in HBaseStorage
> ---------------------------------------------
>
>                 Key: PIG-2921
>                 URL: https://issues.apache.org/jira/browse/PIG-2921
>             Project: Pig
>          Issue Type: New Feature
>          Components: data
>    Affects Versions: 0.9.2
>            Reporter: Harsh J
>
> Right now, the Pig HBaseStorage writes Puts directly into HBase. This is slow for bulk operations (such as the ones Pig exactly does). The Puts/Deletes are more meant for realtime operations, so it would be nice if Pig had an automatic mechanism to prepare bulkloadable HFiles for the target table, and bulkload it in right at the end of the job.
> For compatibility reasons, this can be optional and turned off by default until it is agreed that this must be default (but can continue to provide a turn-off option).

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira