You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@hbase.apache.org by "Kannan Muthukkaruppan (JIRA)" <ji...@apache.org> on 2011/06/09 02:54:58 UTC

[jira] [Created] (HBASE-3967) Add support to HFileOutputFormat based bulk imports to add Delete mutations

Add support to HFileOutputFormat based bulk imports to add Delete mutations
---------------------------------------------------------------------------

                 Key: HBASE-3967
                 URL: https://issues.apache.org/jira/browse/HBASE-3967
             Project: HBase
          Issue Type: Improvement
            Reporter: Kannan Muthukkaruppan


During bulk imports, it'll be useful to be able to do delete mutations (either to delete data that already exists in HBase or was inserted earlier during this run of the import). 

For example, we have a use case, where we are processing a log of data which may have both inserts and deletes in the mix and we want to upload that into HBase using the bulk import mechanism.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HBASE-3967) Support deletes in HFileOutputFormat based bulk import mechanism

Posted by "Bogdan-Alexandru Matican (Commented) (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HBASE-3967?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13149386#comment-13149386 ] 

Bogdan-Alexandru Matican commented on HBASE-3967:
-------------------------------------------------

@Kannan:
You are right, this diff does not include the follow-up fix for tie-breaking - I just checked through it.
                
> Support deletes in HFileOutputFormat based bulk import mechanism
> ----------------------------------------------------------------
>
>                 Key: HBASE-3967
>                 URL: https://issues.apache.org/jira/browse/HBASE-3967
>             Project: HBase
>          Issue Type: Improvement
>            Reporter: Kannan Muthukkaruppan
>            Assignee: Bogdan-Alexandru Matican
>         Attachments: diff.patch
>
>
> During bulk imports, it'll be useful to be able to do delete mutations (either to delete data that already exists in HBase or was inserted earlier during this run of the import). 
> For example, we have a use case, where we are processing a log of data which may have both inserts and deletes in the mix and we want to upload that into HBase using the bulk import mechanism.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Assigned] (HBASE-3967) Support deletes in HFileOutputFormat based bulk import mechanism

Posted by "Nicolas Spiegelberg (Assigned) (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/HBASE-3967?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Nicolas Spiegelberg reassigned HBASE-3967:
------------------------------------------

    Assignee:     (was: Bogdan-Alexandru Matican)
    
> Support deletes in HFileOutputFormat based bulk import mechanism
> ----------------------------------------------------------------
>
>                 Key: HBASE-3967
>                 URL: https://issues.apache.org/jira/browse/HBASE-3967
>             Project: HBase
>          Issue Type: Sub-task
>            Reporter: Kannan Muthukkaruppan
>         Attachments: diff.patch
>
>
> During bulk imports, it'll be useful to be able to do delete mutations (either to delete data that already exists in HBase or was inserted earlier during this run of the import). 
> For example, we have a use case, where we are processing a log of data which may have both inserts and deletes in the mix and we want to upload that into HBase using the bulk import mechanism.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Updated] (HBASE-3967) Support deletes in HFileOutputFormat based bulk import mechanism

Posted by "Lars Hofhansl (Updated) (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/HBASE-3967?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Lars Hofhansl updated HBASE-3967:
---------------------------------

    Priority: Critical  (was: Blocker)
    
> Support deletes in HFileOutputFormat based bulk import mechanism
> ----------------------------------------------------------------
>
>                 Key: HBASE-3967
>                 URL: https://issues.apache.org/jira/browse/HBASE-3967
>             Project: HBase
>          Issue Type: Sub-task
>            Reporter: Kannan Muthukkaruppan
>            Priority: Critical
>             Fix For: 0.96.0
>
>         Attachments: diff.patch
>
>
> During bulk imports, it'll be useful to be able to do delete mutations (either to delete data that already exists in HBase or was inserted earlier during this run of the import). 
> For example, we have a use case, where we are processing a log of data which may have both inserts and deletes in the mix and we want to upload that into HBase using the bulk import mechanism.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Updated] (HBASE-3967) Support deletes in HFileOutputFormat based bulk import mechanism

Posted by "stack (Updated) (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/HBASE-3967?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

stack updated HBASE-3967:
-------------------------

         Priority: Blocker  (was: Major)
    Fix Version/s: 0.96.0

Making blocker on 0.96 at Nicolas's suggestion
                
> Support deletes in HFileOutputFormat based bulk import mechanism
> ----------------------------------------------------------------
>
>                 Key: HBASE-3967
>                 URL: https://issues.apache.org/jira/browse/HBASE-3967
>             Project: HBase
>          Issue Type: Sub-task
>            Reporter: Kannan Muthukkaruppan
>            Priority: Blocker
>             Fix For: 0.96.0
>
>         Attachments: diff.patch
>
>
> During bulk imports, it'll be useful to be able to do delete mutations (either to delete data that already exists in HBase or was inserted earlier during this run of the import). 
> For example, we have a use case, where we are processing a log of data which may have both inserts and deletes in the mix and we want to upload that into HBase using the bulk import mechanism.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (HBASE-3967) Support deletes in HFileOutputFormat based bulk import mechanism

Posted by "stack (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HBASE-3967?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13050050#comment-13050050 ] 

stack commented on HBASE-3967:
------------------------------

@Bogdan and @Kannan, pardon me, I misread Bogdan's pasting of code to be a check on KV over in hbase; I missed that he was talking about a frameowork check of the passed Key class.  Scratch my suggested remove of the class check.

@Bogdan, shouldn't you be working in mapreduce package, rather than mapred package?

> Support deletes in HFileOutputFormat based bulk import mechanism
> ----------------------------------------------------------------
>
>                 Key: HBASE-3967
>                 URL: https://issues.apache.org/jira/browse/HBASE-3967
>             Project: HBase
>          Issue Type: Improvement
>            Reporter: Kannan Muthukkaruppan
>            Assignee: Bogdan-Alexandru Matican
>         Attachments: diff.patch
>
>
> During bulk imports, it'll be useful to be able to do delete mutations (either to delete data that already exists in HBase or was inserted earlier during this run of the import). 
> For example, we have a use case, where we are processing a log of data which may have both inserts and deletes in the mix and we want to upload that into HBase using the bulk import mechanism.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (HBASE-3967) Support deletes in HFileOutputFormat based bulk import mechanism

Posted by "Nicolas Spiegelberg (Commented) (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HBASE-3967?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13160417#comment-13160417 ] 

Nicolas Spiegelberg commented on HBASE-3967:
--------------------------------------------

Patch available in 89-fb, need to port to trunk (SVN: 15334 & 16195)
                
> Support deletes in HFileOutputFormat based bulk import mechanism
> ----------------------------------------------------------------
>
>                 Key: HBASE-3967
>                 URL: https://issues.apache.org/jira/browse/HBASE-3967
>             Project: HBase
>          Issue Type: Sub-task
>            Reporter: Kannan Muthukkaruppan
>            Assignee: Bogdan-Alexandru Matican
>         Attachments: diff.patch
>
>
> During bulk imports, it'll be useful to be able to do delete mutations (either to delete data that already exists in HBase or was inserted earlier during this run of the import). 
> For example, we have a use case, where we are processing a log of data which may have both inserts and deletes in the mix and we want to upload that into HBase using the bulk import mechanism.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (HBASE-3967) Support deletes in HFileOutputFormat based bulk import mechanism

Posted by "Lars Hofhansl (Commented) (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HBASE-3967?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13235648#comment-13235648 ] 

Lars Hofhansl commented on HBASE-3967:
--------------------------------------

Please comment today on why this is a blocker for 0.94. Otherwise I'll move this out of 0.94.
                
> Support deletes in HFileOutputFormat based bulk import mechanism
> ----------------------------------------------------------------
>
>                 Key: HBASE-3967
>                 URL: https://issues.apache.org/jira/browse/HBASE-3967
>             Project: HBase
>          Issue Type: Sub-task
>            Reporter: Kannan Muthukkaruppan
>            Priority: Blocker
>             Fix For: 0.94.0
>
>         Attachments: diff.patch
>
>
> During bulk imports, it'll be useful to be able to do delete mutations (either to delete data that already exists in HBase or was inserted earlier during this run of the import). 
> For example, we have a use case, where we are processing a log of data which may have both inserts and deletes in the mix and we want to upload that into HBase using the bulk import mechanism.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (HBASE-3967) Support deletes in HFileOutputFormat based bulk import mechanism

Posted by "Lars Hofhansl (Commented) (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HBASE-3967?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13217957#comment-13217957 ] 

Lars Hofhansl commented on HBASE-3967:
--------------------------------------

In HBASE-5440 I solved this by having the mapper output KVs and using KeyValueSortReducer.

                
> Support deletes in HFileOutputFormat based bulk import mechanism
> ----------------------------------------------------------------
>
>                 Key: HBASE-3967
>                 URL: https://issues.apache.org/jira/browse/HBASE-3967
>             Project: HBase
>          Issue Type: Sub-task
>            Reporter: Kannan Muthukkaruppan
>         Attachments: diff.patch
>
>
> During bulk imports, it'll be useful to be able to do delete mutations (either to delete data that already exists in HBase or was inserted earlier during this run of the import). 
> For example, we have a use case, where we are processing a log of data which may have both inserts and deletes in the mix and we want to upload that into HBase using the bulk import mechanism.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Updated] (HBASE-3967) Support deletes in HFileOutputFormat based bulk import mechanism

Posted by "Bogdan-Alexandru Matican (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/HBASE-3967?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Bogdan-Alexandru Matican updated HBASE-3967:
--------------------------------------------

    Attachment: diff.patch

The Row modifications are made for our internal branch since we didn't have the lastest ones...

> Support deletes in HFileOutputFormat based bulk import mechanism
> ----------------------------------------------------------------
>
>                 Key: HBASE-3967
>                 URL: https://issues.apache.org/jira/browse/HBASE-3967
>             Project: HBase
>          Issue Type: Improvement
>            Reporter: Kannan Muthukkaruppan
>            Assignee: Bogdan-Alexandru Matican
>         Attachments: diff.patch
>
>
> During bulk imports, it'll be useful to be able to do delete mutations (either to delete data that already exists in HBase or was inserted earlier during this run of the import). 
> For example, we have a use case, where we are processing a log of data which may have both inserts and deletes in the mix and we want to upload that into HBase using the bulk import mechanism.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (HBASE-3967) Support deletes in HFileOutputFormat based bulk import mechanism

Posted by "Bogdan-Alexandru Matican (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HBASE-3967?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13049645#comment-13049645 ] 

Bogdan-Alexandru Matican commented on HBASE-3967:
-------------------------------------------------

Hello and thank you!

>From what I gathered by looking through the hadoop code, the MapTask class will try to get serializers for the respective classes, based on their actual .class field, which basically means that even if they will fail the check (so if we take it out), the serialization process should be ok afterwards.

I back-traced it through the code:

* SerializationFactory gets configuration from job 
http://grepcode.com/file/repository.cloudera.com/content/repositories/releases/com.cloudera.hadoop/hadoop-core/0.20.2-320/org/apache/hadoop/mapred/MapTask.java#794

* SerializationFactory gathers the proper Serialization from the conf (currently only for the WritableSerialization) 
http://grepcode.com/file/repository.cloudera.com/content/repositories/releases/com.cloudera.hadoop/hadoop-core/0.20.2-320/org/apache/hadoop/io/serializer/SerializationFactory.java#50

* WritableSerialization delivers proper Serializer and Deserializer that will work well on the underlying objects as it calls readFields and write on the respective Writable object that they hold 
http://grepcode.com/file/repository.cloudera.com/content/repositories/releases/com.cloudera.hadoop/hadoop-core/0.20.2-320/org/apache/hadoop/io/serializer/WritableSerialization.java#89

I am posting this to make sure I have it right, as I haven't tested by actually modifying the hadoop code myself and trying it. (I should probably checkout, modify and build for testing it...) 

This will also ensure backwards compatibility, as anything previously written cannot possibly break. Also, as with your note of Postel's Law, this will increase the amount of use-cases that get accepted, while not causing any potential problems.

Currently, for this respective Put+Delete case, I finished the initial implementation with the union thing and it works, but if this works and removing those two checks would make the MR code more effective in general, then probably that should change too :)


> Support deletes in HFileOutputFormat based bulk import mechanism
> ----------------------------------------------------------------
>
>                 Key: HBASE-3967
>                 URL: https://issues.apache.org/jira/browse/HBASE-3967
>             Project: HBase
>          Issue Type: Improvement
>            Reporter: Kannan Muthukkaruppan
>            Assignee: Bogdan-Alexandru Matican
>
> During bulk imports, it'll be useful to be able to do delete mutations (either to delete data that already exists in HBase or was inserted earlier during this run of the import). 
> For example, we have a use case, where we are processing a log of data which may have both inserts and deletes in the mix and we want to upload that into HBase using the bulk import mechanism.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (HBASE-3967) Support deletes in HFileOutputFormat based bulk import mechanism

Posted by "Lars Hofhansl (Commented) (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HBASE-3967?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13234504#comment-13234504 ] 

Lars Hofhansl commented on HBASE-3967:
--------------------------------------

Is anybody working on this. I don't think this should hold up the 0.94 release.
                
> Support deletes in HFileOutputFormat based bulk import mechanism
> ----------------------------------------------------------------
>
>                 Key: HBASE-3967
>                 URL: https://issues.apache.org/jira/browse/HBASE-3967
>             Project: HBase
>          Issue Type: Sub-task
>            Reporter: Kannan Muthukkaruppan
>            Priority: Blocker
>             Fix For: 0.94.0
>
>         Attachments: diff.patch
>
>
> During bulk imports, it'll be useful to be able to do delete mutations (either to delete data that already exists in HBase or was inserted earlier during this run of the import). 
> For example, we have a use case, where we are processing a log of data which may have both inserts and deletes in the mix and we want to upload that into HBase using the bulk import mechanism.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Updated] (HBASE-3967) Support deletes in HFileOutputFormat based bulk import mechanism

Posted by "Lars Hofhansl (Updated) (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/HBASE-3967?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Lars Hofhansl updated HBASE-3967:
---------------------------------

    Fix Version/s:     (was: 0.94.0)
                   0.96.0

Moving out of 0.94. Pull back if you disagree.
                
> Support deletes in HFileOutputFormat based bulk import mechanism
> ----------------------------------------------------------------
>
>                 Key: HBASE-3967
>                 URL: https://issues.apache.org/jira/browse/HBASE-3967
>             Project: HBase
>          Issue Type: Sub-task
>            Reporter: Kannan Muthukkaruppan
>            Priority: Blocker
>             Fix For: 0.96.0
>
>         Attachments: diff.patch
>
>
> During bulk imports, it'll be useful to be able to do delete mutations (either to delete data that already exists in HBase or was inserted earlier during this run of the import). 
> For example, we have a use case, where we are processing a log of data which may have both inserts and deletes in the mix and we want to upload that into HBase using the bulk import mechanism.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (HBASE-3967) Support deletes in HFileOutputFormat based bulk import mechanism

Posted by "Kannan Muthukkaruppan (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HBASE-3967?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13270572#comment-13270572 ] 

Kannan Muthukkaruppan commented on HBASE-3967:
----------------------------------------------

@Lars: Good point. Not sure if KeyValueSortReducer is new. I'll look more into this, and see if there are any differences between the two approaches.


                
> Support deletes in HFileOutputFormat based bulk import mechanism
> ----------------------------------------------------------------
>
>                 Key: HBASE-3967
>                 URL: https://issues.apache.org/jira/browse/HBASE-3967
>             Project: HBase
>          Issue Type: Sub-task
>            Reporter: Kannan Muthukkaruppan
>            Priority: Critical
>             Fix For: 0.96.0
>
>         Attachments: diff.patch
>
>
> During bulk imports, it'll be useful to be able to do delete mutations (either to delete data that already exists in HBase or was inserted earlier during this run of the import). 
> For example, we have a use case, where we are processing a log of data which may have both inserts and deletes in the mix and we want to upload that into HBase using the bulk import mechanism.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (HBASE-3967) Support deletes in HFileOutputFormat based bulk import mechanism

Posted by "Kannan Muthukkaruppan (Commented) (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HBASE-3967?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13149384#comment-13149384 ] 

Kannan Muthukkaruppan commented on HBASE-3967:
----------------------------------------------

Nicolas:
We need to make sure if Bogdan updated the patch with the latest fixes. On the internal branch, I recall it needed one followup commit to fix some issue with resolving tiebreaks if multiple entries had the same "rowkey"/"columnkey"/"ts".
                
> Support deletes in HFileOutputFormat based bulk import mechanism
> ----------------------------------------------------------------
>
>                 Key: HBASE-3967
>                 URL: https://issues.apache.org/jira/browse/HBASE-3967
>             Project: HBase
>          Issue Type: Improvement
>            Reporter: Kannan Muthukkaruppan
>            Assignee: Bogdan-Alexandru Matican
>         Attachments: diff.patch
>
>
> During bulk imports, it'll be useful to be able to do delete mutations (either to delete data that already exists in HBase or was inserted earlier during this run of the import). 
> For example, we have a use case, where we are processing a log of data which may have both inserts and deletes in the mix and we want to upload that into HBase using the bulk import mechanism.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (HBASE-3967) Support deletes in HFileOutputFormat based bulk import mechanism

Posted by "Lars Hofhansl (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HBASE-3967?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13264447#comment-13264447 ] 

Lars Hofhansl commented on HBASE-3967:
--------------------------------------

In that case I am a bit confusing about the point of this jira.
A mapper can already produce KeyValues and there already is KeyValueSortReducer.

Also 0.92+ has Mutation (as parent to both Put and Delete), which could be used for the same (with a new reducer of course).

                
> Support deletes in HFileOutputFormat based bulk import mechanism
> ----------------------------------------------------------------
>
>                 Key: HBASE-3967
>                 URL: https://issues.apache.org/jira/browse/HBASE-3967
>             Project: HBase
>          Issue Type: Sub-task
>            Reporter: Kannan Muthukkaruppan
>            Priority: Critical
>             Fix For: 0.96.0
>
>         Attachments: diff.patch
>
>
> During bulk imports, it'll be useful to be able to do delete mutations (either to delete data that already exists in HBase or was inserted earlier during this run of the import). 
> For example, we have a use case, where we are processing a log of data which may have both inserts and deletes in the mix and we want to upload that into HBase using the bulk import mechanism.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Updated] (HBASE-3967) Support deletes in HFileOutputFormat based bulk import mechanism

Posted by "stack (Updated) (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/HBASE-3967?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

stack updated HBASE-3967:
-------------------------

    Fix Version/s:     (was: 0.96.0)
                   0.94.0

Misread.  N said 0.94 blocker.
                
> Support deletes in HFileOutputFormat based bulk import mechanism
> ----------------------------------------------------------------
>
>                 Key: HBASE-3967
>                 URL: https://issues.apache.org/jira/browse/HBASE-3967
>             Project: HBase
>          Issue Type: Sub-task
>            Reporter: Kannan Muthukkaruppan
>            Priority: Blocker
>             Fix For: 0.94.0
>
>         Attachments: diff.patch
>
>
> During bulk imports, it'll be useful to be able to do delete mutations (either to delete data that already exists in HBase or was inserted earlier during this run of the import). 
> For example, we have a use case, where we are processing a log of data which may have both inserts and deletes in the mix and we want to upload that into HBase using the bulk import mechanism.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Updated] (HBASE-3967) Support deletes in HFileOutputFormat based bulk import mechanism

Posted by "Kannan Muthukkaruppan (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/HBASE-3967?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Kannan Muthukkaruppan updated HBASE-3967:
-----------------------------------------

    Summary: Support deletes in HFileOutputFormat based bulk import mechanism  (was: Add support to HFileOutputFormat based bulk imports to add Delete mutations)

> Support deletes in HFileOutputFormat based bulk import mechanism
> ----------------------------------------------------------------
>
>                 Key: HBASE-3967
>                 URL: https://issues.apache.org/jira/browse/HBASE-3967
>             Project: HBase
>          Issue Type: Improvement
>            Reporter: Kannan Muthukkaruppan
>
> During bulk imports, it'll be useful to be able to do delete mutations (either to delete data that already exists in HBase or was inserted earlier during this run of the import). 
> For example, we have a use case, where we are processing a log of data which may have both inserts and deletes in the mix and we want to upload that into HBase using the bulk import mechanism.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HBASE-3967) Support deletes in HFileOutputFormat based bulk import mechanism

Posted by "Lars Hofhansl (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HBASE-3967?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13258599#comment-13258599 ] 

Lars Hofhansl commented on HBASE-3967:
--------------------------------------

HFileOutputFormat just stores KeyValues so it already handles Deletes. The question is: How do you get Deletes output from a Mapper?

One solution (the one I took in HBASE-5440) is to have Mapper emit KeyValues and then use KeyValueSortReducer in the reduce phase.
Another approach is this jira, which is what Facebook has internally (as far as I know).

Yet another approach could be use Mutation (which is new in 0.92), and write a new SortReducer.

I think the point of this jira is to get the Facebook approach into 0.94/trunk in order to make upgrading more palatable for Facebook.
Kannan, correct me if I am wrong.
                
> Support deletes in HFileOutputFormat based bulk import mechanism
> ----------------------------------------------------------------
>
>                 Key: HBASE-3967
>                 URL: https://issues.apache.org/jira/browse/HBASE-3967
>             Project: HBase
>          Issue Type: Sub-task
>            Reporter: Kannan Muthukkaruppan
>            Priority: Critical
>             Fix For: 0.96.0
>
>         Attachments: diff.patch
>
>
> During bulk imports, it'll be useful to be able to do delete mutations (either to delete data that already exists in HBase or was inserted earlier during this run of the import). 
> For example, we have a use case, where we are processing a log of data which may have both inserts and deletes in the mix and we want to upload that into HBase using the bulk import mechanism.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (HBASE-3967) Support deletes in HFileOutputFormat based bulk import mechanism

Posted by "stack (Commented) (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HBASE-3967?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13149350#comment-13149350 ] 

stack commented on HBASE-3967:
------------------------------

@Nicolas None
                
> Support deletes in HFileOutputFormat based bulk import mechanism
> ----------------------------------------------------------------
>
>                 Key: HBASE-3967
>                 URL: https://issues.apache.org/jira/browse/HBASE-3967
>             Project: HBase
>          Issue Type: Improvement
>            Reporter: Kannan Muthukkaruppan
>            Assignee: Bogdan-Alexandru Matican
>         Attachments: diff.patch
>
>
> During bulk imports, it'll be useful to be able to do delete mutations (either to delete data that already exists in HBase or was inserted earlier during this run of the import). 
> For example, we have a use case, where we are processing a log of data which may have both inserts and deletes in the mix and we want to upload that into HBase using the bulk import mechanism.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (HBASE-3967) Support deletes in HFileOutputFormat based bulk import mechanism

Posted by "Kannan Muthukkaruppan (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HBASE-3967?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13264254#comment-13264254 ] 

Kannan Muthukkaruppan commented on HBASE-3967:
----------------------------------------------

Here are the relevant patches:

1) Original fix:
http://svn.apache.org/viewvc?view=revision&revision=1181568

2) Followup fix for tie-breaking when duplicate mutations with same RowKey/ColKey/TS are put or deleted.. we want to make sure the "later" one deterministically wins:
http://svn.apache.org/viewvc?view=revision&revision=1181589

Any volunteers to port this to trunk?


                
> Support deletes in HFileOutputFormat based bulk import mechanism
> ----------------------------------------------------------------
>
>                 Key: HBASE-3967
>                 URL: https://issues.apache.org/jira/browse/HBASE-3967
>             Project: HBase
>          Issue Type: Sub-task
>            Reporter: Kannan Muthukkaruppan
>            Priority: Critical
>             Fix For: 0.96.0
>
>         Attachments: diff.patch
>
>
> During bulk imports, it'll be useful to be able to do delete mutations (either to delete data that already exists in HBase or was inserted earlier during this run of the import). 
> For example, we have a use case, where we are processing a log of data which may have both inserts and deletes in the mix and we want to upload that into HBase using the bulk import mechanism.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (HBASE-3967) Support deletes in HFileOutputFormat based bulk import mechanism

Posted by "Kannan Muthukkaruppan (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HBASE-3967?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13050014#comment-13050014 ] 

Kannan Muthukkaruppan commented on HBASE-3967:
----------------------------------------------

Stack: <<If you removed the check for KV>> -- just making sure did you mean remove the separate checks for Key class and Value class? It might still be a good idea to check that it is at least a subclass of the expected class, no? Or does that check already happen down the stream?

> Support deletes in HFileOutputFormat based bulk import mechanism
> ----------------------------------------------------------------
>
>                 Key: HBASE-3967
>                 URL: https://issues.apache.org/jira/browse/HBASE-3967
>             Project: HBase
>          Issue Type: Improvement
>            Reporter: Kannan Muthukkaruppan
>            Assignee: Bogdan-Alexandru Matican
>
> During bulk imports, it'll be useful to be able to do delete mutations (either to delete data that already exists in HBase or was inserted earlier during this run of the import). 
> For example, we have a use case, where we are processing a log of data which may have both inserts and deletes in the mix and we want to upload that into HBase using the bulk import mechanism.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (HBASE-3967) Support deletes in HFileOutputFormat based bulk import mechanism

Posted by "Nicolas Spiegelberg (Commented) (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HBASE-3967?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13148636#comment-13148636 ] 

Nicolas Spiegelberg commented on HBASE-3967:
--------------------------------------------

This jira seems to have been stalled with a working patch in tow.  Any reason?  We're currently using this for our Messaging migrations at scale.  

@stack: Any problems with a rebase, then commit?
                
> Support deletes in HFileOutputFormat based bulk import mechanism
> ----------------------------------------------------------------
>
>                 Key: HBASE-3967
>                 URL: https://issues.apache.org/jira/browse/HBASE-3967
>             Project: HBase
>          Issue Type: Improvement
>            Reporter: Kannan Muthukkaruppan
>            Assignee: Bogdan-Alexandru Matican
>         Attachments: diff.patch
>
>
> During bulk imports, it'll be useful to be able to do delete mutations (either to delete data that already exists in HBase or was inserted earlier during this run of the import). 
> For example, we have a use case, where we are processing a log of data which may have both inserts and deletes in the mix and we want to upload that into HBase using the bulk import mechanism.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (HBASE-3967) Support deletes in HFileOutputFormat based bulk import mechanism

Posted by "Nicholas Telford (Commented) (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HBASE-3967?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13258369#comment-13258369 ] 

Nicholas Telford commented on HBASE-3967:
-----------------------------------------

Anyone know what the status of this issue is? It looks to have been completed in the latest patch (or perhaps even in HBASE-5440, not sure what Lars meant there) but not reviewed/accepted?
                
> Support deletes in HFileOutputFormat based bulk import mechanism
> ----------------------------------------------------------------
>
>                 Key: HBASE-3967
>                 URL: https://issues.apache.org/jira/browse/HBASE-3967
>             Project: HBase
>          Issue Type: Sub-task
>            Reporter: Kannan Muthukkaruppan
>            Priority: Critical
>             Fix For: 0.96.0
>
>         Attachments: diff.patch
>
>
> During bulk imports, it'll be useful to be able to do delete mutations (either to delete data that already exists in HBase or was inserted earlier during this run of the import). 
> For example, we have a use case, where we are processing a log of data which may have both inserts and deletes in the mix and we want to upload that into HBase using the bulk import mechanism.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (HBASE-3967) Support deletes in HFileOutputFormat based bulk import mechanism

Posted by "Kannan Muthukkaruppan (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HBASE-3967?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13264252#comment-13264252 ] 

Kannan Muthukkaruppan commented on HBASE-3967:
----------------------------------------------

The point of the JIRA was to really just provide a way to be able to bulk import delete mutations in addition to put mutations. We solved this on 89-fb branch by introducing a RowMutation (which extends Row) and its constructor can take a "Put" or "Delete". And by using a RowMutationSortReducer (that is variant of PutSortReducer, except that it handles Deletes & Puts). I will dig up the commit revs on 89-fb branch and try to post the links shortly for you to take a look. Unless there are any technical objections, we should just port the same approach to trunk.
                
> Support deletes in HFileOutputFormat based bulk import mechanism
> ----------------------------------------------------------------
>
>                 Key: HBASE-3967
>                 URL: https://issues.apache.org/jira/browse/HBASE-3967
>             Project: HBase
>          Issue Type: Sub-task
>            Reporter: Kannan Muthukkaruppan
>            Priority: Critical
>             Fix For: 0.96.0
>
>         Attachments: diff.patch
>
>
> During bulk imports, it'll be useful to be able to do delete mutations (either to delete data that already exists in HBase or was inserted earlier during this run of the import). 
> For example, we have a use case, where we are processing a log of data which may have both inserts and deletes in the mix and we want to upload that into HBase using the bulk import mechanism.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Updated] (HBASE-3967) Support deletes in HFileOutputFormat based bulk import mechanism

Posted by "Nicolas Spiegelberg (Updated) (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/HBASE-3967?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Nicolas Spiegelberg updated HBASE-3967:
---------------------------------------

    Issue Type: Sub-task  (was: Improvement)
        Parent: HBASE-4907
    
> Support deletes in HFileOutputFormat based bulk import mechanism
> ----------------------------------------------------------------
>
>                 Key: HBASE-3967
>                 URL: https://issues.apache.org/jira/browse/HBASE-3967
>             Project: HBase
>          Issue Type: Sub-task
>            Reporter: Kannan Muthukkaruppan
>            Assignee: Bogdan-Alexandru Matican
>         Attachments: diff.patch
>
>
> During bulk imports, it'll be useful to be able to do delete mutations (either to delete data that already exists in HBase or was inserted earlier during this run of the import). 
> For example, we have a use case, where we are processing a log of data which may have both inserts and deletes in the mix and we want to upload that into HBase using the bulk import mechanism.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Assigned] (HBASE-3967) Support deletes in HFileOutputFormat based bulk import mechanism

Posted by "Nicolas Spiegelberg (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/HBASE-3967?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Nicolas Spiegelberg reassigned HBASE-3967:
------------------------------------------

    Assignee: Bogdan-Alexandru Matican

Assigning to Bogdan, one of our interns here.

> Support deletes in HFileOutputFormat based bulk import mechanism
> ----------------------------------------------------------------
>
>                 Key: HBASE-3967
>                 URL: https://issues.apache.org/jira/browse/HBASE-3967
>             Project: HBase
>          Issue Type: Improvement
>            Reporter: Kannan Muthukkaruppan
>            Assignee: Bogdan-Alexandru Matican
>
> During bulk imports, it'll be useful to be able to do delete mutations (either to delete data that already exists in HBase or was inserted earlier during this run of the import). 
> For example, we have a use case, where we are processing a log of data which may have both inserts and deletes in the mix and we want to upload that into HBase using the bulk import mechanism.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (HBASE-3967) Support deletes in HFileOutputFormat based bulk import mechanism

Posted by "Lars Hofhansl (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HBASE-3967?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13444274#comment-13444274 ] 

Lars Hofhansl commented on HBASE-3967:
--------------------------------------

@Kannan: Are you still working on this? Maybe we can close it.
                
> Support deletes in HFileOutputFormat based bulk import mechanism
> ----------------------------------------------------------------
>
>                 Key: HBASE-3967
>                 URL: https://issues.apache.org/jira/browse/HBASE-3967
>             Project: HBase
>          Issue Type: Sub-task
>            Reporter: Kannan Muthukkaruppan
>            Priority: Critical
>             Fix For: 0.96.0
>
>         Attachments: diff.patch
>
>
> During bulk imports, it'll be useful to be able to do delete mutations (either to delete data that already exists in HBase or was inserted earlier during this run of the import). 
> For example, we have a use case, where we are processing a log of data which may have both inserts and deletes in the mix and we want to upload that into HBase using the bulk import mechanism.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HBASE-3967) Support deletes in HFileOutputFormat based bulk import mechanism

Posted by "stack (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HBASE-3967?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13049573#comment-13049573 ] 

stack commented on HBASE-3967:
------------------------------

Hey Bogdan (welcome!)


If you removed the check for KV, would that make your life easier (http://en.wikipedia.org/wiki/Jon_Postel#Postel.27s_Law)

> Support deletes in HFileOutputFormat based bulk import mechanism
> ----------------------------------------------------------------
>
>                 Key: HBASE-3967
>                 URL: https://issues.apache.org/jira/browse/HBASE-3967
>             Project: HBase
>          Issue Type: Improvement
>            Reporter: Kannan Muthukkaruppan
>            Assignee: Bogdan-Alexandru Matican
>
> During bulk imports, it'll be useful to be able to do delete mutations (either to delete data that already exists in HBase or was inserted earlier during this run of the import). 
> For example, we have a use case, where we are processing a log of data which may have both inserts and deletes in the mix and we want to upload that into HBase using the bulk import mechanism.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (HBASE-3967) Support deletes in HFileOutputFormat based bulk import mechanism

Posted by "Bogdan-Alexandru Matican (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HBASE-3967?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13049511#comment-13049511 ] 

Bogdan-Alexandru Matican commented on HBASE-3967:
-------------------------------------------------

Ok, so I think I've managed to make this work. However, I couldn't simply abstract up and use Row directly as the mapper output due to the following set of lines in "org.apache.hadoop.mapred.MapTask"

844       if (key.getClass() != keyClass) {
845         throw new IOException("Type mismatch in key from map: expected "
846                               + keyClass.getName() + ", recieved "
847                               + key.getClass().getName());
848       }

and the corresponding for value. 

This meant that even if I tried to pass a Put or a Delete as Rows when writing to the map context, it would fail at this check. As such, I just created an abstraction that acts as a union for _either_ a Put or a Delete and can be built off of either.

> Support deletes in HFileOutputFormat based bulk import mechanism
> ----------------------------------------------------------------
>
>                 Key: HBASE-3967
>                 URL: https://issues.apache.org/jira/browse/HBASE-3967
>             Project: HBase
>          Issue Type: Improvement
>            Reporter: Kannan Muthukkaruppan
>
> During bulk imports, it'll be useful to be able to do delete mutations (either to delete data that already exists in HBase or was inserted earlier during this run of the import). 
> For example, we have a use case, where we are processing a log of data which may have both inserts and deletes in the mix and we want to upload that into HBase using the bulk import mechanism.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira