You are viewing a plain text version of this content. The canonical link for it is here.

Posted to issues@hbase.apache.org by "Amitanand Aiyer (JIRA)" <ji...@apache.org> on 2012/08/22 00:08:38 UTC

[jira] [Created] (HBASE-6630) Port HBASE-6590 to trunk 0.94 : Assign sequence number to bulk loaded files

Amitanand Aiyer created HBASE-6630:
--------------------------------------

             Summary: Port HBASE-6590 to trunk 0.94 : Assign sequence number to bulk loaded files
                 Key: HBASE-6630
                 URL: https://issues.apache.org/jira/browse/HBASE-6630
             Project: HBase
          Issue Type: Sub-task
            Reporter: Amitanand Aiyer
            Assignee: Amitanand Aiyer
            Priority: Minor




--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HBASE-6630) Port HBASE-6590 to trunk 0.94 : Assign sequence number to bulk loaded files

Posted by "Amitanand Aiyer (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HBASE-6630?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13446409#comment-13446409 ] 

Amitanand Aiyer commented on HBASE-6630:
----------------------------------------

Posted the diff to: https://reviews.facebook.net/D5097
                
> Port HBASE-6590 to trunk 0.94 : Assign sequence number to bulk loaded files
> ---------------------------------------------------------------------------
>
>                 Key: HBASE-6630
>                 URL: https://issues.apache.org/jira/browse/HBASE-6630
>             Project: HBase
>          Issue Type: Sub-task
>            Reporter: Amitanand Aiyer
>            Assignee: Amitanand Aiyer
>            Priority: Minor
>
> Currently bulk loaded files are not assigned a sequence number. Thus, they can only be used to import historical data, dating to the past. There are cases where we want to bulk load "current data"; but the bulk load mechanism does not support this, as the bulk loaded files are always sorted behind the non-bulkloaded hfiles. Assigning Sequence Id to bulk loaded files should solve this issue.
> StoreFiles within a store are sorted based on the sequenceId. SequenceId is a monotonically increasing number that accompanies every edit written to the WAL. For entries that update the same cell, we would like the latter edit to win. This comparision is accomplished using memstoreTS, at the KV level; and sequenceId at the StoreFile level (to order scanners in the KeyValueHeap).
> BulkLoaded files are generated outside of HBase/RegionServer, so they do not have a sequenceId written in the file. This causes HBase to lose track of the point in time, when the BulkLoaded file was imported to HBase. Resulting in a behavior, that *only* supports viewing bulkLoaded files as files back-filling data from the begining of time.
> By assigning a sequence number to the file, we can allow the bulk loaded file to fit in where we want. Either at the "current time" or the "begining of time". The latter is the default, to maintain backward compatibility.
> Design approach:
> Store files keep track of the sequence Id in the trailer. Since we do not wish to edit/rewrite the bulk loaded file upon import, we will encode the assigned sequenceId into the fileName. The filename RegEx is updated for this regard. If the sequenceId is encoded in the filename, the sequenceId will be used as the sequenceId for the file. If none is found, the sequenceId will be considered 0 (as per the default, backward-compatible behavior).
> To enable clients to request pre-existing behavior, the command line utility allows for 2 ways to import BulkLoaded Files: to assign or not assign a sequence Number.
>     If a sequence Number is assigned, the imporeted file will be imported with the "current sequence Id".
>     if the sequence Number is not assigned, it will be as if it was backfilling old data, from the begining of time.
> Compaction behavior:
>     With the current compaction algorithm, bulk loaded files – that backfill data, to the begining of time – can cause a compaction storm, converting every minor compaction to a major compaction. To address this, these files are excluded from minor compaction, based on a config param. (enabled for the messages use case).
>     Since, bulk loaded files that are not back-filling data do not cause this issue, they will not be ignored during minor compactions based on the config parameter. This is also required to ensure that there are no holes in the set of files selected for compaction – this is necessary to preserve the order of KV's comparision before and after compaction.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (HBASE-6630) Port HBASE-6590 to trunk 0.94 : Assign sequence number to bulk loaded files

Posted by "Ted Yu (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HBASE-6630?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Ted Yu updated HBASE-6630:
--------------------------

    Attachment: 6590-seq-id-bulk-load.txt

Amit's patch.
                
> Port HBASE-6590 to trunk 0.94 : Assign sequence number to bulk loaded files
> ---------------------------------------------------------------------------
>
>                 Key: HBASE-6630
>                 URL: https://issues.apache.org/jira/browse/HBASE-6630
>             Project: HBase
>          Issue Type: Sub-task
>            Reporter: Amitanand Aiyer
>            Assignee: Amitanand Aiyer
>            Priority: Minor
>         Attachments: 6590-seq-id-bulk-load.txt
>
>
> Currently bulk loaded files are not assigned a sequence number. Thus, they can only be used to import historical data, dating to the past. There are cases where we want to bulk load "current data"; but the bulk load mechanism does not support this, as the bulk loaded files are always sorted behind the non-bulkloaded hfiles. Assigning Sequence Id to bulk loaded files should solve this issue.
> StoreFiles within a store are sorted based on the sequenceId. SequenceId is a monotonically increasing number that accompanies every edit written to the WAL. For entries that update the same cell, we would like the latter edit to win. This comparision is accomplished using memstoreTS, at the KV level; and sequenceId at the StoreFile level (to order scanners in the KeyValueHeap).
> BulkLoaded files are generated outside of HBase/RegionServer, so they do not have a sequenceId written in the file. This causes HBase to lose track of the point in time, when the BulkLoaded file was imported to HBase. Resulting in a behavior, that *only* supports viewing bulkLoaded files as files back-filling data from the begining of time.
> By assigning a sequence number to the file, we can allow the bulk loaded file to fit in where we want. Either at the "current time" or the "begining of time". The latter is the default, to maintain backward compatibility.
> Design approach:
> Store files keep track of the sequence Id in the trailer. Since we do not wish to edit/rewrite the bulk loaded file upon import, we will encode the assigned sequenceId into the fileName. The filename RegEx is updated for this regard. If the sequenceId is encoded in the filename, the sequenceId will be used as the sequenceId for the file. If none is found, the sequenceId will be considered 0 (as per the default, backward-compatible behavior).
> To enable clients to request pre-existing behavior, the command line utility allows for 2 ways to import BulkLoaded Files: to assign or not assign a sequence Number.
>     If a sequence Number is assigned, the imporeted file will be imported with the "current sequence Id".
>     if the sequence Number is not assigned, it will be as if it was backfilling old data, from the begining of time.
> Compaction behavior:
>     With the current compaction algorithm, bulk loaded files – that backfill data, to the begining of time – can cause a compaction storm, converting every minor compaction to a major compaction. To address this, these files are excluded from minor compaction, based on a config param. (enabled for the messages use case).
>     Since, bulk loaded files that are not back-filling data do not cause this issue, they will not be ignored during minor compactions based on the config parameter. This is also required to ensure that there are no holes in the set of files selected for compaction – this is necessary to preserve the order of KV's comparision before and after compaction.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HBASE-6630) Port HBASE-6590 to trunk 0.94 : Assign sequence number to bulk loaded files

Posted by "Ted Yu (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HBASE-6630?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13451398#comment-13451398 ] 

Ted Yu commented on HBASE-6630:
-------------------------------

Integrated to trunk.

Thanks for the patch, Amit.
                
> Port HBASE-6590 to trunk 0.94 : Assign sequence number to bulk loaded files
> ---------------------------------------------------------------------------
>
>                 Key: HBASE-6630
>                 URL: https://issues.apache.org/jira/browse/HBASE-6630
>             Project: HBase
>          Issue Type: Sub-task
>    Affects Versions: 0.94.1
>            Reporter: Amitanand Aiyer
>            Assignee: Amitanand Aiyer
>            Priority: Minor
>         Attachments: 6590-seq-id-bulk-load.txt, 6630-v2.txt
>
>
> Currently bulk loaded files are not assigned a sequence number. Thus, they can only be used to import historical data, dating to the past. There are cases where we want to bulk load "current data"; but the bulk load mechanism does not support this, as the bulk loaded files are always sorted behind the non-bulkloaded hfiles. Assigning Sequence Id to bulk loaded files should solve this issue.
> StoreFiles within a store are sorted based on the sequenceId. SequenceId is a monotonically increasing number that accompanies every edit written to the WAL. For entries that update the same cell, we would like the latter edit to win. This comparision is accomplished using memstoreTS, at the KV level; and sequenceId at the StoreFile level (to order scanners in the KeyValueHeap).
> BulkLoaded files are generated outside of HBase/RegionServer, so they do not have a sequenceId written in the file. This causes HBase to lose track of the point in time, when the BulkLoaded file was imported to HBase. Resulting in a behavior, that *only* supports viewing bulkLoaded files as files back-filling data from the begining of time.
> By assigning a sequence number to the file, we can allow the bulk loaded file to fit in where we want. Either at the "current time" or the "begining of time". The latter is the default, to maintain backward compatibility.
> Design approach:
> Store files keep track of the sequence Id in the trailer. Since we do not wish to edit/rewrite the bulk loaded file upon import, we will encode the assigned sequenceId into the fileName. The filename RegEx is updated for this regard. If the sequenceId is encoded in the filename, the sequenceId will be used as the sequenceId for the file. If none is found, the sequenceId will be considered 0 (as per the default, backward-compatible behavior).
> To enable clients to request pre-existing behavior, the command line utility allows for 2 ways to import BulkLoaded Files: to assign or not assign a sequence Number.
>     If a sequence Number is assigned, the imporeted file will be imported with the "current sequence Id".
>     if the sequence Number is not assigned, it will be as if it was backfilling old data, from the begining of time.
> Compaction behavior:
>     With the current compaction algorithm, bulk loaded files – that backfill data, to the begining of time – can cause a compaction storm, converting every minor compaction to a major compaction. To address this, these files are excluded from minor compaction, based on a config param. (enabled for the messages use case).
>     Since, bulk loaded files that are not back-filling data do not cause this issue, they will not be ignored during minor compactions based on the config parameter. This is also required to ensure that there are no holes in the set of files selected for compaction – this is necessary to preserve the order of KV's comparision before and after compaction.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (HBASE-6630) Port HBASE-6590 to trunk 0.94 : Assign sequence number to bulk loaded files

Posted by "Amitanand Aiyer (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HBASE-6630?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Amitanand Aiyer updated HBASE-6630:
-----------------------------------

       Due Date: 24/Aug/12
    Description: 


Currently bulk loaded files are not assigned a sequence number. Thus, they can only be used to import historical data, dating to the past. There are cases where we want to bulk load "current data"; but the bulk load mechanism does not support this, as the bulk loaded files are always sorted behind the non-bulkloaded hfiles. Assigning Sequence Id to bulk loaded files should solve this issue.

StoreFiles within a store are sorted based on the sequenceId. SequenceId is a monotonically increasing number that accompanies every edit written to the WAL. For entries that update the same cell, we would like the latter edit to win. This comparision is accomplished using memstoreTS, at the KV level; and sequenceId at the StoreFile level (to order scanners in the KeyValueHeap).

BulkLoaded files are generated outside of HBase/RegionServer, so they do not have a sequenceId written in the file. This causes HBase to lose track of the point in time, when the BulkLoaded file was imported to HBase. Resulting in a behavior, that *only* supports viewing bulkLoaded files as files back-filling data from the begining of time.

By assigning a sequence number to the file, we can allow the bulk loaded file to fit in where we want. Either at the "current time" or the "begining of time". The latter is the default, to maintain backward compatibility.

Design approach:
Store files keep track of the sequence Id in the trailer. Since we do not wish to edit/rewrite the bulk loaded file upon import, we will encode the assigned sequenceId into the fileName. The filename RegEx is updated for this regard. If the sequenceId is encoded in the filename, the sequenceId will be used as the sequenceId for the file. If none is found, the sequenceId will be considered 0 (as per the default, backward-compatible behavior).

To enable clients to request pre-existing behavior, the command line utility allows for 2 ways to import BulkLoaded Files: to assign or not assign a sequence Number.

    If a sequence Number is assigned, the imporeted file will be imported with the "current sequence Id".
    if the sequence Number is not assigned, it will be as if it was backfilling old data, from the begining of time.

Compaction behavior:

    With the current compaction algorithm, bulk loaded files – that backfill data, to the begining of time – can cause a compaction storm, converting every minor compaction to a major compaction. To address this, these files are excluded from minor compaction, based on a config param. (enabled for the messages use case).
    Since, bulk loaded files that are not back-filling data do not cause this issue, they will not be ignored during minor compactions based on the config parameter. This is also required to ensure that there are no holes in the set of files selected for compaction – this is necessary to preserve the order of KV's comparision before and after compaction.


    
> Port HBASE-6590 to trunk 0.94 : Assign sequence number to bulk loaded files
> ---------------------------------------------------------------------------
>
>                 Key: HBASE-6630
>                 URL: https://issues.apache.org/jira/browse/HBASE-6630
>             Project: HBase
>          Issue Type: Sub-task
>            Reporter: Amitanand Aiyer
>            Assignee: Amitanand Aiyer
>            Priority: Minor
>
> Currently bulk loaded files are not assigned a sequence number. Thus, they can only be used to import historical data, dating to the past. There are cases where we want to bulk load "current data"; but the bulk load mechanism does not support this, as the bulk loaded files are always sorted behind the non-bulkloaded hfiles. Assigning Sequence Id to bulk loaded files should solve this issue.
> StoreFiles within a store are sorted based on the sequenceId. SequenceId is a monotonically increasing number that accompanies every edit written to the WAL. For entries that update the same cell, we would like the latter edit to win. This comparision is accomplished using memstoreTS, at the KV level; and sequenceId at the StoreFile level (to order scanners in the KeyValueHeap).
> BulkLoaded files are generated outside of HBase/RegionServer, so they do not have a sequenceId written in the file. This causes HBase to lose track of the point in time, when the BulkLoaded file was imported to HBase. Resulting in a behavior, that *only* supports viewing bulkLoaded files as files back-filling data from the begining of time.
> By assigning a sequence number to the file, we can allow the bulk loaded file to fit in where we want. Either at the "current time" or the "begining of time". The latter is the default, to maintain backward compatibility.
> Design approach:
> Store files keep track of the sequence Id in the trailer. Since we do not wish to edit/rewrite the bulk loaded file upon import, we will encode the assigned sequenceId into the fileName. The filename RegEx is updated for this regard. If the sequenceId is encoded in the filename, the sequenceId will be used as the sequenceId for the file. If none is found, the sequenceId will be considered 0 (as per the default, backward-compatible behavior).
> To enable clients to request pre-existing behavior, the command line utility allows for 2 ways to import BulkLoaded Files: to assign or not assign a sequence Number.
>     If a sequence Number is assigned, the imporeted file will be imported with the "current sequence Id".
>     if the sequence Number is not assigned, it will be as if it was backfilling old data, from the begining of time.
> Compaction behavior:
>     With the current compaction algorithm, bulk loaded files – that backfill data, to the begining of time – can cause a compaction storm, converting every minor compaction to a major compaction. To address this, these files are excluded from minor compaction, based on a config param. (enabled for the messages use case).
>     Since, bulk loaded files that are not back-filling data do not cause this issue, they will not be ignored during minor compactions based on the config parameter. This is also required to ensure that there are no holes in the set of files selected for compaction – this is necessary to preserve the order of KV's comparision before and after compaction.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HBASE-6630) Port HBASE-6590 to trunk 0.94 : Assign sequence number to bulk loaded files

Posted by "Hadoop QA (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HBASE-6630?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13451367#comment-13451367 ] 

Hadoop QA commented on HBASE-6630:
----------------------------------

-1 overall.  Here are the results of testing the latest attachment 
  http://issues.apache.org/jira/secure/attachment/12544357/6630-v2.txt
  against trunk revision .

    +1 @author.  The patch does not contain any @author tags.

    +1 tests included.  The patch appears to include 15 new or modified tests.

    +1 hadoop2.0.  The patch compiles against the hadoop 2.0 profile.

    +1 javadoc.  The javadoc tool did not generate any warning messages.

    -1 javac.  The patch appears to cause mvn compile goal to fail.

    -1 findbugs.  The patch appears to cause Findbugs (version 1.3.9) to fail.

    +1 release audit.  The applied patch does not increase the total number of release audit warnings.

     -1 core tests.  The patch failed these unit tests:
     

Test results: https://builds.apache.org/job/PreCommit-HBASE-Build/2830//testReport/
Console output: https://builds.apache.org/job/PreCommit-HBASE-Build/2830//console

This message is automatically generated.
                
> Port HBASE-6590 to trunk 0.94 : Assign sequence number to bulk loaded files
> ---------------------------------------------------------------------------
>
>                 Key: HBASE-6630
>                 URL: https://issues.apache.org/jira/browse/HBASE-6630
>             Project: HBase
>          Issue Type: Sub-task
>    Affects Versions: 0.94.1
>            Reporter: Amitanand Aiyer
>            Assignee: Amitanand Aiyer
>            Priority: Minor
>         Attachments: 6590-seq-id-bulk-load.txt, 6630-v2.txt
>
>
> Currently bulk loaded files are not assigned a sequence number. Thus, they can only be used to import historical data, dating to the past. There are cases where we want to bulk load "current data"; but the bulk load mechanism does not support this, as the bulk loaded files are always sorted behind the non-bulkloaded hfiles. Assigning Sequence Id to bulk loaded files should solve this issue.
> StoreFiles within a store are sorted based on the sequenceId. SequenceId is a monotonically increasing number that accompanies every edit written to the WAL. For entries that update the same cell, we would like the latter edit to win. This comparision is accomplished using memstoreTS, at the KV level; and sequenceId at the StoreFile level (to order scanners in the KeyValueHeap).
> BulkLoaded files are generated outside of HBase/RegionServer, so they do not have a sequenceId written in the file. This causes HBase to lose track of the point in time, when the BulkLoaded file was imported to HBase. Resulting in a behavior, that *only* supports viewing bulkLoaded files as files back-filling data from the begining of time.
> By assigning a sequence number to the file, we can allow the bulk loaded file to fit in where we want. Either at the "current time" or the "begining of time". The latter is the default, to maintain backward compatibility.
> Design approach:
> Store files keep track of the sequence Id in the trailer. Since we do not wish to edit/rewrite the bulk loaded file upon import, we will encode the assigned sequenceId into the fileName. The filename RegEx is updated for this regard. If the sequenceId is encoded in the filename, the sequenceId will be used as the sequenceId for the file. If none is found, the sequenceId will be considered 0 (as per the default, backward-compatible behavior).
> To enable clients to request pre-existing behavior, the command line utility allows for 2 ways to import BulkLoaded Files: to assign or not assign a sequence Number.
>     If a sequence Number is assigned, the imporeted file will be imported with the "current sequence Id".
>     if the sequence Number is not assigned, it will be as if it was backfilling old data, from the begining of time.
> Compaction behavior:
>     With the current compaction algorithm, bulk loaded files – that backfill data, to the begining of time – can cause a compaction storm, converting every minor compaction to a major compaction. To address this, these files are excluded from minor compaction, based on a config param. (enabled for the messages use case).
>     Since, bulk loaded files that are not back-filling data do not cause this issue, they will not be ignored during minor compactions based on the config parameter. This is also required to ensure that there are no holes in the set of files selected for compaction – this is necessary to preserve the order of KV's comparision before and after compaction.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HBASE-6630) Port HBASE-6590 to trunk 0.94 : Assign sequence number to bulk loaded files

Posted by "Hadoop QA (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HBASE-6630?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13448435#comment-13448435 ] 

Hadoop QA commented on HBASE-6630:
----------------------------------

-1 overall.  Here are the results of testing the latest attachment 
  http://issues.apache.org/jira/secure/attachment/12543348/6590-seq-id-bulk-load.txt
  against trunk revision .

    +1 @author.  The patch does not contain any @author tags.

    +1 tests included.  The patch appears to include 15 new or modified tests.

    +1 hadoop2.0.  The patch compiles against the hadoop 2.0 profile.

    -1 javadoc.  The javadoc tool appears to have generated 110 warning messages.

    -1 javac.  The applied patch generated 5 javac compiler warnings (more than the trunk's current 4 warnings).

    -1 findbugs.  The patch appears to introduce 8 new Findbugs (version 1.3.9) warnings.

    +1 release audit.  The applied patch does not increase the total number of release audit warnings.

     -1 core tests.  The patch failed these unit tests:
     

Test results: https://builds.apache.org/job/PreCommit-HBASE-Build/2784//testReport/
Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/2784//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-hadoop2-compat.html
Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/2784//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-hadoop1-compat.html
Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/2784//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-common.html
Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/2784//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-hadoop-compat.html
Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/2784//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-server.html
Console output: https://builds.apache.org/job/PreCommit-HBASE-Build/2784//console

This message is automatically generated.
                
> Port HBASE-6590 to trunk 0.94 : Assign sequence number to bulk loaded files
> ---------------------------------------------------------------------------
>
>                 Key: HBASE-6630
>                 URL: https://issues.apache.org/jira/browse/HBASE-6630
>             Project: HBase
>          Issue Type: Sub-task
>    Affects Versions: 0.94.1
>            Reporter: Amitanand Aiyer
>            Assignee: Amitanand Aiyer
>            Priority: Minor
>         Attachments: 6590-seq-id-bulk-load.txt
>
>
> Currently bulk loaded files are not assigned a sequence number. Thus, they can only be used to import historical data, dating to the past. There are cases where we want to bulk load "current data"; but the bulk load mechanism does not support this, as the bulk loaded files are always sorted behind the non-bulkloaded hfiles. Assigning Sequence Id to bulk loaded files should solve this issue.
> StoreFiles within a store are sorted based on the sequenceId. SequenceId is a monotonically increasing number that accompanies every edit written to the WAL. For entries that update the same cell, we would like the latter edit to win. This comparision is accomplished using memstoreTS, at the KV level; and sequenceId at the StoreFile level (to order scanners in the KeyValueHeap).
> BulkLoaded files are generated outside of HBase/RegionServer, so they do not have a sequenceId written in the file. This causes HBase to lose track of the point in time, when the BulkLoaded file was imported to HBase. Resulting in a behavior, that *only* supports viewing bulkLoaded files as files back-filling data from the begining of time.
> By assigning a sequence number to the file, we can allow the bulk loaded file to fit in where we want. Either at the "current time" or the "begining of time". The latter is the default, to maintain backward compatibility.
> Design approach:
> Store files keep track of the sequence Id in the trailer. Since we do not wish to edit/rewrite the bulk loaded file upon import, we will encode the assigned sequenceId into the fileName. The filename RegEx is updated for this regard. If the sequenceId is encoded in the filename, the sequenceId will be used as the sequenceId for the file. If none is found, the sequenceId will be considered 0 (as per the default, backward-compatible behavior).
> To enable clients to request pre-existing behavior, the command line utility allows for 2 ways to import BulkLoaded Files: to assign or not assign a sequence Number.
>     If a sequence Number is assigned, the imporeted file will be imported with the "current sequence Id".
>     if the sequence Number is not assigned, it will be as if it was backfilling old data, from the begining of time.
> Compaction behavior:
>     With the current compaction algorithm, bulk loaded files – that backfill data, to the begining of time – can cause a compaction storm, converting every minor compaction to a major compaction. To address this, these files are excluded from minor compaction, based on a config param. (enabled for the messages use case).
>     Since, bulk loaded files that are not back-filling data do not cause this issue, they will not be ignored during minor compactions based on the config parameter. This is also required to ensure that there are no holes in the set of files selected for compaction – this is necessary to preserve the order of KV's comparision before and after compaction.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HBASE-6630) Port HBASE-6590 to trunk : Assign sequence number to bulk loaded files

Posted by "Hudson (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HBASE-6630?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13451419#comment-13451419 ] 

Hudson commented on HBASE-6630:
-------------------------------

Integrated in HBase-TRUNK #3316 (See [https://builds.apache.org/job/HBase-TRUNK/3316/])
    HBASE-6630 Port HBASE-6590 to trunk 0.94 : Assign sequence number to bulk loaded files (Amitanand) (Revision 1382351)

     Result = SUCCESS
tedyu : 
Files : 
* /hbase/trunk/hbase-server/src/main/java/org/apache/hadoop/hbase/mapreduce/LoadIncrementalHFiles.java
* /hbase/trunk/hbase-server/src/main/java/org/apache/hadoop/hbase/protobuf/ProtobufUtil.java
* /hbase/trunk/hbase-server/src/main/java/org/apache/hadoop/hbase/protobuf/RequestConverter.java
* /hbase/trunk/hbase-server/src/main/java/org/apache/hadoop/hbase/protobuf/generated/ClientProtos.java
* /hbase/trunk/hbase-server/src/main/java/org/apache/hadoop/hbase/regionserver/HRegion.java
* /hbase/trunk/hbase-server/src/main/java/org/apache/hadoop/hbase/regionserver/HRegionServer.java
* /hbase/trunk/hbase-server/src/main/java/org/apache/hadoop/hbase/regionserver/HStore.java
* /hbase/trunk/hbase-server/src/main/java/org/apache/hadoop/hbase/regionserver/Store.java
* /hbase/trunk/hbase-server/src/main/java/org/apache/hadoop/hbase/regionserver/StoreFile.java
* /hbase/trunk/hbase-server/src/main/java/org/apache/hadoop/hbase/regionserver/wal/HLog.java
* /hbase/trunk/hbase-server/src/main/protobuf/Client.proto
* /hbase/trunk/hbase-server/src/test/java/org/apache/hadoop/hbase/mapreduce/TestLoadIncrementalHFiles.java
* /hbase/trunk/hbase-server/src/test/java/org/apache/hadoop/hbase/regionserver/TestCompaction.java
* /hbase/trunk/hbase-server/src/test/java/org/apache/hadoop/hbase/regionserver/TestHRegionServerBulkLoad.java
* /hbase/trunk/hbase-server/src/test/java/org/apache/hadoop/hbase/regionserver/TestStoreFile.java
* /hbase/trunk/hbase-server/src/test/java/org/apache/hadoop/hbase/regionserver/wal/TestWALReplay.java

                
> Port HBASE-6590 to trunk : Assign sequence number to bulk loaded files
> ----------------------------------------------------------------------
>
>                 Key: HBASE-6630
>                 URL: https://issues.apache.org/jira/browse/HBASE-6630
>             Project: HBase
>          Issue Type: Sub-task
>    Affects Versions: 0.94.1
>            Reporter: Amitanand Aiyer
>            Assignee: Amitanand Aiyer
>            Priority: Minor
>             Fix For: 0.96.0
>
>         Attachments: 6590-seq-id-bulk-load.txt, 6630-v2.txt
>
>
> Currently bulk loaded files are not assigned a sequence number. Thus, they can only be used to import historical data, dating to the past. There are cases where we want to bulk load "current data"; but the bulk load mechanism does not support this, as the bulk loaded files are always sorted behind the non-bulkloaded hfiles. Assigning Sequence Id to bulk loaded files should solve this issue.
> StoreFiles within a store are sorted based on the sequenceId. SequenceId is a monotonically increasing number that accompanies every edit written to the WAL. For entries that update the same cell, we would like the latter edit to win. This comparision is accomplished using memstoreTS, at the KV level; and sequenceId at the StoreFile level (to order scanners in the KeyValueHeap).
> BulkLoaded files are generated outside of HBase/RegionServer, so they do not have a sequenceId written in the file. This causes HBase to lose track of the point in time, when the BulkLoaded file was imported to HBase. Resulting in a behavior, that *only* supports viewing bulkLoaded files as files back-filling data from the begining of time.
> By assigning a sequence number to the file, we can allow the bulk loaded file to fit in where we want. Either at the "current time" or the "begining of time". The latter is the default, to maintain backward compatibility.
> Design approach:
> Store files keep track of the sequence Id in the trailer. Since we do not wish to edit/rewrite the bulk loaded file upon import, we will encode the assigned sequenceId into the fileName. The filename RegEx is updated for this regard. If the sequenceId is encoded in the filename, the sequenceId will be used as the sequenceId for the file. If none is found, the sequenceId will be considered 0 (as per the default, backward-compatible behavior).
> To enable clients to request pre-existing behavior, the command line utility allows for 2 ways to import BulkLoaded Files: to assign or not assign a sequence Number.
>     If a sequence Number is assigned, the imporeted file will be imported with the "current sequence Id".
>     if the sequence Number is not assigned, it will be as if it was backfilling old data, from the begining of time.
> Compaction behavior:
>     With the current compaction algorithm, bulk loaded files – that backfill data, to the begining of time – can cause a compaction storm, converting every minor compaction to a major compaction. To address this, these files are excluded from minor compaction, based on a config param. (enabled for the messages use case).
>     Since, bulk loaded files that are not back-filling data do not cause this issue, they will not be ignored during minor compactions based on the config parameter. This is also required to ensure that there are no holes in the set of files selected for compaction – this is necessary to preserve the order of KV's comparision before and after compaction.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Work started] (HBASE-6630) Port HBASE-6590 to trunk 0.94 : Assign sequence number to bulk loaded files

Posted by "Amitanand Aiyer (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HBASE-6630?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Work on HBASE-6630 started by Amitanand Aiyer.

> Port HBASE-6590 to trunk 0.94 : Assign sequence number to bulk loaded files
> ---------------------------------------------------------------------------
>
>                 Key: HBASE-6630
>                 URL: https://issues.apache.org/jira/browse/HBASE-6630
>             Project: HBase
>          Issue Type: Sub-task
>            Reporter: Amitanand Aiyer
>            Assignee: Amitanand Aiyer
>            Priority: Minor
>
> Currently bulk loaded files are not assigned a sequence number. Thus, they can only be used to import historical data, dating to the past. There are cases where we want to bulk load "current data"; but the bulk load mechanism does not support this, as the bulk loaded files are always sorted behind the non-bulkloaded hfiles. Assigning Sequence Id to bulk loaded files should solve this issue.
> StoreFiles within a store are sorted based on the sequenceId. SequenceId is a monotonically increasing number that accompanies every edit written to the WAL. For entries that update the same cell, we would like the latter edit to win. This comparision is accomplished using memstoreTS, at the KV level; and sequenceId at the StoreFile level (to order scanners in the KeyValueHeap).
> BulkLoaded files are generated outside of HBase/RegionServer, so they do not have a sequenceId written in the file. This causes HBase to lose track of the point in time, when the BulkLoaded file was imported to HBase. Resulting in a behavior, that *only* supports viewing bulkLoaded files as files back-filling data from the begining of time.
> By assigning a sequence number to the file, we can allow the bulk loaded file to fit in where we want. Either at the "current time" or the "begining of time". The latter is the default, to maintain backward compatibility.
> Design approach:
> Store files keep track of the sequence Id in the trailer. Since we do not wish to edit/rewrite the bulk loaded file upon import, we will encode the assigned sequenceId into the fileName. The filename RegEx is updated for this regard. If the sequenceId is encoded in the filename, the sequenceId will be used as the sequenceId for the file. If none is found, the sequenceId will be considered 0 (as per the default, backward-compatible behavior).
> To enable clients to request pre-existing behavior, the command line utility allows for 2 ways to import BulkLoaded Files: to assign or not assign a sequence Number.
>     If a sequence Number is assigned, the imporeted file will be imported with the "current sequence Id".
>     if the sequence Number is not assigned, it will be as if it was backfilling old data, from the begining of time.
> Compaction behavior:
>     With the current compaction algorithm, bulk loaded files – that backfill data, to the begining of time – can cause a compaction storm, converting every minor compaction to a major compaction. To address this, these files are excluded from minor compaction, based on a config param. (enabled for the messages use case).
>     Since, bulk loaded files that are not back-filling data do not cause this issue, they will not be ignored during minor compactions based on the config parameter. This is also required to ensure that there are no holes in the set of files selected for compaction – this is necessary to preserve the order of KV's comparision before and after compaction.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (HBASE-6630) Port HBASE-6590 to trunk 0.94 : Assign sequence number to bulk loaded files

Posted by "Ted Yu (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HBASE-6630?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Ted Yu updated HBASE-6630:
--------------------------

    Attachment: 6630-v2.txt

Patch v2 adds @param for the new parameter
                
> Port HBASE-6590 to trunk 0.94 : Assign sequence number to bulk loaded files
> ---------------------------------------------------------------------------
>
>                 Key: HBASE-6630
>                 URL: https://issues.apache.org/jira/browse/HBASE-6630
>             Project: HBase
>          Issue Type: Sub-task
>    Affects Versions: 0.94.1
>            Reporter: Amitanand Aiyer
>            Assignee: Amitanand Aiyer
>            Priority: Minor
>         Attachments: 6590-seq-id-bulk-load.txt, 6630-v2.txt
>
>
> Currently bulk loaded files are not assigned a sequence number. Thus, they can only be used to import historical data, dating to the past. There are cases where we want to bulk load "current data"; but the bulk load mechanism does not support this, as the bulk loaded files are always sorted behind the non-bulkloaded hfiles. Assigning Sequence Id to bulk loaded files should solve this issue.
> StoreFiles within a store are sorted based on the sequenceId. SequenceId is a monotonically increasing number that accompanies every edit written to the WAL. For entries that update the same cell, we would like the latter edit to win. This comparision is accomplished using memstoreTS, at the KV level; and sequenceId at the StoreFile level (to order scanners in the KeyValueHeap).
> BulkLoaded files are generated outside of HBase/RegionServer, so they do not have a sequenceId written in the file. This causes HBase to lose track of the point in time, when the BulkLoaded file was imported to HBase. Resulting in a behavior, that *only* supports viewing bulkLoaded files as files back-filling data from the begining of time.
> By assigning a sequence number to the file, we can allow the bulk loaded file to fit in where we want. Either at the "current time" or the "begining of time". The latter is the default, to maintain backward compatibility.
> Design approach:
> Store files keep track of the sequence Id in the trailer. Since we do not wish to edit/rewrite the bulk loaded file upon import, we will encode the assigned sequenceId into the fileName. The filename RegEx is updated for this regard. If the sequenceId is encoded in the filename, the sequenceId will be used as the sequenceId for the file. If none is found, the sequenceId will be considered 0 (as per the default, backward-compatible behavior).
> To enable clients to request pre-existing behavior, the command line utility allows for 2 ways to import BulkLoaded Files: to assign or not assign a sequence Number.
>     If a sequence Number is assigned, the imporeted file will be imported with the "current sequence Id".
>     if the sequence Number is not assigned, it will be as if it was backfilling old data, from the begining of time.
> Compaction behavior:
>     With the current compaction algorithm, bulk loaded files – that backfill data, to the begining of time – can cause a compaction storm, converting every minor compaction to a major compaction. To address this, these files are excluded from minor compaction, based on a config param. (enabled for the messages use case).
>     Since, bulk loaded files that are not back-filling data do not cause this issue, they will not be ignored during minor compactions based on the config parameter. This is also required to ensure that there are no holes in the set of files selected for compaction – this is necessary to preserve the order of KV's comparision before and after compaction.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HBASE-6630) Port HBASE-6590 to trunk 0.94 : Assign sequence number to bulk loaded files

Posted by "Ted Yu (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HBASE-6630?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13446426#comment-13446426 ] 

Ted Yu commented on HBASE-6630:
-------------------------------

Just realized that there is no 'Submit Patch' button for this JIRA.
Maybe this JIRA was created when system was under maintenance ?

@Amit:
If you don't want to create new JIRA, please run through test suite and let us know the result.

Thanks
                
> Port HBASE-6590 to trunk 0.94 : Assign sequence number to bulk loaded files
> ---------------------------------------------------------------------------
>
>                 Key: HBASE-6630
>                 URL: https://issues.apache.org/jira/browse/HBASE-6630
>             Project: HBase
>          Issue Type: Sub-task
>            Reporter: Amitanand Aiyer
>            Assignee: Amitanand Aiyer
>            Priority: Minor
>         Attachments: 6590-seq-id-bulk-load.txt
>
>
> Currently bulk loaded files are not assigned a sequence number. Thus, they can only be used to import historical data, dating to the past. There are cases where we want to bulk load "current data"; but the bulk load mechanism does not support this, as the bulk loaded files are always sorted behind the non-bulkloaded hfiles. Assigning Sequence Id to bulk loaded files should solve this issue.
> StoreFiles within a store are sorted based on the sequenceId. SequenceId is a monotonically increasing number that accompanies every edit written to the WAL. For entries that update the same cell, we would like the latter edit to win. This comparision is accomplished using memstoreTS, at the KV level; and sequenceId at the StoreFile level (to order scanners in the KeyValueHeap).
> BulkLoaded files are generated outside of HBase/RegionServer, so they do not have a sequenceId written in the file. This causes HBase to lose track of the point in time, when the BulkLoaded file was imported to HBase. Resulting in a behavior, that *only* supports viewing bulkLoaded files as files back-filling data from the begining of time.
> By assigning a sequence number to the file, we can allow the bulk loaded file to fit in where we want. Either at the "current time" or the "begining of time". The latter is the default, to maintain backward compatibility.
> Design approach:
> Store files keep track of the sequence Id in the trailer. Since we do not wish to edit/rewrite the bulk loaded file upon import, we will encode the assigned sequenceId into the fileName. The filename RegEx is updated for this regard. If the sequenceId is encoded in the filename, the sequenceId will be used as the sequenceId for the file. If none is found, the sequenceId will be considered 0 (as per the default, backward-compatible behavior).
> To enable clients to request pre-existing behavior, the command line utility allows for 2 ways to import BulkLoaded Files: to assign or not assign a sequence Number.
>     If a sequence Number is assigned, the imporeted file will be imported with the "current sequence Id".
>     if the sequence Number is not assigned, it will be as if it was backfilling old data, from the begining of time.
> Compaction behavior:
>     With the current compaction algorithm, bulk loaded files – that backfill data, to the begining of time – can cause a compaction storm, converting every minor compaction to a major compaction. To address this, these files are excluded from minor compaction, based on a config param. (enabled for the messages use case).
>     Since, bulk loaded files that are not back-filling data do not cause this issue, they will not be ignored during minor compactions based on the config parameter. This is also required to ensure that there are no holes in the set of files selected for compaction – this is necessary to preserve the order of KV's comparision before and after compaction.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (HBASE-6630) Port HBASE-6590 to trunk 0.94 : Assign sequence number to bulk loaded files

Posted by "Amitanand Aiyer (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HBASE-6630?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Amitanand Aiyer updated HBASE-6630:
-----------------------------------

    Affects Version/s: 0.94.1
               Status: Patch Available  (was: In Progress)
    
> Port HBASE-6590 to trunk 0.94 : Assign sequence number to bulk loaded files
> ---------------------------------------------------------------------------
>
>                 Key: HBASE-6630
>                 URL: https://issues.apache.org/jira/browse/HBASE-6630
>             Project: HBase
>          Issue Type: Sub-task
>    Affects Versions: 0.94.1
>            Reporter: Amitanand Aiyer
>            Assignee: Amitanand Aiyer
>            Priority: Minor
>         Attachments: 6590-seq-id-bulk-load.txt
>
>
> Currently bulk loaded files are not assigned a sequence number. Thus, they can only be used to import historical data, dating to the past. There are cases where we want to bulk load "current data"; but the bulk load mechanism does not support this, as the bulk loaded files are always sorted behind the non-bulkloaded hfiles. Assigning Sequence Id to bulk loaded files should solve this issue.
> StoreFiles within a store are sorted based on the sequenceId. SequenceId is a monotonically increasing number that accompanies every edit written to the WAL. For entries that update the same cell, we would like the latter edit to win. This comparision is accomplished using memstoreTS, at the KV level; and sequenceId at the StoreFile level (to order scanners in the KeyValueHeap).
> BulkLoaded files are generated outside of HBase/RegionServer, so they do not have a sequenceId written in the file. This causes HBase to lose track of the point in time, when the BulkLoaded file was imported to HBase. Resulting in a behavior, that *only* supports viewing bulkLoaded files as files back-filling data from the begining of time.
> By assigning a sequence number to the file, we can allow the bulk loaded file to fit in where we want. Either at the "current time" or the "begining of time". The latter is the default, to maintain backward compatibility.
> Design approach:
> Store files keep track of the sequence Id in the trailer. Since we do not wish to edit/rewrite the bulk loaded file upon import, we will encode the assigned sequenceId into the fileName. The filename RegEx is updated for this regard. If the sequenceId is encoded in the filename, the sequenceId will be used as the sequenceId for the file. If none is found, the sequenceId will be considered 0 (as per the default, backward-compatible behavior).
> To enable clients to request pre-existing behavior, the command line utility allows for 2 ways to import BulkLoaded Files: to assign or not assign a sequence Number.
>     If a sequence Number is assigned, the imporeted file will be imported with the "current sequence Id".
>     if the sequence Number is not assigned, it will be as if it was backfilling old data, from the begining of time.
> Compaction behavior:
>     With the current compaction algorithm, bulk loaded files – that backfill data, to the begining of time – can cause a compaction storm, converting every minor compaction to a major compaction. To address this, these files are excluded from minor compaction, based on a config param. (enabled for the messages use case).
>     Since, bulk loaded files that are not back-filling data do not cause this issue, they will not be ignored during minor compactions based on the config parameter. This is also required to ensure that there are no holes in the set of files selected for compaction – this is necessary to preserve the order of KV's comparision before and after compaction.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (HBASE-6630) Port HBASE-6590 to trunk : Assign sequence number to bulk loaded files

Posted by "Ted Yu (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HBASE-6630?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Ted Yu updated HBASE-6630:
--------------------------

      Description: 

Currently bulk loaded files are not assigned a sequence number. Thus, they can only be used to import historical data, dating to the past. There are cases where we want to bulk load "current data"; but the bulk load mechanism does not support this, as the bulk loaded files are always sorted behind the non-bulkloaded hfiles. Assigning Sequence Id to bulk loaded files should solve this issue.

StoreFiles within a store are sorted based on the sequenceId. SequenceId is a monotonically increasing number that accompanies every edit written to the WAL. For entries that update the same cell, we would like the latter edit to win. This comparision is accomplished using memstoreTS, at the KV level; and sequenceId at the StoreFile level (to order scanners in the KeyValueHeap).

BulkLoaded files are generated outside of HBase/RegionServer, so they do not have a sequenceId written in the file. This causes HBase to lose track of the point in time, when the BulkLoaded file was imported to HBase. Resulting in a behavior, that *only* supports viewing bulkLoaded files as files back-filling data from the begining of time.

By assigning a sequence number to the file, we can allow the bulk loaded file to fit in where we want. Either at the "current time" or the "begining of time". The latter is the default, to maintain backward compatibility.

Design approach:
Store files keep track of the sequence Id in the trailer. Since we do not wish to edit/rewrite the bulk loaded file upon import, we will encode the assigned sequenceId into the fileName. The filename RegEx is updated for this regard. If the sequenceId is encoded in the filename, the sequenceId will be used as the sequenceId for the file. If none is found, the sequenceId will be considered 0 (as per the default, backward-compatible behavior).

To enable clients to request pre-existing behavior, the command line utility allows for 2 ways to import BulkLoaded Files: to assign or not assign a sequence Number.

    If a sequence Number is assigned, the imporeted file will be imported with the "current sequence Id".
    if the sequence Number is not assigned, it will be as if it was backfilling old data, from the begining of time.

Compaction behavior:

    With the current compaction algorithm, bulk loaded files – that backfill data, to the begining of time – can cause a compaction storm, converting every minor compaction to a major compaction. To address this, these files are excluded from minor compaction, based on a config param. (enabled for the messages use case).
    Since, bulk loaded files that are not back-filling data do not cause this issue, they will not be ignored during minor compactions based on the config parameter. This is also required to ensure that there are no holes in the set of files selected for compaction – this is necessary to preserve the order of KV's comparision before and after compaction.



  was:


Currently bulk loaded files are not assigned a sequence number. Thus, they can only be used to import historical data, dating to the past. There are cases where we want to bulk load "current data"; but the bulk load mechanism does not support this, as the bulk loaded files are always sorted behind the non-bulkloaded hfiles. Assigning Sequence Id to bulk loaded files should solve this issue.

StoreFiles within a store are sorted based on the sequenceId. SequenceId is a monotonically increasing number that accompanies every edit written to the WAL. For entries that update the same cell, we would like the latter edit to win. This comparision is accomplished using memstoreTS, at the KV level; and sequenceId at the StoreFile level (to order scanners in the KeyValueHeap).

BulkLoaded files are generated outside of HBase/RegionServer, so they do not have a sequenceId written in the file. This causes HBase to lose track of the point in time, when the BulkLoaded file was imported to HBase. Resulting in a behavior, that *only* supports viewing bulkLoaded files as files back-filling data from the begining of time.

By assigning a sequence number to the file, we can allow the bulk loaded file to fit in where we want. Either at the "current time" or the "begining of time". The latter is the default, to maintain backward compatibility.

Design approach:
Store files keep track of the sequence Id in the trailer. Since we do not wish to edit/rewrite the bulk loaded file upon import, we will encode the assigned sequenceId into the fileName. The filename RegEx is updated for this regard. If the sequenceId is encoded in the filename, the sequenceId will be used as the sequenceId for the file. If none is found, the sequenceId will be considered 0 (as per the default, backward-compatible behavior).

To enable clients to request pre-existing behavior, the command line utility allows for 2 ways to import BulkLoaded Files: to assign or not assign a sequence Number.

    If a sequence Number is assigned, the imporeted file will be imported with the "current sequence Id".
    if the sequence Number is not assigned, it will be as if it was backfilling old data, from the begining of time.

Compaction behavior:

    With the current compaction algorithm, bulk loaded files – that backfill data, to the begining of time – can cause a compaction storm, converting every minor compaction to a major compaction. To address this, these files are excluded from minor compaction, based on a config param. (enabled for the messages use case).
    Since, bulk loaded files that are not back-filling data do not cause this issue, they will not be ignored during minor compactions based on the config parameter. This is also required to ensure that there are no holes in the set of files selected for compaction – this is necessary to preserve the order of KV's comparision before and after compaction.



    Fix Version/s: 0.96.0
          Summary: Port HBASE-6590 to trunk : Assign sequence number to bulk loaded files  (was: Port HBASE-6590 to trunk 0.94 : Assign sequence number to bulk loaded files)
    
> Port HBASE-6590 to trunk : Assign sequence number to bulk loaded files
> ----------------------------------------------------------------------
>
>                 Key: HBASE-6630
>                 URL: https://issues.apache.org/jira/browse/HBASE-6630
>             Project: HBase
>          Issue Type: Sub-task
>    Affects Versions: 0.94.1
>            Reporter: Amitanand Aiyer
>            Assignee: Amitanand Aiyer
>            Priority: Minor
>             Fix For: 0.96.0
>
>         Attachments: 6590-seq-id-bulk-load.txt, 6630-v2.txt
>
>
> Currently bulk loaded files are not assigned a sequence number. Thus, they can only be used to import historical data, dating to the past. There are cases where we want to bulk load "current data"; but the bulk load mechanism does not support this, as the bulk loaded files are always sorted behind the non-bulkloaded hfiles. Assigning Sequence Id to bulk loaded files should solve this issue.
> StoreFiles within a store are sorted based on the sequenceId. SequenceId is a monotonically increasing number that accompanies every edit written to the WAL. For entries that update the same cell, we would like the latter edit to win. This comparision is accomplished using memstoreTS, at the KV level; and sequenceId at the StoreFile level (to order scanners in the KeyValueHeap).
> BulkLoaded files are generated outside of HBase/RegionServer, so they do not have a sequenceId written in the file. This causes HBase to lose track of the point in time, when the BulkLoaded file was imported to HBase. Resulting in a behavior, that *only* supports viewing bulkLoaded files as files back-filling data from the begining of time.
> By assigning a sequence number to the file, we can allow the bulk loaded file to fit in where we want. Either at the "current time" or the "begining of time". The latter is the default, to maintain backward compatibility.
> Design approach:
> Store files keep track of the sequence Id in the trailer. Since we do not wish to edit/rewrite the bulk loaded file upon import, we will encode the assigned sequenceId into the fileName. The filename RegEx is updated for this regard. If the sequenceId is encoded in the filename, the sequenceId will be used as the sequenceId for the file. If none is found, the sequenceId will be considered 0 (as per the default, backward-compatible behavior).
> To enable clients to request pre-existing behavior, the command line utility allows for 2 ways to import BulkLoaded Files: to assign or not assign a sequence Number.
>     If a sequence Number is assigned, the imporeted file will be imported with the "current sequence Id".
>     if the sequence Number is not assigned, it will be as if it was backfilling old data, from the begining of time.
> Compaction behavior:
>     With the current compaction algorithm, bulk loaded files – that backfill data, to the begining of time – can cause a compaction storm, converting every minor compaction to a major compaction. To address this, these files are excluded from minor compaction, based on a config param. (enabled for the messages use case).
>     Since, bulk loaded files that are not back-filling data do not cause this issue, they will not be ignored during minor compactions based on the config parameter. This is also required to ensure that there are no holes in the set of files selected for compaction – this is necessary to preserve the order of KV's comparision before and after compaction.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HBASE-6630) Port HBASE-6590 to trunk : Assign sequence number to bulk loaded files

Posted by "Hudson (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HBASE-6630?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13451438#comment-13451438 ] 

Hudson commented on HBASE-6630:
-------------------------------

Integrated in HBase-TRUNK-on-Hadoop-2.0.0 #166 (See [https://builds.apache.org/job/HBase-TRUNK-on-Hadoop-2.0.0/166/])
    HBASE-6630 Port HBASE-6590 to trunk 0.94 : Assign sequence number to bulk loaded files (Amitanand) (Revision 1382351)

     Result = FAILURE
tedyu : 
Files : 
* /hbase/trunk/hbase-server/src/main/java/org/apache/hadoop/hbase/mapreduce/LoadIncrementalHFiles.java
* /hbase/trunk/hbase-server/src/main/java/org/apache/hadoop/hbase/protobuf/ProtobufUtil.java
* /hbase/trunk/hbase-server/src/main/java/org/apache/hadoop/hbase/protobuf/RequestConverter.java
* /hbase/trunk/hbase-server/src/main/java/org/apache/hadoop/hbase/protobuf/generated/ClientProtos.java
* /hbase/trunk/hbase-server/src/main/java/org/apache/hadoop/hbase/regionserver/HRegion.java
* /hbase/trunk/hbase-server/src/main/java/org/apache/hadoop/hbase/regionserver/HRegionServer.java
* /hbase/trunk/hbase-server/src/main/java/org/apache/hadoop/hbase/regionserver/HStore.java
* /hbase/trunk/hbase-server/src/main/java/org/apache/hadoop/hbase/regionserver/Store.java
* /hbase/trunk/hbase-server/src/main/java/org/apache/hadoop/hbase/regionserver/StoreFile.java
* /hbase/trunk/hbase-server/src/main/java/org/apache/hadoop/hbase/regionserver/wal/HLog.java
* /hbase/trunk/hbase-server/src/main/protobuf/Client.proto
* /hbase/trunk/hbase-server/src/test/java/org/apache/hadoop/hbase/mapreduce/TestLoadIncrementalHFiles.java
* /hbase/trunk/hbase-server/src/test/java/org/apache/hadoop/hbase/regionserver/TestCompaction.java
* /hbase/trunk/hbase-server/src/test/java/org/apache/hadoop/hbase/regionserver/TestHRegionServerBulkLoad.java
* /hbase/trunk/hbase-server/src/test/java/org/apache/hadoop/hbase/regionserver/TestStoreFile.java
* /hbase/trunk/hbase-server/src/test/java/org/apache/hadoop/hbase/regionserver/wal/TestWALReplay.java

                
> Port HBASE-6590 to trunk : Assign sequence number to bulk loaded files
> ----------------------------------------------------------------------
>
>                 Key: HBASE-6630
>                 URL: https://issues.apache.org/jira/browse/HBASE-6630
>             Project: HBase
>          Issue Type: Sub-task
>    Affects Versions: 0.94.1
>            Reporter: Amitanand Aiyer
>            Assignee: Amitanand Aiyer
>            Priority: Minor
>             Fix For: 0.96.0
>
>         Attachments: 6590-seq-id-bulk-load.txt, 6630-v2.txt
>
>
> Currently bulk loaded files are not assigned a sequence number. Thus, they can only be used to import historical data, dating to the past. There are cases where we want to bulk load "current data"; but the bulk load mechanism does not support this, as the bulk loaded files are always sorted behind the non-bulkloaded hfiles. Assigning Sequence Id to bulk loaded files should solve this issue.
> StoreFiles within a store are sorted based on the sequenceId. SequenceId is a monotonically increasing number that accompanies every edit written to the WAL. For entries that update the same cell, we would like the latter edit to win. This comparision is accomplished using memstoreTS, at the KV level; and sequenceId at the StoreFile level (to order scanners in the KeyValueHeap).
> BulkLoaded files are generated outside of HBase/RegionServer, so they do not have a sequenceId written in the file. This causes HBase to lose track of the point in time, when the BulkLoaded file was imported to HBase. Resulting in a behavior, that *only* supports viewing bulkLoaded files as files back-filling data from the begining of time.
> By assigning a sequence number to the file, we can allow the bulk loaded file to fit in where we want. Either at the "current time" or the "begining of time". The latter is the default, to maintain backward compatibility.
> Design approach:
> Store files keep track of the sequence Id in the trailer. Since we do not wish to edit/rewrite the bulk loaded file upon import, we will encode the assigned sequenceId into the fileName. The filename RegEx is updated for this regard. If the sequenceId is encoded in the filename, the sequenceId will be used as the sequenceId for the file. If none is found, the sequenceId will be considered 0 (as per the default, backward-compatible behavior).
> To enable clients to request pre-existing behavior, the command line utility allows for 2 ways to import BulkLoaded Files: to assign or not assign a sequence Number.
>     If a sequence Number is assigned, the imporeted file will be imported with the "current sequence Id".
>     if the sequence Number is not assigned, it will be as if it was backfilling old data, from the begining of time.
> Compaction behavior:
>     With the current compaction algorithm, bulk loaded files – that backfill data, to the begining of time – can cause a compaction storm, converting every minor compaction to a major compaction. To address this, these files are excluded from minor compaction, based on a config param. (enabled for the messages use case).
>     Since, bulk loaded files that are not back-filling data do not cause this issue, they will not be ignored during minor compactions based on the config parameter. This is also required to ensure that there are no holes in the set of files selected for compaction – this is necessary to preserve the order of KV's comparision before and after compaction.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (HBASE-6630) Port HBASE-6590 to trunk : Assign sequence number to bulk loaded files

Posted by "Ted Yu (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HBASE-6630?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Ted Yu updated HBASE-6630:
--------------------------

      Resolution: Fixed
    Hadoop Flags: Reviewed
          Status: Resolved  (was: Patch Available)
    
> Port HBASE-6590 to trunk : Assign sequence number to bulk loaded files
> ----------------------------------------------------------------------
>
>                 Key: HBASE-6630
>                 URL: https://issues.apache.org/jira/browse/HBASE-6630
>             Project: HBase
>          Issue Type: Sub-task
>    Affects Versions: 0.94.1
>            Reporter: Amitanand Aiyer
>            Assignee: Amitanand Aiyer
>            Priority: Minor
>             Fix For: 0.96.0
>
>         Attachments: 6590-seq-id-bulk-load.txt, 6630-v2.txt
>
>
> Currently bulk loaded files are not assigned a sequence number. Thus, they can only be used to import historical data, dating to the past. There are cases where we want to bulk load "current data"; but the bulk load mechanism does not support this, as the bulk loaded files are always sorted behind the non-bulkloaded hfiles. Assigning Sequence Id to bulk loaded files should solve this issue.
> StoreFiles within a store are sorted based on the sequenceId. SequenceId is a monotonically increasing number that accompanies every edit written to the WAL. For entries that update the same cell, we would like the latter edit to win. This comparision is accomplished using memstoreTS, at the KV level; and sequenceId at the StoreFile level (to order scanners in the KeyValueHeap).
> BulkLoaded files are generated outside of HBase/RegionServer, so they do not have a sequenceId written in the file. This causes HBase to lose track of the point in time, when the BulkLoaded file was imported to HBase. Resulting in a behavior, that *only* supports viewing bulkLoaded files as files back-filling data from the begining of time.
> By assigning a sequence number to the file, we can allow the bulk loaded file to fit in where we want. Either at the "current time" or the "begining of time". The latter is the default, to maintain backward compatibility.
> Design approach:
> Store files keep track of the sequence Id in the trailer. Since we do not wish to edit/rewrite the bulk loaded file upon import, we will encode the assigned sequenceId into the fileName. The filename RegEx is updated for this regard. If the sequenceId is encoded in the filename, the sequenceId will be used as the sequenceId for the file. If none is found, the sequenceId will be considered 0 (as per the default, backward-compatible behavior).
> To enable clients to request pre-existing behavior, the command line utility allows for 2 ways to import BulkLoaded Files: to assign or not assign a sequence Number.
>     If a sequence Number is assigned, the imporeted file will be imported with the "current sequence Id".
>     if the sequence Number is not assigned, it will be as if it was backfilling old data, from the begining of time.
> Compaction behavior:
>     With the current compaction algorithm, bulk loaded files – that backfill data, to the begining of time – can cause a compaction storm, converting every minor compaction to a major compaction. To address this, these files are excluded from minor compaction, based on a config param. (enabled for the messages use case).
>     Since, bulk loaded files that are not back-filling data do not cause this issue, they will not be ignored during minor compactions based on the config parameter. This is also required to ensure that there are no holes in the set of files selected for compaction – this is necessary to preserve the order of KV's comparision before and after compaction.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira