You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@hbase.apache.org by "Anoop Sam John (JIRA)" <ji...@apache.org> on 2012/05/17 21:46:07 UTC

[jira] [Created] (HBASE-6040) Use block encoding and HBase handled checksum verification in bulk loading using HFileOutputFormat

Anoop Sam John created HBASE-6040:
-------------------------------------

             Summary: Use block encoding and HBase handled checksum verification in bulk loading using HFileOutputFormat
                 Key: HBASE-6040
                 URL: https://issues.apache.org/jira/browse/HBASE-6040
             Project: HBase
          Issue Type: Improvement
          Components: mapreduce
            Reporter: Anoop Sam John
            Assignee: Anoop Sam John


When the data is bulk loaded using HFileOutputFormat, we are not using the block encoding and the HBase handled checksum features..  When the writer is created for making the HFile, I am not seeing any such info passing to the WriterBuilder.
In HFileOutputFormat.getNewWriter(byte[] family, Configuration conf), we dont have these info and do not pass also to the writer... So those HFiles will not have these optimizations..

Later in LoadIncrementalHFiles.copyHFileHalf(), where we physically divide one HFile(created by the MR) iff it can not belong to just one region, I can see we pass the datablock encoding details and checksum details to the new HFile writer. But this step wont happen normally I think..


--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Updated] (HBASE-6040) Use block encoding and HBase handled checksum verification in bulk loading using HFileOutputFormat

Posted by "Anoop Sam John (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/HBASE-6040?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Anoop Sam John updated HBASE-6040:
----------------------------------

    Attachment: HBASE-6040_Trunk.patch

Patch for trunk
                
> Use block encoding and HBase handled checksum verification in bulk loading using HFileOutputFormat
> --------------------------------------------------------------------------------------------------
>
>                 Key: HBASE-6040
>                 URL: https://issues.apache.org/jira/browse/HBASE-6040
>             Project: HBase
>          Issue Type: Improvement
>          Components: mapreduce
>            Reporter: Anoop Sam John
>            Assignee: Anoop Sam John
>         Attachments: HBASE-6040_94.patch, HBASE-6040_Trunk.patch
>
>
> When the data is bulk loaded using HFileOutputFormat, we are not using the block encoding and the HBase handled checksum features..  When the writer is created for making the HFile, I am not seeing any such info passing to the WriterBuilder.
> In HFileOutputFormat.getNewWriter(byte[] family, Configuration conf), we dont have these info and do not pass also to the writer... So those HFiles will not have these optimizations..
> Later in LoadIncrementalHFiles.copyHFileHalf(), where we physically divide one HFile(created by the MR) iff it can not belong to just one region, I can see we pass the datablock encoding details and checksum details to the new HFile writer. But this step wont happen normally I think..

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (HBASE-6040) Use block encoding and HBase handled checksum verification in bulk loading using HFileOutputFormat

Posted by "Anoop Sam John (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HBASE-6040?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13289346#comment-13289346 ] 

Anoop Sam John commented on HBASE-6040:
---------------------------------------

HFileDataBlockEncoder is a private interface. Can we change the signature?

HFileDataBlockEncoder#saveMetadata(StoreFile.Writer storeFileWriter)

                
> Use block encoding and HBase handled checksum verification in bulk loading using HFileOutputFormat
> --------------------------------------------------------------------------------------------------
>
>                 Key: HBASE-6040
>                 URL: https://issues.apache.org/jira/browse/HBASE-6040
>             Project: HBase
>          Issue Type: Improvement
>          Components: mapreduce
>    Affects Versions: 0.94.0, 0.96.0
>            Reporter: Anoop Sam John
>            Assignee: Anoop Sam John
>             Fix For: 0.94.1
>
>         Attachments: HBASE-6040_94.patch, HBASE-6040_Trunk.patch
>
>
> When the data is bulk loaded using HFileOutputFormat, we are not using the block encoding and the HBase handled checksum features..  When the writer is created for making the HFile, I am not seeing any such info passing to the WriterBuilder.
> In HFileOutputFormat.getNewWriter(byte[] family, Configuration conf), we dont have these info and do not pass also to the writer... So those HFiles will not have these optimizations..
> Later in LoadIncrementalHFiles.copyHFileHalf(), where we physically divide one HFile(created by the MR) iff it can not belong to just one region, I can see we pass the datablock encoding details and checksum details to the new HFile writer. But this step wont happen normally I think..

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (HBASE-6040) Use block encoding and HBase handled checksum verification in bulk loading using HFileOutputFormat

Posted by "Hudson (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HBASE-6040?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13287174#comment-13287174 ] 

Hudson commented on HBASE-6040:
-------------------------------

Integrated in HBase-0.94-security #33 (See [https://builds.apache.org/job/HBase-0.94-security/33/])
    HBASE-6040 Use block encoding and HBase handled checksum verification in bulk loading using HFileOutputFormat (Revision 1344561)

     Result = FAILURE
stack : 
Files : 
* /hbase/branches/0.94/src/main/java/org/apache/hadoop/hbase/mapreduce/HFileOutputFormat.java

                
> Use block encoding and HBase handled checksum verification in bulk loading using HFileOutputFormat
> --------------------------------------------------------------------------------------------------
>
>                 Key: HBASE-6040
>                 URL: https://issues.apache.org/jira/browse/HBASE-6040
>             Project: HBase
>          Issue Type: Improvement
>          Components: mapreduce
>    Affects Versions: 0.94.0, 0.96.0
>            Reporter: Anoop Sam John
>            Assignee: Anoop Sam John
>             Fix For: 0.94.1
>
>         Attachments: HBASE-6040_94.patch, HBASE-6040_Trunk.patch
>
>
> When the data is bulk loaded using HFileOutputFormat, we are not using the block encoding and the HBase handled checksum features..  When the writer is created for making the HFile, I am not seeing any such info passing to the WriterBuilder.
> In HFileOutputFormat.getNewWriter(byte[] family, Configuration conf), we dont have these info and do not pass also to the writer... So those HFiles will not have these optimizations..
> Later in LoadIncrementalHFiles.copyHFileHalf(), where we physically divide one HFile(created by the MR) iff it can not belong to just one region, I can see we pass the datablock encoding details and checksum details to the new HFile writer. But this step wont happen normally I think..

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (HBASE-6040) Use block encoding and HBase handled checksum verification in bulk loading using HFileOutputFormat

Posted by "Anoop Sam John (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HBASE-6040?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13278162#comment-13278162 ] 

Anoop Sam John commented on HBASE-6040:
---------------------------------------

Will upload a patch tomorrow. Need to test in cluster..
                
> Use block encoding and HBase handled checksum verification in bulk loading using HFileOutputFormat
> --------------------------------------------------------------------------------------------------
>
>                 Key: HBASE-6040
>                 URL: https://issues.apache.org/jira/browse/HBASE-6040
>             Project: HBase
>          Issue Type: Improvement
>          Components: mapreduce
>            Reporter: Anoop Sam John
>            Assignee: Anoop Sam John
>
> When the data is bulk loaded using HFileOutputFormat, we are not using the block encoding and the HBase handled checksum features..  When the writer is created for making the HFile, I am not seeing any such info passing to the WriterBuilder.
> In HFileOutputFormat.getNewWriter(byte[] family, Configuration conf), we dont have these info and do not pass also to the writer... So those HFiles will not have these optimizations..
> Later in LoadIncrementalHFiles.copyHFileHalf(), where we physically divide one HFile(created by the MR) iff it can not belong to just one region, I can see we pass the datablock encoding details and checksum details to the new HFile writer. But this step wont happen normally I think..

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (HBASE-6040) Use block encoding and HBase handled checksum verification in bulk loading using HFileOutputFormat

Posted by "Anoop Sam John (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HBASE-6040?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13289633#comment-13289633 ] 

Anoop Sam John commented on HBASE-6040:
---------------------------------------

Thanks Stack. 
Created new bug HBASE-6164
                
> Use block encoding and HBase handled checksum verification in bulk loading using HFileOutputFormat
> --------------------------------------------------------------------------------------------------
>
>                 Key: HBASE-6040
>                 URL: https://issues.apache.org/jira/browse/HBASE-6040
>             Project: HBase
>          Issue Type: Improvement
>          Components: mapreduce
>    Affects Versions: 0.94.0, 0.96.0
>            Reporter: Anoop Sam John
>            Assignee: Anoop Sam John
>             Fix For: 0.94.1
>
>         Attachments: HBASE-6040_94.patch, HBASE-6040_Trunk.patch
>
>
> When the data is bulk loaded using HFileOutputFormat, we are not using the block encoding and the HBase handled checksum features..  When the writer is created for making the HFile, I am not seeing any such info passing to the WriterBuilder.
> In HFileOutputFormat.getNewWriter(byte[] family, Configuration conf), we dont have these info and do not pass also to the writer... So those HFiles will not have these optimizations..
> Later in LoadIncrementalHFiles.copyHFileHalf(), where we physically divide one HFile(created by the MR) iff it can not belong to just one region, I can see we pass the datablock encoding details and checksum details to the new HFile writer. But this step wont happen normally I think..

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (HBASE-6040) Use block encoding and HBase handled checksum verification in bulk loading using HFileOutputFormat

Posted by "Hudson (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HBASE-6040?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13286374#comment-13286374 ] 

Hudson commented on HBASE-6040:
-------------------------------

Integrated in HBase-TRUNK #2962 (See [https://builds.apache.org/job/HBase-TRUNK/2962/])
    HBASE-6040 Use block encoding and HBase handled checksum verification in bulk loading using HFileOutputFormat (Revision 1344560)

     Result = FAILURE
stack : 
Files : 
* /hbase/trunk/hbase-server/src/main/java/org/apache/hadoop/hbase/mapreduce/HFileOutputFormat.java

                
> Use block encoding and HBase handled checksum verification in bulk loading using HFileOutputFormat
> --------------------------------------------------------------------------------------------------
>
>                 Key: HBASE-6040
>                 URL: https://issues.apache.org/jira/browse/HBASE-6040
>             Project: HBase
>          Issue Type: Improvement
>          Components: mapreduce
>    Affects Versions: 0.94.0, 0.96.0
>            Reporter: Anoop Sam John
>            Assignee: Anoop Sam John
>             Fix For: 0.94.1
>
>         Attachments: HBASE-6040_94.patch, HBASE-6040_Trunk.patch
>
>
> When the data is bulk loaded using HFileOutputFormat, we are not using the block encoding and the HBase handled checksum features..  When the writer is created for making the HFile, I am not seeing any such info passing to the WriterBuilder.
> In HFileOutputFormat.getNewWriter(byte[] family, Configuration conf), we dont have these info and do not pass also to the writer... So those HFiles will not have these optimizations..
> Later in LoadIncrementalHFiles.copyHFileHalf(), where we physically divide one HFile(created by the MR) iff it can not belong to just one region, I can see we pass the datablock encoding details and checksum details to the new HFile writer. But this step wont happen normally I think..

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Updated] (HBASE-6040) Use block encoding and HBase handled checksum verification in bulk loading using HFileOutputFormat

Posted by "stack (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/HBASE-6040?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

stack updated HBASE-6040:
-------------------------

       Resolution: Fixed
    Fix Version/s:     (was: 0.96.0)
     Hadoop Flags: Reviewed
           Status: Resolved  (was: Patch Available)

Applied to 0.94 branch and to trunk.  Thanks for the patch Anoop.
                
> Use block encoding and HBase handled checksum verification in bulk loading using HFileOutputFormat
> --------------------------------------------------------------------------------------------------
>
>                 Key: HBASE-6040
>                 URL: https://issues.apache.org/jira/browse/HBASE-6040
>             Project: HBase
>          Issue Type: Improvement
>          Components: mapreduce
>    Affects Versions: 0.94.0, 0.96.0
>            Reporter: Anoop Sam John
>            Assignee: Anoop Sam John
>             Fix For: 0.94.1
>
>         Attachments: HBASE-6040_94.patch, HBASE-6040_Trunk.patch
>
>
> When the data is bulk loaded using HFileOutputFormat, we are not using the block encoding and the HBase handled checksum features..  When the writer is created for making the HFile, I am not seeing any such info passing to the WriterBuilder.
> In HFileOutputFormat.getNewWriter(byte[] family, Configuration conf), we dont have these info and do not pass also to the writer... So those HFiles will not have these optimizations..
> Later in LoadIncrementalHFiles.copyHFileHalf(), where we physically divide one HFile(created by the MR) iff it can not belong to just one region, I can see we pass the datablock encoding details and checksum details to the new HFile writer. But this step wont happen normally I think..

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Updated] (HBASE-6040) Use block encoding and HBase handled checksum verification in bulk loading using HFileOutputFormat

Posted by "Anoop Sam John (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/HBASE-6040?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Anoop Sam John updated HBASE-6040:
----------------------------------

    Attachment: HBASE-6040_94.patch

Patch prepared for 0.94
                
> Use block encoding and HBase handled checksum verification in bulk loading using HFileOutputFormat
> --------------------------------------------------------------------------------------------------
>
>                 Key: HBASE-6040
>                 URL: https://issues.apache.org/jira/browse/HBASE-6040
>             Project: HBase
>          Issue Type: Improvement
>          Components: mapreduce
>            Reporter: Anoop Sam John
>            Assignee: Anoop Sam John
>         Attachments: HBASE-6040_94.patch
>
>
> When the data is bulk loaded using HFileOutputFormat, we are not using the block encoding and the HBase handled checksum features..  When the writer is created for making the HFile, I am not seeing any such info passing to the WriterBuilder.
> In HFileOutputFormat.getNewWriter(byte[] family, Configuration conf), we dont have these info and do not pass also to the writer... So those HFiles will not have these optimizations..
> Later in LoadIncrementalHFiles.copyHFileHalf(), where we physically divide one HFile(created by the MR) iff it can not belong to just one region, I can see we pass the datablock encoding details and checksum details to the new HFile writer. But this step wont happen normally I think..

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (HBASE-6040) Use block encoding and HBase handled checksum verification in bulk loading using HFileOutputFormat

Posted by "Hudson (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HBASE-6040?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13286510#comment-13286510 ] 

Hudson commented on HBASE-6040:
-------------------------------

Integrated in HBase-TRUNK-on-Hadoop-2.0.0 #34 (See [https://builds.apache.org/job/HBase-TRUNK-on-Hadoop-2.0.0/34/])
    HBASE-6040 Use block encoding and HBase handled checksum verification in bulk loading using HFileOutputFormat (Revision 1344560)

     Result = FAILURE
stack : 
Files : 
* /hbase/trunk/hbase-server/src/main/java/org/apache/hadoop/hbase/mapreduce/HFileOutputFormat.java

                
> Use block encoding and HBase handled checksum verification in bulk loading using HFileOutputFormat
> --------------------------------------------------------------------------------------------------
>
>                 Key: HBASE-6040
>                 URL: https://issues.apache.org/jira/browse/HBASE-6040
>             Project: HBase
>          Issue Type: Improvement
>          Components: mapreduce
>    Affects Versions: 0.94.0, 0.96.0
>            Reporter: Anoop Sam John
>            Assignee: Anoop Sam John
>             Fix For: 0.94.1
>
>         Attachments: HBASE-6040_94.patch, HBASE-6040_Trunk.patch
>
>
> When the data is bulk loaded using HFileOutputFormat, we are not using the block encoding and the HBase handled checksum features..  When the writer is created for making the HFile, I am not seeing any such info passing to the WriterBuilder.
> In HFileOutputFormat.getNewWriter(byte[] family, Configuration conf), we dont have these info and do not pass also to the writer... So those HFiles will not have these optimizations..
> Later in LoadIncrementalHFiles.copyHFileHalf(), where we physically divide one HFile(created by the MR) iff it can not belong to just one region, I can see we pass the datablock encoding details and checksum details to the new HFile writer. But this step wont happen normally I think..

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (HBASE-6040) Use block encoding and HBase handled checksum verification in bulk loading using HFileOutputFormat

Posted by "Hudson (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HBASE-6040?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13286352#comment-13286352 ] 

Hudson commented on HBASE-6040:
-------------------------------

Integrated in HBase-0.94 #239 (See [https://builds.apache.org/job/HBase-0.94/239/])
    HBASE-6040 Use block encoding and HBase handled checksum verification in bulk loading using HFileOutputFormat (Revision 1344561)

     Result = FAILURE
stack : 
Files : 
* /hbase/branches/0.94/src/main/java/org/apache/hadoop/hbase/mapreduce/HFileOutputFormat.java

                
> Use block encoding and HBase handled checksum verification in bulk loading using HFileOutputFormat
> --------------------------------------------------------------------------------------------------
>
>                 Key: HBASE-6040
>                 URL: https://issues.apache.org/jira/browse/HBASE-6040
>             Project: HBase
>          Issue Type: Improvement
>          Components: mapreduce
>    Affects Versions: 0.94.0, 0.96.0
>            Reporter: Anoop Sam John
>            Assignee: Anoop Sam John
>             Fix For: 0.94.1
>
>         Attachments: HBASE-6040_94.patch, HBASE-6040_Trunk.patch
>
>
> When the data is bulk loaded using HFileOutputFormat, we are not using the block encoding and the HBase handled checksum features..  When the writer is created for making the HFile, I am not seeing any such info passing to the WriterBuilder.
> In HFileOutputFormat.getNewWriter(byte[] family, Configuration conf), we dont have these info and do not pass also to the writer... So those HFiles will not have these optimizations..
> Later in LoadIncrementalHFiles.copyHFileHalf(), where we physically divide one HFile(created by the MR) iff it can not belong to just one region, I can see we pass the datablock encoding details and checksum details to the new HFile writer. But this step wont happen normally I think..

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (HBASE-6040) Use block encoding and HBase handled checksum verification in bulk loading using HFileOutputFormat

Posted by "Anoop Sam John (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HBASE-6040?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13289353#comment-13289353 ] 

Anoop Sam John commented on HBASE-6040:
---------------------------------------

HFileDataBlockEncoder interface usage is mainly at the HFile.Writer level.  Why this saveMetadata() we are making from the StoreFile.Writer?
I feel it is better to be moved to close() in HFile.Writer
Any way the signature change would be needed here also.

Note: Handling of bloom we are doing fully at the StoreFile level.
                
> Use block encoding and HBase handled checksum verification in bulk loading using HFileOutputFormat
> --------------------------------------------------------------------------------------------------
>
>                 Key: HBASE-6040
>                 URL: https://issues.apache.org/jira/browse/HBASE-6040
>             Project: HBase
>          Issue Type: Improvement
>          Components: mapreduce
>    Affects Versions: 0.94.0, 0.96.0
>            Reporter: Anoop Sam John
>            Assignee: Anoop Sam John
>             Fix For: 0.94.1
>
>         Attachments: HBASE-6040_94.patch, HBASE-6040_Trunk.patch
>
>
> When the data is bulk loaded using HFileOutputFormat, we are not using the block encoding and the HBase handled checksum features..  When the writer is created for making the HFile, I am not seeing any such info passing to the WriterBuilder.
> In HFileOutputFormat.getNewWriter(byte[] family, Configuration conf), we dont have these info and do not pass also to the writer... So those HFiles will not have these optimizations..
> Later in LoadIncrementalHFiles.copyHFileHalf(), where we physically divide one HFile(created by the MR) iff it can not belong to just one region, I can see we pass the datablock encoding details and checksum details to the new HFile writer. But this step wont happen normally I think..

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (HBASE-6040) Use block encoding and HBase handled checksum verification in bulk loading using HFileOutputFormat

Posted by "Gopinathan A (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HBASE-6040?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13289334#comment-13289334 ] 

Gopinathan A commented on HBASE-6040:
-------------------------------------

Need to take care some more things for block encoding in case bulk load. 

Getting following exception while scanning the table. 
{noformat}
2012-06-05 15:39:24,771 ERROR org.apache.hadoop.hbase.regionserver.HRegionServer: Failed openScanner
java.lang.AssertionError: Expected on-disk data block encoding NONE, got PREFIX
	at org.apache.hadoop.hbase.io.hfile.HFileDataBlockEncoderImpl.diskToCacheFormat(HFileDataBlockEncoderImpl.java:151)
	at org.apache.hadoop.hbase.io.hfile.HFileReaderV2.readBlock(HFileReaderV2.java:329)
	at org.apache.hadoop.hbase.io.hfile.HFileReaderV2$EncodedScannerV2.seekTo(HFileReaderV2.java:951)
	at org.apache.hadoop.hbase.regionserver.StoreFileScanner.seekAtOrAfter(StoreFileScanner.java:229)
	at org.apache.hadoop.hbase.regionserver.StoreFileScanner.seek(StoreFileScanner.java:145)
	at org.apache.hadoop.hbase.regionserver.StoreScanner.<init>(StoreScanner.java:130)
	at org.apache.hadoop.hbase.regionserver.Store.getScanner(Store.java:2044)
	at org.apache.hadoop.hbase.regionserver.HRegion$RegionScannerImpl.<init>(HRegion.java:3307)
	at org.apache.hadoop.hbase.regionserver.HRegion.instantiateRegionScanner(HRegion.java:1630)
	at org.apache.hadoop.hbase.regionserver.HRegion.getScanner(HRegion.java:1622)
	at org.apache.hadoop.hbase.regionserver.HRegion.getScanner(HRegion.java:1598)
	at org.apache.hadoop.hbase.regionserver.HRegionServer.openScanner(HRegionServer.java:2317)
	at sun.reflect.GeneratedMethodAccessor34.invoke(Unknown Source)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
{noformat}

Also better to support BloomFilter in bulkload.
                
> Use block encoding and HBase handled checksum verification in bulk loading using HFileOutputFormat
> --------------------------------------------------------------------------------------------------
>
>                 Key: HBASE-6040
>                 URL: https://issues.apache.org/jira/browse/HBASE-6040
>             Project: HBase
>          Issue Type: Improvement
>          Components: mapreduce
>    Affects Versions: 0.94.0, 0.96.0
>            Reporter: Anoop Sam John
>            Assignee: Anoop Sam John
>             Fix For: 0.94.1
>
>         Attachments: HBASE-6040_94.patch, HBASE-6040_Trunk.patch
>
>
> When the data is bulk loaded using HFileOutputFormat, we are not using the block encoding and the HBase handled checksum features..  When the writer is created for making the HFile, I am not seeing any such info passing to the WriterBuilder.
> In HFileOutputFormat.getNewWriter(byte[] family, Configuration conf), we dont have these info and do not pass also to the writer... So those HFiles will not have these optimizations..
> Later in LoadIncrementalHFiles.copyHFileHalf(), where we physically divide one HFile(created by the MR) iff it can not belong to just one region, I can see we pass the datablock encoding details and checksum details to the new HFile writer. But this step wont happen normally I think..

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Updated] (HBASE-6040) Use block encoding and HBase handled checksum verification in bulk loading using HFileOutputFormat

Posted by "Anoop Sam John (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/HBASE-6040?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Anoop Sam John updated HBASE-6040:
----------------------------------

        Fix Version/s: 0.94.1
                       0.96.0
    Affects Version/s: 0.96.0
                       0.94.0
               Status: Patch Available  (was: Open)
    
> Use block encoding and HBase handled checksum verification in bulk loading using HFileOutputFormat
> --------------------------------------------------------------------------------------------------
>
>                 Key: HBASE-6040
>                 URL: https://issues.apache.org/jira/browse/HBASE-6040
>             Project: HBase
>          Issue Type: Improvement
>          Components: mapreduce
>    Affects Versions: 0.94.0, 0.96.0
>            Reporter: Anoop Sam John
>            Assignee: Anoop Sam John
>             Fix For: 0.96.0, 0.94.1
>
>         Attachments: HBASE-6040_94.patch, HBASE-6040_Trunk.patch
>
>
> When the data is bulk loaded using HFileOutputFormat, we are not using the block encoding and the HBase handled checksum features..  When the writer is created for making the HFile, I am not seeing any such info passing to the WriterBuilder.
> In HFileOutputFormat.getNewWriter(byte[] family, Configuration conf), we dont have these info and do not pass also to the writer... So those HFiles will not have these optimizations..
> Later in LoadIncrementalHFiles.copyHFileHalf(), where we physically divide one HFile(created by the MR) iff it can not belong to just one region, I can see we pass the datablock encoding details and checksum details to the new HFile writer. But this step wont happen normally I think..

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (HBASE-6040) Use block encoding and HBase handled checksum verification in bulk loading using HFileOutputFormat

Posted by "stack (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HBASE-6040?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13289625#comment-13289625 ] 

stack commented on HBASE-6040:
------------------------------

Make a new one at this stage I'd suggest Anoop; this one has been closed for > a day or so.  Thanks.

bq. I feel it is better to be moved to close() in HFile.Writer

Ok.

StoreFile is a wrapper around hfile to add 'hbase' stuff (we have tried to keep hfile 'pure', unpolluted by hbase-isms... I don't think we succeeded but that was the idea).
                
> Use block encoding and HBase handled checksum verification in bulk loading using HFileOutputFormat
> --------------------------------------------------------------------------------------------------
>
>                 Key: HBASE-6040
>                 URL: https://issues.apache.org/jira/browse/HBASE-6040
>             Project: HBase
>          Issue Type: Improvement
>          Components: mapreduce
>    Affects Versions: 0.94.0, 0.96.0
>            Reporter: Anoop Sam John
>            Assignee: Anoop Sam John
>             Fix For: 0.94.1
>
>         Attachments: HBASE-6040_94.patch, HBASE-6040_Trunk.patch
>
>
> When the data is bulk loaded using HFileOutputFormat, we are not using the block encoding and the HBase handled checksum features..  When the writer is created for making the HFile, I am not seeing any such info passing to the WriterBuilder.
> In HFileOutputFormat.getNewWriter(byte[] family, Configuration conf), we dont have these info and do not pass also to the writer... So those HFiles will not have these optimizations..
> Later in LoadIncrementalHFiles.copyHFileHalf(), where we physically divide one HFile(created by the MR) iff it can not belong to just one region, I can see we pass the datablock encoding details and checksum details to the new HFile writer. But this step wont happen normally I think..

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Updated] (HBASE-6040) Use block encoding and HBase handled checksum verification in bulk loading using HFileOutputFormat

Posted by "Anoop Sam John (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/HBASE-6040?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Anoop Sam John updated HBASE-6040:
----------------------------------

    Release Note: 
Added a new config param "hbase.mapreduce.hfileoutputformat.datablock.encoding" using which we can specify which encoding scheme to be used on disk. Data will get written in to HFiles using this encoding scheme while bulk load. The value of this can be NONE, PREFIX, DIFF, FAST_DIFF as these are the DataBlockEncoding types supported now. [When any new types are added later, corresponding names also will become valid]
The checksum type and number of bytes per checksum can be configured using the config params hbase.hstore.checksum.algorithm, hbase.hstore.bytes.per.checksum respectively

    
> Use block encoding and HBase handled checksum verification in bulk loading using HFileOutputFormat
> --------------------------------------------------------------------------------------------------
>
>                 Key: HBASE-6040
>                 URL: https://issues.apache.org/jira/browse/HBASE-6040
>             Project: HBase
>          Issue Type: Improvement
>          Components: mapreduce
>    Affects Versions: 0.94.0, 0.96.0
>            Reporter: Anoop Sam John
>            Assignee: Anoop Sam John
>             Fix For: 0.94.1
>
>         Attachments: HBASE-6040_94.patch, HBASE-6040_Trunk.patch
>
>
> When the data is bulk loaded using HFileOutputFormat, we are not using the block encoding and the HBase handled checksum features..  When the writer is created for making the HFile, I am not seeing any such info passing to the WriterBuilder.
> In HFileOutputFormat.getNewWriter(byte[] family, Configuration conf), we dont have these info and do not pass also to the writer... So those HFiles will not have these optimizations..
> Later in LoadIncrementalHFiles.copyHFileHalf(), where we physically divide one HFile(created by the MR) iff it can not belong to just one region, I can see we pass the datablock encoding details and checksum details to the new HFile writer. But this step wont happen normally I think..

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (HBASE-6040) Use block encoding and HBase handled checksum verification in bulk loading using HFileOutputFormat

Posted by "Zhihong Yu (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HBASE-6040?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13284226#comment-13284226 ] 

Zhihong Yu commented on HBASE-6040:
-----------------------------------

TestHFileOutputFormat passed with the patch.
+1 from me.

Running through test suite is desirable.
                
> Use block encoding and HBase handled checksum verification in bulk loading using HFileOutputFormat
> --------------------------------------------------------------------------------------------------
>
>                 Key: HBASE-6040
>                 URL: https://issues.apache.org/jira/browse/HBASE-6040
>             Project: HBase
>          Issue Type: Improvement
>          Components: mapreduce
>            Reporter: Anoop Sam John
>            Assignee: Anoop Sam John
>         Attachments: HBASE-6040_94.patch
>
>
> When the data is bulk loaded using HFileOutputFormat, we are not using the block encoding and the HBase handled checksum features..  When the writer is created for making the HFile, I am not seeing any such info passing to the WriterBuilder.
> In HFileOutputFormat.getNewWriter(byte[] family, Configuration conf), we dont have these info and do not pass also to the writer... So those HFiles will not have these optimizations..
> Later in LoadIncrementalHFiles.copyHFileHalf(), where we physically divide one HFile(created by the MR) iff it can not belong to just one region, I can see we pass the datablock encoding details and checksum details to the new HFile writer. But this step wont happen normally I think..

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (HBASE-6040) Use block encoding and HBase handled checksum verification in bulk loading using HFileOutputFormat

Posted by "ramkrishna.s.vasudevan (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HBASE-6040?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13289338#comment-13289338 ] 

ramkrishna.s.vasudevan commented on HBASE-6040:
-----------------------------------------------

HBASE-3776 is still open for supporting bloom filter on bulkload.
                
> Use block encoding and HBase handled checksum verification in bulk loading using HFileOutputFormat
> --------------------------------------------------------------------------------------------------
>
>                 Key: HBASE-6040
>                 URL: https://issues.apache.org/jira/browse/HBASE-6040
>             Project: HBase
>          Issue Type: Improvement
>          Components: mapreduce
>    Affects Versions: 0.94.0, 0.96.0
>            Reporter: Anoop Sam John
>            Assignee: Anoop Sam John
>             Fix For: 0.94.1
>
>         Attachments: HBASE-6040_94.patch, HBASE-6040_Trunk.patch
>
>
> When the data is bulk loaded using HFileOutputFormat, we are not using the block encoding and the HBase handled checksum features..  When the writer is created for making the HFile, I am not seeing any such info passing to the WriterBuilder.
> In HFileOutputFormat.getNewWriter(byte[] family, Configuration conf), we dont have these info and do not pass also to the writer... So those HFiles will not have these optimizations..
> Later in LoadIncrementalHFiles.copyHFileHalf(), where we physically divide one HFile(created by the MR) iff it can not belong to just one region, I can see we pass the datablock encoding details and checksum details to the new HFile writer. But this step wont happen normally I think..

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (HBASE-6040) Use block encoding and HBase handled checksum verification in bulk loading using HFileOutputFormat

Posted by "Hadoop QA (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HBASE-6040?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13286300#comment-13286300 ] 

Hadoop QA commented on HBASE-6040:
----------------------------------

-1 overall.  Here are the results of testing the latest attachment 
  http://issues.apache.org/jira/secure/attachment/12530320/HBASE-6040_Trunk.patch
  against trunk revision .

    +1 @author.  The patch does not contain any @author tags.

    -1 tests included.  The patch doesn't appear to include any new or modified tests.
                        Please justify why no new tests are needed for this patch.
                        Also please list what manual steps were performed to verify this patch.

    +1 hadoop2.0.  The patch compiles against the hadoop 2.0 profile.

    +1 javadoc.  The javadoc tool did not generate any warning messages.

    +1 javac.  The applied patch does not increase the total number of javac compiler warnings.

    -1 findbugs.  The patch appears to cause Findbugs (version 1.3.9) to fail.

    +1 release audit.  The applied patch does not increase the total number of release audit warnings.

    +1 core tests.  The patch passed unit tests in .

Test results: https://builds.apache.org/job/PreCommit-HBASE-Build/2069//testReport/
Console output: https://builds.apache.org/job/PreCommit-HBASE-Build/2069//console

This message is automatically generated.
                
> Use block encoding and HBase handled checksum verification in bulk loading using HFileOutputFormat
> --------------------------------------------------------------------------------------------------
>
>                 Key: HBASE-6040
>                 URL: https://issues.apache.org/jira/browse/HBASE-6040
>             Project: HBase
>          Issue Type: Improvement
>          Components: mapreduce
>    Affects Versions: 0.94.0, 0.96.0
>            Reporter: Anoop Sam John
>            Assignee: Anoop Sam John
>             Fix For: 0.96.0, 0.94.1
>
>         Attachments: HBASE-6040_94.patch, HBASE-6040_Trunk.patch
>
>
> When the data is bulk loaded using HFileOutputFormat, we are not using the block encoding and the HBase handled checksum features..  When the writer is created for making the HFile, I am not seeing any such info passing to the WriterBuilder.
> In HFileOutputFormat.getNewWriter(byte[] family, Configuration conf), we dont have these info and do not pass also to the writer... So those HFiles will not have these optimizations..
> Later in LoadIncrementalHFiles.copyHFileHalf(), where we physically divide one HFile(created by the MR) iff it can not belong to just one region, I can see we pass the datablock encoding details and checksum details to the new HFile writer. But this step wont happen normally I think..

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (HBASE-6040) Use block encoding and HBase handled checksum verification in bulk loading using HFileOutputFormat

Posted by "stack (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HBASE-6040?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13286100#comment-13286100 ] 

stack commented on HBASE-6040:
------------------------------

+1

Make a trunk patch Anoop and submit it to hadoopqa?
                
> Use block encoding and HBase handled checksum verification in bulk loading using HFileOutputFormat
> --------------------------------------------------------------------------------------------------
>
>                 Key: HBASE-6040
>                 URL: https://issues.apache.org/jira/browse/HBASE-6040
>             Project: HBase
>          Issue Type: Improvement
>          Components: mapreduce
>            Reporter: Anoop Sam John
>            Assignee: Anoop Sam John
>         Attachments: HBASE-6040_94.patch
>
>
> When the data is bulk loaded using HFileOutputFormat, we are not using the block encoding and the HBase handled checksum features..  When the writer is created for making the HFile, I am not seeing any such info passing to the WriterBuilder.
> In HFileOutputFormat.getNewWriter(byte[] family, Configuration conf), we dont have these info and do not pass also to the writer... So those HFiles will not have these optimizations..
> Later in LoadIncrementalHFiles.copyHFileHalf(), where we physically divide one HFile(created by the MR) iff it can not belong to just one region, I can see we pass the datablock encoding details and checksum details to the new HFile writer. But this step wont happen normally I think..

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Closed] (HBASE-6040) Use block encoding and HBase handled checksum verification in bulk loading using HFileOutputFormat

Posted by "Lars Hofhansl (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/HBASE-6040?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Lars Hofhansl closed HBASE-6040.
--------------------------------

    
> Use block encoding and HBase handled checksum verification in bulk loading using HFileOutputFormat
> --------------------------------------------------------------------------------------------------
>
>                 Key: HBASE-6040
>                 URL: https://issues.apache.org/jira/browse/HBASE-6040
>             Project: HBase
>          Issue Type: Improvement
>          Components: mapreduce
>    Affects Versions: 0.94.0, 0.96.0
>            Reporter: Anoop Sam John
>            Assignee: Anoop Sam John
>             Fix For: 0.94.1
>
>         Attachments: HBASE-6040_94.patch, HBASE-6040_Trunk.patch
>
>
> When the data is bulk loaded using HFileOutputFormat, we are not using the block encoding and the HBase handled checksum features..  When the writer is created for making the HFile, I am not seeing any such info passing to the WriterBuilder.
> In HFileOutputFormat.getNewWriter(byte[] family, Configuration conf), we dont have these info and do not pass also to the writer... So those HFiles will not have these optimizations..
> Later in LoadIncrementalHFiles.copyHFileHalf(), where we physically divide one HFile(created by the MR) iff it can not belong to just one region, I can see we pass the datablock encoding details and checksum details to the new HFile writer. But this step wont happen normally I think..

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HBASE-6040) Use block encoding and HBase handled checksum verification in bulk loading using HFileOutputFormat

Posted by "Anoop Sam John (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HBASE-6040?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13289341#comment-13289341 ] 

Anoop Sam John commented on HBASE-6040:
---------------------------------------

Oh sorry.. I missed that part.
It is regarding the HFileDataBlockEncoder#saveMetadata(StoreFile.Writer storeFileWriter)
In bulk load we deal with HFileWriter directly , not through StoreFileWriter.
The above call of saveMetadata() is happening from StoreFile.Writer#close() only.
This call saveMetadata() only writes the encoder type into fileinfo


We might need to explicitly write this fileInfo from HFileOutputFormat.
HFileDataBlockEncoder#saveMetadata(StoreFile.Writer storeFileWriter)not sure why this takes StoreFile.Writer rather than HFile.Writer
Other methods in this interface deals with HFile or HFile blocks.

@Stack I will reopen this JIRA?

Thanks Gopi for noticing this and raising.  Regarding the point abt usage of bloom we will track through other ticket (If ok)
                
> Use block encoding and HBase handled checksum verification in bulk loading using HFileOutputFormat
> --------------------------------------------------------------------------------------------------
>
>                 Key: HBASE-6040
>                 URL: https://issues.apache.org/jira/browse/HBASE-6040
>             Project: HBase
>          Issue Type: Improvement
>          Components: mapreduce
>    Affects Versions: 0.94.0, 0.96.0
>            Reporter: Anoop Sam John
>            Assignee: Anoop Sam John
>             Fix For: 0.94.1
>
>         Attachments: HBASE-6040_94.patch, HBASE-6040_Trunk.patch
>
>
> When the data is bulk loaded using HFileOutputFormat, we are not using the block encoding and the HBase handled checksum features..  When the writer is created for making the HFile, I am not seeing any such info passing to the WriterBuilder.
> In HFileOutputFormat.getNewWriter(byte[] family, Configuration conf), we dont have these info and do not pass also to the writer... So those HFiles will not have these optimizations..
> Later in LoadIncrementalHFiles.copyHFileHalf(), where we physically divide one HFile(created by the MR) iff it can not belong to just one region, I can see we pass the datablock encoding details and checksum details to the new HFile writer. But this step wont happen normally I think..

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira