You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@hbase.apache.org by "Schubert Zhang (JIRA)" <ji...@apache.org> on 2009/09/07 17:00:57 UTC

[jira] Created: (HBASE-1818) HFile code review and refinement

HFile code review and refinement
--------------------------------

                 Key: HBASE-1818
                 URL: https://issues.apache.org/jira/browse/HBASE-1818
             Project: Hadoop HBase
          Issue Type: Improvement
          Components: io
    Affects Versions: 0.20.0
            Reporter: Schubert Zhang
            Assignee: Schubert Zhang
            Priority: Minor
             Fix For: 0.20.1


HFile is a good mimic of Google's SSTable file format. And we want HFile to become a common file format of hadoop in the near future.
We will review the code of HFile and record the comments here, and then provide fixed patch after the review.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (HBASE-1818) HFile code review and refinement

Posted by "Schubert Zhang (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HBASE-1818?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12754354#action_12754354 ] 

Schubert Zhang commented on HBASE-1818:
---------------------------------------

@ryan
1. Your description of "return null" is good, i.e. null means this meta block no exist.
     Accept you comment.  
     Maybe we can throw another Exception here? If you think it is not necessary, I will remove this added code.

2.  In the HFile.java, I found the block index is based on the first key in the block (not last key).
     Not only treading HFile as a part of HBase, in fact, we want HFile can be a common file format which can be used in other applications. And in fact, I like to support duplicate keys in an HFile, since my application use HFile directly to store data. But  when I checked the code, I found the risk to add duplicate keys. e.g. 
      - block 1:  firstKey=A,  lastKey=B,  indexKey=A
      - block 2:  firstKey=B, lastKey=C,  indexKey=B
When seek key=B, we go into block 2, and miss the lastKey=B in block 1.

Yes, you are right, if the index of data block is last key instead of  first key. it seems fine:
      - block 1:  firstKey=A, lastKey=B, indexKey=B 
      - block 2:  firstKey=B, lastKey=C, indexKey=C
When seek key=B, we go into block 1. The Scanner.seekTo() will find key=B in block 1 from the firstKey of block 1, and the Scanner.next() will not miss the key=B in block 2.

But I double checked the HFile code, the block index is really firstKey now.

> HFile code review and refinement
> --------------------------------
>
>                 Key: HBASE-1818
>                 URL: https://issues.apache.org/jira/browse/HBASE-1818
>             Project: Hadoop HBase
>          Issue Type: Improvement
>          Components: io
>    Affects Versions: 0.20.0
>            Reporter: Schubert Zhang
>            Assignee: Schubert Zhang
>            Priority: Minor
>             Fix For: 0.21.0
>
>         Attachments: HFile-v3.patch
>
>
> HFile is a good mimic of Google's SSTable file format. And we want HFile to become a common file format of hadoop in the near future.
> We will review the code of HFile and record the comments here, and then provide fixed patch after the review.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (HBASE-1818) HFile code review and refinement

Posted by "Schubert Zhang (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HBASE-1818?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12752898#action_12752898 ] 

Schubert Zhang commented on HBASE-1818:
---------------------------------------

@stack,

Thanks, I removed my modification at this point. and  new patch is attached.

I have done a simple verfication test use HBase PerformanceEvaluation tool. It seem the removal of  buf.compact() does not improve performance.

@Ryan,

could you please share your added code?

> HFile code review and refinement
> --------------------------------
>
>                 Key: HBASE-1818
>                 URL: https://issues.apache.org/jira/browse/HBASE-1818
>             Project: Hadoop HBase
>          Issue Type: Improvement
>          Components: io
>    Affects Versions: 0.20.0
>            Reporter: Schubert Zhang
>            Assignee: Schubert Zhang
>            Priority: Minor
>             Fix For: 0.20.1
>
>         Attachments: HFile-v1.patch, HFile-v2.patch
>
>
> HFile is a good mimic of Google's SSTable file format. And we want HFile to become a common file format of hadoop in the near future.
> We will review the code of HFile and record the comments here, and then provide fixed patch after the review.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (HBASE-1818) HFile code review and refinement

Posted by "Schubert Zhang (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/HBASE-1818?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Schubert Zhang updated HBASE-1818:
----------------------------------

    Attachment: HFile-v3.patch

I upload the v3 patch, which only include bug fix.

> HFile code review and refinement
> --------------------------------
>
>                 Key: HBASE-1818
>                 URL: https://issues.apache.org/jira/browse/HBASE-1818
>             Project: Hadoop HBase
>          Issue Type: Improvement
>          Components: io
>    Affects Versions: 0.20.0
>            Reporter: Schubert Zhang
>            Assignee: Schubert Zhang
>            Priority: Minor
>             Fix For: 0.21.0
>
>         Attachments: HFile-v3.patch
>
>
> HFile is a good mimic of Google's SSTable file format. And we want HFile to become a common file format of hadoop in the near future.
> We will review the code of HFile and record the comments here, and then provide fixed patch after the review.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Reopened: (HBASE-1818) HFile code review and refinement

Posted by "stack (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/HBASE-1818?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

stack reopened HBASE-1818:
--------------------------


Reopening.  Tests fail, the TestHFile in particular.

> HFile code review and refinement
> --------------------------------
>
>                 Key: HBASE-1818
>                 URL: https://issues.apache.org/jira/browse/HBASE-1818
>             Project: Hadoop HBase
>          Issue Type: Improvement
>          Components: io
>    Affects Versions: 0.20.0
>            Reporter: Schubert Zhang
>            Assignee: Schubert Zhang
>            Priority: Minor
>             Fix For: 0.21.0
>
>         Attachments: HFile-v1.patch, HFile-v2.patch
>
>
> HFile is a good mimic of Google's SSTable file format. And we want HFile to become a common file format of hadoop in the near future.
> We will review the code of HFile and record the comments here, and then provide fixed patch after the review.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (HBASE-1818) HFile code review and refinement

Posted by "Schubert Zhang (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/HBASE-1818?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Schubert Zhang updated HBASE-1818:
----------------------------------

    Attachment: HFile-v2.patch

> HFile code review and refinement
> --------------------------------
>
>                 Key: HBASE-1818
>                 URL: https://issues.apache.org/jira/browse/HBASE-1818
>             Project: Hadoop HBase
>          Issue Type: Improvement
>          Components: io
>    Affects Versions: 0.20.0
>            Reporter: Schubert Zhang
>            Assignee: Schubert Zhang
>            Priority: Minor
>             Fix For: 0.20.1
>
>         Attachments: HFile-v1.patch, HFile-v2.patch
>
>
> HFile is a good mimic of Google's SSTable file format. And we want HFile to become a common file format of hadoop in the near future.
> We will review the code of HFile and record the comments here, and then provide fixed patch after the review.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (HBASE-1818) HFile code review and refinement

Posted by "ryan rawson (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HBASE-1818?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12752953#action_12752953 ] 

ryan rawson commented on HBASE-1818:
------------------------------------

as it turns out, buf.compact() isnt actually copying the buffer, it's just adjusting some of the internal variables and pointers. Lets leave it as is.

> HFile code review and refinement
> --------------------------------
>
>                 Key: HBASE-1818
>                 URL: https://issues.apache.org/jira/browse/HBASE-1818
>             Project: Hadoop HBase
>          Issue Type: Improvement
>          Components: io
>    Affects Versions: 0.20.0
>            Reporter: Schubert Zhang
>            Assignee: Schubert Zhang
>            Priority: Minor
>             Fix For: 0.20.1
>
>         Attachments: HFile-v1.patch, HFile-v2.patch
>
>
> HFile is a good mimic of Google's SSTable file format. And we want HFile to become a common file format of hadoop in the near future.
> We will review the code of HFile and record the comments here, and then provide fixed patch after the review.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (HBASE-1818) HFile code review and refinement

Posted by "stack (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HBASE-1818?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12754327#action_12754327 ] 

stack commented on HBASE-1818:
------------------------------

Ryan held my hand and explained that while not at memstore flush time, it could happen on a compaction -- makes sense.

> HFile code review and refinement
> --------------------------------
>
>                 Key: HBASE-1818
>                 URL: https://issues.apache.org/jira/browse/HBASE-1818
>             Project: Hadoop HBase
>          Issue Type: Improvement
>          Components: io
>    Affects Versions: 0.20.0
>            Reporter: Schubert Zhang
>            Assignee: Schubert Zhang
>            Priority: Minor
>             Fix For: 0.21.0
>
>         Attachments: HFile-v3.patch
>
>
> HFile is a good mimic of Google's SSTable file format. And we want HFile to become a common file format of hadoop in the near future.
> We will review the code of HFile and record the comments here, and then provide fixed patch after the review.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (HBASE-1818) HFile code review and refinement

Posted by "Schubert Zhang (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/HBASE-1818?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Schubert Zhang updated HBASE-1818:
----------------------------------

    Fix Version/s:     (was: 0.20.1)
                   0.20.0
           Status: Patch Available  (was: Open)

The previous comments are not so readable. 
Please try the patch.

> HFile code review and refinement
> --------------------------------
>
>                 Key: HBASE-1818
>                 URL: https://issues.apache.org/jira/browse/HBASE-1818
>             Project: Hadoop HBase
>          Issue Type: Improvement
>          Components: io
>    Affects Versions: 0.20.0
>            Reporter: Schubert Zhang
>            Assignee: Schubert Zhang
>            Priority: Minor
>             Fix For: 0.20.0
>
>
> HFile is a good mimic of Google's SSTable file format. And we want HFile to become a common file format of hadoop in the near future.
> We will review the code of HFile and record the comments here, and then provide fixed patch after the review.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (HBASE-1818) HFile code review and refinement

Posted by "Schubert Zhang (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HBASE-1818?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12753403#action_12753403 ] 

Schubert Zhang commented on HBASE-1818:
---------------------------------------

@ryan
1. Regards duplicate keys in a HFile. I am following concern:
    If we allow duplicate keys. Consider following scenario:
    A key="abcd" is append in block1's last key/value pair.
    And the the same key="abcd" is append in block2's first key/value pair. Then in the block index, the key="abcd" will point to block2.
    Then, we want to scan from key="abcd", but the first  key="abcd" (in block1's last) will be missed out.
     Can you confirm this scenario is acceptable or required?

2. + if (buf == null)
    + return null;

This check is only added in getMetaBlock(...). In this method, there are three points to return null.
(1)  if (trailer.metaIndexCount == 0) {
         return null; // there are no meta blocks
       }
(2) if (block == -1)
         return null;
(3) if (buf == null)   //new added by me
        return null; 
  If we do not check it, the following buf.get(..) may NPE. because the decompress() method will not throw exception. Do you mean NPE is better than "return null" which same as above two?
  In fact, it is diffcult to make above trade-off for me, maybe I am doing the way as C++.  

3. Regards buf.compact().
    Yes, you may be right. After more test about performance, my patch does not improve the performance (I don't know if it can improve in some other environments.) I agree to remove this modification in my patch to retain the neat of the returned block buffer (position at 0).

@stack and ryan
Thanks for your test. I will change the patch according to you comments. To include only bug fixing.
If the test fail, please just revert to old version.
Please give me comments about my above questions, then I can make active immediately. Thanks.

> HFile code review and refinement
> --------------------------------
>
>                 Key: HBASE-1818
>                 URL: https://issues.apache.org/jira/browse/HBASE-1818
>             Project: Hadoop HBase
>          Issue Type: Improvement
>          Components: io
>    Affects Versions: 0.20.0
>            Reporter: Schubert Zhang
>            Assignee: Schubert Zhang
>            Priority: Minor
>             Fix For: 0.21.0
>
>         Attachments: HFile-v1.patch, HFile-v2.patch
>
>
> HFile is a good mimic of Google's SSTable file format. And we want HFile to become a common file format of hadoop in the near future.
> We will review the code of HFile and record the comments here, and then provide fixed patch after the review.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (HBASE-1818) HFile code review and refinement

Posted by "ryan rawson (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HBASE-1818?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12752885#action_12752885 ] 

ryan rawson commented on HBASE-1818:
------------------------------------

stack, I am nearly 100% sure that code block I added to solve a NPE.

> HFile code review and refinement
> --------------------------------
>
>                 Key: HBASE-1818
>                 URL: https://issues.apache.org/jira/browse/HBASE-1818
>             Project: Hadoop HBase
>          Issue Type: Improvement
>          Components: io
>    Affects Versions: 0.20.0
>            Reporter: Schubert Zhang
>            Assignee: Schubert Zhang
>            Priority: Minor
>             Fix For: 0.20.1
>
>         Attachments: HFile-v1.patch
>
>
> HFile is a good mimic of Google's SSTable file format. And we want HFile to become a common file format of hadoop in the near future.
> We will review the code of HFile and record the comments here, and then provide fixed patch after the review.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (HBASE-1818) HFile code review and refinement

Posted by "Schubert Zhang (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/HBASE-1818?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Schubert Zhang updated HBASE-1818:
----------------------------------

    Attachment: HFile-v1.patch

> HFile code review and refinement
> --------------------------------
>
>                 Key: HBASE-1818
>                 URL: https://issues.apache.org/jira/browse/HBASE-1818
>             Project: Hadoop HBase
>          Issue Type: Improvement
>          Components: io
>    Affects Versions: 0.20.0
>            Reporter: Schubert Zhang
>            Assignee: Schubert Zhang
>            Priority: Minor
>             Fix For: 0.20.0
>
>         Attachments: HFile-v1.patch
>
>
> HFile is a good mimic of Google's SSTable file format. And we want HFile to become a common file format of hadoop in the near future.
> We will review the code of HFile and record the comments here, and then provide fixed patch after the review.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (HBASE-1818) HFile code review and refinement

Posted by "ryan rawson (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HBASE-1818?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12754321#action_12754321 ] 

ryan rawson commented on HBASE-1818:
------------------------------------

So if compress() returns null (which it should not), some kind of error occurred.  Such as DFS out to lunch, etc, etc.  Returning 'null' - which is interpreted as 'no such meta block' is really ruining the API contract here. Just let it NPE.  Better than converting a IO error into a logical error as your code would do.

As for the duplicate key, the block index is based on the last key in the block. So if we have two duplicate keys that straddle a block boundary, the scenario is like so:

- last key of block X is key 'A' and that is entered in the index
- first key of block X+1 is also key 'A', but first key is not part of the index

Scanning will start at block X, and we will see both.

The only scenario where we could potentially have an issue is if we have multiple duplicate keys so they span more than 1 block thus causing 2 index entries to have the same key. The binary search algorithm will need to choose the first entry to maintain correct behaviour on scan, and I am not 100% sure if this is what will happen. But even so, this is very rare, since in HBase keys are distinguished with timestamps, it would require either many keys with the same TS or a few really large ones with the same TS. Something to consider testing for.

More importantly, if we disallow duplicate keys in hfile, that will cause huge problems. Right now there is no other mechanism to prevent duplicate KeyValues from being inserted - its your own lunch if you put multiple values at the same Timestamp.  But this change would cause compactions to throw and prevent them from completing. A far worse scenario.

> HFile code review and refinement
> --------------------------------
>
>                 Key: HBASE-1818
>                 URL: https://issues.apache.org/jira/browse/HBASE-1818
>             Project: Hadoop HBase
>          Issue Type: Improvement
>          Components: io
>    Affects Versions: 0.20.0
>            Reporter: Schubert Zhang
>            Assignee: Schubert Zhang
>            Priority: Minor
>             Fix For: 0.21.0
>
>         Attachments: HFile-v3.patch
>
>
> HFile is a good mimic of Google's SSTable file format. And we want HFile to become a common file format of hadoop in the near future.
> We will review the code of HFile and record the comments here, and then provide fixed patch after the review.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (HBASE-1818) HFile code review and refinement

Posted by "stack (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/HBASE-1818?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

stack updated HBASE-1818:
-------------------------

       Resolution: Fixed
    Fix Version/s: 0.20.1
     Hadoop Flags: [Reviewed]
           Status: Resolved  (was: Patch Available)

Applied to branch and trunk (v5 is bug fixes and a small optimization).

Thanks for the patch Schubert.

> HFile code review and refinement
> --------------------------------
>
>                 Key: HBASE-1818
>                 URL: https://issues.apache.org/jira/browse/HBASE-1818
>             Project: Hadoop HBase
>          Issue Type: Improvement
>          Components: io
>    Affects Versions: 0.20.0
>            Reporter: Schubert Zhang
>            Assignee: Schubert Zhang
>            Priority: Minor
>             Fix For: 0.20.1, 0.21.0
>
>         Attachments: HFile-v3.patch, HFile-v4.patch, HFile-v5.patch
>
>
> HFile is a good mimic of Google's SSTable file format. And we want HFile to become a common file format of hadoop in the near future.
> We will review the code of HFile and record the comments here, and then provide fixed patch after the review.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (HBASE-1818) HFile code review and refinement

Posted by "ryan rawson (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HBASE-1818?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12755701#action_12755701 ] 

ryan rawson commented on HBASE-1818:
------------------------------------

in reference to the block index, yes the scenario with duplicate keys that span a block boundary exists. 

It's possible that we could fix these holes with a different write strategy which doesnt create invalid hfiles like the one you theorized above. Another scenario is when you could have duplicate key entries in the index, which could cause problems with the binary search algorithm.

There is 2 potential fixes here:
- fix binary search algorithm to actually find the _lower bound_ in face of duplicates.
- prevent hfiles like the one indicated above from being created, in this case by extending block 1 larger than the default sizing until we get a different key.

there might be other solutions too.

> HFile code review and refinement
> --------------------------------
>
>                 Key: HBASE-1818
>                 URL: https://issues.apache.org/jira/browse/HBASE-1818
>             Project: Hadoop HBase
>          Issue Type: Improvement
>          Components: io
>    Affects Versions: 0.20.0
>            Reporter: Schubert Zhang
>            Assignee: Schubert Zhang
>            Priority: Minor
>             Fix For: 0.21.0
>
>         Attachments: HFile-v3.patch, HFile-v4.patch
>
>
> HFile is a good mimic of Google's SSTable file format. And we want HFile to become a common file format of hadoop in the near future.
> We will review the code of HFile and record the comments here, and then provide fixed patch after the review.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (HBASE-1818) HFile code review and refinement

Posted by "Schubert Zhang (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HBASE-1818?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12755844#action_12755844 ] 

Schubert Zhang commented on HBASE-1818:
---------------------------------------

Thanks stack for create a new issue (hbase-1841)

Regards the  2 potential fixes:
    - fix binary search algorithm to actually find the lower bound in face of duplicates.
      I think maybe we need to change to use lastkey as the block index?

    - prevent hfiles like the one indicated above from being created, in this case by extending block 1 larger than the default sizing until we get a different key.
      In fact, we used this way in one of our old product, i.e. only start new block/index at the boundary of different key. In this case, we should ensure the number of the duplicated keys not too large (that will lead big block).

> HFile code review and refinement
> --------------------------------
>
>                 Key: HBASE-1818
>                 URL: https://issues.apache.org/jira/browse/HBASE-1818
>             Project: Hadoop HBase
>          Issue Type: Improvement
>          Components: io
>    Affects Versions: 0.20.0
>            Reporter: Schubert Zhang
>            Assignee: Schubert Zhang
>            Priority: Minor
>             Fix For: 0.21.0
>
>         Attachments: HFile-v3.patch, HFile-v4.patch, HFile-v5.patch
>
>
> HFile is a good mimic of Google's SSTable file format. And we want HFile to become a common file format of hadoop in the near future.
> We will review the code of HFile and record the comments here, and then provide fixed patch after the review.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (HBASE-1818) HFile code review and refinement

Posted by "Schubert Zhang (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HBASE-1818?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12755465#action_12755465 ] 

Schubert Zhang commented on HBASE-1818:
---------------------------------------

@ Ryan and Stack,
Could you please make time to give me a reply about above two question? Then I will go ahead.

> HFile code review and refinement
> --------------------------------
>
>                 Key: HBASE-1818
>                 URL: https://issues.apache.org/jira/browse/HBASE-1818
>             Project: Hadoop HBase
>          Issue Type: Improvement
>          Components: io
>    Affects Versions: 0.20.0
>            Reporter: Schubert Zhang
>            Assignee: Schubert Zhang
>            Priority: Minor
>             Fix For: 0.21.0
>
>         Attachments: HFile-v3.patch
>
>
> HFile is a good mimic of Google's SSTable file format. And we want HFile to become a common file format of hadoop in the near future.
> We will review the code of HFile and record the comments here, and then provide fixed patch after the review.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (HBASE-1818) HFile code review and refinement

Posted by "stack (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HBASE-1818?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12754325#action_12754325 ] 

stack commented on HBASE-1818:
------------------------------

I'm wondering how we'd ever write duplicate keys to an hfile?  Wouldn't the most recent overwrite any older keys in memstore, a TreeMap?

> HFile code review and refinement
> --------------------------------
>
>                 Key: HBASE-1818
>                 URL: https://issues.apache.org/jira/browse/HBASE-1818
>             Project: Hadoop HBase
>          Issue Type: Improvement
>          Components: io
>    Affects Versions: 0.20.0
>            Reporter: Schubert Zhang
>            Assignee: Schubert Zhang
>            Priority: Minor
>             Fix For: 0.21.0
>
>         Attachments: HFile-v3.patch
>
>
> HFile is a good mimic of Google's SSTable file format. And we want HFile to become a common file format of hadoop in the near future.
> We will review the code of HFile and record the comments here, and then provide fixed patch after the review.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (HBASE-1818) HFile code review and refinement

Posted by "Schubert Zhang (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/HBASE-1818?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Schubert Zhang updated HBASE-1818:
----------------------------------

    Attachment:     (was: HFile-v1.patch)

> HFile code review and refinement
> --------------------------------
>
>                 Key: HBASE-1818
>                 URL: https://issues.apache.org/jira/browse/HBASE-1818
>             Project: Hadoop HBase
>          Issue Type: Improvement
>          Components: io
>    Affects Versions: 0.20.0
>            Reporter: Schubert Zhang
>            Assignee: Schubert Zhang
>            Priority: Minor
>             Fix For: 0.21.0
>
>         Attachments: HFile-v3.patch
>
>
> HFile is a good mimic of Google's SSTable file format. And we want HFile to become a common file format of hadoop in the near future.
> We will review the code of HFile and record the comments here, and then provide fixed patch after the review.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (HBASE-1818) HFile code review and refinement

Posted by "Schubert Zhang (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/HBASE-1818?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Schubert Zhang updated HBASE-1818:
----------------------------------

    Attachment: HFile-v5.patch

> HFile code review and refinement
> --------------------------------
>
>                 Key: HBASE-1818
>                 URL: https://issues.apache.org/jira/browse/HBASE-1818
>             Project: Hadoop HBase
>          Issue Type: Improvement
>          Components: io
>    Affects Versions: 0.20.0
>            Reporter: Schubert Zhang
>            Assignee: Schubert Zhang
>            Priority: Minor
>             Fix For: 0.21.0
>
>         Attachments: HFile-v3.patch, HFile-v4.patch, HFile-v5.patch
>
>
> HFile is a good mimic of Google's SSTable file format. And we want HFile to become a common file format of hadoop in the near future.
> We will review the code of HFile and record the comments here, and then provide fixed patch after the review.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (HBASE-1818) HFile code review and refinement

Posted by "Schubert Zhang (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/HBASE-1818?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Schubert Zhang updated HBASE-1818:
----------------------------------

    Attachment: HFile-v4.patch

> HFile code review and refinement
> --------------------------------
>
>                 Key: HBASE-1818
>                 URL: https://issues.apache.org/jira/browse/HBASE-1818
>             Project: Hadoop HBase
>          Issue Type: Improvement
>          Components: io
>    Affects Versions: 0.20.0
>            Reporter: Schubert Zhang
>            Assignee: Schubert Zhang
>            Priority: Minor
>             Fix For: 0.21.0
>
>         Attachments: HFile-v3.patch, HFile-v4.patch
>
>
> HFile is a good mimic of Google's SSTable file format. And we want HFile to become a common file format of hadoop in the near future.
> We will review the code of HFile and record the comments here, and then provide fixed patch after the review.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (HBASE-1818) HFile code review and refinement

Posted by "stack (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HBASE-1818?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12753296#action_12753296 ] 

stack commented on HBASE-1818:
------------------------------

@Schubert Is it possible that in the scheme of things, the compact arraycopy is too miniscule to show in overall PE numbers?

The v2 patch looks good.  It does not remove the section flagged above (that Ryan thinks might protect against a NPE). 

I'm moving this issue out of 0.20.1 and will apply to TRUNK.  Its an improvement.  Lets try and do bug fixes only in the branch.

> HFile code review and refinement
> --------------------------------
>
>                 Key: HBASE-1818
>                 URL: https://issues.apache.org/jira/browse/HBASE-1818
>             Project: Hadoop HBase
>          Issue Type: Improvement
>          Components: io
>    Affects Versions: 0.20.0
>            Reporter: Schubert Zhang
>            Assignee: Schubert Zhang
>            Priority: Minor
>             Fix For: 0.21.0
>
>         Attachments: HFile-v1.patch, HFile-v2.patch
>
>
> HFile is a good mimic of Google's SSTable file format. And we want HFile to become a common file format of hadoop in the near future.
> We will review the code of HFile and record the comments here, and then provide fixed patch after the review.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (HBASE-1818) HFile code review and refinement

Posted by "stack (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/HBASE-1818?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

stack updated HBASE-1818:
-------------------------

    Resolution: Fixed
        Status: Resolved  (was: Patch Available)

Committed to TRUNK.  All tests pass. Thanks for the patch Schubert.

> HFile code review and refinement
> --------------------------------
>
>                 Key: HBASE-1818
>                 URL: https://issues.apache.org/jira/browse/HBASE-1818
>             Project: Hadoop HBase
>          Issue Type: Improvement
>          Components: io
>    Affects Versions: 0.20.0
>            Reporter: Schubert Zhang
>            Assignee: Schubert Zhang
>            Priority: Minor
>             Fix For: 0.21.0
>
>         Attachments: HFile-v1.patch, HFile-v2.patch
>
>
> HFile is a good mimic of Google's SSTable file format. And we want HFile to become a common file format of hadoop in the near future.
> We will review the code of HFile and record the comments here, and then provide fixed patch after the review.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (HBASE-1818) HFile code review and refinement

Posted by "stack (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/HBASE-1818?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

stack updated HBASE-1818:
-------------------------

    Fix Version/s:     (was: 0.20.1)
                   0.21.0

> HFile code review and refinement
> --------------------------------
>
>                 Key: HBASE-1818
>                 URL: https://issues.apache.org/jira/browse/HBASE-1818
>             Project: Hadoop HBase
>          Issue Type: Improvement
>          Components: io
>    Affects Versions: 0.20.0
>            Reporter: Schubert Zhang
>            Assignee: Schubert Zhang
>            Priority: Minor
>             Fix For: 0.21.0
>
>         Attachments: HFile-v1.patch, HFile-v2.patch
>
>
> HFile is a good mimic of Google's SSTable file format. And we want HFile to become a common file format of hadoop in the near future.
> We will review the code of HFile and record the comments here, and then provide fixed patch after the review.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (HBASE-1818) HFile code review and refinement

Posted by "ryan rawson (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HBASE-1818?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12752966#action_12752966 ] 

ryan rawson commented on HBASE-1818:
------------------------------------

you were advocating removing:

-      if (trailer.metaIndexCount == 0) {
-        return null; // there are no meta blocks
-      }


i seem to recall adding that to solve some npe or something.

> HFile code review and refinement
> --------------------------------
>
>                 Key: HBASE-1818
>                 URL: https://issues.apache.org/jira/browse/HBASE-1818
>             Project: Hadoop HBase
>          Issue Type: Improvement
>          Components: io
>    Affects Versions: 0.20.0
>            Reporter: Schubert Zhang
>            Assignee: Schubert Zhang
>            Priority: Minor
>             Fix For: 0.20.1
>
>         Attachments: HFile-v1.patch, HFile-v2.patch
>
>
> HFile is a good mimic of Google's SSTable file format. And we want HFile to become a common file format of hadoop in the near future.
> We will review the code of HFile and record the comments here, and then provide fixed patch after the review.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (HBASE-1818) HFile code review and refinement

Posted by "Schubert Zhang (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/HBASE-1818?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Schubert Zhang updated HBASE-1818:
----------------------------------

    Attachment:     (was: HFile-v2.patch)

> HFile code review and refinement
> --------------------------------
>
>                 Key: HBASE-1818
>                 URL: https://issues.apache.org/jira/browse/HBASE-1818
>             Project: Hadoop HBase
>          Issue Type: Improvement
>          Components: io
>    Affects Versions: 0.20.0
>            Reporter: Schubert Zhang
>            Assignee: Schubert Zhang
>            Priority: Minor
>             Fix For: 0.21.0
>
>         Attachments: HFile-v3.patch
>
>
> HFile is a good mimic of Google's SSTable file format. And we want HFile to become a common file format of hadoop in the near future.
> We will review the code of HFile and record the comments here, and then provide fixed patch after the review.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (HBASE-1818) HFile code review and refinement

Posted by "stack (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HBASE-1818?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12755685#action_12755685 ] 

stack commented on HBASE-1818:
------------------------------

Ok on 1. and 2 (I opened new issue to account for the case you discovered in your review where its possible we miss a key if many instances and they span a block -- hbase-1841).

What about this change:

{code}
@@ -860,7 +859,7 @@
       if (trailer.metaIndexCount == 0) {
         return null; // there are no meta blocks
       }
-      if (metaIndex == null) {
+      if ((metaIndex == null) || (metaIndex.count == 0)) {
         throw new IOException("Meta index not loaded");
       }
       byte [] mbname = Bytes.toBytes(metaBlockName);

{code}

Is it possible that metaIndex is loaded but empty?

Thanks Schubert

> HFile code review and refinement
> --------------------------------
>
>                 Key: HBASE-1818
>                 URL: https://issues.apache.org/jira/browse/HBASE-1818
>             Project: Hadoop HBase
>          Issue Type: Improvement
>          Components: io
>    Affects Versions: 0.20.0
>            Reporter: Schubert Zhang
>            Assignee: Schubert Zhang
>            Priority: Minor
>             Fix For: 0.21.0
>
>         Attachments: HFile-v3.patch, HFile-v4.patch
>
>
> HFile is a good mimic of Google's SSTable file format. And we want HFile to become a common file format of hadoop in the near future.
> We will review the code of HFile and record the comments here, and then provide fixed patch after the review.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (HBASE-1818) HFile code review and refinement

Posted by "Schubert Zhang (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/HBASE-1818?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Schubert Zhang updated HBASE-1818:
----------------------------------

    Status: Patch Available  (was: Open)

The previous comments are not so readable.
Please try the patch.

> HFile code review and refinement
> --------------------------------
>
>                 Key: HBASE-1818
>                 URL: https://issues.apache.org/jira/browse/HBASE-1818
>             Project: Hadoop HBase
>          Issue Type: Improvement
>          Components: io
>    Affects Versions: 0.20.0
>            Reporter: Schubert Zhang
>            Assignee: Schubert Zhang
>            Priority: Minor
>             Fix For: 0.20.0
>
>         Attachments: HFile-v1.patch
>
>
> HFile is a good mimic of Google's SSTable file format. And we want HFile to become a common file format of hadoop in the near future.
> We will review the code of HFile and record the comments here, and then provide fixed patch after the review.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (HBASE-1818) HFile code review and refinement

Posted by "Jonathan Gray (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/HBASE-1818?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Jonathan Gray updated HBASE-1818:
---------------------------------

    Fix Version/s:     (was: 0.20.0)
                   0.20.1

Moving to 0.20.1

> HFile code review and refinement
> --------------------------------
>
>                 Key: HBASE-1818
>                 URL: https://issues.apache.org/jira/browse/HBASE-1818
>             Project: Hadoop HBase
>          Issue Type: Improvement
>          Components: io
>    Affects Versions: 0.20.0
>            Reporter: Schubert Zhang
>            Assignee: Schubert Zhang
>            Priority: Minor
>             Fix For: 0.20.1
>
>         Attachments: HFile-v1.patch
>
>
> HFile is a good mimic of Google's SSTable file format. And we want HFile to become a common file format of hadoop in the near future.
> We will review the code of HFile and record the comments here, and then provide fixed patch after the review.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (HBASE-1818) HFile code review and refinement

Posted by "Schubert Zhang (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/HBASE-1818?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Schubert Zhang updated HBASE-1818:
----------------------------------

    Status: Open  (was: Patch Available)

> HFile code review and refinement
> --------------------------------
>
>                 Key: HBASE-1818
>                 URL: https://issues.apache.org/jira/browse/HBASE-1818
>             Project: Hadoop HBase
>          Issue Type: Improvement
>          Components: io
>    Affects Versions: 0.20.0
>            Reporter: Schubert Zhang
>            Assignee: Schubert Zhang
>            Priority: Minor
>             Fix For: 0.20.0
>
>
> HFile is a good mimic of Google's SSTable file format. And we want HFile to become a common file format of hadoop in the near future.
> We will review the code of HFile and record the comments here, and then provide fixed patch after the review.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (HBASE-1818) HFile code review and refinement

Posted by "Schubert Zhang (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HBASE-1818?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12752965#action_12752965 ] 

Schubert Zhang commented on HBASE-1818:
---------------------------------------

public ByteBuffer compact() {
        System.arraycopy(hb, ix(position()), hb, ix(0), remaining());
        position(remaining());
        limit(capacity());
        return this;
    }

> HFile code review and refinement
> --------------------------------
>
>                 Key: HBASE-1818
>                 URL: https://issues.apache.org/jira/browse/HBASE-1818
>             Project: Hadoop HBase
>          Issue Type: Improvement
>          Components: io
>    Affects Versions: 0.20.0
>            Reporter: Schubert Zhang
>            Assignee: Schubert Zhang
>            Priority: Minor
>             Fix For: 0.20.1
>
>         Attachments: HFile-v1.patch, HFile-v2.patch
>
>
> HFile is a good mimic of Google's SSTable file format. And we want HFile to become a common file format of hadoop in the near future.
> We will review the code of HFile and record the comments here, and then provide fixed patch after the review.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (HBASE-1818) HFile code review and refinement

Posted by "stack (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HBASE-1818?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12755520#action_12755520 ] 

stack commented on HBASE-1818:
------------------------------

On hfile as a general format, do think that will work Schubert?  It has stuff like KeyValue imports.  Is that OK?

Ryan, what you think of index being first rather than last key in block?

On the exception in decompress, https://issues.apache.org/jira/browse/HBASE-1809 is an example.  Somehow we're using the HFile after a close.   Maybe another exception would be good here... at least debugging.  Could emit state of the HFile.Reader.

> HFile code review and refinement
> --------------------------------
>
>                 Key: HBASE-1818
>                 URL: https://issues.apache.org/jira/browse/HBASE-1818
>             Project: Hadoop HBase
>          Issue Type: Improvement
>          Components: io
>    Affects Versions: 0.20.0
>            Reporter: Schubert Zhang
>            Assignee: Schubert Zhang
>            Priority: Minor
>             Fix For: 0.21.0
>
>         Attachments: HFile-v3.patch
>
>
> HFile is a good mimic of Google's SSTable file format. And we want HFile to become a common file format of hadoop in the near future.
> We will review the code of HFile and record the comments here, and then provide fixed patch after the review.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (HBASE-1818) HFile code review and refinement

Posted by "ryan rawson (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HBASE-1818?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12753327#action_12753327 ] 

ryan rawson commented on HBASE-1818:
------------------------------------

Here are some comments from the svn commit:

        if (this.comparator.compare(this.lastKeyBuffer, this.lastKeyOffset,
-            this.lastKeyLength, key, offset, length) > 0) {
+            this.lastKeyLength, key, offset, length) >= 0) {
          throw new IOException("Added a key not lexically larger than" +
            " previous key=" + Bytes.toString(key, offset, length) +
            ", lastkey=" + Bytes.toString(this.lastKeyBuffer, this.lastKeyOffset,

This change has the _Exact opposite_ effect from the intent.  The intent (as I construct it) is to allow duplicate keys in a HFile, yet by adding >= you are explicitly requiring the next key to be larger, but NOT equal to the previous key.  This has to stay as is.

+      if (buf == null)
+       return null;

This is pointless, we'll NPE a few lines later anyways. Don't add checking code that doesnt have any benefits - the caller may not be interpret a null return as 'error' in this case, did you check that?

-      // Toss the header. May have to remove later due to performance.
-      buf.compact();
-      buf.limit(buf.limit() - METABLOCKMAGIC.length);
-      buf.rewind();

I am not super happy with this change... it doesnt affect performance, but it does reduce code simplicity, and leaves open room for errors further on, and requires everyone to realize that the buf has this hidden header which isnt part of the public API of HFile.



> HFile code review and refinement
> --------------------------------
>
>                 Key: HBASE-1818
>                 URL: https://issues.apache.org/jira/browse/HBASE-1818
>             Project: Hadoop HBase
>          Issue Type: Improvement
>          Components: io
>    Affects Versions: 0.20.0
>            Reporter: Schubert Zhang
>            Assignee: Schubert Zhang
>            Priority: Minor
>             Fix For: 0.21.0
>
>         Attachments: HFile-v1.patch, HFile-v2.patch
>
>
> HFile is a good mimic of Google's SSTable file format. And we want HFile to become a common file format of hadoop in the near future.
> We will review the code of HFile and record the comments here, and then provide fixed patch after the review.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (HBASE-1818) HFile code review and refinement

Posted by "Schubert Zhang (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/HBASE-1818?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Schubert Zhang updated HBASE-1818:
----------------------------------

    Status: Patch Available  (was: Reopened)

@stack
Thanks for you reply. So I will make following decision to close this issue.

1. Do not check the returned buf  of decompress() and let NPE is throwed.
     I have thought to throw IOException, but it seems it is not a RunTimeException.

2. Leave the HFile can accept duplicated keys. But I think that will let the scenario of Scanner undefined.
    The block index uses firstKey of the block.

So, finally, the patch only include some minor fix.

> HFile code review and refinement
> --------------------------------
>
>                 Key: HBASE-1818
>                 URL: https://issues.apache.org/jira/browse/HBASE-1818
>             Project: Hadoop HBase
>          Issue Type: Improvement
>          Components: io
>    Affects Versions: 0.20.0
>            Reporter: Schubert Zhang
>            Assignee: Schubert Zhang
>            Priority: Minor
>             Fix For: 0.21.0
>
>         Attachments: HFile-v3.patch, HFile-v4.patch
>
>
> HFile is a good mimic of Google's SSTable file format. And we want HFile to become a common file format of hadoop in the near future.
> We will review the code of HFile and record the comments here, and then provide fixed patch after the review.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (HBASE-1818) HFile code review and refinement

Posted by "Schubert Zhang (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HBASE-1818?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12752963#action_12752963 ] 

Schubert Zhang commented on HBASE-1818:
---------------------------------------

I am puzzled......

In the class HeapByteBuffer:

{quote}
    public ByteBuffer compact() {
	System.arraycopy(hb, ix(position()), hb, ix(0), remaining());
	position(remaining());
	limit(capacity());
	return this;
    }
{quote} 

We can see there is a arraycopy here. But I am puzzled why the performance is no improvement? 


@ryan,

{quote}
I am nearly 100% sure that code block I added to solve a NPE. 
{quote}
Could you please tell me about your added code block to solve the NPE? I am interested in it.

> HFile code review and refinement
> --------------------------------
>
>                 Key: HBASE-1818
>                 URL: https://issues.apache.org/jira/browse/HBASE-1818
>             Project: Hadoop HBase
>          Issue Type: Improvement
>          Components: io
>    Affects Versions: 0.20.0
>            Reporter: Schubert Zhang
>            Assignee: Schubert Zhang
>            Priority: Minor
>             Fix For: 0.20.1
>
>         Attachments: HFile-v1.patch, HFile-v2.patch
>
>
> HFile is a good mimic of Google's SSTable file format. And we want HFile to become a common file format of hadoop in the near future.
> We will review the code of HFile and record the comments here, and then provide fixed patch after the review.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (HBASE-1818) HFile code review and refinement

Posted by "stack (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HBASE-1818?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12752656#action_12752656 ] 

stack commented on HBASE-1818:
------------------------------

This looks good Schubert.

Is removing the below safe?

{code}
-      if (trailer.metaIndexCount == 0) {
-        return null; // there are no meta blocks
-      }
{code}

Or are you removing code that is redundant -- if metaIndex == null, the above is null?

Do all tests pass?

> HFile code review and refinement
> --------------------------------
>
>                 Key: HBASE-1818
>                 URL: https://issues.apache.org/jira/browse/HBASE-1818
>             Project: Hadoop HBase
>          Issue Type: Improvement
>          Components: io
>    Affects Versions: 0.20.0
>            Reporter: Schubert Zhang
>            Assignee: Schubert Zhang
>            Priority: Minor
>             Fix For: 0.20.1
>
>         Attachments: HFile-v1.patch
>
>
> HFile is a good mimic of Google's SSTable file format. And we want HFile to become a common file format of hadoop in the near future.
> We will review the code of HFile and record the comments here, and then provide fixed patch after the review.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (HBASE-1818) HFile code review and refinement

Posted by "Schubert Zhang (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HBASE-1818?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12752169#action_12752169 ] 

Schubert Zhang commented on HBASE-1818:
---------------------------------------

Code review comments:

1. HFile.Writer

(1) private int totalBytes = 0;  ==> private long totalBytes = 0;
      int is too small (2GB) to limit the size of uncompressed data.
 
(2) private void finishBlock()
      use int for variable "size".

(3) private void checkKey(final byte [] key, final int offset, final int length)
      to reject to append duplicated key.

    if (this.comparator.compare(this.lastKeyBuffer, this.lastKeyOffset,
            this.lastKeyLength, key, offset, length) > 0)

   should be:

   if (this.comparator.compare(this.lastKeyBuffer, this.lastKeyOffset,
            this.lastKeyLength, key, offset, length) >= 0)

(4) private long writeFileInfo(FSDataOutputStream o)
      
      int avgValueLen = this.entryCount == 0? 0:
        (int)(this.keylength/this.entryCount);

     should be:

           int avgValueLen = this.entryCount == 0? 0:
        (int)(this.valuelength/this.entryCount);

2. HFile.Reader

(1) public ByteBuffer getMetaBlock(String metaBlockName)
     it's better to use:
      if (metaIndex.count <= 0) {
          return null; // there are no meta blocks
      }

     then:

      if (trailer.metaIndexCount == 0) {
        return null; // there are no meta blocks
      }

      and add:
      if (buf == null)
    	return null;
      since decompress() may return null.
      
      remove following to avoud block copy, and improve performance. But to to pay attention, the current position of the buf is at METABLOCKMAGIC.length (not 0):

      // Toss the header. May have to remove later due to performance.
      buf.compact();
      buf.limit(buf.limit() - METABLOCKMAGIC.length);
      buf.rewind();


(2) ByteBuffer readBlock(int block, boolean cacheBlock)
       boundary bug.
       if (block < 0 || block > blockIndex.count) {
        throw new IOException("Requested block is out of range: " + block +
          ", max: " + blockIndex.count);
      }

     should be:

      if (block < 0 || block >= blockIndex.count) {
        throw new IOException("Requested block is out of range: " + block +
          ", max: " + blockIndex.count);
      }

      add:
       if (buf == null) {
          throw new IOException("Decompress block failure " + block);
        }
       since decompress() may return null.
   
      remove following to avoud block copy, and improve performance. But to to pay attention, the current position of the buf is at METABLOCKMAGIC.length (not 0), and the cached block also have this position.

      // Toss the header. May have to remove later due to performance.
      buf.compact();
      buf.limit(buf.limit() - METABLOCKMAGIC.length);
      buf.rewind();
       
(3)   public boolean seekTo()
    
        if (block != null && currBlock == 0) {
          block.rewind();
          block.position(DATABLOCKMAGIC.length);
          currKeyLen = block.getInt();
          currValueLen = block.getInt();
          return true;
        }
        above code add block.position(DATABLOCKMAGIC.length); since we remove buf.compact previously.
        add return ture, I guess the code miss it.

(4)          private void loadBlock(int bloc) throws IOException {
        if (block == null) {
          block = reader.readBlock(bloc, cacheBlocks);
          currBlock = bloc;
          blockFetches++;
        } else {
          if (bloc != currBlock) {
            block = reader.readBlock(bloc, cacheBlocks);
            currBlock = bloc;
            blockFetches++;
          } else {
            // we are already in the same block, just rewind to seek again.
            block.rewind();
            block.position(DATABLOCKMAGIC.length);  // add this code since we remove buf.compact previously.
          }
        }
      }
    }


     


> HFile code review and refinement
> --------------------------------
>
>                 Key: HBASE-1818
>                 URL: https://issues.apache.org/jira/browse/HBASE-1818
>             Project: Hadoop HBase
>          Issue Type: Improvement
>          Components: io
>    Affects Versions: 0.20.0
>            Reporter: Schubert Zhang
>            Assignee: Schubert Zhang
>            Priority: Minor
>             Fix For: 0.20.1
>
>
> HFile is a good mimic of Google's SSTable file format. And we want HFile to become a common file format of hadoop in the near future.
> We will review the code of HFile and record the comments here, and then provide fixed patch after the review.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (HBASE-1818) HFile code review and refinement

Posted by "stack (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HBASE-1818?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12754308#action_12754308 ] 

stack commented on HBASE-1818:
------------------------------

I buy the argument preventing duplicate keys.

Patch looks good to me.  +1 on commiting (TestHFile passes).

@Ryan, what you think?

> HFile code review and refinement
> --------------------------------
>
>                 Key: HBASE-1818
>                 URL: https://issues.apache.org/jira/browse/HBASE-1818
>             Project: Hadoop HBase
>          Issue Type: Improvement
>          Components: io
>    Affects Versions: 0.20.0
>            Reporter: Schubert Zhang
>            Assignee: Schubert Zhang
>            Priority: Minor
>             Fix For: 0.21.0
>
>         Attachments: HFile-v3.patch
>
>
> HFile is a good mimic of Google's SSTable file format. And we want HFile to become a common file format of hadoop in the near future.
> We will review the code of HFile and record the comments here, and then provide fixed patch after the review.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.