You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@hbase.apache.org by "Anoop Sam John (Jira)" <ji...@apache.org> on 2019/08/22 06:29:00 UTC
[jira] [Commented] (HBASE-22887) HFileOutputFormat2 split a lot of HFile by roll once per rowkey

    [ https://issues.apache.org/jira/browse/HBASE-22887?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16913013#comment-16913013 ] 

Anoop Sam John commented on HBASE-22887:
----------------------------------------

Q1 ->  Yes that is very much ok.   In fact the flush itself can create files of this way now. We can select which CFs to be flushed when a region is getting flushed. So one CF getting flushed while the other will still say in memory.
Q2 ->  "same rowkey with same family, comes to HFileOutputFormat2.write" . We can not say it wont.  The row check can still be there.
{code}
if (wl != null && wl.written + length >= maxsize) {
          this.rollRequested = true;
        }

        // This can only happen once a row is finished though
        if (rollRequested && Bytes.compareTo(this.previousRow, rowKey) != 0) {
          rollWriters(wl);
        }
{code}
Thinking...  Why we need to keep the boolean?  If we really do roll, we reset this boolean.   So this is for next cell iteration.  Then also the size check is there. So anyway that time wl.written + length >= maxsize will be true.  So the boolean based set and check is not at all needed here.   The check will be only for a CF file then. Means because of one CF file reaches size will not push any other CF to get rolled.  

> HFileOutputFormat2 split a lot of HFile by roll once per rowkey
> ---------------------------------------------------------------
>
>                 Key: HBASE-22887
>                 URL: https://issues.apache.org/jira/browse/HBASE-22887
>             Project: HBase
>          Issue Type: Bug
>          Components: mapreduce
>    Affects Versions: 2.0.0
>         Environment: HBase 2.0.0
>            Reporter: langdamao
>            Priority: Major
>
> When I use HFileOutputFormat2 in mr job to build HFiles，in reducer it creates lots of files.
> Here is the log:
> {code:java}
> 2019-08-16 14:42:51,988 INFO [main] org.apache.hadoop.hbase.mapreduce.HFileOutputFormat2: Writer=hdfs://hfile/_temporary/1/_temporary/attempt_1558444096078_519332_r_000016_0/F1/06f3b0e9f0644ee782b7cf4469f44a70, wrote=893827310 Writer=hdfs://hfile/_temporary/1/_temporary/attempt_1558444096078_519332_r_000016_0/F1/1454ea148f1547499209a266ad25387f, wrote=61 Writer=hdfs://hfile/_temporary/1/_temporary/attempt_1558444096078_519332_r_000016_0/F1/9d35446634154b4ca4be56f361b57c8b, wrote=55 
> ...  {code}
> It keep writing a new file every rowkey comes.
> then I output more logs for detail and found the problem. Code Here[GitHub|[https://github.com/apache/hbase/blob/master/hbase-mapreduce/src/main/java/org/apache/hadoop/hbase/mapreduce/HFileOutputFormat2.java#L289]]
> {code:java}
> if (wl != null && wl.written + length >= maxsize) {
>   this.rollRequested = true;
> }
> // This can only happen once a row is finished though
> if (rollRequested && Bytes.compareTo(this.previousRow, rowKey) != 0) {
>   rollWriters(wl);
> }{code}
> In my Case，I have two fimaly F1 & F2，and writer of F2 arrives the maxsize
>  ,so rollRequested becomes true, but it's rowkey was the same with previousRow so writer won't be roll. When next rowkey comes with fimaly F1, both of rollRequested && Bytes.compareTo(this.previousRow, rowKey) != 0 is true，and writter of F1 will be roll , new Hfile create. And then same rowkey with fimaly F2 comes set rollRequested
>  true, and next rowkey with fimaly F1 comes writter of F1 rolled. 
> So, it will create a new Hfile for every rowkey with fimaly F1, and F2 will never be roll until job ends.
>  
> Here is my questions and part of solutions:
> Q1. First whether hbase 2.0.0 support different family of same HbaseTable has different rowkey cut？Which means rowkeyA writes in the first HFile of F1，but may be the second HFile of F2. For hbase 1.x.x it doesn't support so we roll all the writter and won't get this problem. I guess the answer is "Yes,support" , we goes to Q2.
> Q2. Do we allow same rowkey with same family, comes to HFileOutputFormat2.write?
> If not, can we fix it this way, cause this rowKey will never be the same with previouseRow
> {code:java}
>  if (wl != null && wl.written + length >= maxsize) { 
>       rollWriters(wl);
>  }{code}
> If yes, should we need Map to record previouseRow
> {code:java}
> private final Map<byte[], byte[]> previousRows =
>         new TreeMap<>(Bytes.BYTES_COMPARATOR);
> if (wl != null && wl.written + length >= maxsize && Bytes.compareTo(this.previousRows.get(family), rowKey) != 0) { 
>      rollWriters(wl); 
> }{code}



--
This message was sent by Atlassian Jira
(v8.3.2#803003)