You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@hbase.apache.org by "ramkrishna.s.vasudevan (JIRA)" <ji...@apache.org> on 2016/01/27 18:44:39 UTC

[jira] [Commented] (HBASE-15171) Avoid counting duplicate kv and generating lots of small hfiles in PutSortReducer

    [ https://issues.apache.org/jira/browse/HBASE-15171?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15119865#comment-15119865 ] 

ramkrishna.s.vasudevan commented on HBASE-15171:
------------------------------------------------

Instead of iterating again the map, can we just get the return value of map.add(kv), it it is false don't add the curSize?  
add() javadoc says this
{code}
add
public boolean add(E e)

Adds the specified element to this set if it is not already present. More formally, adds the specified element e to this set if the set contains no element e2 such that (e==null ? e2==null : e.equals(e2)). If this set already contains the element, the call leaves the set unchanged and returns false.
{code}

> Avoid counting duplicate kv and generating lots of small hfiles in PutSortReducer
> ---------------------------------------------------------------------------------
>
>                 Key: HBASE-15171
>                 URL: https://issues.apache.org/jira/browse/HBASE-15171
>             Project: HBase
>          Issue Type: Sub-task
>    Affects Versions: 2.0.0, 1.1.2, 0.98.17
>            Reporter: Yu Li
>            Assignee: Yu Li
>             Fix For: 2.0.0, 1.3.0
>
>         Attachments: HBASE-15171.patch, HBASE-15171.patch, HBASE-15171.patch
>
>
> Once there was one of our online user writing huge number of duplicated kvs during bulkload, and we found it generated lots of small hfiles and slows down the whole process.
> After debugging, we found in PutSortReducer#reduce, although it already tried to handle the pathological case by setting a threshold for single-row size and having a TreeMap to avoid writing out duplicated kv, it forgot to exclude duplicated kv from the accumulated size. As shown in below code segment:
> {code}
> while (iter.hasNext() && curSize < threshold) {
>   Put p = iter.next();
>   for (List<Cell> cells: p.getFamilyCellMap().values()) {
>     for (Cell cell: cells) {
>       KeyValue kv = KeyValueUtil.ensureKeyValue(cell);
>       map.add(kv);
>       curSize += kv.heapSize();
>     }
>   }
> }
> {code}
> We should move the {{curSize += kv.heapSize();}} line out of the outer for loop



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)