You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@hbase.apache.org by Laxman <la...@huawei.com> on 2012/03/12 16:17:34 UTC

Bulkload discards duplicates

In our test, we noticed that bulkload is discarding the duplicates.
On further analysis, I noticed duplicates are getting discarded only
duplicates exists in same input file and in same split.
I think this is a bug and its not any intentional behavior. 

Usage of TreeSet in the below code snippet is causing the issue.

PutSortReducer.reduce()
======================
      TreeSet<KeyValue> map = new TreeSet<KeyValue>(KeyValue.COMPARATOR);
      long curSize = 0;
      // stop at the end or the RAM threshold
      while (iter.hasNext() && curSize < threshold) {
        Put p = iter.next();
        for (List<KeyValue> kvs : p.getFamilyMap().values()) {
          for (KeyValue kv : kvs) {
            map.add(kv);
            curSize += kv.getLength();
          }
        }

Changing this back to List and then sort explicitly will solve the issue.

Filed a new JIRA for this
https://issues.apache.org/jira/browse/HBASE-5564
--
Regards,
Laxman

Re: Bulkload discards duplicates

Posted by lars hofhansl <lh...@yahoo.com>.

Hi Laxman,

can you clarify what you mean by "duplicates"?
The TreeSet is using KeyValue.COMPARATOR,which treats KVs as the same only if the entire key (including column and timestamp) is the same.
Do you have KVs with the same rowKey, columnKey, and timestamp, but different values?


Thanks.

-- Lars


________________________________
 From: Laxman <la...@huawei.com>
To: dev@hbase.apache.org; user@hbase.apache.org 
Sent: Monday, March 12, 2012 8:17 AM
Subject: Bulkload discards duplicates
 
In our test, we noticed that bulkload is discarding the duplicates.
On further analysis, I noticed duplicates are getting discarded only
duplicates exists in same input file and in same split.
I think this is a bug and its not any intentional behavior. 

Usage of TreeSet in the below code snippet is causing the issue.

PutSortReducer.reduce()
======================
      TreeSet<KeyValue> map = new TreeSet<KeyValue>(KeyValue.COMPARATOR);
      long curSize = 0;
      // stop at the end or the RAM threshold
      while (iter.hasNext() && curSize < threshold) {
        Put p = iter.next();
        for (List<KeyValue> kvs : p.getFamilyMap().values()) {
          for (KeyValue kv : kvs) {
            map.add(kv);
            curSize += kv.getLength();
          }
        }

Changing this back to List and then sort explicitly will solve the issue.

Filed a new JIRA for this
https://issues.apache.org/jira/browse/HBASE-5564
--
Regards,
Laxman

RE: Bulkload discards duplicates

Posted by Laxman <la...@huawei.com>.

Thanks for the quick response stack.

I tested again with the proposed patch.
> > Changing this back to List and then sort explicitly will solve the
issue.

Still the same problem persists making this issue bit more complicated. 

Moving further discussion to JIRA.

--
Regards,
Laxman
> -----Original Message-----
> From: saint.ack@gmail.com [mailto:saint.ack@gmail.com] On Behalf Of
> Stack
> Sent: Monday, March 12, 2012 8:50 PM
> To: user@hbase.apache.org; lakshman.ch@huawei.com
> Cc: dev@hbase.apache.org
> Subject: Re: Bulkload discards duplicates
> 
> On Mon, Mar 12, 2012 at 8:17 AM, Laxman <la...@huawei.com> wrote:
> > In our test, we noticed that bulkload is discarding the duplicates.
> > On further analysis, I noticed duplicates are getting discarded only
> > duplicates exists in same input file and in same split.
> > I think this is a bug and its not any intentional behavior.
> >
> > Usage of TreeSet in the below code snippet is causing the issue.
> >
> > PutSortReducer.reduce()
> > ======================
> >      TreeSet<KeyValue> map = new
> TreeSet<KeyValue>(KeyValue.COMPARATOR);
> >      long curSize = 0;
> >      // stop at the end or the RAM threshold
> >      while (iter.hasNext() && curSize < threshold) {
> >        Put p = iter.next();
> >        for (List<KeyValue> kvs : p.getFamilyMap().values()) {
> >          for (KeyValue kv : kvs) {
> >            map.add(kv);
> >            curSize += kv.getLength();
> >          }
> >        }
> >
> > Changing this back to List and then sort explicitly will solve the
> issue.
> >
> > Filed a new JIRA for this
> > https://issues.apache.org/jira/browse/HBASE-5564
> 
> Thank you for finding the issue and making a JIRA.
> St.Ack

Re: Bulkload discards duplicates

Posted by Stack <st...@duboce.net>.

On Mon, Mar 12, 2012 at 8:17 AM, Laxman <la...@huawei.com> wrote:
> In our test, we noticed that bulkload is discarding the duplicates.
> On further analysis, I noticed duplicates are getting discarded only
> duplicates exists in same input file and in same split.
> I think this is a bug and its not any intentional behavior.
>
> Usage of TreeSet in the below code snippet is causing the issue.
>
> PutSortReducer.reduce()
> ======================
>      TreeSet<KeyValue> map = new TreeSet<KeyValue>(KeyValue.COMPARATOR);
>      long curSize = 0;
>      // stop at the end or the RAM threshold
>      while (iter.hasNext() && curSize < threshold) {
>        Put p = iter.next();
>        for (List<KeyValue> kvs : p.getFamilyMap().values()) {
>          for (KeyValue kv : kvs) {
>            map.add(kv);
>            curSize += kv.getLength();
>          }
>        }
>
> Changing this back to List and then sort explicitly will solve the issue.
>
> Filed a new JIRA for this
> https://issues.apache.org/jira/browse/HBASE-5564

Thank you for finding the issue and making a JIRA.
St.Ack

Re: Bulkload discards duplicates

Posted by lars hofhansl <lh...@yahoo.com>.

Hi Laxman,

can you clarify what you mean by "duplicates"?
The TreeSet is using KeyValue.COMPARATOR,which treats KVs as the same only if the entire key (including column and timestamp) is the same.
Do you have KVs with the same rowKey, columnKey, and timestamp, but different values?


Thanks.

-- Lars


________________________________
 From: Laxman <la...@huawei.com>
To: dev@hbase.apache.org; user@hbase.apache.org 
Sent: Monday, March 12, 2012 8:17 AM
Subject: Bulkload discards duplicates
 
In our test, we noticed that bulkload is discarding the duplicates.
On further analysis, I noticed duplicates are getting discarded only
duplicates exists in same input file and in same split.
I think this is a bug and its not any intentional behavior. 

Usage of TreeSet in the below code snippet is causing the issue.

PutSortReducer.reduce()
======================
      TreeSet<KeyValue> map = new TreeSet<KeyValue>(KeyValue.COMPARATOR);
      long curSize = 0;
      // stop at the end or the RAM threshold
      while (iter.hasNext() && curSize < threshold) {
        Put p = iter.next();
        for (List<KeyValue> kvs : p.getFamilyMap().values()) {
          for (KeyValue kv : kvs) {
            map.add(kv);
            curSize += kv.getLength();
          }
        }

Changing this back to List and then sort explicitly will solve the issue.

Filed a new JIRA for this
https://issues.apache.org/jira/browse/HBASE-5564
--
Regards,
Laxman

Re: Bulkload discards duplicates

Posted by Stack <st...@duboce.net>.

On Mon, Mar 12, 2012 at 8:17 AM, Laxman <la...@huawei.com> wrote:
> In our test, we noticed that bulkload is discarding the duplicates.
> On further analysis, I noticed duplicates are getting discarded only
> duplicates exists in same input file and in same split.
> I think this is a bug and its not any intentional behavior.
>
> Usage of TreeSet in the below code snippet is causing the issue.
>
> PutSortReducer.reduce()
> ======================
>      TreeSet<KeyValue> map = new TreeSet<KeyValue>(KeyValue.COMPARATOR);
>      long curSize = 0;
>      // stop at the end or the RAM threshold
>      while (iter.hasNext() && curSize < threshold) {
>        Put p = iter.next();
>        for (List<KeyValue> kvs : p.getFamilyMap().values()) {
>          for (KeyValue kv : kvs) {
>            map.add(kv);
>            curSize += kv.getLength();
>          }
>        }
>
> Changing this back to List and then sort explicitly will solve the issue.
>
> Filed a new JIRA for this
> https://issues.apache.org/jira/browse/HBASE-5564

Thank you for finding the issue and making a JIRA.
St.Ack