You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@hbase.apache.org by stack <st...@duboce.net> on 2010/01/09 18:58:12 UTC

Re: Hbase bulk import for objects with the same rowid and different columnids

Something is up here.  KVSR uses KeyValue.COMPARATOR which does:

   * Compare KeyValues.  When we compare KeyValues, we only compare the Key
   * portion.  This means two KeyValues with same Key but different Values
are
   * considered the same as far as this Comparator is concerned.
   * Hosts a {@link KeyComparator}.

... where Key in the above is the
key/columnfamily/columnqualifier/timestamp/type combination.

If we're only keeping the last value added, thats odd.  It should be keeping
them all since differing in column makes for a different key.

Can you send us over a sample of the keyvalues that are getting conflated.
 Something is wrong.

Thanks for reporting this.
St.Ack

On Sat, Jan 9, 2010 at 9:09 AM, Ioannis Konstantinou <ik...@cslab.ntua.gr>wrote:

> Hello,
>
> I am trying to bulk upload content to hbase using the instructions provided
> at
> http://hadoop.apache.org/hbase/docs/current/api/org/apache/hadoop/hbase/mapreduce/package-summary.html#package_description
> :
> I have a mapper that reads input and emmits KeyValue objects to be fed in
> the KeyValueSortReducer. The mapper emmits a number of KeyValue objects for
> each row. For the same rowid, the KeyValue objects have different columnids.
>  The problem is the following: when these KeyValue objects (that have the
> same rowid but different colids in the same column family) reach the
> reducer, the TreeSet used to sort KeyValues, keeps only the KeyValue that
> gets last (it replaces all entries with the last one that reaches the
> reducer), as the KeyValue.COMPARATOR compares only the rowid !!!!!
>
> Can I use a different Comparator??? KeyValue objects of the same rowid must
> be sorted before writing them in the Hfile, or this does not matter???
>
> Thank you in advance for your time.
>
>
> --
> Ioannis Konstantinou
> Research Associate, Computing Systems Laboratory
> National Technical University of Athens
> Web: http://www.cslab.ntua.gr/~ikons
>
>

Re: Hbase bulk import for objects with the same rowid and different columnids

Posted by stack <st...@duboce.net>.
Duh.  Thanks Ioannis for finding my dumb bug.  I made hbase-2101 with your
suggested fix.
St.Ack

On Sat, Jan 9, 2010 at 10:31 AM, Ioannis Konstantinou
<ik...@cslab.ntua.gr>wrote:

> The problem is in the class KeyValueSortReducer.
>
> When you add keyvalues to the treeset for sorting, you need to add keyvalue
> clones instead of just references. What happens now, is that in every
> iteration, the value that exists in the treeset gets replaced with the new
> value.
>
> So, you need to replace line 41: map.add(kv)
> with this line: map.add(kv.clone())
>
> in this case, the treeset populates correcty.
>
> στις 9/1/2010 7:58 μμ, O/H stack έγραψε:
>
>> Something is up here.  KVSR uses KeyValue.COMPARATOR which does:
>>
>>
>>    * Compare KeyValues.  When we compare KeyValues, we only compare the
>> Key
>>    * portion.  This means two KeyValues with same Key but different Values
>> are
>>    * considered the same as far as this Comparator is concerned.
>>    * Hosts a {@link KeyComparator}.
>>
>> ... where Key in the above is the
>> key/columnfamily/columnqualifier/timestamp/type combination.
>>
>> If we're only keeping the last value added, thats odd.  It should be
>> keeping
>> them all since differing in column makes for a different key.
>>
>> Can you send us over a sample of the keyvalues that are getting conflated.
>>  Something is wrong.
>>
>> Thanks for reporting this.
>> St.Ack
>>
>> On Sat, Jan 9, 2010 at 9:09 AM, Ioannis Konstantinou<ikons@cslab.ntua.gr
>> >wrote:
>>
>>
>>
>>> Hello,
>>>
>>> I am trying to bulk upload content to hbase using the instructions
>>> provided
>>> at
>>>
>>> http://hadoop.apache.org/hbase/docs/current/api/org/apache/hadoop/hbase/mapreduce/package-summary.html#package_description
>>> :
>>> I have a mapper that reads input and emmits KeyValue objects to be fed in
>>> the KeyValueSortReducer. The mapper emmits a number of KeyValue objects
>>> for
>>> each row. For the same rowid, the KeyValue objects have different
>>> columnids.
>>>  The problem is the following: when these KeyValue objects (that have the
>>> same rowid but different colids in the same column family) reach the
>>> reducer, the TreeSet used to sort KeyValues, keeps only the KeyValue that
>>> gets last (it replaces all entries with the last one that reaches the
>>> reducer), as the KeyValue.COMPARATOR compares only the rowid !!!!!
>>>
>>> Can I use a different Comparator??? KeyValue objects of the same rowid
>>> must
>>> be sorted before writing them in the Hfile, or this does not matter???
>>>
>>> Thank you in advance for your time.
>>>
>>>
>>> --
>>> Ioannis Konstantinou
>>> Research Associate, Computing Systems Laboratory
>>> National Technical University of Athens
>>> Web:http://www.cslab.ntua.gr/~ikons
>>>
>>>
>>>
>>>
>>
>>
>
> --
> Ioannis Konstantinou
> Research Associate, Computing Systems Laboratory
> National Technical University of Athens
> phone: +30 2107721544(internal 421)
> Web:http://www.cslab.ntua.gr/~ikons
>
>

Re: Hbase bulk import for objects with the same rowid and different columnids

Posted by Ioannis Konstantinou <ik...@cslab.ntua.gr>.
The problem is in the class KeyValueSortReducer.
When you add keyvalues to the treeset for sorting, you need to add 
keyvalue clones instead of just references. What happens now, is that in 
every iteration, the value that exists in the treeset gets replaced with 
the new value.

So, you need to replace line 41: map.add(kv)
with this line: map.add(kv.clone())

in this case, the treeset populates correcty.

στις 9/1/2010 7:58 μμ, O/H stack έγραψε:
> Something is up here.  KVSR uses KeyValue.COMPARATOR which does:
>
>     * Compare KeyValues.  When we compare KeyValues, we only compare the Key
>     * portion.  This means two KeyValues with same Key but different Values
> are
>     * considered the same as far as this Comparator is concerned.
>     * Hosts a {@link KeyComparator}.
>
> ... where Key in the above is the
> key/columnfamily/columnqualifier/timestamp/type combination.
>
> If we're only keeping the last value added, thats odd.  It should be keeping
> them all since differing in column makes for a different key.
>
> Can you send us over a sample of the keyvalues that are getting conflated.
>   Something is wrong.
>
> Thanks for reporting this.
> St.Ack
>
> On Sat, Jan 9, 2010 at 9:09 AM, Ioannis Konstantinou<ik...@cslab.ntua.gr>wrote:
>
>    
>> Hello,
>>
>> I am trying to bulk upload content to hbase using the instructions provided
>> at
>> http://hadoop.apache.org/hbase/docs/current/api/org/apache/hadoop/hbase/mapreduce/package-summary.html#package_description
>> :
>> I have a mapper that reads input and emmits KeyValue objects to be fed in
>> the KeyValueSortReducer. The mapper emmits a number of KeyValue objects for
>> each row. For the same rowid, the KeyValue objects have different columnids.
>>   The problem is the following: when these KeyValue objects (that have the
>> same rowid but different colids in the same column family) reach the
>> reducer, the TreeSet used to sort KeyValues, keeps only the KeyValue that
>> gets last (it replaces all entries with the last one that reaches the
>> reducer), as the KeyValue.COMPARATOR compares only the rowid !!!!!
>>
>> Can I use a different Comparator??? KeyValue objects of the same rowid must
>> be sorted before writing them in the Hfile, or this does not matter???
>>
>> Thank you in advance for your time.
>>
>>
>> --
>> Ioannis Konstantinou
>> Research Associate, Computing Systems Laboratory
>> National Technical University of Athens
>> Web:http://www.cslab.ntua.gr/~ikons
>>
>>
>>      
>    

-- 
Ioannis Konstantinou
Research Associate, Computing Systems Laboratory
National Technical University of Athens
phone: +30 2107721544(internal 421)
Web:http://www.cslab.ntua.gr/~ikons