You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@hbase.apache.org by er...@yahoo.com on 2011/02/21 09:34:15 UTC

using composite index for a duplicate check validation

Hi,

I am currently building a multi-tenant ERP-like application capable of handling 
billions of transaction lines.  I a using HBase 0.90 and wrote an end-to-end 
initial POC to test the performance characteristics.  Here is my end-to-end use 
case:
1. load a sizable transaction submission file (csv).  The file has about 150 
attribute and typically about 2,000,000 lines in size
2. validate the file (resolve a bunch of attributes such as product, members, 
dates, amounts, against an effective-dated base data set)
3. flag all lines as duplicate of {list of line ids} if this line has the same 
values for 'user-defined' set of columns.
4. run some calculations against these lines filtered by some user-defined 
filters

I have a cluster of 10 basic machines (1 used as master and 9 as slaves).

so here is what I am doing (I have about 10,000,000 lines in the entity_table at 
this point):
1. load file in HBase table called 'entity_table' using a mapper, passing it a 
file format definition object that understands how to parse the file
2. index the file by creating a new table called 'entity_dup_index' where the 
row key = entity.itemId+"_"+entity.memberId+"_"+entity.invoiceDate with a single 
column family into which I add the entity.key as key and value under it.
3. run the validation step:  the dupcheck will 
to entity_dup_index.get(index_key) and loop over all key/value pairs, removing 
all keys that are <= than entity.key to ensure that I am a dup only of 
previously loaded lines.

The question that I have is about performance:
1. Loading a 2,000,000 line file into HBase takes about 15 mins
2. indexing 2,000,000 lines takes about 3 mins (indexing 6,000,000 takes about 6 
mins)
3. running the dup check on 2,000,000 takes over 1 hour.

      
public void map(ImmutableBytesWritable row, Result result, Context context) 
throws IOException, InterruptedException {
// reset validations
Delete delete = new Delete(row.get());
delete.deleteFamily(valFam.getBytes());
Put put = new Put(row.get());
// ***********************************
String key = getCompositeIndexKey(result);
HTable indexTable // initialized at setConf()
Get get = new Get(key.getBytes());
Result rr = indexTable.get(get);
// loop over all KeyValues of rr
put.add(valFam.getBytes(), ...);
// ***********************************
context.write(tableName, delete);
context.write(tableName, put);
}

the indexTable.get(get) call is the culprit!  when I comment out this code, the 
validation runs under 15 mins.  Would you have some idea on how I could improve 
the composite index lookup or structure my algorithm differently to get better 
performance?

Thanks a lot for your help,
-Eric

Re: using composite index for a duplicate check validation

Posted by er...@yahoo.com.

Thanks a lot Jean-Daniel,

I will try disabling the cache to see if I get a performance improvement.  I was 
not aware of the parallel scan.  I will look into that.
Thanks,
-Eric




________________________________
From: Jean-Daniel Cryans <jd...@apache.org>
To: user@hbase.apache.org
Sent: Wed, February 23, 2011 1:20:59 AM
Subject: Re: using composite index for a duplicate check validation

A Get is a random read, so expect it to be slower than let's say a
scanner or a random insert (the other calls that are made in your
code). Unless you are able to keep all that data in the block cache of
the region servers, those calls are going to be expensive.

A change that would be very easy to do would be to disable the caching
of the data read from the scans (done by the maps) by using
setCaching(false) on the Scan object you're passing to
TableMapReduceUtil. This will make it so that the scanning won't trash
the block cache, but my understanding of your use case is that the
reads will all be done only once, so caching them wouldn't really help
either...

Maybe you could also consider running a parallel scan on the index
table instead of random getting from it, but it depends on how the
index was constructed. Maybe you can come up with a scan that makes
more sense.

Hope that helps,

J-D

On Mon, Feb 21, 2011 at 12:34 AM,  <er...@yahoo.com> wrote:
> Hi,
>
> I am currently building a multi-tenant ERP-like application capable of 
handling
> billions of transaction lines.  I a using HBase 0.90 and wrote an end-to-end
> initial POC to test the performance characteristics.  Here is my end-to-end 
use
> case:
> 1. load a sizable transaction submission file (csv).  The file has about 150
> attribute and typically about 2,000,000 lines in size
> 2. validate the file (resolve a bunch of attributes such as product, members,
> dates, amounts, against an effective-dated base data set)
> 3. flag all lines as duplicate of {list of line ids} if this line has the same
> values for 'user-defined' set of columns.
> 4. run some calculations against these lines filtered by some user-defined
> filters
>
> I have a cluster of 10 basic machines (1 used as master and 9 as slaves).
>
> so here is what I am doing (I have about 10,000,000 lines in the entity_table 
>at
> this point):
> 1. load file in HBase table called 'entity_table' using a mapper, passing it a
> file format definition object that understands how to parse the file
> 2. index the file by creating a new table called 'entity_dup_index' where the
> row key = entity.itemId+"_"+entity.memberId+"_"+entity.invoiceDate with a 
>single
> column family into which I add the entity.key as key and value under it.
> 3. run the validation step:  the dupcheck will
> to entity_dup_index.get(index_key) and loop over all key/value pairs, removing
> all keys that are <= than entity.key to ensure that I am a dup only of
> previously loaded lines.
>
> The question that I have is about performance:
> 1. Loading a 2,000,000 line file into HBase takes about 15 mins
> 2. indexing 2,000,000 lines takes about 3 mins (indexing 6,000,000 takes about 
>6
> mins)
> 3. running the dup check on 2,000,000 takes over 1 hour.
>
>
> public void map(ImmutableBytesWritable row, Result result, Context context)
> throws IOException, InterruptedException {
> // reset validations
> Delete delete = new Delete(row.get());
> delete.deleteFamily(valFam.getBytes());
> Put put = new Put(row.get());
> // ***********************************
> String key = getCompositeIndexKey(result);
> HTable indexTable // initialized at setConf()
> Get get = new Get(key.getBytes());
> Result rr = indexTable.get(get);
> // loop over all KeyValues of rr
> put.add(valFam.getBytes(), ...);
> // ***********************************
> context.write(tableName, delete);
> context.write(tableName, put);
> }
>
> the indexTable.get(get) call is the culprit!  when I comment out this code, 
the
> validation runs under 15 mins.  Would you have some idea on how I could 
improve
> the composite index lookup or structure my algorithm differently to get better
> performance?
>
> Thanks a lot for your help,
> -Eric

Re: using composite index for a duplicate check validation

Posted by Jean-Daniel Cryans <jd...@apache.org>.

A Get is a random read, so expect it to be slower than let's say a
scanner or a random insert (the other calls that are made in your
code). Unless you are able to keep all that data in the block cache of
the region servers, those calls are going to be expensive.

A change that would be very easy to do would be to disable the caching
of the data read from the scans (done by the maps) by using
setCaching(false) on the Scan object you're passing to
TableMapReduceUtil. This will make it so that the scanning won't trash
the block cache, but my understanding of your use case is that the
reads will all be done only once, so caching them wouldn't really help
either...

Maybe you could also consider running a parallel scan on the index
table instead of random getting from it, but it depends on how the
index was constructed. Maybe you can come up with a scan that makes
more sense.

Hope that helps,

J-D

On Mon, Feb 21, 2011 at 12:34 AM,  <er...@yahoo.com> wrote:
> Hi,
>
> I am currently building a multi-tenant ERP-like application capable of handling
> billions of transaction lines.  I a using HBase 0.90 and wrote an end-to-end
> initial POC to test the performance characteristics.  Here is my end-to-end use
> case:
> 1. load a sizable transaction submission file (csv).  The file has about 150
> attribute and typically about 2,000,000 lines in size
> 2. validate the file (resolve a bunch of attributes such as product, members,
> dates, amounts, against an effective-dated base data set)
> 3. flag all lines as duplicate of {list of line ids} if this line has the same
> values for 'user-defined' set of columns.
> 4. run some calculations against these lines filtered by some user-defined
> filters
>
> I have a cluster of 10 basic machines (1 used as master and 9 as slaves).
>
> so here is what I am doing (I have about 10,000,000 lines in the entity_table at
> this point):
> 1. load file in HBase table called 'entity_table' using a mapper, passing it a
> file format definition object that understands how to parse the file
> 2. index the file by creating a new table called 'entity_dup_index' where the
> row key = entity.itemId+"_"+entity.memberId+"_"+entity.invoiceDate with a single
> column family into which I add the entity.key as key and value under it.
> 3. run the validation step:  the dupcheck will
> to entity_dup_index.get(index_key) and loop over all key/value pairs, removing
> all keys that are <= than entity.key to ensure that I am a dup only of
> previously loaded lines.
>
> The question that I have is about performance:
> 1. Loading a 2,000,000 line file into HBase takes about 15 mins
> 2. indexing 2,000,000 lines takes about 3 mins (indexing 6,000,000 takes about 6
> mins)
> 3. running the dup check on 2,000,000 takes over 1 hour.
>
>
> public void map(ImmutableBytesWritable row, Result result, Context context)
> throws IOException, InterruptedException {
> // reset validations
> Delete delete = new Delete(row.get());
> delete.deleteFamily(valFam.getBytes());
> Put put = new Put(row.get());
> // ***********************************
> String key = getCompositeIndexKey(result);
> HTable indexTable // initialized at setConf()
> Get get = new Get(key.getBytes());
> Result rr = indexTable.get(get);
> // loop over all KeyValues of rr
> put.add(valFam.getBytes(), ...);
> // ***********************************
> context.write(tableName, delete);
> context.write(tableName, put);
> }
>
> the indexTable.get(get) call is the culprit!  when I comment out this code, the
> validation runs under 15 mins.  Would you have some idea on how I could improve
> the composite index lookup or structure my algorithm differently to get better
> performance?
>
> Thanks a lot for your help,
> -Eric