You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@hbase.apache.org by Himanshu Verma <hi...@gmail.com> on 2015/08/28 01:56:09 UTC

Optimizing LoadIncrementalHFiles.java

Hi,

I was looking at following method:

 public void doBulkLoad(Path hfofDir, final Admin admin, Table table,
>
>       RegionLocator regionLocator) throws TableNotFoundException,
> IOException  {
>



We can optimize following part of this method:

353       ArrayList<String> familyNames = new
> ArrayList<String>(families.size());
>
> 354       for (HColumnDescriptor family : families) {
>
> 355         familyNames.add(family.getNameAsString());
>
> 356       }
>
> 357       ArrayList<String> unmatchedFamilies = new ArrayList<String>();
>
> 358       Iterator<LoadQueueItem> queueIter = queue.iterator();
>
> 359       while (queueIter.hasNext()) {
>
> 360         LoadQueueItem lqi = queueIter.next();
>
> 361         String familyNameInHFile = Bytes.toString(lqi.family);
>
> 362         if (!familyNames.contains(familyNameInHFile)) {
>
> 363         ¦ unmatchedFamilies.add(familyNameInHFile);
>
> 364         }
>
> 365       }
>

line 353 uses ArrayList data structure for familyNames and calls its
"contains" (line 362) method which is O(n). We can instead use HashSet, its
"contains" method is O(1).

It should increase performance in cases having large number of column
families.

This is my first time here, I can make this change if everything looks fine.

Regards,
Himanshu Verma

Re: Optimizing LoadIncrementalHFiles.java

Posted by Ted Yu <yu...@gmail.com>.
I looked at the code again.
When number of HFiles to be loaded times number of column families is a big
value, your suggestion may produce some speedup. If you have access to a
cluster, you can measure potential savings in your approach.

Cheers

On Thu, Aug 27, 2015 at 5:08 PM, Ted Yu <yu...@gmail.com> wrote:

> At roughly how many column families would this change show performance
> boost ?
>
> Cheers
>
>
>
> > On Aug 27, 2015, at 4:56 PM, Himanshu Verma <hi...@gmail.com>
> wrote:
> >
> > Hi,
> >
> > I was looking at following method:
> >
> > public void doBulkLoad(Path hfofDir, final Admin admin, Table table,
> >>
> >>      RegionLocator regionLocator) throws TableNotFoundException,
> >> IOException  {
> >
> >
> >
> > We can optimize following part of this method:
> >
> > 353       ArrayList<String> familyNames = new
> >> ArrayList<String>(families.size());
> >>
> >> 354       for (HColumnDescriptor family : families) {
> >>
> >> 355         familyNames.add(family.getNameAsString());
> >>
> >> 356       }
> >>
> >> 357       ArrayList<String> unmatchedFamilies = new ArrayList<String>();
> >>
> >> 358       Iterator<LoadQueueItem> queueIter = queue.iterator();
> >>
> >> 359       while (queueIter.hasNext()) {
> >>
> >> 360         LoadQueueItem lqi = queueIter.next();
> >>
> >> 361         String familyNameInHFile = Bytes.toString(lqi.family);
> >>
> >> 362         if (!familyNames.contains(familyNameInHFile)) {
> >>
> >> 363         ¦ unmatchedFamilies.add(familyNameInHFile);
> >>
> >> 364         }
> >>
> >> 365       }
> >
> > line 353 uses ArrayList data structure for familyNames and calls its
> > "contains" (line 362) method which is O(n). We can instead use HashSet,
> its
> > "contains" method is O(1).
> >
> > It should increase performance in cases having large number of column
> > families.
> >
> > This is my first time here, I can make this change if everything looks
> fine.
> >
> > Regards,
> > Himanshu Verma
>

Re: Optimizing LoadIncrementalHFiles.java

Posted by Ted Yu <yu...@gmail.com>.
At roughly how many column families would this change show performance boost ?

Cheers



> On Aug 27, 2015, at 4:56 PM, Himanshu Verma <hi...@gmail.com> wrote:
> 
> Hi,
> 
> I was looking at following method:
> 
> public void doBulkLoad(Path hfofDir, final Admin admin, Table table,
>> 
>>      RegionLocator regionLocator) throws TableNotFoundException,
>> IOException  {
> 
> 
> 
> We can optimize following part of this method:
> 
> 353       ArrayList<String> familyNames = new
>> ArrayList<String>(families.size());
>> 
>> 354       for (HColumnDescriptor family : families) {
>> 
>> 355         familyNames.add(family.getNameAsString());
>> 
>> 356       }
>> 
>> 357       ArrayList<String> unmatchedFamilies = new ArrayList<String>();
>> 
>> 358       Iterator<LoadQueueItem> queueIter = queue.iterator();
>> 
>> 359       while (queueIter.hasNext()) {
>> 
>> 360         LoadQueueItem lqi = queueIter.next();
>> 
>> 361         String familyNameInHFile = Bytes.toString(lqi.family);
>> 
>> 362         if (!familyNames.contains(familyNameInHFile)) {
>> 
>> 363         ¦ unmatchedFamilies.add(familyNameInHFile);
>> 
>> 364         }
>> 
>> 365       }
> 
> line 353 uses ArrayList data structure for familyNames and calls its
> "contains" (line 362) method which is O(n). We can instead use HashSet, its
> "contains" method is O(1).
> 
> It should increase performance in cases having large number of column
> families.
> 
> This is my first time here, I can make this change if everything looks fine.
> 
> Regards,
> Himanshu Verma