You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@hbase.apache.org by Himanshu Verma <hi...@gmail.com> on 2015/08/28 01:56:09 UTC
Optimizing LoadIncrementalHFiles.java
Hi,
I was looking at following method:
public void doBulkLoad(Path hfofDir, final Admin admin, Table table,
>
> RegionLocator regionLocator) throws TableNotFoundException,
> IOException {
>
We can optimize following part of this method:
353 ArrayList<String> familyNames = new
> ArrayList<String>(families.size());
>
> 354 for (HColumnDescriptor family : families) {
>
> 355 familyNames.add(family.getNameAsString());
>
> 356 }
>
> 357 ArrayList<String> unmatchedFamilies = new ArrayList<String>();
>
> 358 Iterator<LoadQueueItem> queueIter = queue.iterator();
>
> 359 while (queueIter.hasNext()) {
>
> 360 LoadQueueItem lqi = queueIter.next();
>
> 361 String familyNameInHFile = Bytes.toString(lqi.family);
>
> 362 if (!familyNames.contains(familyNameInHFile)) {
>
> 363 ¦ unmatchedFamilies.add(familyNameInHFile);
>
> 364 }
>
> 365 }
>
line 353 uses ArrayList data structure for familyNames and calls its
"contains" (line 362) method which is O(n). We can instead use HashSet, its
"contains" method is O(1).
It should increase performance in cases having large number of column
families.
This is my first time here, I can make this change if everything looks fine.
Regards,
Himanshu Verma
Re: Optimizing LoadIncrementalHFiles.java
Posted by Ted Yu <yu...@gmail.com>.
I looked at the code again.
When number of HFiles to be loaded times number of column families is a big
value, your suggestion may produce some speedup. If you have access to a
cluster, you can measure potential savings in your approach.
Cheers
On Thu, Aug 27, 2015 at 5:08 PM, Ted Yu <yu...@gmail.com> wrote:
> At roughly how many column families would this change show performance
> boost ?
>
> Cheers
>
>
>
> > On Aug 27, 2015, at 4:56 PM, Himanshu Verma <hi...@gmail.com>
> wrote:
> >
> > Hi,
> >
> > I was looking at following method:
> >
> > public void doBulkLoad(Path hfofDir, final Admin admin, Table table,
> >>
> >> RegionLocator regionLocator) throws TableNotFoundException,
> >> IOException {
> >
> >
> >
> > We can optimize following part of this method:
> >
> > 353 ArrayList<String> familyNames = new
> >> ArrayList<String>(families.size());
> >>
> >> 354 for (HColumnDescriptor family : families) {
> >>
> >> 355 familyNames.add(family.getNameAsString());
> >>
> >> 356 }
> >>
> >> 357 ArrayList<String> unmatchedFamilies = new ArrayList<String>();
> >>
> >> 358 Iterator<LoadQueueItem> queueIter = queue.iterator();
> >>
> >> 359 while (queueIter.hasNext()) {
> >>
> >> 360 LoadQueueItem lqi = queueIter.next();
> >>
> >> 361 String familyNameInHFile = Bytes.toString(lqi.family);
> >>
> >> 362 if (!familyNames.contains(familyNameInHFile)) {
> >>
> >> 363 ¦ unmatchedFamilies.add(familyNameInHFile);
> >>
> >> 364 }
> >>
> >> 365 }
> >
> > line 353 uses ArrayList data structure for familyNames and calls its
> > "contains" (line 362) method which is O(n). We can instead use HashSet,
> its
> > "contains" method is O(1).
> >
> > It should increase performance in cases having large number of column
> > families.
> >
> > This is my first time here, I can make this change if everything looks
> fine.
> >
> > Regards,
> > Himanshu Verma
>
Re: Optimizing LoadIncrementalHFiles.java
Posted by Ted Yu <yu...@gmail.com>.
At roughly how many column families would this change show performance boost ?
Cheers
> On Aug 27, 2015, at 4:56 PM, Himanshu Verma <hi...@gmail.com> wrote:
>
> Hi,
>
> I was looking at following method:
>
> public void doBulkLoad(Path hfofDir, final Admin admin, Table table,
>>
>> RegionLocator regionLocator) throws TableNotFoundException,
>> IOException {
>
>
>
> We can optimize following part of this method:
>
> 353 ArrayList<String> familyNames = new
>> ArrayList<String>(families.size());
>>
>> 354 for (HColumnDescriptor family : families) {
>>
>> 355 familyNames.add(family.getNameAsString());
>>
>> 356 }
>>
>> 357 ArrayList<String> unmatchedFamilies = new ArrayList<String>();
>>
>> 358 Iterator<LoadQueueItem> queueIter = queue.iterator();
>>
>> 359 while (queueIter.hasNext()) {
>>
>> 360 LoadQueueItem lqi = queueIter.next();
>>
>> 361 String familyNameInHFile = Bytes.toString(lqi.family);
>>
>> 362 if (!familyNames.contains(familyNameInHFile)) {
>>
>> 363 ¦ unmatchedFamilies.add(familyNameInHFile);
>>
>> 364 }
>>
>> 365 }
>
> line 353 uses ArrayList data structure for familyNames and calls its
> "contains" (line 362) method which is O(n). We can instead use HashSet, its
> "contains" method is O(1).
>
> It should increase performance in cases having large number of column
> families.
>
> This is my first time here, I can make this change if everything looks fine.
>
> Regards,
> Himanshu Verma