You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@phoenix.apache.org by "sunfl@certusnet.com.cn" <su...@certusnet.com.cn> on 2014/09/09 08:18:36 UTC

Local index related data bulkload

Hi all and rajeshbabu,
   Recently our job has encountered severe problems with trying to load data with local indexes
into phoenix. The data load performance looks very bad compared with our previous data
loading with gloable indexes. That seems quite absurd because phoenix local index targets 
scenarios with heavy write and space constraint use case, which is just our job application.
   Observing stack trace during our job running, we can find the following info:
   

We then refer to the org.apache.phoenix.index.PhoenixIndexBuilder and commented the batchStarted method. After recompiling the phoenix and restart cluster, 
our job loading performance get significant advance. Following is the code for batcStarted method:
Here are my questions:
1 Can these code committor explain the concrete functionality for this method? Especially concerning to local index data loading...
2 If we modify these codes (e.g. comment this method like what we do), are there any potential influence for phoenix work?
3 More helpful work..Can any guys share their codes about how to complete data bulkload with local indexes while data file are storaged within HDFS?
I know that CsvBulkload can do index related data upserting while map-reduce bulkload didnot support that. Maybe our job is more likely to map-refuce bulkload? So, If someone 
had successfully done loading data through CsvBulkload using Spark and HDFS, please provide us more kindly suggesion.

Best Regards,
Sun

/** 
* Index builder for covered-columns index that ties into phoenix for faster use. 
*/ 
public class PhoenixIndexBuilder extends CoveredColumnsIndexBuilder { 

@Override 
public void batchStarted(MiniBatchOperationInProgress<Mutation> miniBatchOp) throws IOException { 
// The entire purpose of this method impl is to get the existing rows for the 
// table rows being indexed into the block cache, as the index maintenance code 
// does a point scan per row 
List<KeyRange> keys = Lists.newArrayListWithExpectedSize(miniBatchOp.size()); 
List<IndexMaintainer> maintainers = new ArrayList<IndexMaintainer>(); 
for (int i = 0; i < miniBatchOp.size(); i++) { 
Mutation m = miniBatchOp.getOperation(i); 
keys.add(PDataType.VARBINARY.getKeyRange(m.getRow())); 
maintainers.addAll(getCodec().getIndexMaintainers(m.getAttributesMap())); 
} 
Scan scan = IndexManagementUtil.newLocalStateScan(maintainers); 
ScanRanges scanRanges = ScanRanges.create(Collections.singletonList(keys), SchemaUtil.VAR_BINARY_SCHEMA); 
scanRanges.setScanStartStopRow(scan); 
scan.setFilter(scanRanges.getSkipScanFilter()); 
HRegion region = this.env.getRegion(); 
RegionScanner scanner = region.getScanner(scan); 
// Run through the scanner using internal nextRaw method 
region.startRegionOperation(); 
try { 
boolean hasMore; 
do { 
List<Cell> results = Lists.newArrayList(); 
// Results are potentially returned even when the return value of s.next is false 
// since this is an indication of whether or not there are more values after the 
// ones returned 
hasMore = scanner.nextRaw(results);     
} while (hasMore); 
} finally { 
try { 
scanner.close(); 
} finally { 
region.closeRegionOperation(); 
} 
} 
}






RE: Re: Local index related data bulkload

Posted by rajeshbabu chintaguntla <ra...@huawei.com>.
@James
bq. I remember you made a change to do a local region batched mutation, but these could potentially be parallelized further perhaps?
You mean if we have multiple local index indexes then separate each index mutations and write parallely? I think it should give bit better performance. Let me try it.

@Sun,
I have uploaded patch for PHOENIX-1249<https://issues.apache.org/jira/browse/PHOENIX-1249>, if you have time can you try your tests with the patch?

Thanks,
Rajeshbabu.

________________________________
This e-mail and its attachments contain confidential information from HUAWEI, which
is intended only for the person or entity whose address is listed above. Any use of the
information contained herein in any way (including, but not limited to, total or partial
disclosure, reproduction, or dissemination) by persons other than the intended
recipient(s) is prohibited. If you receive this e-mail in error, please notify the sender by
phone or email immediately and delete it!
________________________________
From: sunfl@certusnet.com.cn [sunfl@certusnet.com.cn]
Sent: Friday, September 12, 2014 11:48 AM
To: user
Subject: Re: Re: Local index related data bulkload

Hello James,
I created a JIRA file here: https://issues.apache.org/jira/browse/PHOENIX-1249
Concerning for my use case, local index data loading are indeed occuring bottleneck
for additional index update and server processing. Our project targets to load over 100M
rows into phoenix for 15 minutes data storaged in HDFS. We expect to optimize the data
loading performance and we consider the current reading queries are more or less suitable
for our requirements.
Thanks,
Sun.

________________________________
________________________________



From: James Taylor<ma...@apache.org>
Date: 2014-09-12 13:36
To: user<ma...@phoenix.apache.org>
Subject: Re: Re: Local index related data bulkload
Hi Sun,
You make a good point. Immutable and local vs global are orthogonal. We could support local immutable indexes as well as global immutable indexes. Would you mind filing a JIRA on this?

In your experience, is the index maintenance a bottleneck for you if you create completely covered immutable indexes? What's your mix of reads vs writes for your use case?

Thanks,
James

On Thu, Sep 11, 2014 at 7:32 PM, sunfl@certusnet.com.cn<ma...@certusnet.com.cn> <su...@certusnet.com.cn>> wrote:
Hi, James
Thanks for your reply. We understand the difference and application scenario for IMMUTABLE INDEX and MUTABLE INDEX.
The main reason we expect to facilitate local indexing relates to the feature of write faster and we are trying to increase our data loading
speed and performance. Another consideration is that local indexing did not require include addtional columns when specifying queries,
which also fit our requirements.
James, is there any possibility that local indexing can be created as immutable index? We are not quite understanding about the design of
local indexing and why local indexing must be created as default mutable index. Noting that Hbase and Cassandra are more likely to process
time-series data, maybe immutable index are more efficient in some situations. Thats are just our several considerations. Are their any options
to select when using local index as immutable index? Corrects me if your design had unprevented and limited conditions for the default requirements.\

Thanks,
Sun


From: James Taylor<ma...@apache.org>
Date: 2014-09-12 09:57
To: user<ma...@phoenix.apache.org>
Subject: Re: RE: Local index related data bulkload
Hi Sun,
Yes, that explains it. With immutable indexes, there is no index maintenance required, so there's no processing at all on the server side. If your data is write-once/append-only, then immutable indexes are about as efficient as you'll get. Any reason why you'd want to change them to local indexes? Local indexes is an alternative to global indexes for *mutable* data.
Thanks,
James

On Thu, Sep 11, 2014 at 6:51 PM, sunfl@certusnet.com.cn<ma...@certusnet.com.cn> <su...@certusnet.com.cn>> wrote:
Hi, Rajeshbabu
Best appreciated for your kind reply and explaination. Exactly, we created only one local index for the table.

We have one question: as far as we are concerned, for local indexing the index data may be already prepared

for client upsert? Maybe there is no need to scan and search for specified regionserver processing? Cause we

did not had so much trouble for the case of global index loading (no matther one index or more indexes related

data loading).

Another question. Gloable index we created are immutable indexes as setting IMMUTABLE_ROWS=true, while
local indexing are default mutable indexes. Are these differences meaning a lot for the performance diversity?

Best thanks,
Sun

________________________________
________________________________


发件人: rajeshbabu chintaguntla<ma...@huawei.com>
发送时间: 2014-09-11 23:45
收件人: user@phoenix.apache.org<ma...@phoenix.apache.org>
主题: RE: Re: Local index related data bulkload
Hi Sun,
The code snippet(PhoenixIndexBuilder#batchStarted) you have pointed out is not specific to local indexing, generic for any index. The main idea of the method is to keep the rows to index in block cache. So next time when ever scan the rows while preparing index updates we can get it from cache.
        // The entire purpose of this method impl is to get the existing rows for the
        // table rows being indexed into the block cache, as the index maintenance code
        // does a point scan per row

This gives good performance when a table has more than one index.  One more thing with psql tool we do upserts in batches and each batch have 1000 updates by default(if you don't specify any value to phoenix.mutate.batchSize). Lets suppose if all the rows are different we scan the region until we cache all the 1000 records. That's why
  hasMore = scanner.nextRaw(results);     //Here....  might be taking more time.
Can you tell me how many indexes you have created? One improvement we can do here is if we have only one index we can skip the scan in PhoenixIndexBuilder#batchStarted.

@James, currently we are scanning the data region while preparing index updates?why don't we prepare them without scanning data region if we can have get all index columns data from hooks?


bq. If someone had successfully done loading data through CsvBulkload using Spark and HDFS, please provide us more kindly suggesion.
Please refer "http://phoenix.apache.org/bulk_dataload.html#Loading via MapReduce" to run the bulkload from HDFS. Here we can pass index table to build as --index-table parameter.
But currently there is a problem with local indexing. I will raise an issue and work on it.


Thanks,
Rajeshbabu.

This e-mail and its attachments contain confidential information from HUAWEI, which
is intended only for the person or entity whose address is listed above. Any use of the
information contained herein in any way (including, but not limited to, total or partial
disclosure, reproduction, or dissemination) by persons other than the intended
recipient(s) is prohibited. If you receive this e-mail in error, please notify the sender by
phone or email immediately and delete it!
________________________________
From: sunfl@certusnet.com.cn<ma...@certusnet.com.cn> [sunfl@certusnet.com.cn<ma...@certusnet.com.cn>]
Sent: Thursday, September 11, 2014 6:34 AM
To: user
Subject: Re: Re: Local index related data bulkload

Very thanks.

________________________________
________________________________


From: rajesh babu Chintaguntla<ma...@gmail.com>
Date: 2014-09-10 21:09
To: user@phoenix.apache.org<ma...@phoenix.apache.org>
Subject: Re: Local index related data bulkload
Hi Sun I am not accessible to code. Tomorrow morning I will check and let you know.

Thanks,
Rajeshbabu

On Wednesday, September 10, 2014, sunfl@certusnet.com.cn<ma...@certusnet.com.cn> <su...@certusnet.com.cn>> wrote:
Any available suggestion?

________________________________

发件人: sunfl@certusnet.com.cn<http://UrlBlockedError.aspx>
发送时间: 2014-09-09 14:24
收件人: user<http://UrlBlockedError.aspx>
主题: 回复: Local index related data bulkload
BTW.
The stacktrace info illustrates that our job running performance bottleneck mainly lies in the following code :
     region.startRegionOperation();
          try {
               boolean hasMore;
               do {
                  List<Cell> results = Lists.newArrayList();
             // Results are potentially returned even when the return value of s.next is false
             // since this is an indication of whether or not there are more values after the
            // ones returned
                 hasMore = scanner.nextRaw(results);     //Here....
              } while (hasMore);
            } finally {
               try {
                 scanner.close();
               } finally {
                  region.closeRegionOperation();
                }
            }
         }

________________________________

发件人: sunfl@certusnet.com.cn<http://UrlBlockedError.aspx>
发送时间: 2014-09-09 14:18
收件人: user<http://UrlBlockedError.aspx>
抄送: rajeshbabu chintaguntla<http://UrlBlockedError.aspx>
主题: Local index related data bulkload
Hi all and rajeshbabu,
   Recently our job has encountered severe problems with trying to load data with local indexes
into phoenix. The data load performance looks very bad compared with our previous data
loading with gloable indexes. That seems quite absurd because phoenix local index targets
scenarios with heavy write and space constraint use case, which is just our job application.
   Observing stack trace during our job running, we can find the following info:
   [cid:_Foxmail.1@ed69e4ff-28f6-8feb-cb9e-536347aaa7dc]

We then refer to the org.apache.phoenix.index.PhoenixIndexBuilder and commented the batchStarted method. After recompiling the phoenix and restart cluster,
our job loading performance get significant advance. Following is the code for batcStarted method:
Here are my questions:
1 Can these code committor explain the concrete functionality for this method? Especially concerning to local index data loading...
2 If we modify these codes (e.g. comment this method like what we do), are there any potential influence for phoenix work?
3 More helpful work..Can any guys share their codes about how to complete data bulkload with local indexes while data file are storaged within HDFS?
I know that CsvBulkload can do index related data upserting while map-reduce bulkload didnot support that. Maybe our job is more likely to map-refuce bulkload? So, If someone
had successfully done loading data through CsvBulkload using Spark and HDFS, please provide us more kindly suggesion.

Best Regards,
Sun

/**
* Index builder for covered-columns index that ties into phoenix for faster use.
*/
public class PhoenixIndexBuilder extends CoveredColumnsIndexBuilder {

@Override
public void batchStarted(MiniBatchOperationInProgress<Mutation> miniBatchOp) throws IOException {
// The entire purpose of this method impl is to get the existing rows for the
// table rows being indexed into the block cache, as the index maintenance code
// does a point scan per row
List<KeyRange> keys = Lists.newArrayListWithExpectedSize(miniBatchOp.size());
List<IndexMaintainer> maintainers = new ArrayList<IndexMaintainer>();
for (int i = 0; i < miniBatchOp.size(); i++) {
Mutation m = miniBatchOp.getOperation(i);
keys.add(PDataType.VARBINARY.getKeyRange(m.getRow()));
maintainers.addAll(getCodec().getIndexMaintainers(m.getAttributesMap()));
}
Scan scan = IndexManagementUtil.newLocalStateScan(maintainers);
ScanRanges scanRanges = ScanRanges.create(Collections.singletonList(keys), SchemaUtil.VAR_BINARY_SCHEMA);
scanRanges.setScanStartStopRow(scan);
scan.setFilter(scanRanges.getSkipScanFilter());
HRegion region = this.env.getRegion();
RegionScanner scanner = region.getScanner(scan);
// Run through the scanner using internal nextRaw method
region.startRegionOperation();
try {
boolean hasMore;
do {
List<Cell> results = Lists.newArrayList();
// Results are potentially returned even when the return value of s.next is false
// since this is an indication of whether or not there are more values after the
// ones returned
hasMore = scanner.nextRaw(results);
} while (hasMore);
} finally {
try {
scanner.close();
} finally {
region.closeRegionOperation();
}
}
}
________________________________
________________________________




Re: Re: Local index related data bulkload

Posted by "sunfl@certusnet.com.cn" <su...@certusnet.com.cn>.
Hello James,
I created a JIRA file here: https://issues.apache.org/jira/browse/PHOENIX-1249 
Concerning for my use case, local index data loading are indeed occuring bottleneck 
for additional index update and server processing. Our project targets to load over 100M 
rows into phoenix for 15 minutes data storaged in HDFS. We expect to optimize the data 
loading performance and we consider the current reading queries are more or less suitable 
for our requirements.
Thanks,
Sun.





 
From: James Taylor
Date: 2014-09-12 13:36
To: user
Subject: Re: Re: Local index related data bulkload
Hi Sun,
You make a good point. Immutable and local vs global are orthogonal. We could support local immutable indexes as well as global immutable indexes. Would you mind filing a JIRA on this?

In your experience, is the index maintenance a bottleneck for you if you create completely covered immutable indexes? What's your mix of reads vs writes for your use case?

Thanks,
James

On Thu, Sep 11, 2014 at 7:32 PM, sunfl@certusnet.com.cn <su...@certusnet.com.cn> wrote:
Hi, James
Thanks for your reply. We understand the difference and application scenario for IMMUTABLE INDEX and MUTABLE INDEX. 
The main reason we expect to facilitate local indexing relates to the feature of write faster and we are trying to increase our data loading
speed and performance. Another consideration is that local indexing did not require include addtional columns when specifying queries,
which also fit our requirements. 
James, is there any possibility that local indexing can be created as immutable index? We are not quite understanding about the design of 
local indexing and why local indexing must be created as default mutable index. Noting that Hbase and Cassandra are more likely to process
time-series data, maybe immutable index are more efficient in some situations. Thats are just our several considerations. Are their any options
to select when using local index as immutable index? Corrects me if your design had unprevented and limited conditions for the default requirements.\

Thanks,
Sun


From: James Taylor
Date: 2014-09-12 09:57
To: user
Subject: Re: RE: Local index related data bulkload
Hi Sun,
Yes, that explains it. With immutable indexes, there is no index maintenance required, so there's no processing at all on the server side. If your data is write-once/append-only, then immutable indexes are about as efficient as you'll get. Any reason why you'd want to change them to local indexes? Local indexes is an alternative to global indexes for *mutable* data.
Thanks,
James

On Thu, Sep 11, 2014 at 6:51 PM, sunfl@certusnet.com.cn <su...@certusnet.com.cn> wrote:
Hi, Rajeshbabu
Best appreciated for your kind reply and explaination. Exactly, we created only one local index for the table.

We have one question: as far as we are concerned, for local indexing the index data may be already prepared

for client upsert? Maybe there is no need to scan and search for specified regionserver processing? Cause we

did not had so much trouble for the case of global index loading (no matther one index or more indexes related

data loading). 

Another question. Gloable index we created are immutable indexes as setting IMMUTABLE_ROWS=true, while 
local indexing are default mutable indexes. Are these differences meaning a lot for the performance diversity? 

Best thanks,
Sun






发件人: rajeshbabu chintaguntla
发送时间: 2014-09-11 23:45
收件人: user@phoenix.apache.org
主题: RE: Re: Local index related data bulkload
Hi Sun, 
The code snippet(PhoenixIndexBuilder#batchStarted) you have pointed out is not specific to local indexing, generic for any index. The main idea of the method is to keep the rows to index in block cache. So next time when ever scan the rows while preparing index updates we can get it from cache. 
        // The entire purpose of this method impl is to get the existing rows for the
        // table rows being indexed into the block cache, as the index maintenance code
        // does a point scan per row

This gives good performance when a table has more than one index.  One more thing with psql tool we do upserts in batches and each batch have 1000 updates by default(if you don't specify any value to phoenix.mutate.batchSize). Lets suppose if all the rows are different we scan the region until we cache all the 1000 records. That's why 
  hasMore = scanner.nextRaw(results);     //Here....  might be taking more time.
Can you tell me how many indexes you have created? One improvement we can do here is if we have only one index we can skip the scan in PhoenixIndexBuilder#batchStarted. 

@James, currently we are scanning the data region while preparing index updates?why don't we prepare them without scanning data region if we can have get all index columns data from hooks? 


bq. If someone had successfully done loading data through CsvBulkload using Spark and HDFS, please provide us more kindly suggesion.
Please refer "http://phoenix.apache.org/bulk_dataload.html#Loading via MapReduce" to run the bulkload from HDFS. Here we can pass index table to build as --index-table parameter.
But currently there is a problem with local indexing. I will raise an issue and work on it.


Thanks,
Rajeshbabu.

This e-mail and its attachments contain confidential information from HUAWEI, which 
is intended only for the person or entity whose address is listed above. Any use of the 
information contained herein in any way (including, but not limited to, total or partial 
disclosure, reproduction, or dissemination) by persons other than the intended 
recipient(s) is prohibited. If you receive this e-mail in error, please notify the sender by 
phone or email immediately and delete it!


From: sunfl@certusnet.com.cn [sunfl@certusnet.com.cn]
Sent: Thursday, September 11, 2014 6:34 AM
To: user
Subject: Re: Re: Local index related data bulkload

Very thanks.






From: rajesh babu Chintaguntla
Date: 2014-09-10 21:09
To: user@phoenix.apache.org
Subject: Re: Local index related data bulkload
Hi Sun I am not accessible to code. Tomorrow morning I will check and let you know. 

Thanks,
Rajeshbabu 

On Wednesday, September 10, 2014, sunfl@certusnet.com.cn <su...@certusnet.com.cn> wrote:
Any available suggestion?




发件人: sunfl@certusnet.com.cn
发送时间: 2014-09-09 14:24
收件人: user
主题: 回复: Local index related data bulkload
BTW.
The stacktrace info illustrates that our job running performance bottleneck mainly lies in the following code :
     region.startRegionOperation(); 
          try { 
               boolean hasMore; 
               do { 
                  List<Cell> results = Lists.newArrayList(); 
             // Results are potentially returned even when the return value of s.next is false 
             // since this is an indication of whether or not there are more values after the 
            // ones returned 
                 hasMore = scanner.nextRaw(results);     //Here.... 
              } while (hasMore); 
            } finally { 
               try { 
                 scanner.close(); 
               } finally { 
                  region.closeRegionOperation(); 
                } 
            } 
         } 




发件人: sunfl@certusnet.com.cn
发送时间: 2014-09-09 14:18
收件人: user
抄送: rajeshbabu chintaguntla
主题: Local index related data bulkload
Hi all and rajeshbabu,
   Recently our job has encountered severe problems with trying to load data with local indexes
into phoenix. The data load performance looks very bad compared with our previous data
loading with gloable indexes. That seems quite absurd because phoenix local index targets 
scenarios with heavy write and space constraint use case, which is just our job application.
   Observing stack trace during our job running, we can find the following info:
   

We then refer to the org.apache.phoenix.index.PhoenixIndexBuilder and commented the batchStarted method. After recompiling the phoenix and restart cluster, 
our job loading performance get significant advance. Following is the code for batcStarted method:
Here are my questions:
1 Can these code committor explain the concrete functionality for this method? Especially concerning to local index data loading...
2 If we modify these codes (e.g. comment this method like what we do), are there any potential influence for phoenix work?
3 More helpful work..Can any guys share their codes about how to complete data bulkload with local indexes while data file are storaged within HDFS?
I know that CsvBulkload can do index related data upserting while map-reduce bulkload didnot support that. Maybe our job is more likely to map-refuce bulkload? So, If someone 
had successfully done loading data through CsvBulkload using Spark and HDFS, please provide us more kindly suggesion.

Best Regards,
Sun

/** 
* Index builder for covered-columns index that ties into phoenix for faster use. 
*/ 
public class PhoenixIndexBuilder extends CoveredColumnsIndexBuilder { 

@Override 
public void batchStarted(MiniBatchOperationInProgress<Mutation> miniBatchOp) throws IOException { 
// The entire purpose of this method impl is to get the existing rows for the 
// table rows being indexed into the block cache, as the index maintenance code 
// does a point scan per row 
List<KeyRange> keys = Lists.newArrayListWithExpectedSize(miniBatchOp.size()); 
List<IndexMaintainer> maintainers = new ArrayList<IndexMaintainer>(); 
for (int i = 0; i < miniBatchOp.size(); i++) { 
Mutation m = miniBatchOp.getOperation(i); 
keys.add(PDataType.VARBINARY.getKeyRange(m.getRow())); 
maintainers.addAll(getCodec().getIndexMaintainers(m.getAttributesMap())); 
} 
Scan scan = IndexManagementUtil.newLocalStateScan(maintainers); 
ScanRanges scanRanges = ScanRanges.create(Collections.singletonList(keys), SchemaUtil.VAR_BINARY_SCHEMA); 
scanRanges.setScanStartStopRow(scan); 
scan.setFilter(scanRanges.getSkipScanFilter()); 
HRegion region = this.env.getRegion(); 
RegionScanner scanner = region.getScanner(scan); 
// Run through the scanner using internal nextRaw method 
region.startRegionOperation(); 
try { 
boolean hasMore; 
do { 
List<Cell> results = Lists.newArrayList(); 
// Results are potentially returned even when the return value of s.next is false 
// since this is an indication of whether or not there are more values after the 
// ones returned 
hasMore = scanner.nextRaw(results);     
} while (hasMore); 
} finally { 
try { 
scanner.close(); 
} finally { 
region.closeRegionOperation(); 
} 
} 
}








Re: Re: Local index related data bulkload

Posted by James Taylor <ja...@apache.org>.
Hi Sun,
You make a good point. Immutable and local vs global are orthogonal. We
could support local immutable indexes as well as global immutable indexes.
Would you mind filing a JIRA on this?

In your experience, is the index maintenance a bottleneck for you if you
create completely covered immutable indexes? What's your mix of reads vs
writes for your use case?

Thanks,
James

On Thu, Sep 11, 2014 at 7:32 PM, sunfl@certusnet.com.cn <
sunfl@certusnet.com.cn> wrote:

> Hi, James
> Thanks for your reply. We understand the difference and application
> scenario for IMMUTABLE INDEX and MUTABLE INDEX.
> The main reason we expect to facilitate local indexing relates to the
> feature of write faster and we are trying to increase our data loading
> speed and performance. Another consideration is that local indexing did
> not require include addtional columns when specifying queries,
> which also fit our requirements.
> James, is there any possibility that local indexing can be created as
> immutable index? We are not quite understanding about the design of
> local indexing and why local indexing must be created as default mutable
> index. Noting that Hbase and Cassandra are more likely to process
> time-series data, maybe immutable index are more efficient in some
> situations. Thats are just our several considerations. Are their any options
> to select when using local index as immutable index? Corrects me if your
> design had unprevented and limited conditions for the default requirements.\
>
> Thanks,
> Sun
>
>
> *From:* James Taylor <ja...@apache.org>
> *Date:* 2014-09-12 09:57
> *To:* user <us...@phoenix.apache.org>
> *Subject:* Re: RE: Local index related data bulkload
> Hi Sun,
> Yes, that explains it. With immutable indexes, there is no index
> maintenance required, so there's no processing at all on the server side.
> If your data is write-once/append-only, then immutable indexes are about as
> efficient as you'll get. Any reason why you'd want to change them to local
> indexes? Local indexes is an alternative to global indexes for *mutable*
> data.
> Thanks,
> James
>
> On Thu, Sep 11, 2014 at 6:51 PM, sunfl@certusnet.com.cn <
> sunfl@certusnet.com.cn> wrote:
>
>> Hi, Rajeshbabu
>> Best appreciated for your kind reply and explaination. Exactly, we
>> created only one local index for the table.
>>
>> We have one question: as far as we are concerned, for local indexing the
>> index data may be already prepared
>>
>> for client upsert? Maybe there is no need to scan and search for
>> specified regionserver processing? Cause we
>>
>> did not had so much trouble for the case of global index loading (no
>> matther one index or more indexes related
>>
>> data loading).
>>
>> Another question. Gloable index we created are immutable indexes as
>> setting IMMUTABLE_ROWS=true, while
>> local indexing are default mutable indexes. Are these differences meaning
>> a lot for the performance diversity?
>>
>> Best thanks,
>> Sun
>>
>> ------------------------------
>> ------------------------------
>>
>>
>> *发件人:* rajeshbabu chintaguntla <ra...@huawei.com>
>> *发送时间:* 2014-09-11 23:45
>> *收件人:* user@phoenix.apache.org
>> *主题:* RE: Re: Local index related data bulkload
>> Hi Sun,
>>  The code snippet(*PhoenixIndexBuilder#batchStarted*) you have pointed
>> out is not specific to local indexing, generic for any index. The main idea
>> of the method is to keep the rows to index in block cache. So next time
>> when ever scan the rows while preparing index updates we can get it from
>> cache.
>>          // The entire purpose of this method impl is to get the
>> existing rows for the
>>          // table rows being indexed into the block cache, as the index
>> maintenance code
>>          // does a point scan per row
>>
>>   This gives good performance when a table has more than one index.  One
>> more thing with psql tool we do upserts in batches and each batch have 1000
>> updates by default(if you don't specify any value to
>> phoenix.mutate.batchSize). Lets suppose if all the rows are different we
>> scan the region until we cache all the 1000 records. That's why
>>   hasMore = scanner.nextRaw(results);     //Here....  might be taking
>> more time.
>> Can you tell me how many indexes you have created? One improvement we can
>> do here is if we have only one index we can skip the scan in
>> *PhoenixIndexBuilder#batchStarted. *
>>
>>  @James, currently we are scanning the data region while preparing index
>> updates?why don't we prepare them without scanning data region if we can
>> have get all index columns data from hooks?
>>
>>
>>  bq. If someone had successfully done loading data through CsvBulkload
>> using Spark and HDFS, please provide us more kindly suggesion.
>>  Please refer "http://phoenix.apache.org/bulk_dataload.html#Loading via
>> MapReduce" to run the bulkload from HDFS. Here we can pass index table to
>> build as --index-table parameter.
>>  But currently there is a problem with local indexing. I will raise an
>> issue and work on it.
>>
>>
>>  Thanks,
>>  Rajeshbabu.
>>
>>     This e-mail and its attachments contain confidential information
>> from HUAWEI, which
>> is intended only for the person or entity whose address is listed above.
>> Any use of the
>> information contained herein in any way (including, but not limited to,
>> total or partial
>> disclosure, reproduction, or dissemination) by persons other than the
>> intended
>> recipient(s) is prohibited. If you receive this e-mail in error, please
>> notify the sender by
>> phone or email immediately and delete it!
>>   ------------------------------
>> *From:* sunfl@certusnet.com.cn [sunfl@certusnet.com.cn]
>> *Sent:* Thursday, September 11, 2014 6:34 AM
>> *To:* user
>> *Subject:* Re: Re: Local index related data bulkload
>>
>>   Very thanks.
>>
>>  ------------------------------
>>  ------------------------------
>>
>>
>>     *From:* rajesh babu Chintaguntla <ch...@gmail.com>
>> *Date:* 2014-09-10 21:09
>> *To:* user@phoenix.apache.org
>> *Subject:* Re: Local index related data bulkload
>>   Hi Sun I am not accessible to code. Tomorrow morning I will check and
>> let you know.
>>
>>  Thanks,
>> Rajeshbabu
>>
>> On Wednesday, September 10, 2014, sunfl@certusnet.com.cn <
>> sunfl@certusnet.com.cn> wrote:
>>
>>>  Any available suggestion?
>>>
>>>  ------------------------------
>>>
>>>    *发件人:* sunfl@certusnet.com.cn <http://UrlBlockedError.aspx>
>>> *发送时间:* 2014-09-09 14:24
>>> *收件人:* user <http://UrlBlockedError.aspx>
>>> *主题:* 回复: Local index related data bulkload
>>>   BTW.
>>> The stacktrace info illustrates that our job running performance
>>> bottleneck mainly lies in the following code :
>>>      region.startRegionOperation();
>>>           try {
>>>                boolean hasMore;
>>>                do {
>>>                   List<Cell> results = Lists.newArrayList();
>>>              // Results are potentially returned even when the return
>>> value of s.next is false
>>>              // since this is an indication of whether or not there are
>>> more values after the
>>>             // ones returned
>>>                  hasMore = scanner.nextRaw(results);     //Here....
>>>               } while (hasMore);
>>>             } finally {
>>>                try {
>>>                  scanner.close();
>>>                } finally {
>>>                   region.closeRegionOperation();
>>>                 }
>>>             }
>>>          }
>>>
>>>  ------------------------------
>>>
>>>    *发件人:* sunfl@certusnet.com.cn <http://UrlBlockedError.aspx>
>>> *发送时间:* 2014-09-09 14:18
>>> *收件人:* user <http://UrlBlockedError.aspx>
>>> *抄送:* rajeshbabu chintaguntla <http://UrlBlockedError.aspx>
>>> *主题:* Local index related data bulkload
>>>   Hi all and rajeshbabu,
>>>    Recently our job has encountered severe problems with trying to load
>>> data with local indexes
>>> into phoenix. The data load performance looks very bad compared with our
>>> previous data
>>> loading with gloable indexes. That seems quite absurd because phoenix
>>> local index targets
>>> scenarios with heavy write and space constraint use case, which is just
>>> our job application.
>>>    Observing stack trace during our job running, we can find the
>>> following info:
>>>
>>>
>>>  We then refer to the org.apache.phoenix.index.PhoenixIndexBuilder and
>>> commented the batchStarted method. After recompiling the phoenix and
>>> restart cluster,
>>> our job loading performance get significant advance. Following is the
>>> code for batcStarted method:
>>> Here are my questions:
>>> 1 Can these code committor explain the concrete functionality for this
>>> method? Especially concerning to local index data loading...
>>> 2 If we modify these codes (e.g. comment this method like what we do),
>>> are there any potential influence for phoenix work?
>>> 3 More helpful work..Can any guys share their codes about how to
>>> complete data bulkload with local indexes while data file are storaged
>>> within HDFS?
>>> I know that CsvBulkload can do index related data upserting while
>>> map-reduce bulkload didnot support that. Maybe our job is more likely to
>>> map-refuce bulkload? So, If someone
>>> had successfully done loading data through CsvBulkload using Spark and
>>> HDFS, please provide us more kindly suggesion.
>>>
>>>  Best Regards,
>>> Sun
>>>
>>>  /**
>>> * Index builder for covered-columns index that ties into phoenix for
>>> faster use.
>>> */
>>> public class PhoenixIndexBuilder extends CoveredColumnsIndexBuilder {
>>>
>>> @Override
>>> public void batchStarted(MiniBatchOperationInProgress<Mutation>
>>> miniBatchOp) throws IOException {
>>> // The entire purpose of this method impl is to get the existing rows
>>> for the
>>> // table rows being indexed into the block cache, as the index
>>> maintenance code
>>> // does a point scan per row
>>> List<KeyRange> keys =
>>> Lists.newArrayListWithExpectedSize(miniBatchOp.size());
>>> List<IndexMaintainer> maintainers = new ArrayList<IndexMaintainer>();
>>> for (int i = 0; i < miniBatchOp.size(); i++) {
>>> Mutation m = miniBatchOp.getOperation(i);
>>> keys.add(PDataType.VARBINARY.getKeyRange(m.getRow()));
>>> maintainers.addAll(getCodec().getIndexMaintainers(m.getAttributesMap()));
>>>
>>> }
>>> Scan scan = IndexManagementUtil.newLocalStateScan(maintainers);
>>> ScanRanges scanRanges =
>>> ScanRanges.create(Collections.singletonList(keys),
>>> SchemaUtil.VAR_BINARY_SCHEMA);
>>> scanRanges.setScanStartStopRow(scan);
>>> scan.setFilter(scanRanges.getSkipScanFilter());
>>> HRegion region = this.env.getRegion();
>>> RegionScanner scanner = region.getScanner(scan);
>>> // Run through the scanner using internal nextRaw method
>>> region.startRegionOperation();
>>> try {
>>> boolean hasMore;
>>> do {
>>> List<Cell> results = Lists.newArrayList();
>>> // Results are potentially returned even when the return value of s.next
>>> is false
>>> // since this is an indication of whether or not there are more values
>>> after the
>>> // ones returned
>>> hasMore = scanner.nextRaw(results);
>>> } while (hasMore);
>>> } finally {
>>> try {
>>> scanner.close();
>>> } finally {
>>> region.closeRegionOperation();
>>> }
>>> }
>>> }
>>> ------------------------------
>>>  ------------------------------
>>>
>>>
>>>
>

Re: Re: Local index related data bulkload

Posted by "sunfl@certusnet.com.cn" <su...@certusnet.com.cn>.
Hi, James
Thanks for your reply. We understand the difference and application scenario for IMMUTABLE INDEX and MUTABLE INDEX. 
The main reason we expect to facilitate local indexing relates to the feature of write faster and we are trying to increase our data loading
speed and performance. Another consideration is that local indexing did not require include addtional columns when specifying queries,
which also fit our requirements. 
James, is there any possibility that local indexing can be created as immutable index? We are not quite understanding about the design of 
local indexing and why local indexing must be created as default mutable index. Noting that Hbase and Cassandra are more likely to process
time-series data, maybe immutable index are more efficient in some situations. Thats are just our several considerations. Are their any options
to select when using local index as immutable index? Corrects me if your design had unprevented and limited conditions for the default requirements.\

Thanks,
Sun


From: James Taylor
Date: 2014-09-12 09:57
To: user
Subject: Re: RE: Local index related data bulkload
Hi Sun,
Yes, that explains it. With immutable indexes, there is no index maintenance required, so there's no processing at all on the server side. If your data is write-once/append-only, then immutable indexes are about as efficient as you'll get. Any reason why you'd want to change them to local indexes? Local indexes is an alternative to global indexes for *mutable* data.
Thanks,
James

On Thu, Sep 11, 2014 at 6:51 PM, sunfl@certusnet.com.cn <su...@certusnet.com.cn> wrote:
Hi, Rajeshbabu
Best appreciated for your kind reply and explaination. Exactly, we created only one local index for the table.

We have one question: as far as we are concerned, for local indexing the index data may be already prepared

for client upsert? Maybe there is no need to scan and search for specified regionserver processing? Cause we

did not had so much trouble for the case of global index loading (no matther one index or more indexes related

data loading). 

Another question. Gloable index we created are immutable indexes as setting IMMUTABLE_ROWS=true, while 
local indexing are default mutable indexes. Are these differences meaning a lot for the performance diversity? 

Best thanks,
Sun






发件人: rajeshbabu chintaguntla
发送时间: 2014-09-11 23:45
收件人: user@phoenix.apache.org
主题: RE: Re: Local index related data bulkload
Hi Sun, 
The code snippet(PhoenixIndexBuilder#batchStarted) you have pointed out is not specific to local indexing, generic for any index. The main idea of the method is to keep the rows to index in block cache. So next time when ever scan the rows while preparing index updates we can get it from cache. 
        // The entire purpose of this method impl is to get the existing rows for the
        // table rows being indexed into the block cache, as the index maintenance code
        // does a point scan per row

This gives good performance when a table has more than one index.  One more thing with psql tool we do upserts in batches and each batch have 1000 updates by default(if you don't specify any value to phoenix.mutate.batchSize). Lets suppose if all the rows are different we scan the region until we cache all the 1000 records. That's why 
  hasMore = scanner.nextRaw(results);     //Here....  might be taking more time.
Can you tell me how many indexes you have created? One improvement we can do here is if we have only one index we can skip the scan in PhoenixIndexBuilder#batchStarted. 

@James, currently we are scanning the data region while preparing index updates?why don't we prepare them without scanning data region if we can have get all index columns data from hooks? 


bq. If someone had successfully done loading data through CsvBulkload using Spark and HDFS, please provide us more kindly suggesion.
Please refer "http://phoenix.apache.org/bulk_dataload.html#Loading via MapReduce" to run the bulkload from HDFS. Here we can pass index table to build as --index-table parameter.
But currently there is a problem with local indexing. I will raise an issue and work on it.


Thanks,
Rajeshbabu.

This e-mail and its attachments contain confidential information from HUAWEI, which 
is intended only for the person or entity whose address is listed above. Any use of the 
information contained herein in any way (including, but not limited to, total or partial 
disclosure, reproduction, or dissemination) by persons other than the intended 
recipient(s) is prohibited. If you receive this e-mail in error, please notify the sender by 
phone or email immediately and delete it!


From: sunfl@certusnet.com.cn [sunfl@certusnet.com.cn]
Sent: Thursday, September 11, 2014 6:34 AM
To: user
Subject: Re: Re: Local index related data bulkload

Very thanks.






From: rajesh babu Chintaguntla
Date: 2014-09-10 21:09
To: user@phoenix.apache.org
Subject: Re: Local index related data bulkload
Hi Sun I am not accessible to code. Tomorrow morning I will check and let you know. 

Thanks,
Rajeshbabu 

On Wednesday, September 10, 2014, sunfl@certusnet.com.cn <su...@certusnet.com.cn> wrote:
Any available suggestion?




发件人: sunfl@certusnet.com.cn
发送时间: 2014-09-09 14:24
收件人: user
主题: 回复: Local index related data bulkload
BTW.
The stacktrace info illustrates that our job running performance bottleneck mainly lies in the following code :
     region.startRegionOperation(); 
          try { 
               boolean hasMore; 
               do { 
                  List<Cell> results = Lists.newArrayList(); 
             // Results are potentially returned even when the return value of s.next is false 
             // since this is an indication of whether or not there are more values after the 
            // ones returned 
                 hasMore = scanner.nextRaw(results);     //Here.... 
              } while (hasMore); 
            } finally { 
               try { 
                 scanner.close(); 
               } finally { 
                  region.closeRegionOperation(); 
                } 
            } 
         } 




发件人: sunfl@certusnet.com.cn
发送时间: 2014-09-09 14:18
收件人: user
抄送: rajeshbabu chintaguntla
主题: Local index related data bulkload
Hi all and rajeshbabu,
   Recently our job has encountered severe problems with trying to load data with local indexes
into phoenix. The data load performance looks very bad compared with our previous data
loading with gloable indexes. That seems quite absurd because phoenix local index targets 
scenarios with heavy write and space constraint use case, which is just our job application.
   Observing stack trace during our job running, we can find the following info:
   

We then refer to the org.apache.phoenix.index.PhoenixIndexBuilder and commented the batchStarted method. After recompiling the phoenix and restart cluster, 
our job loading performance get significant advance. Following is the code for batcStarted method:
Here are my questions:
1 Can these code committor explain the concrete functionality for this method? Especially concerning to local index data loading...
2 If we modify these codes (e.g. comment this method like what we do), are there any potential influence for phoenix work?
3 More helpful work..Can any guys share their codes about how to complete data bulkload with local indexes while data file are storaged within HDFS?
I know that CsvBulkload can do index related data upserting while map-reduce bulkload didnot support that. Maybe our job is more likely to map-refuce bulkload? So, If someone 
had successfully done loading data through CsvBulkload using Spark and HDFS, please provide us more kindly suggesion.

Best Regards,
Sun

/** 
* Index builder for covered-columns index that ties into phoenix for faster use. 
*/ 
public class PhoenixIndexBuilder extends CoveredColumnsIndexBuilder { 

@Override 
public void batchStarted(MiniBatchOperationInProgress<Mutation> miniBatchOp) throws IOException { 
// The entire purpose of this method impl is to get the existing rows for the 
// table rows being indexed into the block cache, as the index maintenance code 
// does a point scan per row 
List<KeyRange> keys = Lists.newArrayListWithExpectedSize(miniBatchOp.size()); 
List<IndexMaintainer> maintainers = new ArrayList<IndexMaintainer>(); 
for (int i = 0; i < miniBatchOp.size(); i++) { 
Mutation m = miniBatchOp.getOperation(i); 
keys.add(PDataType.VARBINARY.getKeyRange(m.getRow())); 
maintainers.addAll(getCodec().getIndexMaintainers(m.getAttributesMap())); 
} 
Scan scan = IndexManagementUtil.newLocalStateScan(maintainers); 
ScanRanges scanRanges = ScanRanges.create(Collections.singletonList(keys), SchemaUtil.VAR_BINARY_SCHEMA); 
scanRanges.setScanStartStopRow(scan); 
scan.setFilter(scanRanges.getSkipScanFilter()); 
HRegion region = this.env.getRegion(); 
RegionScanner scanner = region.getScanner(scan); 
// Run through the scanner using internal nextRaw method 
region.startRegionOperation(); 
try { 
boolean hasMore; 
do { 
List<Cell> results = Lists.newArrayList(); 
// Results are potentially returned even when the return value of s.next is false 
// since this is an indication of whether or not there are more values after the 
// ones returned 
hasMore = scanner.nextRaw(results);     
} while (hasMore); 
} finally { 
try { 
scanner.close(); 
} finally { 
region.closeRegionOperation(); 
} 
} 
}







Re: RE: Local index related data bulkload

Posted by James Taylor <ja...@apache.org>.
Hi Sun,
Yes, that explains it. With immutable indexes, there is no index
maintenance required, so there's no processing at all on the server side.
If your data is write-once/append-only, then immutable indexes are about as
efficient as you'll get. Any reason why you'd want to change them to local
indexes? Local indexes is an alternative to global indexes for *mutable*
data.
Thanks,
James

On Thu, Sep 11, 2014 at 6:51 PM, sunfl@certusnet.com.cn <
sunfl@certusnet.com.cn> wrote:

> Hi, Rajeshbabu
> Best appreciated for your kind reply and explaination. Exactly, we created
> only one local index for the table.
>
> We have one question: as far as we are concerned, for local indexing the
> index data may be already prepared
>
> for client upsert? Maybe there is no need to scan and search for specified
> regionserver processing? Cause we
>
> did not had so much trouble for the case of global index loading (no
> matther one index or more indexes related
>
> data loading).
>
> Another question. Gloable index we created are immutable indexes as
> setting IMMUTABLE_ROWS=true, while
> local indexing are default mutable indexes. Are these differences meaning
> a lot for the performance diversity?
>
> Best thanks,
> Sun
>
> ------------------------------
> ------------------------------
>
>
> *发件人:* rajeshbabu chintaguntla <ra...@huawei.com>
> *发送时间:* 2014-09-11 23:45
> *收件人:* user@phoenix.apache.org
> *主题:* RE: Re: Local index related data bulkload
> Hi Sun,
>  The code snippet(*PhoenixIndexBuilder#batchStarted*) you have pointed
> out is not specific to local indexing, generic for any index. The main idea
> of the method is to keep the rows to index in block cache. So next time
> when ever scan the rows while preparing index updates we can get it from
> cache.
>          // The entire purpose of this method impl is to get the existing
> rows for the
>          // table rows being indexed into the block cache, as the index
> maintenance code
>          // does a point scan per row
>
>   This gives good performance when a table has more than one index.  One
> more thing with psql tool we do upserts in batches and each batch have 1000
> updates by default(if you don't specify any value to
> phoenix.mutate.batchSize). Lets suppose if all the rows are different we
> scan the region until we cache all the 1000 records. That's why
>   hasMore = scanner.nextRaw(results);     //Here....  might be taking
> more time.
> Can you tell me how many indexes you have created? One improvement we can
> do here is if we have only one index we can skip the scan in
> *PhoenixIndexBuilder#batchStarted. *
>
>  @James, currently we are scanning the data region while preparing index
> updates?why don't we prepare them without scanning data region if we can
> have get all index columns data from hooks?
>
>
>  bq. If someone had successfully done loading data through CsvBulkload
> using Spark and HDFS, please provide us more kindly suggesion.
>  Please refer "http://phoenix.apache.org/bulk_dataload.html#Loading via
> MapReduce" to run the bulkload from HDFS. Here we can pass index table to
> build as --index-table parameter.
>  But currently there is a problem with local indexing. I will raise an
> issue and work on it.
>
>
>  Thanks,
>  Rajeshbabu.
>
>     This e-mail and its attachments contain confidential information from
> HUAWEI, which
> is intended only for the person or entity whose address is listed above.
> Any use of the
> information contained herein in any way (including, but not limited to,
> total or partial
> disclosure, reproduction, or dissemination) by persons other than the
> intended
> recipient(s) is prohibited. If you receive this e-mail in error, please
> notify the sender by
> phone or email immediately and delete it!
>   ------------------------------
> *From:* sunfl@certusnet.com.cn [sunfl@certusnet.com.cn]
> *Sent:* Thursday, September 11, 2014 6:34 AM
> *To:* user
> *Subject:* Re: Re: Local index related data bulkload
>
>   Very thanks.
>
>  ------------------------------
>  ------------------------------
>
>
>     *From:* rajesh babu Chintaguntla <ch...@gmail.com>
> *Date:* 2014-09-10 21:09
> *To:* user@phoenix.apache.org
> *Subject:* Re: Local index related data bulkload
>   Hi Sun I am not accessible to code. Tomorrow morning I will check and
> let you know.
>
>  Thanks,
> Rajeshbabu
>
> On Wednesday, September 10, 2014, sunfl@certusnet.com.cn <
> sunfl@certusnet.com.cn> wrote:
>
>>  Any available suggestion?
>>
>>  ------------------------------
>>
>>    *发件人:* sunfl@certusnet.com.cn <http://UrlBlockedError.aspx>
>> *发送时间:* 2014-09-09 14:24
>> *收件人:* user <http://UrlBlockedError.aspx>
>> *主题:* 回复: Local index related data bulkload
>>   BTW.
>> The stacktrace info illustrates that our job running performance
>> bottleneck mainly lies in the following code :
>>      region.startRegionOperation();
>>           try {
>>                boolean hasMore;
>>                do {
>>                   List<Cell> results = Lists.newArrayList();
>>              // Results are potentially returned even when the return
>> value of s.next is false
>>              // since this is an indication of whether or not there are
>> more values after the
>>             // ones returned
>>                  hasMore = scanner.nextRaw(results);     //Here....
>>               } while (hasMore);
>>             } finally {
>>                try {
>>                  scanner.close();
>>                } finally {
>>                   region.closeRegionOperation();
>>                 }
>>             }
>>          }
>>
>>  ------------------------------
>>
>>    *发件人:* sunfl@certusnet.com.cn <http://UrlBlockedError.aspx>
>> *发送时间:* 2014-09-09 14:18
>> *收件人:* user <http://UrlBlockedError.aspx>
>> *抄送:* rajeshbabu chintaguntla <http://UrlBlockedError.aspx>
>> *主题:* Local index related data bulkload
>>   Hi all and rajeshbabu,
>>    Recently our job has encountered severe problems with trying to load
>> data with local indexes
>> into phoenix. The data load performance looks very bad compared with our
>> previous data
>> loading with gloable indexes. That seems quite absurd because phoenix
>> local index targets
>> scenarios with heavy write and space constraint use case, which is just
>> our job application.
>>    Observing stack trace during our job running, we can find the
>> following info:
>>
>>
>>  We then refer to the org.apache.phoenix.index.PhoenixIndexBuilder and
>> commented the batchStarted method. After recompiling the phoenix and
>> restart cluster,
>> our job loading performance get significant advance. Following is the
>> code for batcStarted method:
>> Here are my questions:
>> 1 Can these code committor explain the concrete functionality for this
>> method? Especially concerning to local index data loading...
>> 2 If we modify these codes (e.g. comment this method like what we do),
>> are there any potential influence for phoenix work?
>> 3 More helpful work..Can any guys share their codes about how to
>> complete data bulkload with local indexes while data file are storaged
>> within HDFS?
>> I know that CsvBulkload can do index related data upserting while
>> map-reduce bulkload didnot support that. Maybe our job is more likely to
>> map-refuce bulkload? So, If someone
>> had successfully done loading data through CsvBulkload using Spark and
>> HDFS, please provide us more kindly suggesion.
>>
>>  Best Regards,
>> Sun
>>
>>  /**
>> * Index builder for covered-columns index that ties into phoenix for
>> faster use.
>> */
>> public class PhoenixIndexBuilder extends CoveredColumnsIndexBuilder {
>>
>> @Override
>> public void batchStarted(MiniBatchOperationInProgress<Mutation>
>> miniBatchOp) throws IOException {
>> // The entire purpose of this method impl is to get the existing rows for
>> the
>> // table rows being indexed into the block cache, as the index
>> maintenance code
>> // does a point scan per row
>> List<KeyRange> keys =
>> Lists.newArrayListWithExpectedSize(miniBatchOp.size());
>> List<IndexMaintainer> maintainers = new ArrayList<IndexMaintainer>();
>> for (int i = 0; i < miniBatchOp.size(); i++) {
>> Mutation m = miniBatchOp.getOperation(i);
>> keys.add(PDataType.VARBINARY.getKeyRange(m.getRow()));
>> maintainers.addAll(getCodec().getIndexMaintainers(m.getAttributesMap()));
>> }
>> Scan scan = IndexManagementUtil.newLocalStateScan(maintainers);
>> ScanRanges scanRanges =
>> ScanRanges.create(Collections.singletonList(keys),
>> SchemaUtil.VAR_BINARY_SCHEMA);
>> scanRanges.setScanStartStopRow(scan);
>> scan.setFilter(scanRanges.getSkipScanFilter());
>> HRegion region = this.env.getRegion();
>> RegionScanner scanner = region.getScanner(scan);
>> // Run through the scanner using internal nextRaw method
>> region.startRegionOperation();
>> try {
>> boolean hasMore;
>> do {
>> List<Cell> results = Lists.newArrayList();
>> // Results are potentially returned even when the return value of s.next
>> is false
>> // since this is an indication of whether or not there are more values
>> after the
>> // ones returned
>> hasMore = scanner.nextRaw(results);
>> } while (hasMore);
>> } finally {
>> try {
>> scanner.close();
>> } finally {
>> region.closeRegionOperation();
>> }
>> }
>> }
>> ------------------------------
>>  ------------------------------
>>
>>
>>

回复: RE: Local index related data bulkload

Posted by "sunfl@certusnet.com.cn" <su...@certusnet.com.cn>.
Hi, Rajeshbabu
Best appreciated for your kind reply and explaination. Exactly, we created only one local index for the table.

We have one question: as far as we are concerned, for local indexing the index data may be already prepared

for client upsert? Maybe there is no need to scan and search for specified regionserver processing? Cause we

did not had so much trouble for the case of global index loading (no matther one index or more indexes related

data loading). 

Another question. Gloable index we created are immutable indexes as setting IMMUTABLE_ROWS=true, while 
local indexing are default mutable indexes. Are these differences meaning a lot for the performance diversity? 

Best thanks,
Sun






发件人: rajeshbabu chintaguntla
发送时间: 2014-09-11 23:45
收件人: user@phoenix.apache.org
主题: RE: Re: Local index related data bulkload
Hi Sun, 
The code snippet(PhoenixIndexBuilder#batchStarted) you have pointed out is not specific to local indexing, generic for any index. The main idea of the method is to keep the rows to index in block cache. So next time when ever scan the rows while preparing index updates we can get it from cache. 
        // The entire purpose of this method impl is to get the existing rows for the
        // table rows being indexed into the block cache, as the index maintenance code
        // does a point scan per row

This gives good performance when a table has more than one index.  One more thing with psql tool we do upserts in batches and each batch have 1000 updates by default(if you don't specify any value to phoenix.mutate.batchSize). Lets suppose if all the rows are different we scan the region until we cache all the 1000 records. That's why 
  hasMore = scanner.nextRaw(results);     //Here....  might be taking more time.
Can you tell me how many indexes you have created? One improvement we can do here is if we have only one index we can skip the scan in PhoenixIndexBuilder#batchStarted. 

@James, currently we are scanning the data region while preparing index updates?why don't we prepare them without scanning data region if we can have get all index columns data from hooks? 


bq. If someone had successfully done loading data through CsvBulkload using Spark and HDFS, please provide us more kindly suggesion.
Please refer "http://phoenix.apache.org/bulk_dataload.html#Loading via MapReduce" to run the bulkload from HDFS. Here we can pass index table to build as --index-table parameter.
But currently there is a problem with local indexing. I will raise an issue and work on it.


Thanks,
Rajeshbabu.

This e-mail and its attachments contain confidential information from HUAWEI, which 
is intended only for the person or entity whose address is listed above. Any use of the 
information contained herein in any way (including, but not limited to, total or partial 
disclosure, reproduction, or dissemination) by persons other than the intended 
recipient(s) is prohibited. If you receive this e-mail in error, please notify the sender by 
phone or email immediately and delete it!


From: sunfl@certusnet.com.cn [sunfl@certusnet.com.cn]
Sent: Thursday, September 11, 2014 6:34 AM
To: user
Subject: Re: Re: Local index related data bulkload

Very thanks.






From: rajesh babu Chintaguntla
Date: 2014-09-10 21:09
To: user@phoenix.apache.org
Subject: Re: Local index related data bulkload
Hi Sun I am not accessible to code. Tomorrow morning I will check and let you know. 

Thanks,
Rajeshbabu 

On Wednesday, September 10, 2014, sunfl@certusnet.com.cn <su...@certusnet.com.cn> wrote:
Any available suggestion?




发件人: sunfl@certusnet.com.cn
发送时间: 2014-09-09 14:24
收件人: user
主题: 回复: Local index related data bulkload
BTW.
The stacktrace info illustrates that our job running performance bottleneck mainly lies in the following code :
     region.startRegionOperation(); 
          try { 
               boolean hasMore; 
               do { 
                  List<Cell> results = Lists.newArrayList(); 
             // Results are potentially returned even when the return value of s.next is false 
             // since this is an indication of whether or not there are more values after the 
            // ones returned 
                 hasMore = scanner.nextRaw(results);     //Here.... 
              } while (hasMore); 
            } finally { 
               try { 
                 scanner.close(); 
               } finally { 
                  region.closeRegionOperation(); 
                } 
            } 
         } 




发件人: sunfl@certusnet.com.cn
发送时间: 2014-09-09 14:18
收件人: user
抄送: rajeshbabu chintaguntla
主题: Local index related data bulkload
Hi all and rajeshbabu,
   Recently our job has encountered severe problems with trying to load data with local indexes
into phoenix. The data load performance looks very bad compared with our previous data
loading with gloable indexes. That seems quite absurd because phoenix local index targets 
scenarios with heavy write and space constraint use case, which is just our job application.
   Observing stack trace during our job running, we can find the following info:
   

We then refer to the org.apache.phoenix.index.PhoenixIndexBuilder and commented the batchStarted method. After recompiling the phoenix and restart cluster, 
our job loading performance get significant advance. Following is the code for batcStarted method:
Here are my questions:
1 Can these code committor explain the concrete functionality for this method? Especially concerning to local index data loading...
2 If we modify these codes (e.g. comment this method like what we do), are there any potential influence for phoenix work?
3 More helpful work..Can any guys share their codes about how to complete data bulkload with local indexes while data file are storaged within HDFS?
I know that CsvBulkload can do index related data upserting while map-reduce bulkload didnot support that. Maybe our job is more likely to map-refuce bulkload? So, If someone 
had successfully done loading data through CsvBulkload using Spark and HDFS, please provide us more kindly suggesion.

Best Regards,
Sun

/** 
* Index builder for covered-columns index that ties into phoenix for faster use. 
*/ 
public class PhoenixIndexBuilder extends CoveredColumnsIndexBuilder { 

@Override 
public void batchStarted(MiniBatchOperationInProgress<Mutation> miniBatchOp) throws IOException { 
// The entire purpose of this method impl is to get the existing rows for the 
// table rows being indexed into the block cache, as the index maintenance code 
// does a point scan per row 
List<KeyRange> keys = Lists.newArrayListWithExpectedSize(miniBatchOp.size()); 
List<IndexMaintainer> maintainers = new ArrayList<IndexMaintainer>(); 
for (int i = 0; i < miniBatchOp.size(); i++) { 
Mutation m = miniBatchOp.getOperation(i); 
keys.add(PDataType.VARBINARY.getKeyRange(m.getRow())); 
maintainers.addAll(getCodec().getIndexMaintainers(m.getAttributesMap())); 
} 
Scan scan = IndexManagementUtil.newLocalStateScan(maintainers); 
ScanRanges scanRanges = ScanRanges.create(Collections.singletonList(keys), SchemaUtil.VAR_BINARY_SCHEMA); 
scanRanges.setScanStartStopRow(scan); 
scan.setFilter(scanRanges.getSkipScanFilter()); 
HRegion region = this.env.getRegion(); 
RegionScanner scanner = region.getScanner(scan); 
// Run through the scanner using internal nextRaw method 
region.startRegionOperation(); 
try { 
boolean hasMore; 
do { 
List<Cell> results = Lists.newArrayList(); 
// Results are potentially returned even when the return value of s.next is false 
// since this is an indication of whether or not there are more values after the 
// ones returned 
hasMore = scanner.nextRaw(results);     
} while (hasMore); 
} finally { 
try { 
scanner.close(); 
} finally { 
region.closeRegionOperation(); 
} 
} 
}






Re: Re: Local index related data bulkload

Posted by James Taylor <ja...@apache.org>.
Hi Sun,
Thanks for reporting this issue. Your point is well taken: Local indexes
are supposed to be faster at writes than global secondary indexes because
all the writes are local. If not, something needs to be changed/fixed.

As far as PhoenixIndexBuilder#batchStarted, yes Rajeshbabu is right - it's
to batch get all the data rows before hand that we'll use to do the index
maintenance. We need to know the old row values to update the index table
correctly. The alternative would be to get them one by one which would
presumably be slower. If we didn't do them in a batched way, I suspect your
overall time would be higher (and would show up in the Scan we do for each
individual data row instead).

@Rajeshbabu - not sure I understand your question. How would we know which
index rows need to be deleted as a result of the data rows changing without
looking up the data rows?

One area that may have room for improvement is the writing of the index
rows. In the case of a local index, are we writing the rows in parallel? I
remember you made a change to do a local region batched mutation, but these
could potentially be parallelized further perhaps?

Thanks,
James





On Thu, Sep 11, 2014 at 8:45 AM, rajeshbabu chintaguntla <
rajeshbabu.chintaguntla@huawei.com> wrote:

>  Hi Sun,
>  The code snippet(*PhoenixIndexBuilder#batchStarted*) you have pointed
> out is not specific to local indexing, generic for any index. The main idea
> of the method is to keep the rows to index in block cache. So next time
> when ever scan the rows while preparing index updates we can get it from
> cache.
>          // The entire purpose of this method impl is to get the existing
> rows for the
>          // table rows being indexed into the block cache, as the index
> maintenance code
>          // does a point scan per row
>
>   This gives good performance when a table has more than one index.  One
> more thing with psql tool we do upserts in batches and each batch have 1000
> updates by default(if you don't specify any value to
> phoenix.mutate.batchSize). Lets suppose if all the rows are different we
> scan the region until we cache all the 1000 records. That's why
>   hasMore = scanner.nextRaw(results);     //Here....  might be taking
> more time.
> Can you tell me how many indexes you have created? One improvement we can
> do here is if we have only one index we can skip the scan in
> *PhoenixIndexBuilder#batchStarted. *
>
>  @James, currently we are scanning the data region while preparing index
> updates?why don't we prepare them without scanning data region if we can
> have get all index columns data from hooks?
>
>
>  bq. If someone had successfully done loading data through CsvBulkload
> using Spark and HDFS, please provide us more kindly suggesion.
>  Please refer "http://phoenix.apache.org/bulk_dataload.html#Loading via
> MapReduce" to run the bulkload from HDFS. Here we can pass index table to
> build as --index-table parameter.
>  But currently there is a problem with local indexing. I will raise an
> issue and work on it.
>
>
>  Thanks,
>  Rajeshbabu.
>
>     This e-mail and its attachments contain confidential information from
> HUAWEI, which
> is intended only for the person or entity whose address is listed above.
> Any use of the
> information contained herein in any way (including, but not limited to,
> total or partial
> disclosure, reproduction, or dissemination) by persons other than the
> intended
> recipient(s) is prohibited. If you receive this e-mail in error, please
> notify the sender by
> phone or email immediately and delete it!
>   ------------------------------
> *From:* sunfl@certusnet.com.cn [sunfl@certusnet.com.cn]
> *Sent:* Thursday, September 11, 2014 6:34 AM
> *To:* user
> *Subject:* Re: Re: Local index related data bulkload
>
>   Very thanks.
>
>  ------------------------------
>  ------------------------------
>
>
>     *From:* rajesh babu Chintaguntla <ch...@gmail.com>
> *Date:* 2014-09-10 21:09
> *To:* user@phoenix.apache.org
> *Subject:* Re: Local index related data bulkload
>   Hi Sun I am not accessible to code. Tomorrow morning I will check and
> let you know.
>
>  Thanks,
> Rajeshbabu
>
> On Wednesday, September 10, 2014, sunfl@certusnet.com.cn <
> sunfl@certusnet.com.cn> wrote:
>
>>  Any available suggestion?
>>
>>  ------------------------------
>>
>>    *发件人:* sunfl@certusnet.com.cn <http://UrlBlockedError.aspx>
>> *发送时间:* 2014-09-09 14:24
>> *收件人:* user <http://UrlBlockedError.aspx>
>> *主题:* 回复: Local index related data bulkload
>>   BTW.
>> The stacktrace info illustrates that our job running performance
>> bottleneck mainly lies in the following code :
>>      region.startRegionOperation();
>>           try {
>>                boolean hasMore;
>>                do {
>>                   List<Cell> results = Lists.newArrayList();
>>              // Results are potentially returned even when the return
>> value of s.next is false
>>              // since this is an indication of whether or not there are
>> more values after the
>>             // ones returned
>>                  hasMore = scanner.nextRaw(results);     //Here....
>>               } while (hasMore);
>>             } finally {
>>                try {
>>                  scanner.close();
>>                } finally {
>>                   region.closeRegionOperation();
>>                 }
>>             }
>>          }
>>
>>  ------------------------------
>>
>>    *发件人:* sunfl@certusnet.com.cn <http://UrlBlockedError.aspx>
>> *发送时间:* 2014-09-09 14:18
>> *收件人:* user <http://UrlBlockedError.aspx>
>> *抄送:* rajeshbabu chintaguntla <http://UrlBlockedError.aspx>
>> *主题:* Local index related data bulkload
>>   Hi all and rajeshbabu,
>>    Recently our job has encountered severe problems with trying to load
>> data with local indexes
>> into phoenix. The data load performance looks very bad compared with our
>> previous data
>> loading with gloable indexes. That seems quite absurd because phoenix
>> local index targets
>> scenarios with heavy write and space constraint use case, which is just
>> our job application.
>>    Observing stack trace during our job running, we can find the
>> following info:
>>
>>
>>  We then refer to the org.apache.phoenix.index.PhoenixIndexBuilder and
>> commented the batchStarted method. After recompiling the phoenix and
>> restart cluster,
>> our job loading performance get significant advance. Following is the
>> code for batcStarted method:
>> Here are my questions:
>> 1 Can these code committor explain the concrete functionality for this
>> method? Especially concerning to local index data loading...
>> 2 If we modify these codes (e.g. comment this method like what we do),
>> are there any potential influence for phoenix work?
>> 3 More helpful work..Can any guys share their codes about how to
>> complete data bulkload with local indexes while data file are storaged
>> within HDFS?
>> I know that CsvBulkload can do index related data upserting while
>> map-reduce bulkload didnot support that. Maybe our job is more likely to
>> map-refuce bulkload? So, If someone
>> had successfully done loading data through CsvBulkload using Spark and
>> HDFS, please provide us more kindly suggesion.
>>
>>  Best Regards,
>> Sun
>>
>>  /**
>> * Index builder for covered-columns index that ties into phoenix for
>> faster use.
>> */
>> public class PhoenixIndexBuilder extends CoveredColumnsIndexBuilder {
>>
>> @Override
>> public void batchStarted(MiniBatchOperationInProgress<Mutation>
>> miniBatchOp) throws IOException {
>> // The entire purpose of this method impl is to get the existing rows for
>> the
>> // table rows being indexed into the block cache, as the index
>> maintenance code
>> // does a point scan per row
>> List<KeyRange> keys =
>> Lists.newArrayListWithExpectedSize(miniBatchOp.size());
>> List<IndexMaintainer> maintainers = new ArrayList<IndexMaintainer>();
>> for (int i = 0; i < miniBatchOp.size(); i++) {
>> Mutation m = miniBatchOp.getOperation(i);
>> keys.add(PDataType.VARBINARY.getKeyRange(m.getRow()));
>> maintainers.addAll(getCodec().getIndexMaintainers(m.getAttributesMap()));
>> }
>> Scan scan = IndexManagementUtil.newLocalStateScan(maintainers);
>> ScanRanges scanRanges =
>> ScanRanges.create(Collections.singletonList(keys),
>> SchemaUtil.VAR_BINARY_SCHEMA);
>> scanRanges.setScanStartStopRow(scan);
>> scan.setFilter(scanRanges.getSkipScanFilter());
>> HRegion region = this.env.getRegion();
>> RegionScanner scanner = region.getScanner(scan);
>> // Run through the scanner using internal nextRaw method
>> region.startRegionOperation();
>> try {
>> boolean hasMore;
>> do {
>> List<Cell> results = Lists.newArrayList();
>> // Results are potentially returned even when the return value of s.next
>> is false
>> // since this is an indication of whether or not there are more values
>> after the
>> // ones returned
>> hasMore = scanner.nextRaw(results);
>> } while (hasMore);
>> } finally {
>> try {
>> scanner.close();
>> } finally {
>> region.closeRegionOperation();
>> }
>> }
>> }
>> ------------------------------
>>  ------------------------------
>>
>>
>>

RE: Re: Local index related data bulkload

Posted by rajeshbabu chintaguntla <ra...@huawei.com>.
Hi Sun,
The code snippet(PhoenixIndexBuilder#batchStarted) you have pointed out is not specific to local indexing, generic for any index. The main idea of the method is to keep the rows to index in block cache. So next time when ever scan the rows while preparing index updates we can get it from cache.
        // The entire purpose of this method impl is to get the existing rows for the
        // table rows being indexed into the block cache, as the index maintenance code
        // does a point scan per row

This gives good performance when a table has more than one index.  One more thing with psql tool we do upserts in batches and each batch have 1000 updates by default(if you don't specify any value to phoenix.mutate.batchSize). Lets suppose if all the rows are different we scan the region until we cache all the 1000 records. That's why
  hasMore = scanner.nextRaw(results);     //Here....  might be taking more time.
Can you tell me how many indexes you have created? One improvement we can do here is if we have only one index we can skip the scan in PhoenixIndexBuilder#batchStarted.

@James, currently we are scanning the data region while preparing index updates?why don't we prepare them without scanning data region if we can have get all index columns data from hooks?


bq. If someone had successfully done loading data through CsvBulkload using Spark and HDFS, please provide us more kindly suggesion.
Please refer "http://phoenix.apache.org/bulk_dataload.html#Loading via MapReduce" to run the bulkload from HDFS. Here we can pass index table to build as --index-table parameter.
But currently there is a problem with local indexing. I will raise an issue and work on it.


Thanks,
Rajeshbabu.

This e-mail and its attachments contain confidential information from HUAWEI, which
is intended only for the person or entity whose address is listed above. Any use of the
information contained herein in any way (including, but not limited to, total or partial
disclosure, reproduction, or dissemination) by persons other than the intended
recipient(s) is prohibited. If you receive this e-mail in error, please notify the sender by
phone or email immediately and delete it!
________________________________
From: sunfl@certusnet.com.cn [sunfl@certusnet.com.cn]
Sent: Thursday, September 11, 2014 6:34 AM
To: user
Subject: Re: Re: Local index related data bulkload

Very thanks.

________________________________
________________________________


From: rajesh babu Chintaguntla<ma...@gmail.com>
Date: 2014-09-10 21:09
To: user@phoenix.apache.org<ma...@phoenix.apache.org>
Subject: Re: Local index related data bulkload
Hi Sun I am not accessible to code. Tomorrow morning I will check and let you know.

Thanks,
Rajeshbabu

On Wednesday, September 10, 2014, sunfl@certusnet.com.cn<ma...@certusnet.com.cn> <su...@certusnet.com.cn>> wrote:
Any available suggestion?

________________________________

发件人: sunfl@certusnet.com.cn<UrlBlockedError.aspx>
发送时间: 2014-09-09 14:24
收件人: user<UrlBlockedError.aspx>
主题: 回复: Local index related data bulkload
BTW.
The stacktrace info illustrates that our job running performance bottleneck mainly lies in the following code :
     region.startRegionOperation();
          try {
               boolean hasMore;
               do {
                  List<Cell> results = Lists.newArrayList();
             // Results are potentially returned even when the return value of s.next is false
             // since this is an indication of whether or not there are more values after the
            // ones returned
                 hasMore = scanner.nextRaw(results);     //Here....
              } while (hasMore);
            } finally {
               try {
                 scanner.close();
               } finally {
                  region.closeRegionOperation();
                }
            }
         }

________________________________

发件人: sunfl@certusnet.com.cn<UrlBlockedError.aspx>
发送时间: 2014-09-09 14:18
收件人: user<UrlBlockedError.aspx>
抄送: rajeshbabu chintaguntla<UrlBlockedError.aspx>
主题: Local index related data bulkload
Hi all and rajeshbabu,
   Recently our job has encountered severe problems with trying to load data with local indexes
into phoenix. The data load performance looks very bad compared with our previous data
loading with gloable indexes. That seems quite absurd because phoenix local index targets
scenarios with heavy write and space constraint use case, which is just our job application.
   Observing stack trace during our job running, we can find the following info:
   [cid:_Foxmail.1@0a4a2e99-3654-7128-e254-cbc1e8b6ea0d]

We then refer to the org.apache.phoenix.index.PhoenixIndexBuilder and commented the batchStarted method. After recompiling the phoenix and restart cluster,
our job loading performance get significant advance. Following is the code for batcStarted method:
Here are my questions:
1 Can these code committor explain the concrete functionality for this method? Especially concerning to local index data loading...
2 If we modify these codes (e.g. comment this method like what we do), are there any potential influence for phoenix work?
3 More helpful work..Can any guys share their codes about how to complete data bulkload with local indexes while data file are storaged within HDFS?
I know that CsvBulkload can do index related data upserting while map-reduce bulkload didnot support that. Maybe our job is more likely to map-refuce bulkload? So, If someone
had successfully done loading data through CsvBulkload using Spark and HDFS, please provide us more kindly suggesion.

Best Regards,
Sun

/**
* Index builder for covered-columns index that ties into phoenix for faster use.
*/
public class PhoenixIndexBuilder extends CoveredColumnsIndexBuilder {

@Override
public void batchStarted(MiniBatchOperationInProgress<Mutation> miniBatchOp) throws IOException {
// The entire purpose of this method impl is to get the existing rows for the
// table rows being indexed into the block cache, as the index maintenance code
// does a point scan per row
List<KeyRange> keys = Lists.newArrayListWithExpectedSize(miniBatchOp.size());
List<IndexMaintainer> maintainers = new ArrayList<IndexMaintainer>();
for (int i = 0; i < miniBatchOp.size(); i++) {
Mutation m = miniBatchOp.getOperation(i);
keys.add(PDataType.VARBINARY.getKeyRange(m.getRow()));
maintainers.addAll(getCodec().getIndexMaintainers(m.getAttributesMap()));
}
Scan scan = IndexManagementUtil.newLocalStateScan(maintainers);
ScanRanges scanRanges = ScanRanges.create(Collections.singletonList(keys), SchemaUtil.VAR_BINARY_SCHEMA);
scanRanges.setScanStartStopRow(scan);
scan.setFilter(scanRanges.getSkipScanFilter());
HRegion region = this.env.getRegion();
RegionScanner scanner = region.getScanner(scan);
// Run through the scanner using internal nextRaw method
region.startRegionOperation();
try {
boolean hasMore;
do {
List<Cell> results = Lists.newArrayList();
// Results are potentially returned even when the return value of s.next is false
// since this is an indication of whether or not there are more values after the
// ones returned
hasMore = scanner.nextRaw(results);
} while (hasMore);
} finally {
try {
scanner.close();
} finally {
region.closeRegionOperation();
}
}
}
________________________________
________________________________


Re: Re: Local index related data bulkload

Posted by "sunfl@certusnet.com.cn" <su...@certusnet.com.cn>.
Very thanks.






From: rajesh babu Chintaguntla
Date: 2014-09-10 21:09
To: user@phoenix.apache.org
Subject: Re: Local index related data bulkload
Hi Sun I am not accessible to code. Tomorrow morning I will check and let you know.

Thanks,
Rajeshbabu 

On Wednesday, September 10, 2014, sunfl@certusnet.com.cn <su...@certusnet.com.cn> wrote:
Any available suggestion?




发件人: sunfl@certusnet.com.cn
发送时间: 2014-09-09 14:24
收件人: user
主题: 回复: Local index related data bulkload
BTW.
The stacktrace info illustrates that our job running performance bottleneck mainly lies in the following code :
     region.startRegionOperation(); 
          try { 
               boolean hasMore; 
               do { 
                  List<Cell> results = Lists.newArrayList(); 
             // Results are potentially returned even when the return value of s.next is false 
             // since this is an indication of whether or not there are more values after the 
            // ones returned 
                 hasMore = scanner.nextRaw(results);     //Here.... 
              } while (hasMore); 
            } finally { 
               try { 
                 scanner.close(); 
               } finally { 
                  region.closeRegionOperation(); 
                } 
            } 
         } 




发件人: sunfl@certusnet.com.cn
发送时间: 2014-09-09 14:18
收件人: user
抄送: rajeshbabu chintaguntla
主题: Local index related data bulkload
Hi all and rajeshbabu,
   Recently our job has encountered severe problems with trying to load data with local indexes
into phoenix. The data load performance looks very bad compared with our previous data
loading with gloable indexes. That seems quite absurd because phoenix local index targets 
scenarios with heavy write and space constraint use case, which is just our job application.
   Observing stack trace during our job running, we can find the following info:
   

We then refer to the org.apache.phoenix.index.PhoenixIndexBuilder and commented the batchStarted method. After recompiling the phoenix and restart cluster, 
our job loading performance get significant advance. Following is the code for batcStarted method:
Here are my questions:
1 Can these code committor explain the concrete functionality for this method? Especially concerning to local index data loading...
2 If we modify these codes (e.g. comment this method like what we do), are there any potential influence for phoenix work?
3 More helpful work..Can any guys share their codes about how to complete data bulkload with local indexes while data file are storaged within HDFS?
I know that CsvBulkload can do index related data upserting while map-reduce bulkload didnot support that. Maybe our job is more likely to map-refuce bulkload? So, If someone 
had successfully done loading data through CsvBulkload using Spark and HDFS, please provide us more kindly suggesion.

Best Regards,
Sun

/** 
* Index builder for covered-columns index that ties into phoenix for faster use. 
*/ 
public class PhoenixIndexBuilder extends CoveredColumnsIndexBuilder { 

@Override 
public void batchStarted(MiniBatchOperationInProgress<Mutation> miniBatchOp) throws IOException { 
// The entire purpose of this method impl is to get the existing rows for the 
// table rows being indexed into the block cache, as the index maintenance code 
// does a point scan per row 
List<KeyRange> keys = Lists.newArrayListWithExpectedSize(miniBatchOp.size()); 
List<IndexMaintainer> maintainers = new ArrayList<IndexMaintainer>(); 
for (int i = 0; i < miniBatchOp.size(); i++) { 
Mutation m = miniBatchOp.getOperation(i); 
keys.add(PDataType.VARBINARY.getKeyRange(m.getRow())); 
maintainers.addAll(getCodec().getIndexMaintainers(m.getAttributesMap())); 
} 
Scan scan = IndexManagementUtil.newLocalStateScan(maintainers); 
ScanRanges scanRanges = ScanRanges.create(Collections.singletonList(keys), SchemaUtil.VAR_BINARY_SCHEMA); 
scanRanges.setScanStartStopRow(scan); 
scan.setFilter(scanRanges.getSkipScanFilter()); 
HRegion region = this.env.getRegion(); 
RegionScanner scanner = region.getScanner(scan); 
// Run through the scanner using internal nextRaw method 
region.startRegionOperation(); 
try { 
boolean hasMore; 
do { 
List<Cell> results = Lists.newArrayList(); 
// Results are potentially returned even when the return value of s.next is false 
// since this is an indication of whether or not there are more values after the 
// ones returned 
hasMore = scanner.nextRaw(results);     
} while (hasMore); 
} finally { 
try { 
scanner.close(); 
} finally { 
region.closeRegionOperation(); 
} 
} 
}






Re: Local index related data bulkload

Posted by rajesh babu Chintaguntla <ch...@gmail.com>.
Hi Sun I am not accessible to code. Tomorrow morning I will check and let
you know.

Thanks,
Rajeshbabu

On Wednesday, September 10, 2014, sunfl@certusnet.com.cn <
sunfl@certusnet.com.cn> wrote:

> Any available suggestion?
>
> ------------------------------
>
> *发件人:* sunfl@certusnet.com.cn
> <javascript:_e(%7B%7D,'cvml','sunfl@certusnet.com.cn');>
> *发送时间:* 2014-09-09 14:24
> *收件人:* user <javascript:_e(%7B%7D,'cvml','user@phoenix.apache.org');>
> *主题:* 回复: Local index related data bulkload
> BTW.
> The stacktrace info illustrates that our job running performance
> bottleneck mainly lies in the following code :
>      region.startRegionOperation();
>           try {
>                boolean hasMore;
>                do {
>                   List<Cell> results = Lists.newArrayList();
>              // Results are potentially returned even when the return
> value of s.next is false
>              // since this is an indication of whether or not there are
> more values after the
>             // ones returned
>                  hasMore = scanner.nextRaw(results);     //Here....
>               } while (hasMore);
>             } finally {
>                try {
>                  scanner.close();
>                } finally {
>                   region.closeRegionOperation();
>                 }
>             }
>          }
>
> ------------------------------
>
> *发件人:* sunfl@certusnet.com.cn
> <javascript:_e(%7B%7D,'cvml','sunfl@certusnet.com.cn');>
> *发送时间:* 2014-09-09 14:18
> *收件人:* user <javascript:_e(%7B%7D,'cvml','user@phoenix.apache.org');>
> *抄送:* rajeshbabu chintaguntla
> <javascript:_e(%7B%7D,'cvml','rajeshbabu.chintaguntla@huawei.com');>
> *主题:* Local index related data bulkload
> Hi all and rajeshbabu,
>    Recently our job has encountered severe problems with trying to load
> data with local indexes
> into phoenix. The data load performance looks very bad compared with our
> previous data
> loading with gloable indexes. That seems quite absurd because phoenix
> local index targets
> scenarios with heavy write and space constraint use case, which is just
> our job application.
>    Observing stack trace during our job running, we can find the following
> info:
>
>
> We then refer to the org.apache.phoenix.index.PhoenixIndexBuilder and
> commented the batchStarted method. After recompiling the phoenix and
> restart cluster,
> our job loading performance get significant advance. Following is the code
> for batcStarted method:
> Here are my questions:
> 1 Can these code committor explain the concrete functionality for this
> method? Especially concerning to local index data loading...
> 2 If we modify these codes (e.g. comment this method like what we do), are
> there any potential influence for phoenix work?
> 3 More helpful work..Can any guys share their codes about how to complete
> data bulkload with local indexes while data file are storaged within HDFS?
> I know that CsvBulkload can do index related data upserting while
> map-reduce bulkload didnot support that. Maybe our job is more likely to
> map-refuce bulkload? So, If someone
> had successfully done loading data through CsvBulkload using Spark and
> HDFS, please provide us more kindly suggesion.
>
> Best Regards,
> Sun
>
> /**
> * Index builder for covered-columns index that ties into phoenix for
> faster use.
> */
> public class PhoenixIndexBuilder extends CoveredColumnsIndexBuilder {
>
> @Override
> public void batchStarted(MiniBatchOperationInProgress<Mutation>
> miniBatchOp) throws IOException {
> // The entire purpose of this method impl is to get the existing rows for
> the
> // table rows being indexed into the block cache, as the index maintenance
> code
> // does a point scan per row
> List<KeyRange> keys =
> Lists.newArrayListWithExpectedSize(miniBatchOp.size());
> List<IndexMaintainer> maintainers = new ArrayList<IndexMaintainer>();
> for (int i = 0; i < miniBatchOp.size(); i++) {
> Mutation m = miniBatchOp.getOperation(i);
> keys.add(PDataType.VARBINARY.getKeyRange(m.getRow()));
> maintainers.addAll(getCodec().getIndexMaintainers(m.getAttributesMap()));
> }
> Scan scan = IndexManagementUtil.newLocalStateScan(maintainers);
> ScanRanges scanRanges = ScanRanges.create(Collections.singletonList(keys),
> SchemaUtil.VAR_BINARY_SCHEMA);
> scanRanges.setScanStartStopRow(scan);
> scan.setFilter(scanRanges.getSkipScanFilter());
> HRegion region = this.env.getRegion();
> RegionScanner scanner = region.getScanner(scan);
> // Run through the scanner using internal nextRaw method
> region.startRegionOperation();
> try {
> boolean hasMore;
> do {
> List<Cell> results = Lists.newArrayList();
> // Results are potentially returned even when the return value of s.next
> is false
> // since this is an indication of whether or not there are more values
> after the
> // ones returned
> hasMore = scanner.nextRaw(results);
> } while (hasMore);
> } finally {
> try {
> scanner.close();
> } finally {
> region.closeRegionOperation();
> }
> }
> }
> ------------------------------
> ------------------------------
>
>
>

回复: 回复: Local index related data bulkload

Posted by "sunfl@certusnet.com.cn" <su...@certusnet.com.cn>.
Any available suggestion?




发件人: sunfl@certusnet.com.cn
发送时间: 2014-09-09 14:24
收件人: user
主题: 回复: Local index related data bulkload
BTW.
The stacktrace info illustrates that our job running performance bottleneck mainly lies in the following code :
     region.startRegionOperation(); 
          try { 
               boolean hasMore; 
               do { 
                  List<Cell> results = Lists.newArrayList(); 
             // Results are potentially returned even when the return value of s.next is false 
             // since this is an indication of whether or not there are more values after the 
            // ones returned 
                 hasMore = scanner.nextRaw(results);     //Here.... 
              } while (hasMore); 
            } finally { 
               try { 
                 scanner.close(); 
               } finally { 
                  region.closeRegionOperation(); 
                } 
            } 
         } 




发件人: sunfl@certusnet.com.cn
发送时间: 2014-09-09 14:18
收件人: user
抄送: rajeshbabu chintaguntla
主题: Local index related data bulkload
Hi all and rajeshbabu,
   Recently our job has encountered severe problems with trying to load data with local indexes
into phoenix. The data load performance looks very bad compared with our previous data
loading with gloable indexes. That seems quite absurd because phoenix local index targets 
scenarios with heavy write and space constraint use case, which is just our job application.
   Observing stack trace during our job running, we can find the following info:
   

We then refer to the org.apache.phoenix.index.PhoenixIndexBuilder and commented the batchStarted method. After recompiling the phoenix and restart cluster, 
our job loading performance get significant advance. Following is the code for batcStarted method:
Here are my questions:
1 Can these code committor explain the concrete functionality for this method? Especially concerning to local index data loading...
2 If we modify these codes (e.g. comment this method like what we do), are there any potential influence for phoenix work?
3 More helpful work..Can any guys share their codes about how to complete data bulkload with local indexes while data file are storaged within HDFS?
I know that CsvBulkload can do index related data upserting while map-reduce bulkload didnot support that. Maybe our job is more likely to map-refuce bulkload? So, If someone 
had successfully done loading data through CsvBulkload using Spark and HDFS, please provide us more kindly suggesion.

Best Regards,
Sun

/** 
* Index builder for covered-columns index that ties into phoenix for faster use. 
*/ 
public class PhoenixIndexBuilder extends CoveredColumnsIndexBuilder { 

@Override 
public void batchStarted(MiniBatchOperationInProgress<Mutation> miniBatchOp) throws IOException { 
// The entire purpose of this method impl is to get the existing rows for the 
// table rows being indexed into the block cache, as the index maintenance code 
// does a point scan per row 
List<KeyRange> keys = Lists.newArrayListWithExpectedSize(miniBatchOp.size()); 
List<IndexMaintainer> maintainers = new ArrayList<IndexMaintainer>(); 
for (int i = 0; i < miniBatchOp.size(); i++) { 
Mutation m = miniBatchOp.getOperation(i); 
keys.add(PDataType.VARBINARY.getKeyRange(m.getRow())); 
maintainers.addAll(getCodec().getIndexMaintainers(m.getAttributesMap())); 
} 
Scan scan = IndexManagementUtil.newLocalStateScan(maintainers); 
ScanRanges scanRanges = ScanRanges.create(Collections.singletonList(keys), SchemaUtil.VAR_BINARY_SCHEMA); 
scanRanges.setScanStartStopRow(scan); 
scan.setFilter(scanRanges.getSkipScanFilter()); 
HRegion region = this.env.getRegion(); 
RegionScanner scanner = region.getScanner(scan); 
// Run through the scanner using internal nextRaw method 
region.startRegionOperation(); 
try { 
boolean hasMore; 
do { 
List<Cell> results = Lists.newArrayList(); 
// Results are potentially returned even when the return value of s.next is false 
// since this is an indication of whether or not there are more values after the 
// ones returned 
hasMore = scanner.nextRaw(results);     
} while (hasMore); 
} finally { 
try { 
scanner.close(); 
} finally { 
region.closeRegionOperation(); 
} 
} 
}






回复: Local index related data bulkload

Posted by "sunfl@certusnet.com.cn" <su...@certusnet.com.cn>.
BTW.
The stacktrace info illustrates that our job running performance bottleneck mainly lies in the following code :
     region.startRegionOperation(); 
          try { 
               boolean hasMore; 
               do { 
                  List<Cell> results = Lists.newArrayList(); 
             // Results are potentially returned even when the return value of s.next is false 
             // since this is an indication of whether or not there are more values after the 
            // ones returned 
                 hasMore = scanner.nextRaw(results);     //Here.... 
              } while (hasMore); 
            } finally { 
               try { 
                 scanner.close(); 
               } finally { 
                  region.closeRegionOperation(); 
                } 
            } 
         } 




发件人: sunfl@certusnet.com.cn
发送时间: 2014-09-09 14:18
收件人: user
抄送: rajeshbabu chintaguntla
主题: Local index related data bulkload
Hi all and rajeshbabu,
   Recently our job has encountered severe problems with trying to load data with local indexes
into phoenix. The data load performance looks very bad compared with our previous data
loading with gloable indexes. That seems quite absurd because phoenix local index targets 
scenarios with heavy write and space constraint use case, which is just our job application.
   Observing stack trace during our job running, we can find the following info:
   

We then refer to the org.apache.phoenix.index.PhoenixIndexBuilder and commented the batchStarted method. After recompiling the phoenix and restart cluster, 
our job loading performance get significant advance. Following is the code for batcStarted method:
Here are my questions:
1 Can these code committor explain the concrete functionality for this method? Especially concerning to local index data loading...
2 If we modify these codes (e.g. comment this method like what we do), are there any potential influence for phoenix work?
3 More helpful work..Can any guys share their codes about how to complete data bulkload with local indexes while data file are storaged within HDFS?
I know that CsvBulkload can do index related data upserting while map-reduce bulkload didnot support that. Maybe our job is more likely to map-refuce bulkload? So, If someone 
had successfully done loading data through CsvBulkload using Spark and HDFS, please provide us more kindly suggesion.

Best Regards,
Sun

/** 
* Index builder for covered-columns index that ties into phoenix for faster use. 
*/ 
public class PhoenixIndexBuilder extends CoveredColumnsIndexBuilder { 

@Override 
public void batchStarted(MiniBatchOperationInProgress<Mutation> miniBatchOp) throws IOException { 
// The entire purpose of this method impl is to get the existing rows for the 
// table rows being indexed into the block cache, as the index maintenance code 
// does a point scan per row 
List<KeyRange> keys = Lists.newArrayListWithExpectedSize(miniBatchOp.size()); 
List<IndexMaintainer> maintainers = new ArrayList<IndexMaintainer>(); 
for (int i = 0; i < miniBatchOp.size(); i++) { 
Mutation m = miniBatchOp.getOperation(i); 
keys.add(PDataType.VARBINARY.getKeyRange(m.getRow())); 
maintainers.addAll(getCodec().getIndexMaintainers(m.getAttributesMap())); 
} 
Scan scan = IndexManagementUtil.newLocalStateScan(maintainers); 
ScanRanges scanRanges = ScanRanges.create(Collections.singletonList(keys), SchemaUtil.VAR_BINARY_SCHEMA); 
scanRanges.setScanStartStopRow(scan); 
scan.setFilter(scanRanges.getSkipScanFilter()); 
HRegion region = this.env.getRegion(); 
RegionScanner scanner = region.getScanner(scan); 
// Run through the scanner using internal nextRaw method 
region.startRegionOperation(); 
try { 
boolean hasMore; 
do { 
List<Cell> results = Lists.newArrayList(); 
// Results are potentially returned even when the return value of s.next is false 
// since this is an indication of whether or not there are more values after the 
// ones returned 
hasMore = scanner.nextRaw(results);     
} while (hasMore); 
} finally { 
try { 
scanner.close(); 
} finally { 
region.closeRegionOperation(); 
} 
} 
}