You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@phoenix.apache.org by Shawn/Xiang Li <sh...@gmail.com> on 2018/12/12 01:58:47 UTC

"upsert select" with "limit" clause

Hi,



1.       Want to check what is underlying running for limit clause used in
the following Upsert statement (is it involving any coprocessor working
behind?):



*                                  upsert into table2 select * from table1
limit 3000000; * (table 1 and table 2 have same schema)



              The above statement is running a lot slower than without
“limit”  clause as shown in following, even the above statement upsert less
data:



*                                upsert into table2 select * from table1;*



2.       We also observe memory usable is pretty high without the limit
clause (8gb vs 2gb), sometimes for big table it can reach 20gb without
using limit clause.  According to phoenix website description for upsert
select “If auto commit is on, and both a) the target table matches the
source table, and b) the select performs no aggregation, then the
population of the target table will be done completely on the server-side
(with constraint violations logged, but otherwise ignored).”



               My question is If everything is done on server-side, how
come we have such high memory usage on the client machine?



Thanks,

Shawn

Re: "upsert select" with "limit" clause

Posted by William Shen <wi...@marinsoftware.com>.

Shawn, in my own investigation with the SELECT statements running slower
with LIMIT, I have found that with the limit under certain threshold,
Phoenix will perform the scan in SERIAL instead of PARALLEL. Not sure why
that is the case, but maybe your explain plan would yield the same insight.

On Tue, Dec 18, 2018 at 10:30 AM Vincent Poon <vi...@apache.org>
wrote:

> Shawn,
>
> Can you do an "explain" to show what your two statements are doing?  That
> might give some clues.  Perhaps one is able to be run on the server for
> some reason and the other is not.
> Otherwise, I don't see why one would be substantially slower than the
> other.
>
> Vincent
>
> On Mon, Dec 17, 2018 at 9:14 PM Shawn Li <sh...@gmail.com> wrote:
>
>> Hi Jonathan,
>>
>> The single threaded on one side sounds logical to me. Hopefully Vincent
>> can confirm it.
>>
>> Thanks,
>> Shawn
>>
>> On Mon, Dec 17, 2018 at 9:25 PM Jonathan Leech <jo...@gmail.com>
>> wrote:
>>
>>> My guess is that in order to enforce the limit that it’s effectively
>>> single threaded in either the select or the upsert.
>>>
>>> On Dec 17, 2018, at 6:43 PM, Shawn Li <sh...@gmail.com> wrote:
>>>
>>> Hi Vincent,
>>>
>>> Thanks for explaining. That makes much more sense now and it explains
>>> the high memory usage when without "limit" clause. Because it upserts much
>>> quickly when using "upsert select" without "limit", the memory usage in
>>> client machine is much higher than "upsert select" with "limit" .
>>>
>>> So back to the other question. Can you explain what is underlying
>>> Phoenix implementation for "upsert select limit"? Why it is slower than
>>> without "limit" when insert a huge number (2m rows) like ""upsert into
>>> table2 select * from table1 limit 2,000,000;". This is much slower than
>>> inserting the whole table (upsert into table2 select * from table1;).
>>>
>>> Thanks,
>>> Xiang
>>>
>>>
>>> On Mon, Dec 17, 2018 at 1:56 PM Vincent Poon <vi...@apache.org>
>>> wrote:
>>>
>>>> Shawn,
>>>> Your query *upsert into table2 select * from table1;  *would not be
>>>> run on the server - the source and target table are different.  It would
>>>> have to be something like:
>>>> *upsert into table1 select * from table1;*
>>>>
>>>> If you want to run server-side upsert select on a target table that is
>>>> different from the source table, you need to set
>>>> "phoenix.client.enable.server.upsert.select" to true on your client.
>>>> The are some other restrictions: the table can't have any global
>>>> indexes, and the statement can't have a join or where subquery.  We need to
>>>> update the documentation with this information.
>>>>
>>>> The reason there are all these hurdles is because it's generally not
>>>> recommended to do server-side upsert select across different tables,
>>>> because that means you're doing cross-regionserver RPCs (e.g. read data
>>>> from a region of sourcetable, and write to a region of targettable on a
>>>> different regionserver), potentially tying up handlers in an unpredictable
>>>> way.
>>>>
>>>> On Sun, Dec 16, 2018 at 7:12 PM Shawn Li <sh...@gmail.com> wrote:
>>>>
>>>>> Hi Jaanai,
>>>>>
>>>>> According to Phoenix website, " If auto commit is on, and both a) the
>>>>> target table matches the source table, and b) the select performs no
>>>>> aggregation, then the population of the target table will be done
>>>>> completely on the server-side (with constraint violations logged, but
>>>>> otherwise ignored). Otherwise, data is buffered on the client and, if auto
>>>>> commit is on, committed in row batches as specified by the UpsertBatchSize
>>>>> connection property (or the phoenix.mutate.upsertBatchSize HBase config
>>>>> property which defaults to 10000 rows)"
>>>>>
>>>>> And our sql statement is just: *upsert into table2 select * from
>>>>> table1; *which should match the first case, all operations should be
>>>>> in server site. But the memory usage on the client machine is higher than
>>>>> "upsert select limit" clause. And the memory usage is check by run 'top'
>>>>> command under Linux. So we are sure it is caused by "select upsert" in
>>>>> Phoenix and not others, and can't explain why there is so high memory usage
>>>>> on client/gateway machine when all operations are supposed to happen on the
>>>>> serve side.
>>>>>
>>>>> Thanks,
>>>>> Shawn
>>>>>
>>>>> On Thu, Dec 13, 2018 at 3:15 AM Jaanai Zhang <cl...@gmail.com>
>>>>> wrote:
>>>>>
>>>>>> Shawn,
>>>>>>
>>>>>> The UPSERT SELECT will run in a coprocessor on if it hasn't limit
>>>>>> clause, only query 1 table, the query is doing aggregation, no sequences
>>>>>> and auto commit is on. Please check your SQL ... and you can also check
>>>>>> whether some resources have not been released.
>>>>>>
>>>>>> ----------------------------------------
>>>>>>    Jaanai Zhang
>>>>>>    Best regards!
>>>>>>
>>>>>>
>>>>>>
>>>>>> Shawn Li <sh...@gmail.com> 于2018年12月13日周四 下午12:10写道：
>>>>>>
>>>>>>> Hi Jaanai,
>>>>>>>
>>>>>>> Thanks for putting your thought. The behavior you describe is
>>>>>>> correct on the Hbase region sever side. The memory usage for blockcache and
>>>>>>> memstore will be high under such high throughput. But our phoenix client is
>>>>>>> on a gateway machine (no hbase region server sitting on it or any Hbase
>>>>>>> service on it), so not sure how to explain such high memory usage for
>>>>>>> upsert select without "limit" clause. The high memory usage behavior like
>>>>>>> all select results send to client machine, cached in client machine's
>>>>>>> memory, and then insert back to target table, which is not like the
>>>>>>> behavior that should happen, all of this should be done on the server side
>>>>>>> as the table schema is exactly the same. By the way, this happens on both
>>>>>>> Phoenix 4.7 and Phoenix 4.14.
>>>>>>>
>>>>>>>
>>>>>>> Thanks,
>>>>>>> Shawn
>>>>>>>
>>>>>>> On Wed, Dec 12, 2018 at 10:26 PM Jaanai Zhang <
>>>>>>> cloud.poster@gmail.com> wrote:
>>>>>>>
>>>>>>>> Shawn,
>>>>>>>>
>>>>>>>>
>>>>>>>> For the upsert without limit,  which will read the source table and
>>>>>>>> write the target tables on the server side.  I think the higher memory
>>>>>>>> usage is caused by using scan cache and memstore under the higher
>>>>>>>> throughput.
>>>>>>>>
>>>>>>>> ----------------------------------------
>>>>>>>>    Jaanai Zhang
>>>>>>>>    Best regards!
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> Shawn Li <sh...@gmail.com> 于2018年12月13日周四 上午10:13写道：
>>>>>>>>
>>>>>>>>> Hi Vincent,
>>>>>>>>>
>>>>>>>>> So you describe limit will sent result to client side and then
>>>>>>>>> write to server, this might explain why upsert with limit is slower than
>>>>>>>>> without limit. But looks like it can't explain the memory usage? The memory
>>>>>>>>> usage on client machine is 8gb (without "limit") vs 2gb (with limit),
>>>>>>>>> sometime upsert without "limit" can even reach 20gb for big table.
>>>>>>>>>
>>>>>>>>> Thanks,
>>>>>>>>> Shawn
>>>>>>>>>
>>>>>>>>> On Wed, Dec 12, 2018 at 6:34 PM Vincent Poon <
>>>>>>>>> vincentpoon@apache.org> wrote:
>>>>>>>>>
>>>>>>>>>> I think it's done client-side if you have LIMIT.  If you have
>>>>>>>>>> e.g. LIMIT 1000 , it would be incorrect for each regionserver to upsert
>>>>>>>>>> 100, if you have more than one regionserver.  So instead results are sent
>>>>>>>>>> back to the client, where the LIMIT is applied and then written back to the
>>>>>>>>>> server in the UPSERT.
>>>>>>>>>>
>>>>>>>>>> On Wed, Dec 12, 2018 at 1:18 PM Shawn Li <sh...@gmail.com>
>>>>>>>>>> wrote:
>>>>>>>>>>
>>>>>>>>>>> Hi Vincent,
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> The table creation statement is similar to below. We have about
>>>>>>>>>>> 200 fields. Table is mutable and don’t have any index on the table.
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> CREATE TABLE IF NOT EXISTS us_population (
>>>>>>>>>>>
>>>>>>>>>>>       state CHAR(2) NOT NULL,
>>>>>>>>>>>
>>>>>>>>>>>       city VARCHAR,
>>>>>>>>>>>
>>>>>>>>>>>       population BIGINT,
>>>>>>>>>>>
>>>>>>>>>>>       …
>>>>>>>>>>>
>>>>>>>>>>>       CONSTRAINT my_pk PRIMARY KEY (state));
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> Thanks,
>>>>>>>>>>>
>>>>>>>>>>> Shawn
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> On Wed, Dec 12, 2018, 13:42 Vincent Poon <vincentpoon@apache.org
>>>>>>>>>>> wrote:
>>>>>>>>>>>
>>>>>>>>>>>> For #2, can you provide the table definition and the statement
>>>>>>>>>>>> used?  e.g. Is the table immutable, or does it have indexes?
>>>>>>>>>>>>
>>>>>>>>>>>> On Tue, Dec 11, 2018 at 6:08 PM Shawn/Xiang Li <
>>>>>>>>>>>> shawnlijob@gmail.com> wrote:
>>>>>>>>>>>>
>>>>>>>>>>>>> Hi,
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> 1.       Want to check what is underlying running for limit
>>>>>>>>>>>>> clause used in the following Upsert statement (is it involving any
>>>>>>>>>>>>> coprocessor working behind?):
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> *                                  upsert into table2 select *
>>>>>>>>>>>>> from table1 limit 3000000; * (table 1 and table 2 have same
>>>>>>>>>>>>> schema)
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>               The above statement is running a lot slower than
>>>>>>>>>>>>> without “limit”  clause as shown in following, even the above statement
>>>>>>>>>>>>> upsert less data:
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> *                                upsert into table2 select *
>>>>>>>>>>>>> from table1;*
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> 2.       We also observe memory usable is pretty high without
>>>>>>>>>>>>> the limit clause (8gb vs 2gb), sometimes for big table it can reach 20gb
>>>>>>>>>>>>> without using limit clause.  According to phoenix website description for
>>>>>>>>>>>>> upsert select “If auto commit is on, and both a) the target table matches
>>>>>>>>>>>>> the source table, and b) the select performs no aggregation, then the
>>>>>>>>>>>>> population of the target table will be done completely on the server-side
>>>>>>>>>>>>> (with constraint violations logged, but otherwise ignored).”
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>                My question is If everything is done on
>>>>>>>>>>>>> server-side, how come we have such high memory usage on the client machine?
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> Thanks,
>>>>>>>>>>>>>
>>>>>>>>>>>>> Shawn
>>>>>>>>>>>>>
>>>>>>>>>>>>

Re: "upsert select" with "limit" clause

Posted by Vincent Poon <vi...@apache.org>.

Shawn, that's correct.  UPSERT SELECT with the limit does it in a single
thread, whereas without the limit it is done in parallel in multiple
threads.

Your CSV upsert will be faster because it doesn't need to SELECT - it
already has the data.
You can try increasing phoenix.mutate.maxSize
and phoenix.mutate.maxSizeBytes to see if that helps.

On Thu, Dec 20, 2018 at 10:41 AM Shawn Li <sh...@gmail.com> wrote:

> Hi Vincent,
>
> Thanks for checking the source code. You mentioned with limit UPSERT uses
> serial insert, can I interpret this as single thread process? So are you
> saying UPSERT without limit is faster is because it uses parallel
> insert/multi threads to insert? If so, then it makes sense now for both
> speed and memory usage.
>
> But the speed for UPSERT with limit is still pretty slow for us. The
> inserting rate is about 300 rows/sec vs 3000 rows/sec without limit. Also
> bulk csv data loading via psql is also use single thread (per phoenix site:
> "Single-threaded client loading tool for CSV formatted data via the psql
> <https://phoenix.apache.org/download.html#Loading-Data> command"), but
> the insert rate can be 3000-5000 rows/sec for us. Any other reason upsert
> with limit is so slow?
>
> Thanks,
> Shawn
>
> On Wed, Dec 19, 2018 at 5:04 PM Vincent Poon <vi...@apache.org>
> wrote:
>
>> Shawn,
>>
>> Took a quick look, I think what is happening is the UPSERT is done
>> serially when you have LIMIT.
>> Parallel scans are issued for the SELECT, which is why the explain plan
>> shows PARALLEL, but then the results are concatenated via a single
>> LimitingResultIterator, in order to apply the CLIENT LIMIT.
>> The upsert then reads from that iterator and does the mutations in
>> batches.
>>
>> To insert in parallel, we would need some sort of shared state between
>> the writing threads to ensure we respect the limit, and I don't think we
>> currently have something like that.
>>
>> Vincent
>>
>> On Tue, Dec 18, 2018 at 2:31 PM Vincent Poon <vi...@apache.org>
>> wrote:
>>
>>>
>>> Shawn, that sounds like a bug, I would file a JIRA.
>>>
>>> On Tue, Dec 18, 2018 at 12:33 PM Shawn Li <sh...@gmail.com> wrote:
>>>
>>>> Hi Vincent & William,
>>>>
>>>>
>>>>
>>>> Below is the explain plan, both are PARALLEL excuted in plan:
>>>>
>>>>
>>>>
>>>> explain upsert into table1 select * from table2;
>>>>
>>>>
>>>>
>>>> UPSERT
>>>> SELECT
>>>>                       |
>>>>
>>>> CLIENT 27-CHUNK 915799 ROWS 2831155510 BYTES PARALLEL 18-WAY ROUND
>>>> ROBIN FULL SCAN OVER table2
>>>>
>>>>
>>>>
>>>> explain upsert into table1 select * from table2 limit 2000000;
>>>>
>>>>
>>>>
>>>> UPSERT
>>>> SELECT
>>>>                       |
>>>>
>>>> | CLIENT 27-CHUNK 36000000 ROWS 481140000000 BYTES PARALLEL 18-WAY
>>>> ROUND ROBIN FULL SCAN OVER table2 |
>>>>
>>>> |     SERVER 2000000 ROW
>>>> LIMIT
>>>>                     |
>>>>
>>>> | CLIENT 2000000 ROW LIMIT
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>> Thanks,
>>>>
>>>> Shawn
>>>>
>>>> On Tue, Dec 18, 2018, 13:30 Vincent Poon <vincentpoon@apache.org wrote:
>>>>
>>>>> Shawn,
>>>>>
>>>>> Can you do an "explain" to show what your two statements are doing?
>>>>> That might give some clues.  Perhaps one is able to be run on the server
>>>>> for some reason and the other is not.
>>>>> Otherwise, I don't see why one would be substantially slower than the
>>>>> other.
>>>>>
>>>>> Vincent
>>>>>
>>>>> On Mon, Dec 17, 2018 at 9:14 PM Shawn Li <sh...@gmail.com> wrote:
>>>>>
>>>>>> Hi Jonathan,
>>>>>>
>>>>>> The single threaded on one side sounds logical to me. Hopefully
>>>>>> Vincent can confirm it.
>>>>>>
>>>>>> Thanks,
>>>>>> Shawn
>>>>>>
>>>>>> On Mon, Dec 17, 2018 at 9:25 PM Jonathan Leech <jo...@gmail.com>
>>>>>> wrote:
>>>>>>
>>>>>>> My guess is that in order to enforce the limit that it’s effectively
>>>>>>> single threaded in either the select or the upsert.
>>>>>>>
>>>>>>> On Dec 17, 2018, at 6:43 PM, Shawn Li <sh...@gmail.com> wrote:
>>>>>>>
>>>>>>> Hi Vincent,
>>>>>>>
>>>>>>> Thanks for explaining. That makes much more sense now and it
>>>>>>> explains the high memory usage when without "limit" clause. Because it
>>>>>>> upserts much quickly when using "upsert select" without "limit", the memory
>>>>>>> usage in client machine is much higher than "upsert select" with "limit" .
>>>>>>>
>>>>>>> So back to the other question. Can you explain what is underlying
>>>>>>> Phoenix implementation for "upsert select limit"? Why it is slower than
>>>>>>> without "limit" when insert a huge number (2m rows) like ""upsert into
>>>>>>> table2 select * from table1 limit 2,000,000;". This is much slower than
>>>>>>> inserting the whole table (upsert into table2 select * from table1;).
>>>>>>>
>>>>>>> Thanks,
>>>>>>> Xiang
>>>>>>>
>>>>>>>
>>>>>>> On Mon, Dec 17, 2018 at 1:56 PM Vincent Poon <vi...@apache.org>
>>>>>>> wrote:
>>>>>>>
>>>>>>>> Shawn,
>>>>>>>> Your query *upsert into table2 select * from table1;  *would not
>>>>>>>> be run on the server - the source and target table are different.  It would
>>>>>>>> have to be something like:
>>>>>>>> *upsert into table1 select * from table1;*
>>>>>>>>
>>>>>>>> If you want to run server-side upsert select on a target table that
>>>>>>>> is different from the source table, you need to set
>>>>>>>> "phoenix.client.enable.server.upsert.select" to true on your client.
>>>>>>>> The are some other restrictions: the table can't have any global
>>>>>>>> indexes, and the statement can't have a join or where subquery.  We need to
>>>>>>>> update the documentation with this information.
>>>>>>>>
>>>>>>>> The reason there are all these hurdles is because it's generally
>>>>>>>> not recommended to do server-side upsert select across different tables,
>>>>>>>> because that means you're doing cross-regionserver RPCs (e.g. read data
>>>>>>>> from a region of sourcetable, and write to a region of targettable on a
>>>>>>>> different regionserver), potentially tying up handlers in an unpredictable
>>>>>>>> way.
>>>>>>>>
>>>>>>>> On Sun, Dec 16, 2018 at 7:12 PM Shawn Li <sh...@gmail.com>
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>>> Hi Jaanai,
>>>>>>>>>
>>>>>>>>> According to Phoenix website, " If auto commit is on, and both a)
>>>>>>>>> the target table matches the source table, and b) the select performs no
>>>>>>>>> aggregation, then the population of the target table will be done
>>>>>>>>> completely on the server-side (with constraint violations logged, but
>>>>>>>>> otherwise ignored). Otherwise, data is buffered on the client and, if auto
>>>>>>>>> commit is on, committed in row batches as specified by the UpsertBatchSize
>>>>>>>>> connection property (or the phoenix.mutate.upsertBatchSize HBase config
>>>>>>>>> property which defaults to 10000 rows)"
>>>>>>>>>
>>>>>>>>> And our sql statement is just: *upsert into table2 select * from
>>>>>>>>> table1; *which should match the first case, all operations should
>>>>>>>>> be in server site. But the memory usage on the client machine is higher
>>>>>>>>> than "upsert select limit" clause. And the memory usage is check by run
>>>>>>>>> 'top'  command under Linux. So we are sure it is caused by "select upsert"
>>>>>>>>> in Phoenix and not others, and can't explain why there is so high memory
>>>>>>>>> usage on client/gateway machine when all operations are supposed to happen
>>>>>>>>> on the serve side.
>>>>>>>>>
>>>>>>>>> Thanks,
>>>>>>>>> Shawn
>>>>>>>>>
>>>>>>>>> On Thu, Dec 13, 2018 at 3:15 AM Jaanai Zhang <
>>>>>>>>> cloud.poster@gmail.com> wrote:
>>>>>>>>>
>>>>>>>>>> Shawn,
>>>>>>>>>>
>>>>>>>>>> The UPSERT SELECT will run in a coprocessor on if it hasn't limit
>>>>>>>>>> clause, only query 1 table, the query is doing aggregation, no sequences
>>>>>>>>>> and auto commit is on. Please check your SQL ... and you can also check
>>>>>>>>>> whether some resources have not been released.
>>>>>>>>>>
>>>>>>>>>> ----------------------------------------
>>>>>>>>>>    Jaanai Zhang
>>>>>>>>>>    Best regards!
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> Shawn Li <sh...@gmail.com> 于2018年12月13日周四 下午12:10写道：
>>>>>>>>>>
>>>>>>>>>>> Hi Jaanai,
>>>>>>>>>>>
>>>>>>>>>>> Thanks for putting your thought. The behavior you describe is
>>>>>>>>>>> correct on the Hbase region sever side. The memory usage for blockcache and
>>>>>>>>>>> memstore will be high under such high throughput. But our phoenix client is
>>>>>>>>>>> on a gateway machine (no hbase region server sitting on it or any Hbase
>>>>>>>>>>> service on it), so not sure how to explain such high memory usage for
>>>>>>>>>>> upsert select without "limit" clause. The high memory usage behavior like
>>>>>>>>>>> all select results send to client machine, cached in client machine's
>>>>>>>>>>> memory, and then insert back to target table, which is not like the
>>>>>>>>>>> behavior that should happen, all of this should be done on the server side
>>>>>>>>>>> as the table schema is exactly the same. By the way, this happens on both
>>>>>>>>>>> Phoenix 4.7 and Phoenix 4.14.
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> Thanks,
>>>>>>>>>>> Shawn
>>>>>>>>>>>
>>>>>>>>>>> On Wed, Dec 12, 2018 at 10:26 PM Jaanai Zhang <
>>>>>>>>>>> cloud.poster@gmail.com> wrote:
>>>>>>>>>>>
>>>>>>>>>>>> Shawn,
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> For the upsert without limit,  which will read the source table
>>>>>>>>>>>> and write the target tables on the server side.  I think the higher memory
>>>>>>>>>>>> usage is caused by using scan cache and memstore under the higher
>>>>>>>>>>>> throughput.
>>>>>>>>>>>>
>>>>>>>>>>>> ----------------------------------------
>>>>>>>>>>>>    Jaanai Zhang
>>>>>>>>>>>>    Best regards!
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> Shawn Li <sh...@gmail.com> 于2018年12月13日周四 上午10:13写道：
>>>>>>>>>>>>
>>>>>>>>>>>>> Hi Vincent,
>>>>>>>>>>>>>
>>>>>>>>>>>>> So you describe limit will sent result to client side and then
>>>>>>>>>>>>> write to server, this might explain why upsert with limit is slower than
>>>>>>>>>>>>> without limit. But looks like it can't explain the memory usage? The memory
>>>>>>>>>>>>> usage on client machine is 8gb (without "limit") vs 2gb (with limit),
>>>>>>>>>>>>> sometime upsert without "limit" can even reach 20gb for big table.
>>>>>>>>>>>>>
>>>>>>>>>>>>> Thanks,
>>>>>>>>>>>>> Shawn
>>>>>>>>>>>>>
>>>>>>>>>>>>> On Wed, Dec 12, 2018 at 6:34 PM Vincent Poon <
>>>>>>>>>>>>> vincentpoon@apache.org> wrote:
>>>>>>>>>>>>>
>>>>>>>>>>>>>> I think it's done client-side if you have LIMIT.  If you have
>>>>>>>>>>>>>> e.g. LIMIT 1000 , it would be incorrect for each regionserver to upsert
>>>>>>>>>>>>>> 100, if you have more than one regionserver.  So instead results are sent
>>>>>>>>>>>>>> back to the client, where the LIMIT is applied and then written back to the
>>>>>>>>>>>>>> server in the UPSERT.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> On Wed, Dec 12, 2018 at 1:18 PM Shawn Li <
>>>>>>>>>>>>>> shawnlijob@gmail.com> wrote:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Hi Vincent,
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> The table creation statement is similar to below. We have
>>>>>>>>>>>>>>> about 200 fields. Table is mutable and don’t have any index on the table.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> CREATE TABLE IF NOT EXISTS us_population (
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>       state CHAR(2) NOT NULL,
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>       city VARCHAR,
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>       population BIGINT,
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>       …
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>       CONSTRAINT my_pk PRIMARY KEY (state));
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Thanks,
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Shawn
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> On Wed, Dec 12, 2018, 13:42 Vincent Poon <
>>>>>>>>>>>>>>> vincentpoon@apache.org wrote:
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> For #2, can you provide the table definition and the
>>>>>>>>>>>>>>>> statement used?  e.g. Is the table immutable, or does it have indexes?
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> On Tue, Dec 11, 2018 at 6:08 PM Shawn/Xiang Li <
>>>>>>>>>>>>>>>> shawnlijob@gmail.com> wrote:
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> Hi,
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> 1.       Want to check what is underlying running for
>>>>>>>>>>>>>>>>> limit clause used in the following Upsert statement (is it involving any
>>>>>>>>>>>>>>>>> coprocessor working behind?):
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> *                                  upsert into table2
>>>>>>>>>>>>>>>>> select * from table1 limit 3000000; * (table 1 and table
>>>>>>>>>>>>>>>>> 2 have same schema)
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>               The above statement is running a lot slower
>>>>>>>>>>>>>>>>> than without “limit”  clause as shown in following, even the above
>>>>>>>>>>>>>>>>> statement upsert less data:
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> *                                upsert into table2 select
>>>>>>>>>>>>>>>>> * from table1;*
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> 2.       We also observe memory usable is pretty high
>>>>>>>>>>>>>>>>> without the limit clause (8gb vs 2gb), sometimes for big table it can reach
>>>>>>>>>>>>>>>>> 20gb without using limit clause.  According to phoenix website description
>>>>>>>>>>>>>>>>> for upsert select “If auto commit is on, and both a) the target table
>>>>>>>>>>>>>>>>> matches the source table, and b) the select performs no aggregation, then
>>>>>>>>>>>>>>>>> the population of the target table will be done completely on the
>>>>>>>>>>>>>>>>> server-side (with constraint violations logged, but otherwise ignored).”
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>                My question is If everything is done on
>>>>>>>>>>>>>>>>> server-side, how come we have such high memory usage on the client machine?
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> Thanks,
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> Shawn
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>

Re: "upsert select" with "limit" clause

Posted by Shawn Li <sh...@gmail.com>.

Hi Vincent,

Thanks for checking the source code. You mentioned with limit UPSERT uses
serial insert, can I interpret this as single thread process? So are you
saying UPSERT without limit is faster is because it uses parallel
insert/multi threads to insert? If so, then it makes sense now for both
speed and memory usage.

But the speed for UPSERT with limit is still pretty slow for us. The
inserting rate is about 300 rows/sec vs 3000 rows/sec without limit. Also
bulk csv data loading via psql is also use single thread (per phoenix site:
"Single-threaded client loading tool for CSV formatted data via the psql
<https://phoenix.apache.org/download.html#Loading-Data> command"), but the
insert rate can be 3000-5000 rows/sec for us. Any other reason upsert with
limit is so slow?

Thanks,
Shawn

On Wed, Dec 19, 2018 at 5:04 PM Vincent Poon <vi...@apache.org> wrote:

> Shawn,
>
> Took a quick look, I think what is happening is the UPSERT is done
> serially when you have LIMIT.
> Parallel scans are issued for the SELECT, which is why the explain plan
> shows PARALLEL, but then the results are concatenated via a single
> LimitingResultIterator, in order to apply the CLIENT LIMIT.
> The upsert then reads from that iterator and does the mutations in batches.
>
> To insert in parallel, we would need some sort of shared state between the
> writing threads to ensure we respect the limit, and I don't think we
> currently have something like that.
>
> Vincent
>
> On Tue, Dec 18, 2018 at 2:31 PM Vincent Poon <vi...@apache.org>
> wrote:
>
>>
>> Shawn, that sounds like a bug, I would file a JIRA.
>>
>> On Tue, Dec 18, 2018 at 12:33 PM Shawn Li <sh...@gmail.com> wrote:
>>
>>> Hi Vincent & William,
>>>
>>>
>>>
>>> Below is the explain plan, both are PARALLEL excuted in plan:
>>>
>>>
>>>
>>> explain upsert into table1 select * from table2;
>>>
>>>
>>>
>>> UPSERT
>>> SELECT
>>>                       |
>>>
>>> CLIENT 27-CHUNK 915799 ROWS 2831155510 BYTES PARALLEL 18-WAY ROUND ROBIN
>>> FULL SCAN OVER table2
>>>
>>>
>>>
>>> explain upsert into table1 select * from table2 limit 2000000;
>>>
>>>
>>>
>>> UPSERT
>>> SELECT
>>>                       |
>>>
>>> | CLIENT 27-CHUNK 36000000 ROWS 481140000000 BYTES PARALLEL 18-WAY ROUND
>>> ROBIN FULL SCAN OVER table2 |
>>>
>>> |     SERVER 2000000 ROW
>>> LIMIT
>>>                     |
>>>
>>> | CLIENT 2000000 ROW LIMIT
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>> Thanks,
>>>
>>> Shawn
>>>
>>> On Tue, Dec 18, 2018, 13:30 Vincent Poon <vincentpoon@apache.org wrote:
>>>
>>>> Shawn,
>>>>
>>>> Can you do an "explain" to show what your two statements are doing?
>>>> That might give some clues.  Perhaps one is able to be run on the server
>>>> for some reason and the other is not.
>>>> Otherwise, I don't see why one would be substantially slower than the
>>>> other.
>>>>
>>>> Vincent
>>>>
>>>> On Mon, Dec 17, 2018 at 9:14 PM Shawn Li <sh...@gmail.com> wrote:
>>>>
>>>>> Hi Jonathan,
>>>>>
>>>>> The single threaded on one side sounds logical to me. Hopefully
>>>>> Vincent can confirm it.
>>>>>
>>>>> Thanks,
>>>>> Shawn
>>>>>
>>>>> On Mon, Dec 17, 2018 at 9:25 PM Jonathan Leech <jo...@gmail.com>
>>>>> wrote:
>>>>>
>>>>>> My guess is that in order to enforce the limit that it’s effectively
>>>>>> single threaded in either the select or the upsert.
>>>>>>
>>>>>> On Dec 17, 2018, at 6:43 PM, Shawn Li <sh...@gmail.com> wrote:
>>>>>>
>>>>>> Hi Vincent,
>>>>>>
>>>>>> Thanks for explaining. That makes much more sense now and it explains
>>>>>> the high memory usage when without "limit" clause. Because it upserts much
>>>>>> quickly when using "upsert select" without "limit", the memory usage in
>>>>>> client machine is much higher than "upsert select" with "limit" .
>>>>>>
>>>>>> So back to the other question. Can you explain what is underlying
>>>>>> Phoenix implementation for "upsert select limit"? Why it is slower than
>>>>>> without "limit" when insert a huge number (2m rows) like ""upsert into
>>>>>> table2 select * from table1 limit 2,000,000;". This is much slower than
>>>>>> inserting the whole table (upsert into table2 select * from table1;).
>>>>>>
>>>>>> Thanks,
>>>>>> Xiang
>>>>>>
>>>>>>
>>>>>> On Mon, Dec 17, 2018 at 1:56 PM Vincent Poon <vi...@apache.org>
>>>>>> wrote:
>>>>>>
>>>>>>> Shawn,
>>>>>>> Your query *upsert into table2 select * from table1;  *would not be
>>>>>>> run on the server - the source and target table are different.  It would
>>>>>>> have to be something like:
>>>>>>> *upsert into table1 select * from table1;*
>>>>>>>
>>>>>>> If you want to run server-side upsert select on a target table that
>>>>>>> is different from the source table, you need to set
>>>>>>> "phoenix.client.enable.server.upsert.select" to true on your client.
>>>>>>> The are some other restrictions: the table can't have any global
>>>>>>> indexes, and the statement can't have a join or where subquery.  We need to
>>>>>>> update the documentation with this information.
>>>>>>>
>>>>>>> The reason there are all these hurdles is because it's generally not
>>>>>>> recommended to do server-side upsert select across different tables,
>>>>>>> because that means you're doing cross-regionserver RPCs (e.g. read data
>>>>>>> from a region of sourcetable, and write to a region of targettable on a
>>>>>>> different regionserver), potentially tying up handlers in an unpredictable
>>>>>>> way.
>>>>>>>
>>>>>>> On Sun, Dec 16, 2018 at 7:12 PM Shawn Li <sh...@gmail.com>
>>>>>>> wrote:
>>>>>>>
>>>>>>>> Hi Jaanai,
>>>>>>>>
>>>>>>>> According to Phoenix website, " If auto commit is on, and both a)
>>>>>>>> the target table matches the source table, and b) the select performs no
>>>>>>>> aggregation, then the population of the target table will be done
>>>>>>>> completely on the server-side (with constraint violations logged, but
>>>>>>>> otherwise ignored). Otherwise, data is buffered on the client and, if auto
>>>>>>>> commit is on, committed in row batches as specified by the UpsertBatchSize
>>>>>>>> connection property (or the phoenix.mutate.upsertBatchSize HBase config
>>>>>>>> property which defaults to 10000 rows)"
>>>>>>>>
>>>>>>>> And our sql statement is just: *upsert into table2 select * from
>>>>>>>> table1; *which should match the first case, all operations should
>>>>>>>> be in server site. But the memory usage on the client machine is higher
>>>>>>>> than "upsert select limit" clause. And the memory usage is check by run
>>>>>>>> 'top'  command under Linux. So we are sure it is caused by "select upsert"
>>>>>>>> in Phoenix and not others, and can't explain why there is so high memory
>>>>>>>> usage on client/gateway machine when all operations are supposed to happen
>>>>>>>> on the serve side.
>>>>>>>>
>>>>>>>> Thanks,
>>>>>>>> Shawn
>>>>>>>>
>>>>>>>> On Thu, Dec 13, 2018 at 3:15 AM Jaanai Zhang <
>>>>>>>> cloud.poster@gmail.com> wrote:
>>>>>>>>
>>>>>>>>> Shawn,
>>>>>>>>>
>>>>>>>>> The UPSERT SELECT will run in a coprocessor on if it hasn't limit
>>>>>>>>> clause, only query 1 table, the query is doing aggregation, no sequences
>>>>>>>>> and auto commit is on. Please check your SQL ... and you can also check
>>>>>>>>> whether some resources have not been released.
>>>>>>>>>
>>>>>>>>> ----------------------------------------
>>>>>>>>>    Jaanai Zhang
>>>>>>>>>    Best regards!
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> Shawn Li <sh...@gmail.com> 于2018年12月13日周四 下午12:10写道：
>>>>>>>>>
>>>>>>>>>> Hi Jaanai,
>>>>>>>>>>
>>>>>>>>>> Thanks for putting your thought. The behavior you describe is
>>>>>>>>>> correct on the Hbase region sever side. The memory usage for blockcache and
>>>>>>>>>> memstore will be high under such high throughput. But our phoenix client is
>>>>>>>>>> on a gateway machine (no hbase region server sitting on it or any Hbase
>>>>>>>>>> service on it), so not sure how to explain such high memory usage for
>>>>>>>>>> upsert select without "limit" clause. The high memory usage behavior like
>>>>>>>>>> all select results send to client machine, cached in client machine's
>>>>>>>>>> memory, and then insert back to target table, which is not like the
>>>>>>>>>> behavior that should happen, all of this should be done on the server side
>>>>>>>>>> as the table schema is exactly the same. By the way, this happens on both
>>>>>>>>>> Phoenix 4.7 and Phoenix 4.14.
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> Thanks,
>>>>>>>>>> Shawn
>>>>>>>>>>
>>>>>>>>>> On Wed, Dec 12, 2018 at 10:26 PM Jaanai Zhang <
>>>>>>>>>> cloud.poster@gmail.com> wrote:
>>>>>>>>>>
>>>>>>>>>>> Shawn,
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> For the upsert without limit,  which will read the source table
>>>>>>>>>>> and write the target tables on the server side.  I think the higher memory
>>>>>>>>>>> usage is caused by using scan cache and memstore under the higher
>>>>>>>>>>> throughput.
>>>>>>>>>>>
>>>>>>>>>>> ----------------------------------------
>>>>>>>>>>>    Jaanai Zhang
>>>>>>>>>>>    Best regards!
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> Shawn Li <sh...@gmail.com> 于2018年12月13日周四 上午10:13写道：
>>>>>>>>>>>
>>>>>>>>>>>> Hi Vincent,
>>>>>>>>>>>>
>>>>>>>>>>>> So you describe limit will sent result to client side and then
>>>>>>>>>>>> write to server, this might explain why upsert with limit is slower than
>>>>>>>>>>>> without limit. But looks like it can't explain the memory usage? The memory
>>>>>>>>>>>> usage on client machine is 8gb (without "limit") vs 2gb (with limit),
>>>>>>>>>>>> sometime upsert without "limit" can even reach 20gb for big table.
>>>>>>>>>>>>
>>>>>>>>>>>> Thanks,
>>>>>>>>>>>> Shawn
>>>>>>>>>>>>
>>>>>>>>>>>> On Wed, Dec 12, 2018 at 6:34 PM Vincent Poon <
>>>>>>>>>>>> vincentpoon@apache.org> wrote:
>>>>>>>>>>>>
>>>>>>>>>>>>> I think it's done client-side if you have LIMIT.  If you have
>>>>>>>>>>>>> e.g. LIMIT 1000 , it would be incorrect for each regionserver to upsert
>>>>>>>>>>>>> 100, if you have more than one regionserver.  So instead results are sent
>>>>>>>>>>>>> back to the client, where the LIMIT is applied and then written back to the
>>>>>>>>>>>>> server in the UPSERT.
>>>>>>>>>>>>>
>>>>>>>>>>>>> On Wed, Dec 12, 2018 at 1:18 PM Shawn Li <sh...@gmail.com>
>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>
>>>>>>>>>>>>>> Hi Vincent,
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> The table creation statement is similar to below. We have
>>>>>>>>>>>>>> about 200 fields. Table is mutable and don’t have any index on the table.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> CREATE TABLE IF NOT EXISTS us_population (
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>       state CHAR(2) NOT NULL,
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>       city VARCHAR,
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>       population BIGINT,
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>       …
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>       CONSTRAINT my_pk PRIMARY KEY (state));
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Thanks,
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Shawn
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> On Wed, Dec 12, 2018, 13:42 Vincent Poon <
>>>>>>>>>>>>>> vincentpoon@apache.org wrote:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> For #2, can you provide the table definition and the
>>>>>>>>>>>>>>> statement used?  e.g. Is the table immutable, or does it have indexes?
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> On Tue, Dec 11, 2018 at 6:08 PM Shawn/Xiang Li <
>>>>>>>>>>>>>>> shawnlijob@gmail.com> wrote:
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Hi,
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> 1.       Want to check what is underlying running for
>>>>>>>>>>>>>>>> limit clause used in the following Upsert statement (is it involving any
>>>>>>>>>>>>>>>> coprocessor working behind?):
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> *                                  upsert into table2
>>>>>>>>>>>>>>>> select * from table1 limit 3000000; * (table 1 and table 2
>>>>>>>>>>>>>>>> have same schema)
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>               The above statement is running a lot slower
>>>>>>>>>>>>>>>> than without “limit”  clause as shown in following, even the above
>>>>>>>>>>>>>>>> statement upsert less data:
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> *                                upsert into table2 select
>>>>>>>>>>>>>>>> * from table1;*
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> 2.       We also observe memory usable is pretty high
>>>>>>>>>>>>>>>> without the limit clause (8gb vs 2gb), sometimes for big table it can reach
>>>>>>>>>>>>>>>> 20gb without using limit clause.  According to phoenix website description
>>>>>>>>>>>>>>>> for upsert select “If auto commit is on, and both a) the target table
>>>>>>>>>>>>>>>> matches the source table, and b) the select performs no aggregation, then
>>>>>>>>>>>>>>>> the population of the target table will be done completely on the
>>>>>>>>>>>>>>>> server-side (with constraint violations logged, but otherwise ignored).”
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>                My question is If everything is done on
>>>>>>>>>>>>>>>> server-side, how come we have such high memory usage on the client machine?
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Thanks,
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Shawn
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>

Re: "upsert select" with "limit" clause

Posted by Jaanai Zhang <cl...@gmail.com>.

Shawn,

I have done some tests in the 4.14.1-HBase-1.4 version. The detail
information is as follow:

CREATE TABLE test (id VARCHAR PRIMARY KEY, c1 varchar, c2 varchar)
SALT_BUCKETS = 10;

explain select * from test where c1 = 'xx' limit 5 offset 100;

CREATE TABLE test1 (id VARCHAR PRIMARY KEY, c1 varchar, c2 varchar)
SALT_BUCKETS = 10;

explain upsert into test1 select * from test limit 10;



0: jdbc:phoenix:thin:url=http://localhost:876> explain upsert into test1
select * from test limit 10;

+-----------------------------------------------------------------------------------+-----------------+----------------+--------------+

|                                       PLAN
        | EST_BYTES_READ  | EST_ROWS_READ  | EST_INFO_TS  |

+-----------------------------------------------------------------------------------+-----------------+----------------+--------------+

| UPSERT SELECT
         | 2040            | 10             | 0            |

| CLIENT 10-CHUNK 10 ROWS 2040 BYTES *SERIAL* 10-WAY ROUND ROBIN FULL SCAN
OVER TEST  | 2040            | 10             | 0            |

|     SERVER 10 ROW LIMIT
         | 2040            | 10             | 0            |

| CLIENT 10 ROW LIMIT
         | 2040            | 10             | 0            |

+-----------------------------------------------------------------------------------+-----------------+----------------+--------------+

4 rows selected (0.028 seconds)

0: jdbc:phoenix:thin:url=http://localhost:876> explain upsert into test1
select * from test;

+------------------------------------------------------------------+-----------------+----------------+--------------+

|                               PLAN                               |
EST_BYTES_READ  | EST_ROWS_READ  | EST_INFO_TS  |

+------------------------------------------------------------------+-----------------+----------------+--------------+

| UPSERT SELECT                                                    | null
          | null           | null         |

| CLIENT 10-CHUNK PARALLEL 10-WAY ROUND ROBIN FULL SCAN OVER TEST  | null
          | null           | null         |

+------------------------------------------------------------------+-----------------+----------------+--------------+

2 rows selected (0.033 seconds)


I notice that the UPSERT will produce serial scans with the limit clause.
what is your Phoenix's version?   @Vincent FYI






----------------------------------------
   Jaanai Zhang
   Best regards!



Vincent Poon <vi...@apache.org> 于2018年12月20日周四 上午6:04写道：

> Shawn,
>
> Took a quick look, I think what is happening is the UPSERT is done
> serially when you have LIMIT.
> Parallel scans are issued for the SELECT, which is why the explain plan
> shows PARALLEL, but then the results are concatenated via a single
> LimitingResultIterator, in order to apply the CLIENT LIMIT.
> The upsert then reads from that iterator and does the mutations in batches.
>
> To insert in parallel, we would need some sort of shared state between the
> writing threads to ensure we respect the limit, and I don't think we
> currently have something like that.
>
> Vincent
>
> On Tue, Dec 18, 2018 at 2:31 PM Vincent Poon <vi...@apache.org>
> wrote:
>
>>
>> Shawn, that sounds like a bug, I would file a JIRA.
>>
>> On Tue, Dec 18, 2018 at 12:33 PM Shawn Li <sh...@gmail.com> wrote:
>>
>>> Hi Vincent & William,
>>>
>>>
>>>
>>> Below is the explain plan, both are PARALLEL excuted in plan:
>>>
>>>
>>>
>>> explain upsert into table1 select * from table2;
>>>
>>>
>>>
>>> UPSERT
>>> SELECT
>>>                       |
>>>
>>> CLIENT 27-CHUNK 915799 ROWS 2831155510 BYTES PARALLEL 18-WAY ROUND ROBIN
>>> FULL SCAN OVER table2
>>>
>>>
>>>
>>> explain upsert into table1 select * from table2 limit 2000000;
>>>
>>>
>>>
>>> UPSERT
>>> SELECT
>>>                       |
>>>
>>> | CLIENT 27-CHUNK 36000000 ROWS 481140000000 BYTES PARALLEL 18-WAY ROUND
>>> ROBIN FULL SCAN OVER table2 |
>>>
>>> |     SERVER 2000000 ROW
>>> LIMIT
>>>                     |
>>>
>>> | CLIENT 2000000 ROW LIMIT
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>> Thanks,
>>>
>>> Shawn
>>>
>>> On Tue, Dec 18, 2018, 13:30 Vincent Poon <vincentpoon@apache.org wrote:
>>>
>>>> Shawn,
>>>>
>>>> Can you do an "explain" to show what your two statements are doing?
>>>> That might give some clues.  Perhaps one is able to be run on the server
>>>> for some reason and the other is not.
>>>> Otherwise, I don't see why one would be substantially slower than the
>>>> other.
>>>>
>>>> Vincent
>>>>
>>>> On Mon, Dec 17, 2018 at 9:14 PM Shawn Li <sh...@gmail.com> wrote:
>>>>
>>>>> Hi Jonathan,
>>>>>
>>>>> The single threaded on one side sounds logical to me. Hopefully
>>>>> Vincent can confirm it.
>>>>>
>>>>> Thanks,
>>>>> Shawn
>>>>>
>>>>> On Mon, Dec 17, 2018 at 9:25 PM Jonathan Leech <jo...@gmail.com>
>>>>> wrote:
>>>>>
>>>>>> My guess is that in order to enforce the limit that it’s effectively
>>>>>> single threaded in either the select or the upsert.
>>>>>>
>>>>>> On Dec 17, 2018, at 6:43 PM, Shawn Li <sh...@gmail.com> wrote:
>>>>>>
>>>>>> Hi Vincent,
>>>>>>
>>>>>> Thanks for explaining. That makes much more sense now and it explains
>>>>>> the high memory usage when without "limit" clause. Because it upserts much
>>>>>> quickly when using "upsert select" without "limit", the memory usage in
>>>>>> client machine is much higher than "upsert select" with "limit" .
>>>>>>
>>>>>> So back to the other question. Can you explain what is underlying
>>>>>> Phoenix implementation for "upsert select limit"? Why it is slower than
>>>>>> without "limit" when insert a huge number (2m rows) like ""upsert into
>>>>>> table2 select * from table1 limit 2,000,000;". This is much slower than
>>>>>> inserting the whole table (upsert into table2 select * from table1;).
>>>>>>
>>>>>> Thanks,
>>>>>> Xiang
>>>>>>
>>>>>>
>>>>>> On Mon, Dec 17, 2018 at 1:56 PM Vincent Poon <vi...@apache.org>
>>>>>> wrote:
>>>>>>
>>>>>>> Shawn,
>>>>>>> Your query *upsert into table2 select * from table1;  *would not be
>>>>>>> run on the server - the source and target table are different.  It would
>>>>>>> have to be something like:
>>>>>>> *upsert into table1 select * from table1;*
>>>>>>>
>>>>>>> If you want to run server-side upsert select on a target table that
>>>>>>> is different from the source table, you need to set
>>>>>>> "phoenix.client.enable.server.upsert.select" to true on your client.
>>>>>>> The are some other restrictions: the table can't have any global
>>>>>>> indexes, and the statement can't have a join or where subquery.  We need to
>>>>>>> update the documentation with this information.
>>>>>>>
>>>>>>> The reason there are all these hurdles is because it's generally not
>>>>>>> recommended to do server-side upsert select across different tables,
>>>>>>> because that means you're doing cross-regionserver RPCs (e.g. read data
>>>>>>> from a region of sourcetable, and write to a region of targettable on a
>>>>>>> different regionserver), potentially tying up handlers in an unpredictable
>>>>>>> way.
>>>>>>>
>>>>>>> On Sun, Dec 16, 2018 at 7:12 PM Shawn Li <sh...@gmail.com>
>>>>>>> wrote:
>>>>>>>
>>>>>>>> Hi Jaanai,
>>>>>>>>
>>>>>>>> According to Phoenix website, " If auto commit is on, and both a)
>>>>>>>> the target table matches the source table, and b) the select performs no
>>>>>>>> aggregation, then the population of the target table will be done
>>>>>>>> completely on the server-side (with constraint violations logged, but
>>>>>>>> otherwise ignored). Otherwise, data is buffered on the client and, if auto
>>>>>>>> commit is on, committed in row batches as specified by the UpsertBatchSize
>>>>>>>> connection property (or the phoenix.mutate.upsertBatchSize HBase config
>>>>>>>> property which defaults to 10000 rows)"
>>>>>>>>
>>>>>>>> And our sql statement is just: *upsert into table2 select * from
>>>>>>>> table1; *which should match the first case, all operations should
>>>>>>>> be in server site. But the memory usage on the client machine is higher
>>>>>>>> than "upsert select limit" clause. And the memory usage is check by run
>>>>>>>> 'top'  command under Linux. So we are sure it is caused by "select upsert"
>>>>>>>> in Phoenix and not others, and can't explain why there is so high memory
>>>>>>>> usage on client/gateway machine when all operations are supposed to happen
>>>>>>>> on the serve side.
>>>>>>>>
>>>>>>>> Thanks,
>>>>>>>> Shawn
>>>>>>>>
>>>>>>>> On Thu, Dec 13, 2018 at 3:15 AM Jaanai Zhang <
>>>>>>>> cloud.poster@gmail.com> wrote:
>>>>>>>>
>>>>>>>>> Shawn,
>>>>>>>>>
>>>>>>>>> The UPSERT SELECT will run in a coprocessor on if it hasn't limit
>>>>>>>>> clause, only query 1 table, the query is doing aggregation, no sequences
>>>>>>>>> and auto commit is on. Please check your SQL ... and you can also check
>>>>>>>>> whether some resources have not been released.
>>>>>>>>>
>>>>>>>>> ----------------------------------------
>>>>>>>>>    Jaanai Zhang
>>>>>>>>>    Best regards!
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> Shawn Li <sh...@gmail.com> 于2018年12月13日周四 下午12:10写道：
>>>>>>>>>
>>>>>>>>>> Hi Jaanai,
>>>>>>>>>>
>>>>>>>>>> Thanks for putting your thought. The behavior you describe is
>>>>>>>>>> correct on the Hbase region sever side. The memory usage for blockcache and
>>>>>>>>>> memstore will be high under such high throughput. But our phoenix client is
>>>>>>>>>> on a gateway machine (no hbase region server sitting on it or any Hbase
>>>>>>>>>> service on it), so not sure how to explain such high memory usage for
>>>>>>>>>> upsert select without "limit" clause. The high memory usage behavior like
>>>>>>>>>> all select results send to client machine, cached in client machine's
>>>>>>>>>> memory, and then insert back to target table, which is not like the
>>>>>>>>>> behavior that should happen, all of this should be done on the server side
>>>>>>>>>> as the table schema is exactly the same. By the way, this happens on both
>>>>>>>>>> Phoenix 4.7 and Phoenix 4.14.
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> Thanks,
>>>>>>>>>> Shawn
>>>>>>>>>>
>>>>>>>>>> On Wed, Dec 12, 2018 at 10:26 PM Jaanai Zhang <
>>>>>>>>>> cloud.poster@gmail.com> wrote:
>>>>>>>>>>
>>>>>>>>>>> Shawn,
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> For the upsert without limit,  which will read the source table
>>>>>>>>>>> and write the target tables on the server side.  I think the higher memory
>>>>>>>>>>> usage is caused by using scan cache and memstore under the higher
>>>>>>>>>>> throughput.
>>>>>>>>>>>
>>>>>>>>>>> ----------------------------------------
>>>>>>>>>>>    Jaanai Zhang
>>>>>>>>>>>    Best regards!
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> Shawn Li <sh...@gmail.com> 于2018年12月13日周四 上午10:13写道：
>>>>>>>>>>>
>>>>>>>>>>>> Hi Vincent,
>>>>>>>>>>>>
>>>>>>>>>>>> So you describe limit will sent result to client side and then
>>>>>>>>>>>> write to server, this might explain why upsert with limit is slower than
>>>>>>>>>>>> without limit. But looks like it can't explain the memory usage? The memory
>>>>>>>>>>>> usage on client machine is 8gb (without "limit") vs 2gb (with limit),
>>>>>>>>>>>> sometime upsert without "limit" can even reach 20gb for big table.
>>>>>>>>>>>>
>>>>>>>>>>>> Thanks,
>>>>>>>>>>>> Shawn
>>>>>>>>>>>>
>>>>>>>>>>>> On Wed, Dec 12, 2018 at 6:34 PM Vincent Poon <
>>>>>>>>>>>> vincentpoon@apache.org> wrote:
>>>>>>>>>>>>
>>>>>>>>>>>>> I think it's done client-side if you have LIMIT.  If you have
>>>>>>>>>>>>> e.g. LIMIT 1000 , it would be incorrect for each regionserver to upsert
>>>>>>>>>>>>> 100, if you have more than one regionserver.  So instead results are sent
>>>>>>>>>>>>> back to the client, where the LIMIT is applied and then written back to the
>>>>>>>>>>>>> server in the UPSERT.
>>>>>>>>>>>>>
>>>>>>>>>>>>> On Wed, Dec 12, 2018 at 1:18 PM Shawn Li <sh...@gmail.com>
>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>
>>>>>>>>>>>>>> Hi Vincent,
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> The table creation statement is similar to below. We have
>>>>>>>>>>>>>> about 200 fields. Table is mutable and don’t have any index on the table.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> CREATE TABLE IF NOT EXISTS us_population (
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>       state CHAR(2) NOT NULL,
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>       city VARCHAR,
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>       population BIGINT,
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>       …
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>       CONSTRAINT my_pk PRIMARY KEY (state));
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Thanks,
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Shawn
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> On Wed, Dec 12, 2018, 13:42 Vincent Poon <
>>>>>>>>>>>>>> vincentpoon@apache.org wrote:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> For #2, can you provide the table definition and the
>>>>>>>>>>>>>>> statement used?  e.g. Is the table immutable, or does it have indexes?
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> On Tue, Dec 11, 2018 at 6:08 PM Shawn/Xiang Li <
>>>>>>>>>>>>>>> shawnlijob@gmail.com> wrote:
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Hi,
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> 1.       Want to check what is underlying running for
>>>>>>>>>>>>>>>> limit clause used in the following Upsert statement (is it involving any
>>>>>>>>>>>>>>>> coprocessor working behind?):
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> *                                  upsert into table2
>>>>>>>>>>>>>>>> select * from table1 limit 3000000; * (table 1 and table 2
>>>>>>>>>>>>>>>> have same schema)
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>               The above statement is running a lot slower
>>>>>>>>>>>>>>>> than without “limit”  clause as shown in following, even the above
>>>>>>>>>>>>>>>> statement upsert less data:
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> *                                upsert into table2 select
>>>>>>>>>>>>>>>> * from table1;*
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> 2.       We also observe memory usable is pretty high
>>>>>>>>>>>>>>>> without the limit clause (8gb vs 2gb), sometimes for big table it can reach
>>>>>>>>>>>>>>>> 20gb without using limit clause.  According to phoenix website description
>>>>>>>>>>>>>>>> for upsert select “If auto commit is on, and both a) the target table
>>>>>>>>>>>>>>>> matches the source table, and b) the select performs no aggregation, then
>>>>>>>>>>>>>>>> the population of the target table will be done completely on the
>>>>>>>>>>>>>>>> server-side (with constraint violations logged, but otherwise ignored).”
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>                My question is If everything is done on
>>>>>>>>>>>>>>>> server-side, how come we have such high memory usage on the client machine?
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Thanks,
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Shawn
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>

Re: "upsert select" with "limit" clause

Posted by Vincent Poon <vi...@apache.org>.

Shawn,

Took a quick look, I think what is happening is the UPSERT is done serially
when you have LIMIT.
Parallel scans are issued for the SELECT, which is why the explain plan
shows PARALLEL, but then the results are concatenated via a single
LimitingResultIterator, in order to apply the CLIENT LIMIT.
The upsert then reads from that iterator and does the mutations in batches.

To insert in parallel, we would need some sort of shared state between the
writing threads to ensure we respect the limit, and I don't think we
currently have something like that.

Vincent

On Tue, Dec 18, 2018 at 2:31 PM Vincent Poon <vi...@apache.org> wrote:

>
> Shawn, that sounds like a bug, I would file a JIRA.
>
> On Tue, Dec 18, 2018 at 12:33 PM Shawn Li <sh...@gmail.com> wrote:
>
>> Hi Vincent & William,
>>
>>
>>
>> Below is the explain plan, both are PARALLEL excuted in plan:
>>
>>
>>
>> explain upsert into table1 select * from table2;
>>
>>
>>
>> UPSERT
>> SELECT
>>                       |
>>
>> CLIENT 27-CHUNK 915799 ROWS 2831155510 BYTES PARALLEL 18-WAY ROUND ROBIN
>> FULL SCAN OVER table2
>>
>>
>>
>> explain upsert into table1 select * from table2 limit 2000000;
>>
>>
>>
>> UPSERT
>> SELECT
>>                       |
>>
>> | CLIENT 27-CHUNK 36000000 ROWS 481140000000 BYTES PARALLEL 18-WAY ROUND
>> ROBIN FULL SCAN OVER table2 |
>>
>> |     SERVER 2000000 ROW
>> LIMIT
>>                     |
>>
>> | CLIENT 2000000 ROW LIMIT
>>
>>
>>
>>
>>
>>
>>
>> Thanks,
>>
>> Shawn
>>
>> On Tue, Dec 18, 2018, 13:30 Vincent Poon <vincentpoon@apache.org wrote:
>>
>>> Shawn,
>>>
>>> Can you do an "explain" to show what your two statements are doing?
>>> That might give some clues.  Perhaps one is able to be run on the server
>>> for some reason and the other is not.
>>> Otherwise, I don't see why one would be substantially slower than the
>>> other.
>>>
>>> Vincent
>>>
>>> On Mon, Dec 17, 2018 at 9:14 PM Shawn Li <sh...@gmail.com> wrote:
>>>
>>>> Hi Jonathan,
>>>>
>>>> The single threaded on one side sounds logical to me. Hopefully Vincent
>>>> can confirm it.
>>>>
>>>> Thanks,
>>>> Shawn
>>>>
>>>> On Mon, Dec 17, 2018 at 9:25 PM Jonathan Leech <jo...@gmail.com>
>>>> wrote:
>>>>
>>>>> My guess is that in order to enforce the limit that it’s effectively
>>>>> single threaded in either the select or the upsert.
>>>>>
>>>>> On Dec 17, 2018, at 6:43 PM, Shawn Li <sh...@gmail.com> wrote:
>>>>>
>>>>> Hi Vincent,
>>>>>
>>>>> Thanks for explaining. That makes much more sense now and it explains
>>>>> the high memory usage when without "limit" clause. Because it upserts much
>>>>> quickly when using "upsert select" without "limit", the memory usage in
>>>>> client machine is much higher than "upsert select" with "limit" .
>>>>>
>>>>> So back to the other question. Can you explain what is underlying
>>>>> Phoenix implementation for "upsert select limit"? Why it is slower than
>>>>> without "limit" when insert a huge number (2m rows) like ""upsert into
>>>>> table2 select * from table1 limit 2,000,000;". This is much slower than
>>>>> inserting the whole table (upsert into table2 select * from table1;).
>>>>>
>>>>> Thanks,
>>>>> Xiang
>>>>>
>>>>>
>>>>> On Mon, Dec 17, 2018 at 1:56 PM Vincent Poon <vi...@apache.org>
>>>>> wrote:
>>>>>
>>>>>> Shawn,
>>>>>> Your query *upsert into table2 select * from table1;  *would not be
>>>>>> run on the server - the source and target table are different.  It would
>>>>>> have to be something like:
>>>>>> *upsert into table1 select * from table1;*
>>>>>>
>>>>>> If you want to run server-side upsert select on a target table that
>>>>>> is different from the source table, you need to set
>>>>>> "phoenix.client.enable.server.upsert.select" to true on your client.
>>>>>> The are some other restrictions: the table can't have any global
>>>>>> indexes, and the statement can't have a join or where subquery.  We need to
>>>>>> update the documentation with this information.
>>>>>>
>>>>>> The reason there are all these hurdles is because it's generally not
>>>>>> recommended to do server-side upsert select across different tables,
>>>>>> because that means you're doing cross-regionserver RPCs (e.g. read data
>>>>>> from a region of sourcetable, and write to a region of targettable on a
>>>>>> different regionserver), potentially tying up handlers in an unpredictable
>>>>>> way.
>>>>>>
>>>>>> On Sun, Dec 16, 2018 at 7:12 PM Shawn Li <sh...@gmail.com>
>>>>>> wrote:
>>>>>>
>>>>>>> Hi Jaanai,
>>>>>>>
>>>>>>> According to Phoenix website, " If auto commit is on, and both a)
>>>>>>> the target table matches the source table, and b) the select performs no
>>>>>>> aggregation, then the population of the target table will be done
>>>>>>> completely on the server-side (with constraint violations logged, but
>>>>>>> otherwise ignored). Otherwise, data is buffered on the client and, if auto
>>>>>>> commit is on, committed in row batches as specified by the UpsertBatchSize
>>>>>>> connection property (or the phoenix.mutate.upsertBatchSize HBase config
>>>>>>> property which defaults to 10000 rows)"
>>>>>>>
>>>>>>> And our sql statement is just: *upsert into table2 select * from
>>>>>>> table1; *which should match the first case, all operations should
>>>>>>> be in server site. But the memory usage on the client machine is higher
>>>>>>> than "upsert select limit" clause. And the memory usage is check by run
>>>>>>> 'top'  command under Linux. So we are sure it is caused by "select upsert"
>>>>>>> in Phoenix and not others, and can't explain why there is so high memory
>>>>>>> usage on client/gateway machine when all operations are supposed to happen
>>>>>>> on the serve side.
>>>>>>>
>>>>>>> Thanks,
>>>>>>> Shawn
>>>>>>>
>>>>>>> On Thu, Dec 13, 2018 at 3:15 AM Jaanai Zhang <cl...@gmail.com>
>>>>>>> wrote:
>>>>>>>
>>>>>>>> Shawn,
>>>>>>>>
>>>>>>>> The UPSERT SELECT will run in a coprocessor on if it hasn't limit
>>>>>>>> clause, only query 1 table, the query is doing aggregation, no sequences
>>>>>>>> and auto commit is on. Please check your SQL ... and you can also check
>>>>>>>> whether some resources have not been released.
>>>>>>>>
>>>>>>>> ----------------------------------------
>>>>>>>>    Jaanai Zhang
>>>>>>>>    Best regards!
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> Shawn Li <sh...@gmail.com> 于2018年12月13日周四 下午12:10写道：
>>>>>>>>
>>>>>>>>> Hi Jaanai,
>>>>>>>>>
>>>>>>>>> Thanks for putting your thought. The behavior you describe is
>>>>>>>>> correct on the Hbase region sever side. The memory usage for blockcache and
>>>>>>>>> memstore will be high under such high throughput. But our phoenix client is
>>>>>>>>> on a gateway machine (no hbase region server sitting on it or any Hbase
>>>>>>>>> service on it), so not sure how to explain such high memory usage for
>>>>>>>>> upsert select without "limit" clause. The high memory usage behavior like
>>>>>>>>> all select results send to client machine, cached in client machine's
>>>>>>>>> memory, and then insert back to target table, which is not like the
>>>>>>>>> behavior that should happen, all of this should be done on the server side
>>>>>>>>> as the table schema is exactly the same. By the way, this happens on both
>>>>>>>>> Phoenix 4.7 and Phoenix 4.14.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> Thanks,
>>>>>>>>> Shawn
>>>>>>>>>
>>>>>>>>> On Wed, Dec 12, 2018 at 10:26 PM Jaanai Zhang <
>>>>>>>>> cloud.poster@gmail.com> wrote:
>>>>>>>>>
>>>>>>>>>> Shawn,
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> For the upsert without limit,  which will read the source table
>>>>>>>>>> and write the target tables on the server side.  I think the higher memory
>>>>>>>>>> usage is caused by using scan cache and memstore under the higher
>>>>>>>>>> throughput.
>>>>>>>>>>
>>>>>>>>>> ----------------------------------------
>>>>>>>>>>    Jaanai Zhang
>>>>>>>>>>    Best regards!
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> Shawn Li <sh...@gmail.com> 于2018年12月13日周四 上午10:13写道：
>>>>>>>>>>
>>>>>>>>>>> Hi Vincent,
>>>>>>>>>>>
>>>>>>>>>>> So you describe limit will sent result to client side and then
>>>>>>>>>>> write to server, this might explain why upsert with limit is slower than
>>>>>>>>>>> without limit. But looks like it can't explain the memory usage? The memory
>>>>>>>>>>> usage on client machine is 8gb (without "limit") vs 2gb (with limit),
>>>>>>>>>>> sometime upsert without "limit" can even reach 20gb for big table.
>>>>>>>>>>>
>>>>>>>>>>> Thanks,
>>>>>>>>>>> Shawn
>>>>>>>>>>>
>>>>>>>>>>> On Wed, Dec 12, 2018 at 6:34 PM Vincent Poon <
>>>>>>>>>>> vincentpoon@apache.org> wrote:
>>>>>>>>>>>
>>>>>>>>>>>> I think it's done client-side if you have LIMIT.  If you have
>>>>>>>>>>>> e.g. LIMIT 1000 , it would be incorrect for each regionserver to upsert
>>>>>>>>>>>> 100, if you have more than one regionserver.  So instead results are sent
>>>>>>>>>>>> back to the client, where the LIMIT is applied and then written back to the
>>>>>>>>>>>> server in the UPSERT.
>>>>>>>>>>>>
>>>>>>>>>>>> On Wed, Dec 12, 2018 at 1:18 PM Shawn Li <sh...@gmail.com>
>>>>>>>>>>>> wrote:
>>>>>>>>>>>>
>>>>>>>>>>>>> Hi Vincent,
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> The table creation statement is similar to below. We have
>>>>>>>>>>>>> about 200 fields. Table is mutable and don’t have any index on the table.
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> CREATE TABLE IF NOT EXISTS us_population (
>>>>>>>>>>>>>
>>>>>>>>>>>>>       state CHAR(2) NOT NULL,
>>>>>>>>>>>>>
>>>>>>>>>>>>>       city VARCHAR,
>>>>>>>>>>>>>
>>>>>>>>>>>>>       population BIGINT,
>>>>>>>>>>>>>
>>>>>>>>>>>>>       …
>>>>>>>>>>>>>
>>>>>>>>>>>>>       CONSTRAINT my_pk PRIMARY KEY (state));
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> Thanks,
>>>>>>>>>>>>>
>>>>>>>>>>>>> Shawn
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> On Wed, Dec 12, 2018, 13:42 Vincent Poon <
>>>>>>>>>>>>> vincentpoon@apache.org wrote:
>>>>>>>>>>>>>
>>>>>>>>>>>>>> For #2, can you provide the table definition and the
>>>>>>>>>>>>>> statement used?  e.g. Is the table immutable, or does it have indexes?
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> On Tue, Dec 11, 2018 at 6:08 PM Shawn/Xiang Li <
>>>>>>>>>>>>>> shawnlijob@gmail.com> wrote:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Hi,
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> 1.       Want to check what is underlying running for limit
>>>>>>>>>>>>>>> clause used in the following Upsert statement (is it involving any
>>>>>>>>>>>>>>> coprocessor working behind?):
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> *                                  upsert into table2 select
>>>>>>>>>>>>>>> * from table1 limit 3000000; * (table 1 and table 2 have
>>>>>>>>>>>>>>> same schema)
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>               The above statement is running a lot slower
>>>>>>>>>>>>>>> than without “limit”  clause as shown in following, even the above
>>>>>>>>>>>>>>> statement upsert less data:
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> *                                upsert into table2 select *
>>>>>>>>>>>>>>> from table1;*
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> 2.       We also observe memory usable is pretty high
>>>>>>>>>>>>>>> without the limit clause (8gb vs 2gb), sometimes for big table it can reach
>>>>>>>>>>>>>>> 20gb without using limit clause.  According to phoenix website description
>>>>>>>>>>>>>>> for upsert select “If auto commit is on, and both a) the target table
>>>>>>>>>>>>>>> matches the source table, and b) the select performs no aggregation, then
>>>>>>>>>>>>>>> the population of the target table will be done completely on the
>>>>>>>>>>>>>>> server-side (with constraint violations logged, but otherwise ignored).”
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>                My question is If everything is done on
>>>>>>>>>>>>>>> server-side, how come we have such high memory usage on the client machine?
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Thanks,
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Shawn
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>

Re: "upsert select" with "limit" clause

Posted by Vincent Poon <vi...@apache.org>.

Shawn, that sounds like a bug, I would file a JIRA.

On Tue, Dec 18, 2018 at 12:33 PM Shawn Li <sh...@gmail.com> wrote:

> Hi Vincent & William,
>
>
>
> Below is the explain plan, both are PARALLEL excuted in plan:
>
>
>
> explain upsert into table1 select * from table2;
>
>
>
> UPSERT
> SELECT
>                       |
>
> CLIENT 27-CHUNK 915799 ROWS 2831155510 BYTES PARALLEL 18-WAY ROUND ROBIN
> FULL SCAN OVER table2
>
>
>
> explain upsert into table1 select * from table2 limit 2000000;
>
>
>
> UPSERT
> SELECT
>                       |
>
> | CLIENT 27-CHUNK 36000000 ROWS 481140000000 BYTES PARALLEL 18-WAY ROUND
> ROBIN FULL SCAN OVER table2 |
>
> |     SERVER 2000000 ROW
> LIMIT
>                     |
>
> | CLIENT 2000000 ROW LIMIT
>
>
>
>
>
>
>
> Thanks,
>
> Shawn
>
> On Tue, Dec 18, 2018, 13:30 Vincent Poon <vincentpoon@apache.org wrote:
>
>> Shawn,
>>
>> Can you do an "explain" to show what your two statements are doing?  That
>> might give some clues.  Perhaps one is able to be run on the server for
>> some reason and the other is not.
>> Otherwise, I don't see why one would be substantially slower than the
>> other.
>>
>> Vincent
>>
>> On Mon, Dec 17, 2018 at 9:14 PM Shawn Li <sh...@gmail.com> wrote:
>>
>>> Hi Jonathan,
>>>
>>> The single threaded on one side sounds logical to me. Hopefully Vincent
>>> can confirm it.
>>>
>>> Thanks,
>>> Shawn
>>>
>>> On Mon, Dec 17, 2018 at 9:25 PM Jonathan Leech <jo...@gmail.com>
>>> wrote:
>>>
>>>> My guess is that in order to enforce the limit that it’s effectively
>>>> single threaded in either the select or the upsert.
>>>>
>>>> On Dec 17, 2018, at 6:43 PM, Shawn Li <sh...@gmail.com> wrote:
>>>>
>>>> Hi Vincent,
>>>>
>>>> Thanks for explaining. That makes much more sense now and it explains
>>>> the high memory usage when without "limit" clause. Because it upserts much
>>>> quickly when using "upsert select" without "limit", the memory usage in
>>>> client machine is much higher than "upsert select" with "limit" .
>>>>
>>>> So back to the other question. Can you explain what is underlying
>>>> Phoenix implementation for "upsert select limit"? Why it is slower than
>>>> without "limit" when insert a huge number (2m rows) like ""upsert into
>>>> table2 select * from table1 limit 2,000,000;". This is much slower than
>>>> inserting the whole table (upsert into table2 select * from table1;).
>>>>
>>>> Thanks,
>>>> Xiang
>>>>
>>>>
>>>> On Mon, Dec 17, 2018 at 1:56 PM Vincent Poon <vi...@apache.org>
>>>> wrote:
>>>>
>>>>> Shawn,
>>>>> Your query *upsert into table2 select * from table1;  *would not be
>>>>> run on the server - the source and target table are different.  It would
>>>>> have to be something like:
>>>>> *upsert into table1 select * from table1;*
>>>>>
>>>>> If you want to run server-side upsert select on a target table that is
>>>>> different from the source table, you need to set
>>>>> "phoenix.client.enable.server.upsert.select" to true on your client.
>>>>> The are some other restrictions: the table can't have any global
>>>>> indexes, and the statement can't have a join or where subquery.  We need to
>>>>> update the documentation with this information.
>>>>>
>>>>> The reason there are all these hurdles is because it's generally not
>>>>> recommended to do server-side upsert select across different tables,
>>>>> because that means you're doing cross-regionserver RPCs (e.g. read data
>>>>> from a region of sourcetable, and write to a region of targettable on a
>>>>> different regionserver), potentially tying up handlers in an unpredictable
>>>>> way.
>>>>>
>>>>> On Sun, Dec 16, 2018 at 7:12 PM Shawn Li <sh...@gmail.com> wrote:
>>>>>
>>>>>> Hi Jaanai,
>>>>>>
>>>>>> According to Phoenix website, " If auto commit is on, and both a)
>>>>>> the target table matches the source table, and b) the select performs no
>>>>>> aggregation, then the population of the target table will be done
>>>>>> completely on the server-side (with constraint violations logged, but
>>>>>> otherwise ignored). Otherwise, data is buffered on the client and, if auto
>>>>>> commit is on, committed in row batches as specified by the UpsertBatchSize
>>>>>> connection property (or the phoenix.mutate.upsertBatchSize HBase config
>>>>>> property which defaults to 10000 rows)"
>>>>>>
>>>>>> And our sql statement is just: *upsert into table2 select * from
>>>>>> table1; *which should match the first case, all operations should be
>>>>>> in server site. But the memory usage on the client machine is higher than
>>>>>> "upsert select limit" clause. And the memory usage is check by run 'top'
>>>>>> command under Linux. So we are sure it is caused by "select upsert" in
>>>>>> Phoenix and not others, and can't explain why there is so high memory usage
>>>>>> on client/gateway machine when all operations are supposed to happen on the
>>>>>> serve side.
>>>>>>
>>>>>> Thanks,
>>>>>> Shawn
>>>>>>
>>>>>> On Thu, Dec 13, 2018 at 3:15 AM Jaanai Zhang <cl...@gmail.com>
>>>>>> wrote:
>>>>>>
>>>>>>> Shawn,
>>>>>>>
>>>>>>> The UPSERT SELECT will run in a coprocessor on if it hasn't limit
>>>>>>> clause, only query 1 table, the query is doing aggregation, no sequences
>>>>>>> and auto commit is on. Please check your SQL ... and you can also check
>>>>>>> whether some resources have not been released.
>>>>>>>
>>>>>>> ----------------------------------------
>>>>>>>    Jaanai Zhang
>>>>>>>    Best regards!
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> Shawn Li <sh...@gmail.com> 于2018年12月13日周四 下午12:10写道：
>>>>>>>
>>>>>>>> Hi Jaanai,
>>>>>>>>
>>>>>>>> Thanks for putting your thought. The behavior you describe is
>>>>>>>> correct on the Hbase region sever side. The memory usage for blockcache and
>>>>>>>> memstore will be high under such high throughput. But our phoenix client is
>>>>>>>> on a gateway machine (no hbase region server sitting on it or any Hbase
>>>>>>>> service on it), so not sure how to explain such high memory usage for
>>>>>>>> upsert select without "limit" clause. The high memory usage behavior like
>>>>>>>> all select results send to client machine, cached in client machine's
>>>>>>>> memory, and then insert back to target table, which is not like the
>>>>>>>> behavior that should happen, all of this should be done on the server side
>>>>>>>> as the table schema is exactly the same. By the way, this happens on both
>>>>>>>> Phoenix 4.7 and Phoenix 4.14.
>>>>>>>>
>>>>>>>>
>>>>>>>> Thanks,
>>>>>>>> Shawn
>>>>>>>>
>>>>>>>> On Wed, Dec 12, 2018 at 10:26 PM Jaanai Zhang <
>>>>>>>> cloud.poster@gmail.com> wrote:
>>>>>>>>
>>>>>>>>> Shawn,
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> For the upsert without limit,  which will read the source table
>>>>>>>>> and write the target tables on the server side.  I think the higher memory
>>>>>>>>> usage is caused by using scan cache and memstore under the higher
>>>>>>>>> throughput.
>>>>>>>>>
>>>>>>>>> ----------------------------------------
>>>>>>>>>    Jaanai Zhang
>>>>>>>>>    Best regards!
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> Shawn Li <sh...@gmail.com> 于2018年12月13日周四 上午10:13写道：
>>>>>>>>>
>>>>>>>>>> Hi Vincent,
>>>>>>>>>>
>>>>>>>>>> So you describe limit will sent result to client side and then
>>>>>>>>>> write to server, this might explain why upsert with limit is slower than
>>>>>>>>>> without limit. But looks like it can't explain the memory usage? The memory
>>>>>>>>>> usage on client machine is 8gb (without "limit") vs 2gb (with limit),
>>>>>>>>>> sometime upsert without "limit" can even reach 20gb for big table.
>>>>>>>>>>
>>>>>>>>>> Thanks,
>>>>>>>>>> Shawn
>>>>>>>>>>
>>>>>>>>>> On Wed, Dec 12, 2018 at 6:34 PM Vincent Poon <
>>>>>>>>>> vincentpoon@apache.org> wrote:
>>>>>>>>>>
>>>>>>>>>>> I think it's done client-side if you have LIMIT.  If you have
>>>>>>>>>>> e.g. LIMIT 1000 , it would be incorrect for each regionserver to upsert
>>>>>>>>>>> 100, if you have more than one regionserver.  So instead results are sent
>>>>>>>>>>> back to the client, where the LIMIT is applied and then written back to the
>>>>>>>>>>> server in the UPSERT.
>>>>>>>>>>>
>>>>>>>>>>> On Wed, Dec 12, 2018 at 1:18 PM Shawn Li <sh...@gmail.com>
>>>>>>>>>>> wrote:
>>>>>>>>>>>
>>>>>>>>>>>> Hi Vincent,
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> The table creation statement is similar to below. We have about
>>>>>>>>>>>> 200 fields. Table is mutable and don’t have any index on the table.
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> CREATE TABLE IF NOT EXISTS us_population (
>>>>>>>>>>>>
>>>>>>>>>>>>       state CHAR(2) NOT NULL,
>>>>>>>>>>>>
>>>>>>>>>>>>       city VARCHAR,
>>>>>>>>>>>>
>>>>>>>>>>>>       population BIGINT,
>>>>>>>>>>>>
>>>>>>>>>>>>       …
>>>>>>>>>>>>
>>>>>>>>>>>>       CONSTRAINT my_pk PRIMARY KEY (state));
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> Thanks,
>>>>>>>>>>>>
>>>>>>>>>>>> Shawn
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> On Wed, Dec 12, 2018, 13:42 Vincent Poon <
>>>>>>>>>>>> vincentpoon@apache.org wrote:
>>>>>>>>>>>>
>>>>>>>>>>>>> For #2, can you provide the table definition and the statement
>>>>>>>>>>>>> used?  e.g. Is the table immutable, or does it have indexes?
>>>>>>>>>>>>>
>>>>>>>>>>>>> On Tue, Dec 11, 2018 at 6:08 PM Shawn/Xiang Li <
>>>>>>>>>>>>> shawnlijob@gmail.com> wrote:
>>>>>>>>>>>>>
>>>>>>>>>>>>>> Hi,
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> 1.       Want to check what is underlying running for limit
>>>>>>>>>>>>>> clause used in the following Upsert statement (is it involving any
>>>>>>>>>>>>>> coprocessor working behind?):
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> *                                  upsert into table2 select
>>>>>>>>>>>>>> * from table1 limit 3000000; * (table 1 and table 2 have
>>>>>>>>>>>>>> same schema)
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>               The above statement is running a lot slower
>>>>>>>>>>>>>> than without “limit”  clause as shown in following, even the above
>>>>>>>>>>>>>> statement upsert less data:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> *                                upsert into table2 select *
>>>>>>>>>>>>>> from table1;*
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> 2.       We also observe memory usable is pretty high
>>>>>>>>>>>>>> without the limit clause (8gb vs 2gb), sometimes for big table it can reach
>>>>>>>>>>>>>> 20gb without using limit clause.  According to phoenix website description
>>>>>>>>>>>>>> for upsert select “If auto commit is on, and both a) the target table
>>>>>>>>>>>>>> matches the source table, and b) the select performs no aggregation, then
>>>>>>>>>>>>>> the population of the target table will be done completely on the
>>>>>>>>>>>>>> server-side (with constraint violations logged, but otherwise ignored).”
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>                My question is If everything is done on
>>>>>>>>>>>>>> server-side, how come we have such high memory usage on the client machine?
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Thanks,
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Shawn
>>>>>>>>>>>>>>
>>>>>>>>>>>>>

Re: "upsert select" with "limit" clause

Posted by Shawn Li <sh...@gmail.com>.

Hi Vincent & William,



Below is the explain plan, both are PARALLEL excuted in plan:



explain upsert into table1 select * from table2;



UPSERT
SELECT
                      |

CLIENT 27-CHUNK 915799 ROWS 2831155510 BYTES PARALLEL 18-WAY ROUND ROBIN
FULL SCAN OVER table2



explain upsert into table1 select * from table2 limit 2000000;



UPSERT
SELECT
                      |

| CLIENT 27-CHUNK 36000000 ROWS 481140000000 BYTES PARALLEL 18-WAY ROUND
ROBIN FULL SCAN OVER table2 |

|     SERVER 2000000 ROW
LIMIT
                    |

| CLIENT 2000000 ROW LIMIT







Thanks,

Shawn

On Tue, Dec 18, 2018, 13:30 Vincent Poon <vincentpoon@apache.org wrote:

> Shawn,
>
> Can you do an "explain" to show what your two statements are doing?  That
> might give some clues.  Perhaps one is able to be run on the server for
> some reason and the other is not.
> Otherwise, I don't see why one would be substantially slower than the
> other.
>
> Vincent
>
> On Mon, Dec 17, 2018 at 9:14 PM Shawn Li <sh...@gmail.com> wrote:
>
>> Hi Jonathan,
>>
>> The single threaded on one side sounds logical to me. Hopefully Vincent
>> can confirm it.
>>
>> Thanks,
>> Shawn
>>
>> On Mon, Dec 17, 2018 at 9:25 PM Jonathan Leech <jo...@gmail.com>
>> wrote:
>>
>>> My guess is that in order to enforce the limit that it’s effectively
>>> single threaded in either the select or the upsert.
>>>
>>> On Dec 17, 2018, at 6:43 PM, Shawn Li <sh...@gmail.com> wrote:
>>>
>>> Hi Vincent,
>>>
>>> Thanks for explaining. That makes much more sense now and it explains
>>> the high memory usage when without "limit" clause. Because it upserts much
>>> quickly when using "upsert select" without "limit", the memory usage in
>>> client machine is much higher than "upsert select" with "limit" .
>>>
>>> So back to the other question. Can you explain what is underlying
>>> Phoenix implementation for "upsert select limit"? Why it is slower than
>>> without "limit" when insert a huge number (2m rows) like ""upsert into
>>> table2 select * from table1 limit 2,000,000;". This is much slower than
>>> inserting the whole table (upsert into table2 select * from table1;).
>>>
>>> Thanks,
>>> Xiang
>>>
>>>
>>> On Mon, Dec 17, 2018 at 1:56 PM Vincent Poon <vi...@apache.org>
>>> wrote:
>>>
>>>> Shawn,
>>>> Your query *upsert into table2 select * from table1;  *would not be
>>>> run on the server - the source and target table are different.  It would
>>>> have to be something like:
>>>> *upsert into table1 select * from table1;*
>>>>
>>>> If you want to run server-side upsert select on a target table that is
>>>> different from the source table, you need to set
>>>> "phoenix.client.enable.server.upsert.select" to true on your client.
>>>> The are some other restrictions: the table can't have any global
>>>> indexes, and the statement can't have a join or where subquery.  We need to
>>>> update the documentation with this information.
>>>>
>>>> The reason there are all these hurdles is because it's generally not
>>>> recommended to do server-side upsert select across different tables,
>>>> because that means you're doing cross-regionserver RPCs (e.g. read data
>>>> from a region of sourcetable, and write to a region of targettable on a
>>>> different regionserver), potentially tying up handlers in an unpredictable
>>>> way.
>>>>
>>>> On Sun, Dec 16, 2018 at 7:12 PM Shawn Li <sh...@gmail.com> wrote:
>>>>
>>>>> Hi Jaanai,
>>>>>
>>>>> According to Phoenix website, " If auto commit is on, and both a) the
>>>>> target table matches the source table, and b) the select performs no
>>>>> aggregation, then the population of the target table will be done
>>>>> completely on the server-side (with constraint violations logged, but
>>>>> otherwise ignored). Otherwise, data is buffered on the client and, if auto
>>>>> commit is on, committed in row batches as specified by the UpsertBatchSize
>>>>> connection property (or the phoenix.mutate.upsertBatchSize HBase config
>>>>> property which defaults to 10000 rows)"
>>>>>
>>>>> And our sql statement is just: *upsert into table2 select * from
>>>>> table1; *which should match the first case, all operations should be
>>>>> in server site. But the memory usage on the client machine is higher than
>>>>> "upsert select limit" clause. And the memory usage is check by run 'top'
>>>>> command under Linux. So we are sure it is caused by "select upsert" in
>>>>> Phoenix and not others, and can't explain why there is so high memory usage
>>>>> on client/gateway machine when all operations are supposed to happen on the
>>>>> serve side.
>>>>>
>>>>> Thanks,
>>>>> Shawn
>>>>>
>>>>> On Thu, Dec 13, 2018 at 3:15 AM Jaanai Zhang <cl...@gmail.com>
>>>>> wrote:
>>>>>
>>>>>> Shawn,
>>>>>>
>>>>>> The UPSERT SELECT will run in a coprocessor on if it hasn't limit
>>>>>> clause, only query 1 table, the query is doing aggregation, no sequences
>>>>>> and auto commit is on. Please check your SQL ... and you can also check
>>>>>> whether some resources have not been released.
>>>>>>
>>>>>> ----------------------------------------
>>>>>>    Jaanai Zhang
>>>>>>    Best regards!
>>>>>>
>>>>>>
>>>>>>
>>>>>> Shawn Li <sh...@gmail.com> 于2018年12月13日周四 下午12:10写道：
>>>>>>
>>>>>>> Hi Jaanai,
>>>>>>>
>>>>>>> Thanks for putting your thought. The behavior you describe is
>>>>>>> correct on the Hbase region sever side. The memory usage for blockcache and
>>>>>>> memstore will be high under such high throughput. But our phoenix client is
>>>>>>> on a gateway machine (no hbase region server sitting on it or any Hbase
>>>>>>> service on it), so not sure how to explain such high memory usage for
>>>>>>> upsert select without "limit" clause. The high memory usage behavior like
>>>>>>> all select results send to client machine, cached in client machine's
>>>>>>> memory, and then insert back to target table, which is not like the
>>>>>>> behavior that should happen, all of this should be done on the server side
>>>>>>> as the table schema is exactly the same. By the way, this happens on both
>>>>>>> Phoenix 4.7 and Phoenix 4.14.
>>>>>>>
>>>>>>>
>>>>>>> Thanks,
>>>>>>> Shawn
>>>>>>>
>>>>>>> On Wed, Dec 12, 2018 at 10:26 PM Jaanai Zhang <
>>>>>>> cloud.poster@gmail.com> wrote:
>>>>>>>
>>>>>>>> Shawn,
>>>>>>>>
>>>>>>>>
>>>>>>>> For the upsert without limit,  which will read the source table and
>>>>>>>> write the target tables on the server side.  I think the higher memory
>>>>>>>> usage is caused by using scan cache and memstore under the higher
>>>>>>>> throughput.
>>>>>>>>
>>>>>>>> ----------------------------------------
>>>>>>>>    Jaanai Zhang
>>>>>>>>    Best regards!
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> Shawn Li <sh...@gmail.com> 于2018年12月13日周四 上午10:13写道：
>>>>>>>>
>>>>>>>>> Hi Vincent,
>>>>>>>>>
>>>>>>>>> So you describe limit will sent result to client side and then
>>>>>>>>> write to server, this might explain why upsert with limit is slower than
>>>>>>>>> without limit. But looks like it can't explain the memory usage? The memory
>>>>>>>>> usage on client machine is 8gb (without "limit") vs 2gb (with limit),
>>>>>>>>> sometime upsert without "limit" can even reach 20gb for big table.
>>>>>>>>>
>>>>>>>>> Thanks,
>>>>>>>>> Shawn
>>>>>>>>>
>>>>>>>>> On Wed, Dec 12, 2018 at 6:34 PM Vincent Poon <
>>>>>>>>> vincentpoon@apache.org> wrote:
>>>>>>>>>
>>>>>>>>>> I think it's done client-side if you have LIMIT.  If you have
>>>>>>>>>> e.g. LIMIT 1000 , it would be incorrect for each regionserver to upsert
>>>>>>>>>> 100, if you have more than one regionserver.  So instead results are sent
>>>>>>>>>> back to the client, where the LIMIT is applied and then written back to the
>>>>>>>>>> server in the UPSERT.
>>>>>>>>>>
>>>>>>>>>> On Wed, Dec 12, 2018 at 1:18 PM Shawn Li <sh...@gmail.com>
>>>>>>>>>> wrote:
>>>>>>>>>>
>>>>>>>>>>> Hi Vincent,
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> The table creation statement is similar to below. We have about
>>>>>>>>>>> 200 fields. Table is mutable and don’t have any index on the table.
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> CREATE TABLE IF NOT EXISTS us_population (
>>>>>>>>>>>
>>>>>>>>>>>       state CHAR(2) NOT NULL,
>>>>>>>>>>>
>>>>>>>>>>>       city VARCHAR,
>>>>>>>>>>>
>>>>>>>>>>>       population BIGINT,
>>>>>>>>>>>
>>>>>>>>>>>       …
>>>>>>>>>>>
>>>>>>>>>>>       CONSTRAINT my_pk PRIMARY KEY (state));
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> Thanks,
>>>>>>>>>>>
>>>>>>>>>>> Shawn
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> On Wed, Dec 12, 2018, 13:42 Vincent Poon <vincentpoon@apache.org
>>>>>>>>>>> wrote:
>>>>>>>>>>>
>>>>>>>>>>>> For #2, can you provide the table definition and the statement
>>>>>>>>>>>> used?  e.g. Is the table immutable, or does it have indexes?
>>>>>>>>>>>>
>>>>>>>>>>>> On Tue, Dec 11, 2018 at 6:08 PM Shawn/Xiang Li <
>>>>>>>>>>>> shawnlijob@gmail.com> wrote:
>>>>>>>>>>>>
>>>>>>>>>>>>> Hi,
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> 1.       Want to check what is underlying running for limit
>>>>>>>>>>>>> clause used in the following Upsert statement (is it involving any
>>>>>>>>>>>>> coprocessor working behind?):
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> *                                  upsert into table2 select *
>>>>>>>>>>>>> from table1 limit 3000000; * (table 1 and table 2 have same
>>>>>>>>>>>>> schema)
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>               The above statement is running a lot slower than
>>>>>>>>>>>>> without “limit”  clause as shown in following, even the above statement
>>>>>>>>>>>>> upsert less data:
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> *                                upsert into table2 select *
>>>>>>>>>>>>> from table1;*
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> 2.       We also observe memory usable is pretty high without
>>>>>>>>>>>>> the limit clause (8gb vs 2gb), sometimes for big table it can reach 20gb
>>>>>>>>>>>>> without using limit clause.  According to phoenix website description for
>>>>>>>>>>>>> upsert select “If auto commit is on, and both a) the target table matches
>>>>>>>>>>>>> the source table, and b) the select performs no aggregation, then the
>>>>>>>>>>>>> population of the target table will be done completely on the server-side
>>>>>>>>>>>>> (with constraint violations logged, but otherwise ignored).”
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>                My question is If everything is done on
>>>>>>>>>>>>> server-side, how come we have such high memory usage on the client machine?
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> Thanks,
>>>>>>>>>>>>>
>>>>>>>>>>>>> Shawn
>>>>>>>>>>>>>
>>>>>>>>>>>>

Re: "upsert select" with "limit" clause

Posted by Vincent Poon <vi...@apache.org>.

Shawn,

Can you do an "explain" to show what your two statements are doing?  That
might give some clues.  Perhaps one is able to be run on the server for
some reason and the other is not.
Otherwise, I don't see why one would be substantially slower than the other.

Vincent

On Mon, Dec 17, 2018 at 9:14 PM Shawn Li <sh...@gmail.com> wrote:

> Hi Jonathan,
>
> The single threaded on one side sounds logical to me. Hopefully Vincent
> can confirm it.
>
> Thanks,
> Shawn
>
> On Mon, Dec 17, 2018 at 9:25 PM Jonathan Leech <jo...@gmail.com> wrote:
>
>> My guess is that in order to enforce the limit that it’s effectively
>> single threaded in either the select or the upsert.
>>
>> On Dec 17, 2018, at 6:43 PM, Shawn Li <sh...@gmail.com> wrote:
>>
>> Hi Vincent,
>>
>> Thanks for explaining. That makes much more sense now and it explains the
>> high memory usage when without "limit" clause. Because it upserts much
>> quickly when using "upsert select" without "limit", the memory usage in
>> client machine is much higher than "upsert select" with "limit" .
>>
>> So back to the other question. Can you explain what is underlying Phoenix
>> implementation for "upsert select limit"? Why it is slower than without
>> "limit" when insert a huge number (2m rows) like ""upsert into table2
>> select * from table1 limit 2,000,000;". This is much slower than inserting
>> the whole table (upsert into table2 select * from table1;).
>>
>> Thanks,
>> Xiang
>>
>>
>> On Mon, Dec 17, 2018 at 1:56 PM Vincent Poon <vi...@apache.org>
>> wrote:
>>
>>> Shawn,
>>> Your query *upsert into table2 select * from table1;  *would not be run
>>> on the server - the source and target table are different.  It would have
>>> to be something like:
>>> *upsert into table1 select * from table1;*
>>>
>>> If you want to run server-side upsert select on a target table that is
>>> different from the source table, you need to set
>>> "phoenix.client.enable.server.upsert.select" to true on your client.
>>> The are some other restrictions: the table can't have any global
>>> indexes, and the statement can't have a join or where subquery.  We need to
>>> update the documentation with this information.
>>>
>>> The reason there are all these hurdles is because it's generally not
>>> recommended to do server-side upsert select across different tables,
>>> because that means you're doing cross-regionserver RPCs (e.g. read data
>>> from a region of sourcetable, and write to a region of targettable on a
>>> different regionserver), potentially tying up handlers in an unpredictable
>>> way.
>>>
>>> On Sun, Dec 16, 2018 at 7:12 PM Shawn Li <sh...@gmail.com> wrote:
>>>
>>>> Hi Jaanai,
>>>>
>>>> According to Phoenix website, " If auto commit is on, and both a) the
>>>> target table matches the source table, and b) the select performs no
>>>> aggregation, then the population of the target table will be done
>>>> completely on the server-side (with constraint violations logged, but
>>>> otherwise ignored). Otherwise, data is buffered on the client and, if auto
>>>> commit is on, committed in row batches as specified by the UpsertBatchSize
>>>> connection property (or the phoenix.mutate.upsertBatchSize HBase config
>>>> property which defaults to 10000 rows)"
>>>>
>>>> And our sql statement is just: *upsert into table2 select * from
>>>> table1; *which should match the first case, all operations should be
>>>> in server site. But the memory usage on the client machine is higher than
>>>> "upsert select limit" clause. And the memory usage is check by run 'top'
>>>> command under Linux. So we are sure it is caused by "select upsert" in
>>>> Phoenix and not others, and can't explain why there is so high memory usage
>>>> on client/gateway machine when all operations are supposed to happen on the
>>>> serve side.
>>>>
>>>> Thanks,
>>>> Shawn
>>>>
>>>> On Thu, Dec 13, 2018 at 3:15 AM Jaanai Zhang <cl...@gmail.com>
>>>> wrote:
>>>>
>>>>> Shawn,
>>>>>
>>>>> The UPSERT SELECT will run in a coprocessor on if it hasn't limit
>>>>> clause, only query 1 table, the query is doing aggregation, no sequences
>>>>> and auto commit is on. Please check your SQL ... and you can also check
>>>>> whether some resources have not been released.
>>>>>
>>>>> ----------------------------------------
>>>>>    Jaanai Zhang
>>>>>    Best regards!
>>>>>
>>>>>
>>>>>
>>>>> Shawn Li <sh...@gmail.com> 于2018年12月13日周四 下午12:10写道：
>>>>>
>>>>>> Hi Jaanai,
>>>>>>
>>>>>> Thanks for putting your thought. The behavior you describe is correct
>>>>>> on the Hbase region sever side. The memory usage for blockcache and
>>>>>> memstore will be high under such high throughput. But our phoenix client is
>>>>>> on a gateway machine (no hbase region server sitting on it or any Hbase
>>>>>> service on it), so not sure how to explain such high memory usage for
>>>>>> upsert select without "limit" clause. The high memory usage behavior like
>>>>>> all select results send to client machine, cached in client machine's
>>>>>> memory, and then insert back to target table, which is not like the
>>>>>> behavior that should happen, all of this should be done on the server side
>>>>>> as the table schema is exactly the same. By the way, this happens on both
>>>>>> Phoenix 4.7 and Phoenix 4.14.
>>>>>>
>>>>>>
>>>>>> Thanks,
>>>>>> Shawn
>>>>>>
>>>>>> On Wed, Dec 12, 2018 at 10:26 PM Jaanai Zhang <cl...@gmail.com>
>>>>>> wrote:
>>>>>>
>>>>>>> Shawn,
>>>>>>>
>>>>>>>
>>>>>>> For the upsert without limit,  which will read the source table and
>>>>>>> write the target tables on the server side.  I think the higher memory
>>>>>>> usage is caused by using scan cache and memstore under the higher
>>>>>>> throughput.
>>>>>>>
>>>>>>> ----------------------------------------
>>>>>>>    Jaanai Zhang
>>>>>>>    Best regards!
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> Shawn Li <sh...@gmail.com> 于2018年12月13日周四 上午10:13写道：
>>>>>>>
>>>>>>>> Hi Vincent,
>>>>>>>>
>>>>>>>> So you describe limit will sent result to client side and then
>>>>>>>> write to server, this might explain why upsert with limit is slower than
>>>>>>>> without limit. But looks like it can't explain the memory usage? The memory
>>>>>>>> usage on client machine is 8gb (without "limit") vs 2gb (with limit),
>>>>>>>> sometime upsert without "limit" can even reach 20gb for big table.
>>>>>>>>
>>>>>>>> Thanks,
>>>>>>>> Shawn
>>>>>>>>
>>>>>>>> On Wed, Dec 12, 2018 at 6:34 PM Vincent Poon <
>>>>>>>> vincentpoon@apache.org> wrote:
>>>>>>>>
>>>>>>>>> I think it's done client-side if you have LIMIT.  If you have e.g.
>>>>>>>>> LIMIT 1000 , it would be incorrect for each regionserver to upsert 100, if
>>>>>>>>> you have more than one regionserver.  So instead results are sent back to
>>>>>>>>> the client, where the LIMIT is applied and then written back to the server
>>>>>>>>> in the UPSERT.
>>>>>>>>>
>>>>>>>>> On Wed, Dec 12, 2018 at 1:18 PM Shawn Li <sh...@gmail.com>
>>>>>>>>> wrote:
>>>>>>>>>
>>>>>>>>>> Hi Vincent,
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> The table creation statement is similar to below. We have about
>>>>>>>>>> 200 fields. Table is mutable and don’t have any index on the table.
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> CREATE TABLE IF NOT EXISTS us_population (
>>>>>>>>>>
>>>>>>>>>>       state CHAR(2) NOT NULL,
>>>>>>>>>>
>>>>>>>>>>       city VARCHAR,
>>>>>>>>>>
>>>>>>>>>>       population BIGINT,
>>>>>>>>>>
>>>>>>>>>>       …
>>>>>>>>>>
>>>>>>>>>>       CONSTRAINT my_pk PRIMARY KEY (state));
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> Thanks,
>>>>>>>>>>
>>>>>>>>>> Shawn
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> On Wed, Dec 12, 2018, 13:42 Vincent Poon <vincentpoon@apache.org
>>>>>>>>>> wrote:
>>>>>>>>>>
>>>>>>>>>>> For #2, can you provide the table definition and the statement
>>>>>>>>>>> used?  e.g. Is the table immutable, or does it have indexes?
>>>>>>>>>>>
>>>>>>>>>>> On Tue, Dec 11, 2018 at 6:08 PM Shawn/Xiang Li <
>>>>>>>>>>> shawnlijob@gmail.com> wrote:
>>>>>>>>>>>
>>>>>>>>>>>> Hi,
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> 1.       Want to check what is underlying running for limit
>>>>>>>>>>>> clause used in the following Upsert statement (is it involving any
>>>>>>>>>>>> coprocessor working behind?):
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> *                                  upsert into table2 select *
>>>>>>>>>>>> from table1 limit 3000000; * (table 1 and table 2 have same
>>>>>>>>>>>> schema)
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>               The above statement is running a lot slower than
>>>>>>>>>>>> without “limit”  clause as shown in following, even the above statement
>>>>>>>>>>>> upsert less data:
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> *                                upsert into table2 select *
>>>>>>>>>>>> from table1;*
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> 2.       We also observe memory usable is pretty high without
>>>>>>>>>>>> the limit clause (8gb vs 2gb), sometimes for big table it can reach 20gb
>>>>>>>>>>>> without using limit clause.  According to phoenix website description for
>>>>>>>>>>>> upsert select “If auto commit is on, and both a) the target table matches
>>>>>>>>>>>> the source table, and b) the select performs no aggregation, then the
>>>>>>>>>>>> population of the target table will be done completely on the server-side
>>>>>>>>>>>> (with constraint violations logged, but otherwise ignored).”
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>                My question is If everything is done on
>>>>>>>>>>>> server-side, how come we have such high memory usage on the client machine?
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> Thanks,
>>>>>>>>>>>>
>>>>>>>>>>>> Shawn
>>>>>>>>>>>>
>>>>>>>>>>>

Re: "upsert select" with "limit" clause

Posted by Shawn Li <sh...@gmail.com>.

Hi Jonathan,

The single threaded on one side sounds logical to me. Hopefully Vincent can
confirm it.

Thanks,
Shawn

On Mon, Dec 17, 2018 at 9:25 PM Jonathan Leech <jo...@gmail.com> wrote:

> My guess is that in order to enforce the limit that it’s effectively
> single threaded in either the select or the upsert.
>
> On Dec 17, 2018, at 6:43 PM, Shawn Li <sh...@gmail.com> wrote:
>
> Hi Vincent,
>
> Thanks for explaining. That makes much more sense now and it explains the
> high memory usage when without "limit" clause. Because it upserts much
> quickly when using "upsert select" without "limit", the memory usage in
> client machine is much higher than "upsert select" with "limit" .
>
> So back to the other question. Can you explain what is underlying Phoenix
> implementation for "upsert select limit"? Why it is slower than without
> "limit" when insert a huge number (2m rows) like ""upsert into table2
> select * from table1 limit 2,000,000;". This is much slower than inserting
> the whole table (upsert into table2 select * from table1;).
>
> Thanks,
> Xiang
>
>
> On Mon, Dec 17, 2018 at 1:56 PM Vincent Poon <vi...@apache.org>
> wrote:
>
>> Shawn,
>> Your query *upsert into table2 select * from table1;  *would not be run
>> on the server - the source and target table are different.  It would have
>> to be something like:
>> *upsert into table1 select * from table1;*
>>
>> If you want to run server-side upsert select on a target table that is
>> different from the source table, you need to set
>> "phoenix.client.enable.server.upsert.select" to true on your client.
>> The are some other restrictions: the table can't have any global indexes,
>> and the statement can't have a join or where subquery.  We need to update
>> the documentation with this information.
>>
>> The reason there are all these hurdles is because it's generally not
>> recommended to do server-side upsert select across different tables,
>> because that means you're doing cross-regionserver RPCs (e.g. read data
>> from a region of sourcetable, and write to a region of targettable on a
>> different regionserver), potentially tying up handlers in an unpredictable
>> way.
>>
>> On Sun, Dec 16, 2018 at 7:12 PM Shawn Li <sh...@gmail.com> wrote:
>>
>>> Hi Jaanai,
>>>
>>> According to Phoenix website, " If auto commit is on, and both a) the
>>> target table matches the source table, and b) the select performs no
>>> aggregation, then the population of the target table will be done
>>> completely on the server-side (with constraint violations logged, but
>>> otherwise ignored). Otherwise, data is buffered on the client and, if auto
>>> commit is on, committed in row batches as specified by the UpsertBatchSize
>>> connection property (or the phoenix.mutate.upsertBatchSize HBase config
>>> property which defaults to 10000 rows)"
>>>
>>> And our sql statement is just: *upsert into table2 select * from
>>> table1; *which should match the first case, all operations should be in
>>> server site. But the memory usage on the client machine is higher than
>>> "upsert select limit" clause. And the memory usage is check by run 'top'
>>> command under Linux. So we are sure it is caused by "select upsert" in
>>> Phoenix and not others, and can't explain why there is so high memory usage
>>> on client/gateway machine when all operations are supposed to happen on the
>>> serve side.
>>>
>>> Thanks,
>>> Shawn
>>>
>>> On Thu, Dec 13, 2018 at 3:15 AM Jaanai Zhang <cl...@gmail.com>
>>> wrote:
>>>
>>>> Shawn,
>>>>
>>>> The UPSERT SELECT will run in a coprocessor on if it hasn't limit
>>>> clause, only query 1 table, the query is doing aggregation, no sequences
>>>> and auto commit is on. Please check your SQL ... and you can also check
>>>> whether some resources have not been released.
>>>>
>>>> ----------------------------------------
>>>>    Jaanai Zhang
>>>>    Best regards!
>>>>
>>>>
>>>>
>>>> Shawn Li <sh...@gmail.com> 于2018年12月13日周四 下午12:10写道：
>>>>
>>>>> Hi Jaanai,
>>>>>
>>>>> Thanks for putting your thought. The behavior you describe is correct
>>>>> on the Hbase region sever side. The memory usage for blockcache and
>>>>> memstore will be high under such high throughput. But our phoenix client is
>>>>> on a gateway machine (no hbase region server sitting on it or any Hbase
>>>>> service on it), so not sure how to explain such high memory usage for
>>>>> upsert select without "limit" clause. The high memory usage behavior like
>>>>> all select results send to client machine, cached in client machine's
>>>>> memory, and then insert back to target table, which is not like the
>>>>> behavior that should happen, all of this should be done on the server side
>>>>> as the table schema is exactly the same. By the way, this happens on both
>>>>> Phoenix 4.7 and Phoenix 4.14.
>>>>>
>>>>>
>>>>> Thanks,
>>>>> Shawn
>>>>>
>>>>> On Wed, Dec 12, 2018 at 10:26 PM Jaanai Zhang <cl...@gmail.com>
>>>>> wrote:
>>>>>
>>>>>> Shawn,
>>>>>>
>>>>>>
>>>>>> For the upsert without limit,  which will read the source table and
>>>>>> write the target tables on the server side.  I think the higher memory
>>>>>> usage is caused by using scan cache and memstore under the higher
>>>>>> throughput.
>>>>>>
>>>>>> ----------------------------------------
>>>>>>    Jaanai Zhang
>>>>>>    Best regards!
>>>>>>
>>>>>>
>>>>>>
>>>>>> Shawn Li <sh...@gmail.com> 于2018年12月13日周四 上午10:13写道：
>>>>>>
>>>>>>> Hi Vincent,
>>>>>>>
>>>>>>> So you describe limit will sent result to client side and then write
>>>>>>> to server, this might explain why upsert with limit is slower than without
>>>>>>> limit. But looks like it can't explain the memory usage? The memory usage
>>>>>>> on client machine is 8gb (without "limit") vs 2gb (with limit), sometime
>>>>>>> upsert without "limit" can even reach 20gb for big table.
>>>>>>>
>>>>>>> Thanks,
>>>>>>> Shawn
>>>>>>>
>>>>>>> On Wed, Dec 12, 2018 at 6:34 PM Vincent Poon <vi...@apache.org>
>>>>>>> wrote:
>>>>>>>
>>>>>>>> I think it's done client-side if you have LIMIT.  If you have e.g.
>>>>>>>> LIMIT 1000 , it would be incorrect for each regionserver to upsert 100, if
>>>>>>>> you have more than one regionserver.  So instead results are sent back to
>>>>>>>> the client, where the LIMIT is applied and then written back to the server
>>>>>>>> in the UPSERT.
>>>>>>>>
>>>>>>>> On Wed, Dec 12, 2018 at 1:18 PM Shawn Li <sh...@gmail.com>
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>>> Hi Vincent,
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> The table creation statement is similar to below. We have about
>>>>>>>>> 200 fields. Table is mutable and don’t have any index on the table.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> CREATE TABLE IF NOT EXISTS us_population (
>>>>>>>>>
>>>>>>>>>       state CHAR(2) NOT NULL,
>>>>>>>>>
>>>>>>>>>       city VARCHAR,
>>>>>>>>>
>>>>>>>>>       population BIGINT,
>>>>>>>>>
>>>>>>>>>       …
>>>>>>>>>
>>>>>>>>>       CONSTRAINT my_pk PRIMARY KEY (state));
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> Thanks,
>>>>>>>>>
>>>>>>>>> Shawn
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On Wed, Dec 12, 2018, 13:42 Vincent Poon <vincentpoon@apache.org
>>>>>>>>> wrote:
>>>>>>>>>
>>>>>>>>>> For #2, can you provide the table definition and the statement
>>>>>>>>>> used?  e.g. Is the table immutable, or does it have indexes?
>>>>>>>>>>
>>>>>>>>>> On Tue, Dec 11, 2018 at 6:08 PM Shawn/Xiang Li <
>>>>>>>>>> shawnlijob@gmail.com> wrote:
>>>>>>>>>>
>>>>>>>>>>> Hi,
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> 1.       Want to check what is underlying running for limit
>>>>>>>>>>> clause used in the following Upsert statement (is it involving any
>>>>>>>>>>> coprocessor working behind?):
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> *                                  upsert into table2 select *
>>>>>>>>>>> from table1 limit 3000000; * (table 1 and table 2 have same
>>>>>>>>>>> schema)
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>               The above statement is running a lot slower than
>>>>>>>>>>> without “limit”  clause as shown in following, even the above statement
>>>>>>>>>>> upsert less data:
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> *                                upsert into table2 select *
>>>>>>>>>>> from table1;*
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> 2.       We also observe memory usable is pretty high without
>>>>>>>>>>> the limit clause (8gb vs 2gb), sometimes for big table it can reach 20gb
>>>>>>>>>>> without using limit clause.  According to phoenix website description for
>>>>>>>>>>> upsert select “If auto commit is on, and both a) the target table matches
>>>>>>>>>>> the source table, and b) the select performs no aggregation, then the
>>>>>>>>>>> population of the target table will be done completely on the server-side
>>>>>>>>>>> (with constraint violations logged, but otherwise ignored).”
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>                My question is If everything is done on
>>>>>>>>>>> server-side, how come we have such high memory usage on the client machine?
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> Thanks,
>>>>>>>>>>>
>>>>>>>>>>> Shawn
>>>>>>>>>>>
>>>>>>>>>>

Re: "upsert select" with "limit" clause

Posted by Jonathan Leech <jo...@gmail.com>.

My guess is that in order to enforce the limit that it’s effectively single threaded in either the select or the upsert.

> On Dec 17, 2018, at 6:43 PM, Shawn Li <sh...@gmail.com> wrote:
> 
> Hi Vincent,
> 
> Thanks for explaining. That makes much more sense now and it explains the high memory usage when without "limit" clause. Because it upserts much quickly when using "upsert select" without "limit", the memory usage in client machine is much higher than "upsert select" with "limit" .
> 
> So back to the other question. Can you explain what is underlying Phoenix implementation for "upsert select limit"? Why it is slower than without "limit" when insert a huge number (2m rows) like ""upsert into table2 select * from table1 limit 2,000,000;". This is much slower than inserting the whole table (upsert into table2 select * from table1;).  
> 
> Thanks,
> Xiang 
> 
> 
>> On Mon, Dec 17, 2018 at 1:56 PM Vincent Poon <vi...@apache.org> wrote:
>> Shawn,
>> Your query upsert into table2 select * from table1;  would not be run on the server - the source and target table are different.  It would have to be something like:
>> upsert into table1 select * from table1;
>> 
>> If you want to run server-side upsert select on a target table that is different from the source table, you need to set "phoenix.client.enable.server.upsert.select" to true on your client.  
>> The are some other restrictions: the table can't have any global indexes, and the statement can't have a join or where subquery.  We need to update the documentation with this information.
>> 
>> The reason there are all these hurdles is because it's generally not recommended to do server-side upsert select across different tables, because that means you're doing cross-regionserver RPCs (e.g. read data from a region of sourcetable, and write to a region of targettable on a different regionserver), potentially tying up handlers in an unpredictable way.
>> 
>>> On Sun, Dec 16, 2018 at 7:12 PM Shawn Li <sh...@gmail.com> wrote:
>>> Hi Jaanai,
>>> 
>>> According to Phoenix website, " If auto commit is on, and both a) the target table matches the source table, and b) the select performs no aggregation, then the population of the target table will be done completely on the server-side (with constraint violations logged, but otherwise ignored). Otherwise, data is buffered on the client and, if auto commit is on, committed in row batches as specified by the UpsertBatchSize connection property (or the phoenix.mutate.upsertBatchSize HBase config property which defaults to 10000 rows)"
>>> 
>>> And our sql statement is just: upsert into table2 select * from table1; which should match the first case, all operations should be in server site. But the memory usage on the client machine is higher than "upsert select limit" clause. And the memory usage is check by run 'top'  command under Linux. So we are sure it is caused by "select upsert" in Phoenix and not others, and can't explain why there is so high memory usage on client/gateway machine when all operations are supposed to happen on the serve side. 
>>> 
>>> Thanks,
>>> Shawn
>>> 
>>>> On Thu, Dec 13, 2018 at 3:15 AM Jaanai Zhang <cl...@gmail.com> wrote:
>>>> Shawn,
>>>> 
>>>> The UPSERT SELECT will run in a coprocessor on if it hasn't limit clause, only query 1 table, the query is doing aggregation, no sequences and auto commit is on. Please check your SQL ... and you can also check whether some resources have not been released.
>>>> 
>>>> ----------------------------------------
>>>>    Jaanai Zhang
>>>>    Best regards!
>>>> 
>>>> 
>>>> 
>>>> Shawn Li <sh...@gmail.com> 于2018年12月13日周四 下午12:10写道：
>>>>> Hi Jaanai,
>>>>> 
>>>>> Thanks for putting your thought. The behavior you describe is correct on the Hbase region sever side. The memory usage for blockcache and memstore will be high under such high throughput. But our phoenix client is on a gateway machine (no hbase region server sitting on it or any Hbase service on it), so not sure how to explain such high memory usage for upsert select without "limit" clause. The high memory usage behavior like all select results send to client machine, cached in client machine's memory, and then insert back to target table, which is not like the behavior that should happen, all of this should be done on the server side as the table schema is exactly the same. By the way, this happens on both Phoenix 4.7 and Phoenix 4.14.
>>>>>    
>>>>> 
>>>>> Thanks,
>>>>> Shawn 
>>>>> 
>>>>>> On Wed, Dec 12, 2018 at 10:26 PM Jaanai Zhang <cl...@gmail.com> wrote:
>>>>>> Shawn,
>>>>>> 
>>>>>> 
>>>>>> For the upsert without limit,  which will read the source table and write the target tables on the server side.  I think the higher memory usage is caused by using scan cache and memstore under the higher throughput.
>>>>>> 
>>>>>> ----------------------------------------
>>>>>>    Jaanai Zhang
>>>>>>    Best regards!
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> Shawn Li <sh...@gmail.com> 于2018年12月13日周四 上午10:13写道：
>>>>>>> Hi Vincent,
>>>>>>> 
>>>>>>> So you describe limit will sent result to client side and then write to server, this might explain why upsert with limit is slower than without limit. But looks like it can't explain the memory usage? The memory usage on client machine is 8gb (without "limit") vs 2gb (with limit), sometime upsert without "limit" can even reach 20gb for big table.
>>>>>>> 
>>>>>>> Thanks,
>>>>>>> Shawn 
>>>>>>> 
>>>>>>>> On Wed, Dec 12, 2018 at 6:34 PM Vincent Poon <vi...@apache.org> wrote:
>>>>>>>> I think it's done client-side if you have LIMIT.  If you have e.g. LIMIT 1000 , it would be incorrect for each regionserver to upsert 100, if you have more than one regionserver.  So instead results are sent back to the client, where the LIMIT is applied and then written back to the server in the UPSERT.
>>>>>>>> 
>>>>>>>>> On Wed, Dec 12, 2018 at 1:18 PM Shawn Li <sh...@gmail.com> wrote:
>>>>>>>>> Hi Vincent,
>>>>>>>>> 
>>>>>>>>>  
>>>>>>>>> 
>>>>>>>>> The table creation statement is similar to below. We have about 200 fields. Table is mutable and don’t have any index on the table.
>>>>>>>>> 
>>>>>>>>>  
>>>>>>>>> 
>>>>>>>>> CREATE TABLE IF NOT EXISTS us_population (
>>>>>>>>> 
>>>>>>>>>       state CHAR(2) NOT NULL,
>>>>>>>>> 
>>>>>>>>>       city VARCHAR,
>>>>>>>>> 
>>>>>>>>>       population BIGINT,
>>>>>>>>> 
>>>>>>>>>       …
>>>>>>>>> 
>>>>>>>>>       CONSTRAINT my_pk PRIMARY KEY (state));
>>>>>>>>> 
>>>>>>>>>  
>>>>>>>>> 
>>>>>>>>> Thanks,
>>>>>>>>> 
>>>>>>>>> Shawn
>>>>>>>>> 
>>>>>>>>>  
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>>> On Wed, Dec 12, 2018, 13:42 Vincent Poon <vincentpoon@apache.org wrote:
>>>>>>>>>> For #2, can you provide the table definition and the statement used?  e.g. Is the table immutable, or does it have indexes?
>>>>>>>>>> 
>>>>>>>>>>> On Tue, Dec 11, 2018 at 6:08 PM Shawn/Xiang Li <sh...@gmail.com> wrote:
>>>>>>>>>>> Hi,
>>>>>>>>>>> 
>>>>>>>>>>>  
>>>>>>>>>>> 
>>>>>>>>>>> 1.       Want to check what is underlying running for limit clause used in the following Upsert statement (is it involving any coprocessor working behind?):
>>>>>>>>>>>  
>>>>>>>>>>> 
>>>>>>>>>>>                                   upsert into table2 select * from table1 limit 3000000;  (table 1 and table 2 have same schema)
>>>>>>>>>>> 
>>>>>>>>>>>  
>>>>>>>>>>> 
>>>>>>>>>>>               The above statement is running a lot slower than without “limit”  clause as shown in following, even the above statement upsert less data:
>>>>>>>>>>> 
>>>>>>>>>>>  
>>>>>>>>>>> 
>>>>>>>>>>>                                 upsert into table2 select * from table1;
>>>>>>>>>>> 
>>>>>>>>>>>  
>>>>>>>>>>> 
>>>>>>>>>>> 2.       We also observe memory usable is pretty high without the limit clause (8gb vs 2gb), sometimes for big table it can reach 20gb without using limit clause.  According to phoenix website description for upsert select “If auto commit is on, and both a) the target table matches the source table, and b) the select performs no aggregation, then the population of the target table will be done completely on the server-side (with constraint violations logged, but otherwise ignored).”
>>>>>>>>>>>  
>>>>>>>>>>> 
>>>>>>>>>>>                My question is If everything is done on server-side, how come we have such high memory usage on the client machine?
>>>>>>>>>>> 
>>>>>>>>>>>  
>>>>>>>>>>> 
>>>>>>>>>>> Thanks,
>>>>>>>>>>> 
>>>>>>>>>>> Shawn

Re: "upsert select" with "limit" clause

Posted by Shawn Li <sh...@gmail.com>.

Hi Vincent,

Thanks for explaining. That makes much more sense now and it explains the
high memory usage when without "limit" clause. Because it upserts much
quickly when using "upsert select" without "limit", the memory usage in
client machine is much higher than "upsert select" with "limit" .

So back to the other question. Can you explain what is underlying Phoenix
implementation for "upsert select limit"? Why it is slower than without
"limit" when insert a huge number (2m rows) like ""upsert into table2
select * from table1 limit 2,000,000;". This is much slower than inserting
the whole table (upsert into table2 select * from table1;).

Thanks,
Xiang


On Mon, Dec 17, 2018 at 1:56 PM Vincent Poon <vi...@apache.org> wrote:

> Shawn,
> Your query *upsert into table2 select * from table1;  *would not be run
> on the server - the source and target table are different.  It would have
> to be something like:
> *upsert into table1 select * from table1;*
>
> If you want to run server-side upsert select on a target table that is
> different from the source table, you need to set
> "phoenix.client.enable.server.upsert.select" to true on your client.
> The are some other restrictions: the table can't have any global indexes,
> and the statement can't have a join or where subquery.  We need to update
> the documentation with this information.
>
> The reason there are all these hurdles is because it's generally not
> recommended to do server-side upsert select across different tables,
> because that means you're doing cross-regionserver RPCs (e.g. read data
> from a region of sourcetable, and write to a region of targettable on a
> different regionserver), potentially tying up handlers in an unpredictable
> way.
>
> On Sun, Dec 16, 2018 at 7:12 PM Shawn Li <sh...@gmail.com> wrote:
>
>> Hi Jaanai,
>>
>> According to Phoenix website, " If auto commit is on, and both a) the
>> target table matches the source table, and b) the select performs no
>> aggregation, then the population of the target table will be done
>> completely on the server-side (with constraint violations logged, but
>> otherwise ignored). Otherwise, data is buffered on the client and, if auto
>> commit is on, committed in row batches as specified by the UpsertBatchSize
>> connection property (or the phoenix.mutate.upsertBatchSize HBase config
>> property which defaults to 10000 rows)"
>>
>> And our sql statement is just: *upsert into table2 select * from table1;
>> *which should match the first case, all operations should be in server
>> site. But the memory usage on the client machine is higher than "upsert
>> select limit" clause. And the memory usage is check by run 'top'  command
>> under Linux. So we are sure it is caused by "select upsert" in Phoenix and
>> not others, and can't explain why there is so high memory usage on
>> client/gateway machine when all operations are supposed to happen on the
>> serve side.
>>
>> Thanks,
>> Shawn
>>
>> On Thu, Dec 13, 2018 at 3:15 AM Jaanai Zhang <cl...@gmail.com>
>> wrote:
>>
>>> Shawn,
>>>
>>> The UPSERT SELECT will run in a coprocessor on if it hasn't limit
>>> clause, only query 1 table, the query is doing aggregation, no sequences
>>> and auto commit is on. Please check your SQL ... and you can also check
>>> whether some resources have not been released.
>>>
>>> ----------------------------------------
>>>    Jaanai Zhang
>>>    Best regards!
>>>
>>>
>>>
>>> Shawn Li <sh...@gmail.com> 于2018年12月13日周四 下午12:10写道：
>>>
>>>> Hi Jaanai,
>>>>
>>>> Thanks for putting your thought. The behavior you describe is correct
>>>> on the Hbase region sever side. The memory usage for blockcache and
>>>> memstore will be high under such high throughput. But our phoenix client is
>>>> on a gateway machine (no hbase region server sitting on it or any Hbase
>>>> service on it), so not sure how to explain such high memory usage for
>>>> upsert select without "limit" clause. The high memory usage behavior like
>>>> all select results send to client machine, cached in client machine's
>>>> memory, and then insert back to target table, which is not like the
>>>> behavior that should happen, all of this should be done on the server side
>>>> as the table schema is exactly the same. By the way, this happens on both
>>>> Phoenix 4.7 and Phoenix 4.14.
>>>>
>>>>
>>>> Thanks,
>>>> Shawn
>>>>
>>>> On Wed, Dec 12, 2018 at 10:26 PM Jaanai Zhang <cl...@gmail.com>
>>>> wrote:
>>>>
>>>>> Shawn,
>>>>>
>>>>>
>>>>> For the upsert without limit,  which will read the source table and
>>>>> write the target tables on the server side.  I think the higher memory
>>>>> usage is caused by using scan cache and memstore under the higher
>>>>> throughput.
>>>>>
>>>>> ----------------------------------------
>>>>>    Jaanai Zhang
>>>>>    Best regards!
>>>>>
>>>>>
>>>>>
>>>>> Shawn Li <sh...@gmail.com> 于2018年12月13日周四 上午10:13写道：
>>>>>
>>>>>> Hi Vincent,
>>>>>>
>>>>>> So you describe limit will sent result to client side and then write
>>>>>> to server, this might explain why upsert with limit is slower than without
>>>>>> limit. But looks like it can't explain the memory usage? The memory usage
>>>>>> on client machine is 8gb (without "limit") vs 2gb (with limit), sometime
>>>>>> upsert without "limit" can even reach 20gb for big table.
>>>>>>
>>>>>> Thanks,
>>>>>> Shawn
>>>>>>
>>>>>> On Wed, Dec 12, 2018 at 6:34 PM Vincent Poon <vi...@apache.org>
>>>>>> wrote:
>>>>>>
>>>>>>> I think it's done client-side if you have LIMIT.  If you have e.g.
>>>>>>> LIMIT 1000 , it would be incorrect for each regionserver to upsert 100, if
>>>>>>> you have more than one regionserver.  So instead results are sent back to
>>>>>>> the client, where the LIMIT is applied and then written back to the server
>>>>>>> in the UPSERT.
>>>>>>>
>>>>>>> On Wed, Dec 12, 2018 at 1:18 PM Shawn Li <sh...@gmail.com>
>>>>>>> wrote:
>>>>>>>
>>>>>>>> Hi Vincent,
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> The table creation statement is similar to below. We have about 200
>>>>>>>> fields. Table is mutable and don’t have any index on the table.
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> CREATE TABLE IF NOT EXISTS us_population (
>>>>>>>>
>>>>>>>>       state CHAR(2) NOT NULL,
>>>>>>>>
>>>>>>>>       city VARCHAR,
>>>>>>>>
>>>>>>>>       population BIGINT,
>>>>>>>>
>>>>>>>>       …
>>>>>>>>
>>>>>>>>       CONSTRAINT my_pk PRIMARY KEY (state));
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> Thanks,
>>>>>>>>
>>>>>>>> Shawn
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> On Wed, Dec 12, 2018, 13:42 Vincent Poon <vincentpoon@apache.org
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>>> For #2, can you provide the table definition and the statement
>>>>>>>>> used?  e.g. Is the table immutable, or does it have indexes?
>>>>>>>>>
>>>>>>>>> On Tue, Dec 11, 2018 at 6:08 PM Shawn/Xiang Li <
>>>>>>>>> shawnlijob@gmail.com> wrote:
>>>>>>>>>
>>>>>>>>>> Hi,
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> 1.       Want to check what is underlying running for limit
>>>>>>>>>> clause used in the following Upsert statement (is it involving any
>>>>>>>>>> coprocessor working behind?):
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> *                                  upsert into table2 select *
>>>>>>>>>> from table1 limit 3000000; * (table 1 and table 2 have same
>>>>>>>>>> schema)
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>               The above statement is running a lot slower than
>>>>>>>>>> without “limit”  clause as shown in following, even the above statement
>>>>>>>>>> upsert less data:
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> *                                upsert into table2 select * from
>>>>>>>>>> table1;*
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> 2.       We also observe memory usable is pretty high without
>>>>>>>>>> the limit clause (8gb vs 2gb), sometimes for big table it can reach 20gb
>>>>>>>>>> without using limit clause.  According to phoenix website description for
>>>>>>>>>> upsert select “If auto commit is on, and both a) the target table matches
>>>>>>>>>> the source table, and b) the select performs no aggregation, then the
>>>>>>>>>> population of the target table will be done completely on the server-side
>>>>>>>>>> (with constraint violations logged, but otherwise ignored).”
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>                My question is If everything is done on
>>>>>>>>>> server-side, how come we have such high memory usage on the client machine?
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> Thanks,
>>>>>>>>>>
>>>>>>>>>> Shawn
>>>>>>>>>>
>>>>>>>>>

Re: "upsert select" with "limit" clause

Posted by Vincent Poon <vi...@apache.org>.

Shawn,
Your query *upsert into table2 select * from table1;  *would not be run on
the server - the source and target table are different.  It would have to
be something like:
*upsert into table1 select * from table1;*

If you want to run server-side upsert select on a target table that is
different from the source table, you need to set
"phoenix.client.enable.server.upsert.select" to true on your client.
The are some other restrictions: the table can't have any global indexes,
and the statement can't have a join or where subquery.  We need to update
the documentation with this information.

The reason there are all these hurdles is because it's generally not
recommended to do server-side upsert select across different tables,
because that means you're doing cross-regionserver RPCs (e.g. read data
from a region of sourcetable, and write to a region of targettable on a
different regionserver), potentially tying up handlers in an unpredictable
way.

On Sun, Dec 16, 2018 at 7:12 PM Shawn Li <sh...@gmail.com> wrote:

> Hi Jaanai,
>
> According to Phoenix website, " If auto commit is on, and both a) the
> target table matches the source table, and b) the select performs no
> aggregation, then the population of the target table will be done
> completely on the server-side (with constraint violations logged, but
> otherwise ignored). Otherwise, data is buffered on the client and, if auto
> commit is on, committed in row batches as specified by the UpsertBatchSize
> connection property (or the phoenix.mutate.upsertBatchSize HBase config
> property which defaults to 10000 rows)"
>
> And our sql statement is just: *upsert into table2 select * from table1; *which
> should match the first case, all operations should be in server site. But
> the memory usage on the client machine is higher than "upsert select limit"
> clause. And the memory usage is check by run 'top'  command under Linux. So
> we are sure it is caused by "select upsert" in Phoenix and not others, and
> can't explain why there is so high memory usage on client/gateway machine
> when all operations are supposed to happen on the serve side.
>
> Thanks,
> Shawn
>
> On Thu, Dec 13, 2018 at 3:15 AM Jaanai Zhang <cl...@gmail.com>
> wrote:
>
>> Shawn,
>>
>> The UPSERT SELECT will run in a coprocessor on if it hasn't limit clause,
>> only query 1 table, the query is doing aggregation, no sequences and auto
>> commit is on. Please check your SQL ... and you can also check whether some
>> resources have not been released.
>>
>> ----------------------------------------
>>    Jaanai Zhang
>>    Best regards!
>>
>>
>>
>> Shawn Li <sh...@gmail.com> 于2018年12月13日周四 下午12:10写道：
>>
>>> Hi Jaanai,
>>>
>>> Thanks for putting your thought. The behavior you describe is correct on
>>> the Hbase region sever side. The memory usage for blockcache and memstore
>>> will be high under such high throughput. But our phoenix client is on a
>>> gateway machine (no hbase region server sitting on it or any Hbase service
>>> on it), so not sure how to explain such high memory usage for upsert select
>>> without "limit" clause. The high memory usage behavior like all select
>>> results send to client machine, cached in client machine's memory, and then
>>> insert back to target table, which is not like the behavior that should
>>> happen, all of this should be done on the server side as the table schema
>>> is exactly the same. By the way, this happens on both Phoenix 4.7 and
>>> Phoenix 4.14.
>>>
>>>
>>> Thanks,
>>> Shawn
>>>
>>> On Wed, Dec 12, 2018 at 10:26 PM Jaanai Zhang <cl...@gmail.com>
>>> wrote:
>>>
>>>> Shawn,
>>>>
>>>>
>>>> For the upsert without limit,  which will read the source table and
>>>> write the target tables on the server side.  I think the higher memory
>>>> usage is caused by using scan cache and memstore under the higher
>>>> throughput.
>>>>
>>>> ----------------------------------------
>>>>    Jaanai Zhang
>>>>    Best regards!
>>>>
>>>>
>>>>
>>>> Shawn Li <sh...@gmail.com> 于2018年12月13日周四 上午10:13写道：
>>>>
>>>>> Hi Vincent,
>>>>>
>>>>> So you describe limit will sent result to client side and then write
>>>>> to server, this might explain why upsert with limit is slower than without
>>>>> limit. But looks like it can't explain the memory usage? The memory usage
>>>>> on client machine is 8gb (without "limit") vs 2gb (with limit), sometime
>>>>> upsert without "limit" can even reach 20gb for big table.
>>>>>
>>>>> Thanks,
>>>>> Shawn
>>>>>
>>>>> On Wed, Dec 12, 2018 at 6:34 PM Vincent Poon <vi...@apache.org>
>>>>> wrote:
>>>>>
>>>>>> I think it's done client-side if you have LIMIT.  If you have e.g.
>>>>>> LIMIT 1000 , it would be incorrect for each regionserver to upsert 100, if
>>>>>> you have more than one regionserver.  So instead results are sent back to
>>>>>> the client, where the LIMIT is applied and then written back to the server
>>>>>> in the UPSERT.
>>>>>>
>>>>>> On Wed, Dec 12, 2018 at 1:18 PM Shawn Li <sh...@gmail.com>
>>>>>> wrote:
>>>>>>
>>>>>>> Hi Vincent,
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> The table creation statement is similar to below. We have about 200
>>>>>>> fields. Table is mutable and don’t have any index on the table.
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> CREATE TABLE IF NOT EXISTS us_population (
>>>>>>>
>>>>>>>       state CHAR(2) NOT NULL,
>>>>>>>
>>>>>>>       city VARCHAR,
>>>>>>>
>>>>>>>       population BIGINT,
>>>>>>>
>>>>>>>       …
>>>>>>>
>>>>>>>       CONSTRAINT my_pk PRIMARY KEY (state));
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> Thanks,
>>>>>>>
>>>>>>> Shawn
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> On Wed, Dec 12, 2018, 13:42 Vincent Poon <vincentpoon@apache.org
>>>>>>> wrote:
>>>>>>>
>>>>>>>> For #2, can you provide the table definition and the statement
>>>>>>>> used?  e.g. Is the table immutable, or does it have indexes?
>>>>>>>>
>>>>>>>> On Tue, Dec 11, 2018 at 6:08 PM Shawn/Xiang Li <
>>>>>>>> shawnlijob@gmail.com> wrote:
>>>>>>>>
>>>>>>>>> Hi,
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> 1.       Want to check what is underlying running for limit
>>>>>>>>> clause used in the following Upsert statement (is it involving any
>>>>>>>>> coprocessor working behind?):
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> *                                  upsert into table2 select *
>>>>>>>>> from table1 limit 3000000; * (table 1 and table 2 have same
>>>>>>>>> schema)
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>               The above statement is running a lot slower than
>>>>>>>>> without “limit”  clause as shown in following, even the above statement
>>>>>>>>> upsert less data:
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> *                                upsert into table2 select * from
>>>>>>>>> table1;*
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> 2.       We also observe memory usable is pretty high without the
>>>>>>>>> limit clause (8gb vs 2gb), sometimes for big table it can reach 20gb
>>>>>>>>> without using limit clause.  According to phoenix website description for
>>>>>>>>> upsert select “If auto commit is on, and both a) the target table matches
>>>>>>>>> the source table, and b) the select performs no aggregation, then the
>>>>>>>>> population of the target table will be done completely on the server-side
>>>>>>>>> (with constraint violations logged, but otherwise ignored).”
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>                My question is If everything is done on
>>>>>>>>> server-side, how come we have such high memory usage on the client machine?
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> Thanks,
>>>>>>>>>
>>>>>>>>> Shawn
>>>>>>>>>
>>>>>>>>

Re: "upsert select" with "limit" clause

Posted by Shawn Li <sh...@gmail.com>.

Hi Jaanai,

According to Phoenix website, " If auto commit is on, and both a) the
target table matches the source table, and b) the select performs no
aggregation, then the population of the target table will be done
completely on the server-side (with constraint violations logged, but
otherwise ignored). Otherwise, data is buffered on the client and, if auto
commit is on, committed in row batches as specified by the UpsertBatchSize
connection property (or the phoenix.mutate.upsertBatchSize HBase config
property which defaults to 10000 rows)"

And our sql statement is just: *upsert into table2 select * from table1; *which
should match the first case, all operations should be in server site. But
the memory usage on the client machine is higher than "upsert select limit"
clause. And the memory usage is check by run 'top'  command under Linux. So
we are sure it is caused by "select upsert" in Phoenix and not others, and
can't explain why there is so high memory usage on client/gateway machine
when all operations are supposed to happen on the serve side.

Thanks,
Shawn

On Thu, Dec 13, 2018 at 3:15 AM Jaanai Zhang <cl...@gmail.com> wrote:

> Shawn,
>
> The UPSERT SELECT will run in a coprocessor on if it hasn't limit clause,
> only query 1 table, the query is doing aggregation, no sequences and auto
> commit is on. Please check your SQL ... and you can also check whether some
> resources have not been released.
>
> ----------------------------------------
>    Jaanai Zhang
>    Best regards!
>
>
>
> Shawn Li <sh...@gmail.com> 于2018年12月13日周四 下午12:10写道：
>
>> Hi Jaanai,
>>
>> Thanks for putting your thought. The behavior you describe is correct on
>> the Hbase region sever side. The memory usage for blockcache and memstore
>> will be high under such high throughput. But our phoenix client is on a
>> gateway machine (no hbase region server sitting on it or any Hbase service
>> on it), so not sure how to explain such high memory usage for upsert select
>> without "limit" clause. The high memory usage behavior like all select
>> results send to client machine, cached in client machine's memory, and then
>> insert back to target table, which is not like the behavior that should
>> happen, all of this should be done on the server side as the table schema
>> is exactly the same. By the way, this happens on both Phoenix 4.7 and
>> Phoenix 4.14.
>>
>>
>> Thanks,
>> Shawn
>>
>> On Wed, Dec 12, 2018 at 10:26 PM Jaanai Zhang <cl...@gmail.com>
>> wrote:
>>
>>> Shawn,
>>>
>>>
>>> For the upsert without limit,  which will read the source table and
>>> write the target tables on the server side.  I think the higher memory
>>> usage is caused by using scan cache and memstore under the higher
>>> throughput.
>>>
>>> ----------------------------------------
>>>    Jaanai Zhang
>>>    Best regards!
>>>
>>>
>>>
>>> Shawn Li <sh...@gmail.com> 于2018年12月13日周四 上午10:13写道：
>>>
>>>> Hi Vincent,
>>>>
>>>> So you describe limit will sent result to client side and then write to
>>>> server, this might explain why upsert with limit is slower than without
>>>> limit. But looks like it can't explain the memory usage? The memory usage
>>>> on client machine is 8gb (without "limit") vs 2gb (with limit), sometime
>>>> upsert without "limit" can even reach 20gb for big table.
>>>>
>>>> Thanks,
>>>> Shawn
>>>>
>>>> On Wed, Dec 12, 2018 at 6:34 PM Vincent Poon <vi...@apache.org>
>>>> wrote:
>>>>
>>>>> I think it's done client-side if you have LIMIT.  If you have e.g.
>>>>> LIMIT 1000 , it would be incorrect for each regionserver to upsert 100, if
>>>>> you have more than one regionserver.  So instead results are sent back to
>>>>> the client, where the LIMIT is applied and then written back to the server
>>>>> in the UPSERT.
>>>>>
>>>>> On Wed, Dec 12, 2018 at 1:18 PM Shawn Li <sh...@gmail.com> wrote:
>>>>>
>>>>>> Hi Vincent,
>>>>>>
>>>>>>
>>>>>>
>>>>>> The table creation statement is similar to below. We have about 200
>>>>>> fields. Table is mutable and don’t have any index on the table.
>>>>>>
>>>>>>
>>>>>>
>>>>>> CREATE TABLE IF NOT EXISTS us_population (
>>>>>>
>>>>>>       state CHAR(2) NOT NULL,
>>>>>>
>>>>>>       city VARCHAR,
>>>>>>
>>>>>>       population BIGINT,
>>>>>>
>>>>>>       …
>>>>>>
>>>>>>       CONSTRAINT my_pk PRIMARY KEY (state));
>>>>>>
>>>>>>
>>>>>>
>>>>>> Thanks,
>>>>>>
>>>>>> Shawn
>>>>>>
>>>>>>
>>>>>>
>>>>>> On Wed, Dec 12, 2018, 13:42 Vincent Poon <vincentpoon@apache.org
>>>>>> wrote:
>>>>>>
>>>>>>> For #2, can you provide the table definition and the statement
>>>>>>> used?  e.g. Is the table immutable, or does it have indexes?
>>>>>>>
>>>>>>> On Tue, Dec 11, 2018 at 6:08 PM Shawn/Xiang Li <sh...@gmail.com>
>>>>>>> wrote:
>>>>>>>
>>>>>>>> Hi,
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> 1.       Want to check what is underlying running for limit clause
>>>>>>>> used in the following Upsert statement (is it involving any coprocessor
>>>>>>>> working behind?):
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> *                                  upsert into table2 select * from
>>>>>>>> table1 limit 3000000; * (table 1 and table 2 have same schema)
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>               The above statement is running a lot slower than
>>>>>>>> without “limit”  clause as shown in following, even the above statement
>>>>>>>> upsert less data:
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> *                                upsert into table2 select * from
>>>>>>>> table1;*
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> 2.       We also observe memory usable is pretty high without the
>>>>>>>> limit clause (8gb vs 2gb), sometimes for big table it can reach 20gb
>>>>>>>> without using limit clause.  According to phoenix website description for
>>>>>>>> upsert select “If auto commit is on, and both a) the target table matches
>>>>>>>> the source table, and b) the select performs no aggregation, then the
>>>>>>>> population of the target table will be done completely on the server-side
>>>>>>>> (with constraint violations logged, but otherwise ignored).”
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>                My question is If everything is done on server-side,
>>>>>>>> how come we have such high memory usage on the client machine?
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> Thanks,
>>>>>>>>
>>>>>>>> Shawn
>>>>>>>>
>>>>>>>

Re: "upsert select" with "limit" clause

Posted by Jaanai Zhang <cl...@gmail.com>.

Shawn,

The UPSERT SELECT will run in a coprocessor on if it hasn't limit clause,
only query 1 table, the query is doing aggregation, no sequences and auto
commit is on. Please check your SQL ... and you can also check whether some
resources have not been released.

----------------------------------------
   Jaanai Zhang
   Best regards!



Shawn Li <sh...@gmail.com> 于2018年12月13日周四 下午12:10写道：

> Hi Jaanai,
>
> Thanks for putting your thought. The behavior you describe is correct on
> the Hbase region sever side. The memory usage for blockcache and memstore
> will be high under such high throughput. But our phoenix client is on a
> gateway machine (no hbase region server sitting on it or any Hbase service
> on it), so not sure how to explain such high memory usage for upsert select
> without "limit" clause. The high memory usage behavior like all select
> results send to client machine, cached in client machine's memory, and then
> insert back to target table, which is not like the behavior that should
> happen, all of this should be done on the server side as the table schema
> is exactly the same. By the way, this happens on both Phoenix 4.7 and
> Phoenix 4.14.
>
>
> Thanks,
> Shawn
>
> On Wed, Dec 12, 2018 at 10:26 PM Jaanai Zhang <cl...@gmail.com>
> wrote:
>
>> Shawn,
>>
>>
>> For the upsert without limit,  which will read the source table and write
>> the target tables on the server side.  I think the higher memory usage is
>> caused by using scan cache and memstore under the higher throughput.
>>
>> ----------------------------------------
>>    Jaanai Zhang
>>    Best regards!
>>
>>
>>
>> Shawn Li <sh...@gmail.com> 于2018年12月13日周四 上午10:13写道：
>>
>>> Hi Vincent,
>>>
>>> So you describe limit will sent result to client side and then write to
>>> server, this might explain why upsert with limit is slower than without
>>> limit. But looks like it can't explain the memory usage? The memory usage
>>> on client machine is 8gb (without "limit") vs 2gb (with limit), sometime
>>> upsert without "limit" can even reach 20gb for big table.
>>>
>>> Thanks,
>>> Shawn
>>>
>>> On Wed, Dec 12, 2018 at 6:34 PM Vincent Poon <vi...@apache.org>
>>> wrote:
>>>
>>>> I think it's done client-side if you have LIMIT.  If you have e.g.
>>>> LIMIT 1000 , it would be incorrect for each regionserver to upsert 100, if
>>>> you have more than one regionserver.  So instead results are sent back to
>>>> the client, where the LIMIT is applied and then written back to the server
>>>> in the UPSERT.
>>>>
>>>> On Wed, Dec 12, 2018 at 1:18 PM Shawn Li <sh...@gmail.com> wrote:
>>>>
>>>>> Hi Vincent,
>>>>>
>>>>>
>>>>>
>>>>> The table creation statement is similar to below. We have about 200
>>>>> fields. Table is mutable and don’t have any index on the table.
>>>>>
>>>>>
>>>>>
>>>>> CREATE TABLE IF NOT EXISTS us_population (
>>>>>
>>>>>       state CHAR(2) NOT NULL,
>>>>>
>>>>>       city VARCHAR,
>>>>>
>>>>>       population BIGINT,
>>>>>
>>>>>       …
>>>>>
>>>>>       CONSTRAINT my_pk PRIMARY KEY (state));
>>>>>
>>>>>
>>>>>
>>>>> Thanks,
>>>>>
>>>>> Shawn
>>>>>
>>>>>
>>>>>
>>>>> On Wed, Dec 12, 2018, 13:42 Vincent Poon <vincentpoon@apache.org
>>>>> wrote:
>>>>>
>>>>>> For #2, can you provide the table definition and the statement used?
>>>>>> e.g. Is the table immutable, or does it have indexes?
>>>>>>
>>>>>> On Tue, Dec 11, 2018 at 6:08 PM Shawn/Xiang Li <sh...@gmail.com>
>>>>>> wrote:
>>>>>>
>>>>>>> Hi,
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> 1.       Want to check what is underlying running for limit clause
>>>>>>> used in the following Upsert statement (is it involving any coprocessor
>>>>>>> working behind?):
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> *                                  upsert into table2 select * from
>>>>>>> table1 limit 3000000; * (table 1 and table 2 have same schema)
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>               The above statement is running a lot slower than
>>>>>>> without “limit”  clause as shown in following, even the above statement
>>>>>>> upsert less data:
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> *                                upsert into table2 select * from
>>>>>>> table1;*
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> 2.       We also observe memory usable is pretty high without the
>>>>>>> limit clause (8gb vs 2gb), sometimes for big table it can reach 20gb
>>>>>>> without using limit clause.  According to phoenix website description for
>>>>>>> upsert select “If auto commit is on, and both a) the target table matches
>>>>>>> the source table, and b) the select performs no aggregation, then the
>>>>>>> population of the target table will be done completely on the server-side
>>>>>>> (with constraint violations logged, but otherwise ignored).”
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>                My question is If everything is done on server-side,
>>>>>>> how come we have such high memory usage on the client machine?
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> Thanks,
>>>>>>>
>>>>>>> Shawn
>>>>>>>
>>>>>>

Re: "upsert select" with "limit" clause

Posted by Shawn Li <sh...@gmail.com>.

Hi Jaanai,

Thanks for putting your thought. The behavior you describe is correct on
the Hbase region sever side. The memory usage for blockcache and memstore
will be high under such high throughput. But our phoenix client is on a
gateway machine (no hbase region server sitting on it or any Hbase service
on it), so not sure how to explain such high memory usage for upsert select
without "limit" clause. The high memory usage behavior like all select
results send to client machine, cached in client machine's memory, and then
insert back to target table, which is not like the behavior that should
happen, all of this should be done on the server side as the table schema
is exactly the same. By the way, this happens on both Phoenix 4.7 and
Phoenix 4.14.


Thanks,
Shawn

On Wed, Dec 12, 2018 at 10:26 PM Jaanai Zhang <cl...@gmail.com>
wrote:

> Shawn,
>
>
> For the upsert without limit,  which will read the source table and write
> the target tables on the server side.  I think the higher memory usage is
> caused by using scan cache and memstore under the higher throughput.
>
> ----------------------------------------
>    Jaanai Zhang
>    Best regards!
>
>
>
> Shawn Li <sh...@gmail.com> 于2018年12月13日周四 上午10:13写道：
>
>> Hi Vincent,
>>
>> So you describe limit will sent result to client side and then write to
>> server, this might explain why upsert with limit is slower than without
>> limit. But looks like it can't explain the memory usage? The memory usage
>> on client machine is 8gb (without "limit") vs 2gb (with limit), sometime
>> upsert without "limit" can even reach 20gb for big table.
>>
>> Thanks,
>> Shawn
>>
>> On Wed, Dec 12, 2018 at 6:34 PM Vincent Poon <vi...@apache.org>
>> wrote:
>>
>>> I think it's done client-side if you have LIMIT.  If you have e.g. LIMIT
>>> 1000 , it would be incorrect for each regionserver to upsert 100, if you
>>> have more than one regionserver.  So instead results are sent back to the
>>> client, where the LIMIT is applied and then written back to the server in
>>> the UPSERT.
>>>
>>> On Wed, Dec 12, 2018 at 1:18 PM Shawn Li <sh...@gmail.com> wrote:
>>>
>>>> Hi Vincent,
>>>>
>>>>
>>>>
>>>> The table creation statement is similar to below. We have about 200
>>>> fields. Table is mutable and don’t have any index on the table.
>>>>
>>>>
>>>>
>>>> CREATE TABLE IF NOT EXISTS us_population (
>>>>
>>>>       state CHAR(2) NOT NULL,
>>>>
>>>>       city VARCHAR,
>>>>
>>>>       population BIGINT,
>>>>
>>>>       …
>>>>
>>>>       CONSTRAINT my_pk PRIMARY KEY (state));
>>>>
>>>>
>>>>
>>>> Thanks,
>>>>
>>>> Shawn
>>>>
>>>>
>>>>
>>>> On Wed, Dec 12, 2018, 13:42 Vincent Poon <vincentpoon@apache.org wrote:
>>>>
>>>>> For #2, can you provide the table definition and the statement used?
>>>>> e.g. Is the table immutable, or does it have indexes?
>>>>>
>>>>> On Tue, Dec 11, 2018 at 6:08 PM Shawn/Xiang Li <sh...@gmail.com>
>>>>> wrote:
>>>>>
>>>>>> Hi,
>>>>>>
>>>>>>
>>>>>>
>>>>>> 1.       Want to check what is underlying running for limit clause
>>>>>> used in the following Upsert statement (is it involving any coprocessor
>>>>>> working behind?):
>>>>>>
>>>>>>
>>>>>>
>>>>>> *                                  upsert into table2 select * from
>>>>>> table1 limit 3000000; * (table 1 and table 2 have same schema)
>>>>>>
>>>>>>
>>>>>>
>>>>>>               The above statement is running a lot slower than
>>>>>> without “limit”  clause as shown in following, even the above statement
>>>>>> upsert less data:
>>>>>>
>>>>>>
>>>>>>
>>>>>> *                                upsert into table2 select * from
>>>>>> table1;*
>>>>>>
>>>>>>
>>>>>>
>>>>>> 2.       We also observe memory usable is pretty high without the
>>>>>> limit clause (8gb vs 2gb), sometimes for big table it can reach 20gb
>>>>>> without using limit clause.  According to phoenix website description for
>>>>>> upsert select “If auto commit is on, and both a) the target table matches
>>>>>> the source table, and b) the select performs no aggregation, then the
>>>>>> population of the target table will be done completely on the server-side
>>>>>> (with constraint violations logged, but otherwise ignored).”
>>>>>>
>>>>>>
>>>>>>
>>>>>>                My question is If everything is done on server-side,
>>>>>> how come we have such high memory usage on the client machine?
>>>>>>
>>>>>>
>>>>>>
>>>>>> Thanks,
>>>>>>
>>>>>> Shawn
>>>>>>
>>>>>

Re: "upsert select" with "limit" clause

Posted by Jaanai Zhang <cl...@gmail.com>.

Shawn,


For the upsert without limit,  which will read the source table and write
the target tables on the server side.  I think the higher memory usage is
caused by using scan cache and memstore under the higher throughput.

----------------------------------------
   Jaanai Zhang
   Best regards!



Shawn Li <sh...@gmail.com> 于2018年12月13日周四 上午10:13写道：

> Hi Vincent,
>
> So you describe limit will sent result to client side and then write to
> server, this might explain why upsert with limit is slower than without
> limit. But looks like it can't explain the memory usage? The memory usage
> on client machine is 8gb (without "limit") vs 2gb (with limit), sometime
> upsert without "limit" can even reach 20gb for big table.
>
> Thanks,
> Shawn
>
> On Wed, Dec 12, 2018 at 6:34 PM Vincent Poon <vi...@apache.org>
> wrote:
>
>> I think it's done client-side if you have LIMIT.  If you have e.g. LIMIT
>> 1000 , it would be incorrect for each regionserver to upsert 100, if you
>> have more than one regionserver.  So instead results are sent back to the
>> client, where the LIMIT is applied and then written back to the server in
>> the UPSERT.
>>
>> On Wed, Dec 12, 2018 at 1:18 PM Shawn Li <sh...@gmail.com> wrote:
>>
>>> Hi Vincent,
>>>
>>>
>>>
>>> The table creation statement is similar to below. We have about 200
>>> fields. Table is mutable and don’t have any index on the table.
>>>
>>>
>>>
>>> CREATE TABLE IF NOT EXISTS us_population (
>>>
>>>       state CHAR(2) NOT NULL,
>>>
>>>       city VARCHAR,
>>>
>>>       population BIGINT,
>>>
>>>       …
>>>
>>>       CONSTRAINT my_pk PRIMARY KEY (state));
>>>
>>>
>>>
>>> Thanks,
>>>
>>> Shawn
>>>
>>>
>>>
>>> On Wed, Dec 12, 2018, 13:42 Vincent Poon <vincentpoon@apache.org wrote:
>>>
>>>> For #2, can you provide the table definition and the statement used?
>>>> e.g. Is the table immutable, or does it have indexes?
>>>>
>>>> On Tue, Dec 11, 2018 at 6:08 PM Shawn/Xiang Li <sh...@gmail.com>
>>>> wrote:
>>>>
>>>>> Hi,
>>>>>
>>>>>
>>>>>
>>>>> 1.       Want to check what is underlying running for limit clause
>>>>> used in the following Upsert statement (is it involving any coprocessor
>>>>> working behind?):
>>>>>
>>>>>
>>>>>
>>>>> *                                  upsert into table2 select * from
>>>>> table1 limit 3000000; * (table 1 and table 2 have same schema)
>>>>>
>>>>>
>>>>>
>>>>>               The above statement is running a lot slower than without
>>>>> “limit”  clause as shown in following, even the above statement upsert less
>>>>> data:
>>>>>
>>>>>
>>>>>
>>>>> *                                upsert into table2 select * from
>>>>> table1;*
>>>>>
>>>>>
>>>>>
>>>>> 2.       We also observe memory usable is pretty high without the
>>>>> limit clause (8gb vs 2gb), sometimes for big table it can reach 20gb
>>>>> without using limit clause.  According to phoenix website description for
>>>>> upsert select “If auto commit is on, and both a) the target table matches
>>>>> the source table, and b) the select performs no aggregation, then the
>>>>> population of the target table will be done completely on the server-side
>>>>> (with constraint violations logged, but otherwise ignored).”
>>>>>
>>>>>
>>>>>
>>>>>                My question is If everything is done on server-side,
>>>>> how come we have such high memory usage on the client machine?
>>>>>
>>>>>
>>>>>
>>>>> Thanks,
>>>>>
>>>>> Shawn
>>>>>
>>>>

Re: "upsert select" with "limit" clause

Posted by Shawn Li <sh...@gmail.com>.

Hi Vincent,

So you describe limit will sent result to client side and then write to
server, this might explain why upsert with limit is slower than without
limit. But looks like it can't explain the memory usage? The memory usage
on client machine is 8gb (without "limit") vs 2gb (with limit), sometime
upsert without "limit" can even reach 20gb for big table.

Thanks,
Shawn

On Wed, Dec 12, 2018 at 6:34 PM Vincent Poon <vi...@apache.org> wrote:

> I think it's done client-side if you have LIMIT.  If you have e.g. LIMIT
> 1000 , it would be incorrect for each regionserver to upsert 100, if you
> have more than one regionserver.  So instead results are sent back to the
> client, where the LIMIT is applied and then written back to the server in
> the UPSERT.
>
> On Wed, Dec 12, 2018 at 1:18 PM Shawn Li <sh...@gmail.com> wrote:
>
>> Hi Vincent,
>>
>>
>>
>> The table creation statement is similar to below. We have about 200
>> fields. Table is mutable and don’t have any index on the table.
>>
>>
>>
>> CREATE TABLE IF NOT EXISTS us_population (
>>
>>       state CHAR(2) NOT NULL,
>>
>>       city VARCHAR,
>>
>>       population BIGINT,
>>
>>       …
>>
>>       CONSTRAINT my_pk PRIMARY KEY (state));
>>
>>
>>
>> Thanks,
>>
>> Shawn
>>
>>
>>
>> On Wed, Dec 12, 2018, 13:42 Vincent Poon <vincentpoon@apache.org wrote:
>>
>>> For #2, can you provide the table definition and the statement used?
>>> e.g. Is the table immutable, or does it have indexes?
>>>
>>> On Tue, Dec 11, 2018 at 6:08 PM Shawn/Xiang Li <sh...@gmail.com>
>>> wrote:
>>>
>>>> Hi,
>>>>
>>>>
>>>>
>>>> 1.       Want to check what is underlying running for limit clause
>>>> used in the following Upsert statement (is it involving any coprocessor
>>>> working behind?):
>>>>
>>>>
>>>>
>>>> *                                  upsert into table2 select * from
>>>> table1 limit 3000000; * (table 1 and table 2 have same schema)
>>>>
>>>>
>>>>
>>>>               The above statement is running a lot slower than without
>>>> “limit”  clause as shown in following, even the above statement upsert less
>>>> data:
>>>>
>>>>
>>>>
>>>> *                                upsert into table2 select * from
>>>> table1;*
>>>>
>>>>
>>>>
>>>> 2.       We also observe memory usable is pretty high without the
>>>> limit clause (8gb vs 2gb), sometimes for big table it can reach 20gb
>>>> without using limit clause.  According to phoenix website description for
>>>> upsert select “If auto commit is on, and both a) the target table matches
>>>> the source table, and b) the select performs no aggregation, then the
>>>> population of the target table will be done completely on the server-side
>>>> (with constraint violations logged, but otherwise ignored).”
>>>>
>>>>
>>>>
>>>>                My question is If everything is done on server-side, how
>>>> come we have such high memory usage on the client machine?
>>>>
>>>>
>>>>
>>>> Thanks,
>>>>
>>>> Shawn
>>>>
>>>

Re: "upsert select" with "limit" clause

Posted by William Shen <wi...@marinsoftware.com>.

Vincent,
Do we expect to see the same behavior with SELECT?

I observed the following. Not sure what about applying the limit is adding
to the time... especially since there is only one row, much less than the
actual LIMIT.

SELECT tb1."updBy" FROM "prod"."ADGROUPS" tb1 WHERE
("cstId","cltId","pubId","accId","cpgnId") IN
((612266,61623806,7,1736995,343308425));

+--------+

| updBy  |

+--------+

| null   |

+--------+

1 row selected (0.144 seconds)

 SELECT tb1."updBy" FROM "prod"."ADGROUPS" tb1 WHERE
("cstId","cltId","pubId","accId","cpgnId") IN
((612266,61623806,7,1736995,343308425)) LIMIT 10000;

+--------+

| updBy  |

+--------+

| null   |

+--------+

1 row selected (0.571 seconds)


EXPLAIN SELECT tb1."updBy" FROM "prod"."ADGROUPS" tb1 WHERE
("cstId","cltId","pubId","accId","cpgnId") IN
((612266,61623806,7,1736995,343308425));

+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+-----------------+-------------+

|
            PLAN
                            | EST_BYTES_READ  | EST_ROWS_RE |

+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+-----------------+-------------+

| CLIENT 255-CHUNK 0 ROWS 0 BYTES PARALLEL 255-WAY ROUND ROBIN RANGE SCAN
OVER prod.ADGROUPS [0,612266,61623806,7,1736995,343308425] -
[254,612266,61623806,7,1736995,343308425]  | 0               | 0           |

+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+-----------------+-------------+

1 row selected (0.042 seconds)


 EXPLAIN SELECT tb1."updBy" FROM "prod"."ADGROUPS" tb1 WHERE
("cstId","cltId","pubId","accId","cpgnId") IN
((612266,61623806,7,1736995,343308425)) LIMIT 10000;

+----------------------------------------------------------------------------------------------------------------------------------------------------------------+-----------------+----------------+-------------+

|
    PLAN
          | EST_BYTES_READ  | EST_ROWS_READ  |  EST_INFO_T |

+----------------------------------------------------------------------------------------------------------------------------------------------------------------+-----------------+----------------+-------------+

| CLIENT 255-CHUNK SERIAL 255-WAY ROUND ROBIN RANGE SCAN OVER prod.ADGROUPS
[0,612266,61623806,7,1736995,343308425] -
[254,612266,61623806,7,1736995,343308425]  | 80520000        | 10000
  | 15446585013 |

|     SERVER 10000 ROW LIMIT

          | 80520000        | 10000          | 15446585013 |

| CLIENT 10000 ROW LIMIT

          | 80520000        | 10000          | 15446585013 |

+----------------------------------------------------------------------------------------------------------------------------------------------------------------+-----------------+----------------+-------------+

3 rows selected (0.117 seconds)

On Wed, Dec 12, 2018 at 3:34 PM Vincent Poon <vi...@apache.org> wrote:

> I think it's done client-side if you have LIMIT.  If you have e.g. LIMIT
> 1000 , it would be incorrect for each regionserver to upsert 100, if you
> have more than one regionserver.  So instead results are sent back to the
> client, where the LIMIT is applied and then written back to the server in
> the UPSERT.
>
> On Wed, Dec 12, 2018 at 1:18 PM Shawn Li <sh...@gmail.com> wrote:
>
>> Hi Vincent,
>>
>>
>>
>> The table creation statement is similar to below. We have about 200
>> fields. Table is mutable and don’t have any index on the table.
>>
>>
>>
>> CREATE TABLE IF NOT EXISTS us_population (
>>
>>       state CHAR(2) NOT NULL,
>>
>>       city VARCHAR,
>>
>>       population BIGINT,
>>
>>       …
>>
>>       CONSTRAINT my_pk PRIMARY KEY (state));
>>
>>
>>
>> Thanks,
>>
>> Shawn
>>
>>
>>
>> On Wed, Dec 12, 2018, 13:42 Vincent Poon <vincentpoon@apache.org wrote:
>>
>>> For #2, can you provide the table definition and the statement used?
>>> e.g. Is the table immutable, or does it have indexes?
>>>
>>> On Tue, Dec 11, 2018 at 6:08 PM Shawn/Xiang Li <sh...@gmail.com>
>>> wrote:
>>>
>>>> Hi,
>>>>
>>>>
>>>>
>>>> 1.       Want to check what is underlying running for limit clause
>>>> used in the following Upsert statement (is it involving any coprocessor
>>>> working behind?):
>>>>
>>>>
>>>>
>>>> *                                  upsert into table2 select * from
>>>> table1 limit 3000000; * (table 1 and table 2 have same schema)
>>>>
>>>>
>>>>
>>>>               The above statement is running a lot slower than without
>>>> “limit”  clause as shown in following, even the above statement upsert less
>>>> data:
>>>>
>>>>
>>>>
>>>> *                                upsert into table2 select * from
>>>> table1;*
>>>>
>>>>
>>>>
>>>> 2.       We also observe memory usable is pretty high without the
>>>> limit clause (8gb vs 2gb), sometimes for big table it can reach 20gb
>>>> without using limit clause.  According to phoenix website description for
>>>> upsert select “If auto commit is on, and both a) the target table matches
>>>> the source table, and b) the select performs no aggregation, then the
>>>> population of the target table will be done completely on the server-side
>>>> (with constraint violations logged, but otherwise ignored).”
>>>>
>>>>
>>>>
>>>>                My question is If everything is done on server-side, how
>>>> come we have such high memory usage on the client machine?
>>>>
>>>>
>>>>
>>>> Thanks,
>>>>
>>>> Shawn
>>>>
>>>

Re: "upsert select" with "limit" clause

Posted by Vincent Poon <vi...@apache.org>.

I think it's done client-side if you have LIMIT.  If you have e.g. LIMIT
1000 , it would be incorrect for each regionserver to upsert 100, if you
have more than one regionserver.  So instead results are sent back to the
client, where the LIMIT is applied and then written back to the server in
the UPSERT.

On Wed, Dec 12, 2018 at 1:18 PM Shawn Li <sh...@gmail.com> wrote:

> Hi Vincent,
>
>
>
> The table creation statement is similar to below. We have about 200
> fields. Table is mutable and don’t have any index on the table.
>
>
>
> CREATE TABLE IF NOT EXISTS us_population (
>
>       state CHAR(2) NOT NULL,
>
>       city VARCHAR,
>
>       population BIGINT,
>
>       …
>
>       CONSTRAINT my_pk PRIMARY KEY (state));
>
>
>
> Thanks,
>
> Shawn
>
>
>
> On Wed, Dec 12, 2018, 13:42 Vincent Poon <vincentpoon@apache.org wrote:
>
>> For #2, can you provide the table definition and the statement used?
>> e.g. Is the table immutable, or does it have indexes?
>>
>> On Tue, Dec 11, 2018 at 6:08 PM Shawn/Xiang Li <sh...@gmail.com>
>> wrote:
>>
>>> Hi,
>>>
>>>
>>>
>>> 1.       Want to check what is underlying running for limit clause used
>>> in the following Upsert statement (is it involving any coprocessor working
>>> behind?):
>>>
>>>
>>>
>>> *                                  upsert into table2 select * from
>>> table1 limit 3000000; * (table 1 and table 2 have same schema)
>>>
>>>
>>>
>>>               The above statement is running a lot slower than without
>>> “limit”  clause as shown in following, even the above statement upsert less
>>> data:
>>>
>>>
>>>
>>> *                                upsert into table2 select * from
>>> table1;*
>>>
>>>
>>>
>>> 2.       We also observe memory usable is pretty high without the limit
>>> clause (8gb vs 2gb), sometimes for big table it can reach 20gb without
>>> using limit clause.  According to phoenix website description for upsert
>>> select “If auto commit is on, and both a) the target table matches the
>>> source table, and b) the select performs no aggregation, then the
>>> population of the target table will be done completely on the server-side
>>> (with constraint violations logged, but otherwise ignored).”
>>>
>>>
>>>
>>>                My question is If everything is done on server-side, how
>>> come we have such high memory usage on the client machine?
>>>
>>>
>>>
>>> Thanks,
>>>
>>> Shawn
>>>
>>

Re: "upsert select" with "limit" clause

Posted by Shawn Li <sh...@gmail.com>.

Hi Vincent,



The table creation statement is similar to below. We have about 200 fields.
Table is mutable and don’t have any index on the table.



CREATE TABLE IF NOT EXISTS us_population (

      state CHAR(2) NOT NULL,

      city VARCHAR,

      population BIGINT,

      …

      CONSTRAINT my_pk PRIMARY KEY (state));



Thanks,

Shawn



On Wed, Dec 12, 2018, 13:42 Vincent Poon <vincentpoon@apache.org wrote:

> For #2, can you provide the table definition and the statement used?  e.g.
> Is the table immutable, or does it have indexes?
>
> On Tue, Dec 11, 2018 at 6:08 PM Shawn/Xiang Li <sh...@gmail.com>
> wrote:
>
>> Hi,
>>
>>
>>
>> 1.       Want to check what is underlying running for limit clause used
>> in the following Upsert statement (is it involving any coprocessor working
>> behind?):
>>
>>
>>
>> *                                  upsert into table2 select * from
>> table1 limit 3000000; * (table 1 and table 2 have same schema)
>>
>>
>>
>>               The above statement is running a lot slower than without
>> “limit”  clause as shown in following, even the above statement upsert less
>> data:
>>
>>
>>
>> *                                upsert into table2 select * from table1;*
>>
>>
>>
>> 2.       We also observe memory usable is pretty high without the limit
>> clause (8gb vs 2gb), sometimes for big table it can reach 20gb without
>> using limit clause.  According to phoenix website description for upsert
>> select “If auto commit is on, and both a) the target table matches the
>> source table, and b) the select performs no aggregation, then the
>> population of the target table will be done completely on the server-side
>> (with constraint violations logged, but otherwise ignored).”
>>
>>
>>
>>                My question is If everything is done on server-side, how
>> come we have such high memory usage on the client machine?
>>
>>
>>
>> Thanks,
>>
>> Shawn
>>
>

Re: "upsert select" with "limit" clause

Posted by Vincent Poon <vi...@apache.org>.

For #2, can you provide the table definition and the statement used?  e.g.
Is the table immutable, or does it have indexes?

On Tue, Dec 11, 2018 at 6:08 PM Shawn/Xiang Li <sh...@gmail.com> wrote:

> Hi,
>
>
>
> 1.       Want to check what is underlying running for limit clause used
> in the following Upsert statement (is it involving any coprocessor working
> behind?):
>
>
>
> *                                  upsert into table2 select * from table1
> limit 3000000; * (table 1 and table 2 have same schema)
>
>
>
>               The above statement is running a lot slower than without
> “limit”  clause as shown in following, even the above statement upsert less
> data:
>
>
>
> *                                upsert into table2 select * from table1;*
>
>
>
> 2.       We also observe memory usable is pretty high without the limit
> clause (8gb vs 2gb), sometimes for big table it can reach 20gb without
> using limit clause.  According to phoenix website description for upsert
> select “If auto commit is on, and both a) the target table matches the
> source table, and b) the select performs no aggregation, then the
> population of the target table will be done completely on the server-side
> (with constraint violations logged, but otherwise ignored).”
>
>
>
>                My question is If everything is done on server-side, how
> come we have such high memory usage on the client machine?
>
>
>
> Thanks,
>
> Shawn
>