You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@phoenix.apache.org by Alexander Batyrshin <0x...@gmail.com> on 2019/09/03 12:03:29 UTC

Any reason for so small phoenix.mutate.batchSize by default?

 Hello all,

1) There is bug in documentation - http://phoenix.apache.org/tuning.html <http://phoenix.apache.org/tuning.html>
phoenix.mutate.batchSize is not 1000, but only 100 by default
https://github.com/apache/phoenix/blob/master/phoenix-core/src/main/java/org/apache/phoenix/query/QueryServicesOptions.java#L164 <https://github.com/apache/phoenix/blob/master/phoenix-core/src/main/java/org/apache/phoenix/query/QueryServicesOptions.java#L164>
Changed for https://issues.apache.org/jira/browse/PHOENIX-541 <https://issues.apache.org/jira/browse/PHOENIX-541>


2) I want to discuss this default value. From PHOENIX-541 <https://issues.apache.org/jira/browse/PHOENIX-541> I read about issue with MR and wide rows (2MB per row) and it looks like rare case. But in most common cases we can get much better write perfomance with batchSize = 1000 especially if it used with SALT table

Re: Any reason for so small phoenix.mutate.batchSize by default?

Posted by Alexander Batyrshin <0x...@gmail.com>.


> On 3 Sep 2019, at 19:45, Alexander Batyrshin <0x...@gmail.com> wrote:
> 
> I observer that there is some extra mutations in batch for every my UPSERTs
> For example if app call executeUpdate() only 5 times then on commit there will be "DEBUG MutationState:1046 - Sent batch of 10"
> Can’t figure out where this extra mutations comes from and why.
> 
> This is mean that “useful” batch size is phoenix.mutate.batchSize / 2.


Extra mutation is Mutation.Delete because I have NULL value in UPSERT statement

Re: Any reason for so small phoenix.mutate.batchSize by default?

Posted by Alexander Batyrshin <0x...@gmail.com>.

I observer that there is some extra mutations in batch for every my UPSERTs
For example if app call executeUpdate() only 5 times then on commit there will be "DEBUG MutationState:1046 - Sent batch of 10"
Can’t figure out where this extra mutations comes from and why.

This is mean that “useful” batch size is phoenix.mutate.batchSize / 2.

> * What does your table DDL look like?

CREATE TABLE IF NOT EXISTS TABLE_CODES (
    "id" VARCHAR NOT NULL PRIMARY KEY,
    "d"."tg" VARCHAR,
    "d"."drip" VARCHAR,
    "d"."s" UNSIGNED_TINYINT,
    "d"."se" UNSIGNED_TINYINT,
    "d"."rle" UNSIGNED_TINYINT,
    "d"."dme" TIMESTAMP,
    "d"."dpa" TIMESTAMP,
    "d"."p" VARCHAR,
    "d"."pt" UNSIGNED_TINYINT,
    "d"."x" VARCHAR,
    "d"."pn" VARCHAR,
    "d"."b" VARCHAR,
    "d"."hc" VARCHAR ARRAY,
    "d"."ns" VARCHAR(16),
    "d"."tv" VARCHAR(10),
    "d"."vcp" VARCHAR,
    "d"."et" UNSIGNED_TINYINT,
    "d"."xoa" BINARY(16),
    "d"."j" VARCHAR
) SALT_BUCKETS=30, COLUMN_ENCODED_BYTES=NONE;

CREATE INDEX "IDX_CIS_O" ON "TABLE_CODES" ("d"."x", "d"."dme") INCLUDE("d"."tg", "d"."rle", "d"."pt" ... ) SALT_BUCKETS=30;
CREATE INDEX "IDX_CIS_PRID" ON "TABLE_CODES" ("d"."drip", "d"."dme") INCLUDE("d"."tg", "d"."rle", "d"."pt" ...) SALT_BUCKETS=30;

For my case SALT_BUCKET=30 every batch with default settings will carry only 50 “useful” rows and they will be splitted across 30 servers, so every server will get only 1-2 rows.

> * How large is one mutation you're writing (in bytes)?

Any idea how to calculate it?
https://phoenix.apache.org/metrics.html <https://phoenix.apache.org/metrics.html> will give me total mutations count and total size in bytes of batch. But as I mentioned before there is “extra” mutation that will corrupt statistics

> * How much data ends up being sent to a RegionServer in one RPC?
Where I can get this metric?


> On 3 Sep 2019, at 17:19, Josh Elser <el...@apache.org> wrote:
> 
> Hey Alexander,
> 
> Was just poking at the code for this: it looks like this is really just determining the number of mutations that get "processed together" (as opposed to a hard limit).
> 
> Since you have done some work, I'm curious if you could generate some data to help back up your suggestion:
> 
> * What does your table DDL look like?
> * How large is one mutation you're writing (in bytes)?
> * How much data ends up being sent to a RegionServer in one RPC?
> 
> You're right in that we would want to make sure that we're sending an adequate amount of data to a RegionServer in an RPC, but this is tricky to balance for all cases (thus, setting a smaller value to avoid sending batches that are too large is safer).
> 
> On 9/3/19 8:03 AM, Alexander Batyrshin wrote:
>>  Hello all,
>> 1) There is bug in documentation - http://phoenix.apache.org/tuning.html
>> phoenix.mutate.batchSize is not 1000, but only 100 by default
>> https://github.com/apache/phoenix/blob/master/phoenix-core/src/main/java/org/apache/phoenix/query/QueryServicesOptions.java#L164
>> Changed for https://issues.apache.org/jira/browse/PHOENIX-541
>> 2) I want to discuss this default value. From PHOENIX-541 <https://issues.apache.org/jira/browse/PHOENIX-541> I read about issue with MR and wide rows (2MB per row) and it looks like rare case. But in most common cases we can get much better write perfomance with batchSize = 1000 especially if it used with SALT table

Re: Any reason for so small phoenix.mutate.batchSize by default?

Posted by Josh Elser <el...@apache.org>.

Hey Alexander,

Was just poking at the code for this: it looks like this is really just 
determining the number of mutations that get "processed together" (as 
opposed to a hard limit).

Since you have done some work, I'm curious if you could generate some 
data to help back up your suggestion:

* What does your table DDL look like?
* How large is one mutation you're writing (in bytes)?
* How much data ends up being sent to a RegionServer in one RPC?

You're right in that we would want to make sure that we're sending an 
adequate amount of data to a RegionServer in an RPC, but this is tricky 
to balance for all cases (thus, setting a smaller value to avoid sending 
batches that are too large is safer).

On 9/3/19 8:03 AM, Alexander Batyrshin wrote:
>   Hello all,
> 
> 1) There is bug in documentation - http://phoenix.apache.org/tuning.html
> phoenix.mutate.batchSize is not 1000, but only 100 by default
> https://github.com/apache/phoenix/blob/master/phoenix-core/src/main/java/org/apache/phoenix/query/QueryServicesOptions.java#L164
> Changed for https://issues.apache.org/jira/browse/PHOENIX-541
> 
> 
> 2) I want to discuss this default value. From PHOENIX-541 
> <https://issues.apache.org/jira/browse/PHOENIX-541> I read about issue 
> with MR and wide rows (2MB per row) and it looks like rare case. But in 
> most common cases we can get much better write perfomance with batchSize 
> = 1000 especially if it used with SALT table