You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@phoenix.apache.org by you Zhuang <zh...@gmail.com> on 2019/08/29 08:11:32 UTC

Is there a way to specify split num or reducer num when creating phoenix table ?

I have a chronological series of data. Data row like dt, r1 ,r2 ,r3 ,r4 ,r5 ,r6 ,d1 ,d2 ,d3 ,d4 , d5 …

And dt is format  as  20190829 , increasing monotonically, such as 20190830,20190831...

The query pattern is some like select * from table where dt between 20180620 and  20190829 and r3 = ? And r6 = ?;

Dt is mandatory, remain filter is some random combination of r1 to r6, selected columns are always  all columns *.


I have made dt,r1,r2,… r6 to be compound primary key. The create table clause is below:

CREATE TABLE app.table(
 Dt integer not null ,
 R1 integer not null,
 R2 integer not null,
 R3 integer not null,
 R4 integer not null,
 R5 integer not null,
 R6 integer not null,

 D1 decimal(30,6),
 D2 decimal(30,6),
 D3 decimal(30,6),
 D4 decimal(30,6),
 D5 decimal(30,6),
 D6 decimal(30,6)


 CONSTRAINT pk PRIMARY KEY (dt,r1,r2,r3,r4,r5,r6)
) SALT_BUCKETS = 3,UPDATE_CACHE_FREQUENCY = 300000,COMPRESSION = 'SNAPPY',  SPLIT_POLICY = 'org.apache.hadoop.hbase.regionserver.ConstantSizeRegionSplitPolicy', MAX_FILESIZE = '5000000000’;

I have 3 region server so I determine SALT_BUCKETS = 3.

But when I initially load table data with csvbulkload tool , the  dt ranges from  20180620 to 20190829, data size is about 1T,

Csvbulkload map reduce shows 3 partitions for reducer, It  always failed due to so small partitions.

I increase SALT_BUCKETS = 512, but max SALT_BUCKETS = 256, I set it to 256 but not works.



I know I can split on (…)  when creating table, but I don’t know how to determine the point , and hundreds of points is scaring.


So Is there a way to specify split num or reducer num when creating phoenix table ? 

I will be expecting any advice for tuning this scenario.

Re: Is there a way to specify split num or reducer num when creating phoenix table ?

Posted by you Zhuang <zh...@gmail.com>.

Sorry,i can’t get your words.  when I set SALT_BUCKETS = 256, the region num is actually 256. 

Pre-split along date-boundaries will be hundreds of split points, it is a heavy work.

The biggest problem I’m facing is how to bulkload 1T data to phoenix table.

> On Aug 29, 2019, at 8:18 PM, Josh Elser <el...@apache.org> wrote:
> 
> inside

Re: Is there a way to specify split num or reducer num when creating phoenix table ?

Posted by Josh Elser <el...@apache.org>.

Configuring salt buckets is not the same thing as pre-splitting a table. 
You should not be setting a crazy large number of buckets like you are.

If you want more parallelism in the MapReduce job, pre-split along 
date-boundaries, with the salt bucket taken into consideration (e.g. 
\x00_date, \x01_date, \x02_date).

HBase requires that a file to be bulk-loaded fit inside of a single 
region. A Reducer will only generate data for a single Region (as a 
Reducer can only generate one file). Create more regions, and you will 
get more parallelism.

On 8/29/19 4:11 AM, you Zhuang wrote:
> I have a chronological series of data. Data row like dt, r1 ,r2 ,r3 ,r4 ,r5 ,r6 ,d1 ,d2 ,d3 ,d4 , d5 …
> 
> And dt is format  as  20190829 , increasing monotonically, such as 20190830,20190831...
> 
> The query pattern is some like select * from table where dt between 20180620 and  20190829 and r3 = ? And r6 = ?;
> 
> Dt is mandatory, remain filter is some random combination of r1 to r6, selected columns are always  all columns *.
> 
> 
> I have made dt,r1,r2,… r6 to be compound primary key. The create table clause is below:
> 
> CREATE TABLE app.table(
>   Dt integer not null ,
>   R1 integer not null,
>   R2 integer not null,
>   R3 integer not null,
>   R4 integer not null,
>   R5 integer not null,
>   R6 integer not null,
> 
>   D1 decimal(30,6),
>   D2 decimal(30,6),
>   D3 decimal(30,6),
>   D4 decimal(30,6),
>   D5 decimal(30,6),
>   D6 decimal(30,6)
> 
> 
>   CONSTRAINT pk PRIMARY KEY (dt,r1,r2,r3,r4,r5,r6)
> ) SALT_BUCKETS = 3,UPDATE_CACHE_FREQUENCY = 300000,COMPRESSION = 'SNAPPY',  SPLIT_POLICY = 'org.apache.hadoop.hbase.regionserver.ConstantSizeRegionSplitPolicy', MAX_FILESIZE = '5000000000’;
> 
> I have 3 region server so I determine SALT_BUCKETS = 3.
> 
> But when I initially load table data with csvbulkload tool , the  dt ranges from  20180620 to 20190829, data size is about 1T,
> 
> Csvbulkload map reduce shows 3 partitions for reducer, It  always failed due to so small partitions.
> 
> I increase SALT_BUCKETS = 512, but max SALT_BUCKETS = 256, I set it to 256 but not works.
> 
> 
> 
> I know I can split on (…)  when creating table, but I don’t know how to determine the point , and hundreds of points is scaring.
> 
> 
> So Is there a way to specify split num or reducer num when creating phoenix table ?
> 
> I will be expecting any advice for tuning this scenario.
>