You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@phoenix.apache.org by "Riesland, Zack" <Za...@sensus.com> on 2016/01/18 18:57:29 UTC

Guidance on how many regions to plan for

In the past, my struggles with hbase/phoenix have been related to data ingest.

Each night, we ingest lots of data via CsvBulkUpload.

After lots of trial and error trying to get our largest table to cooperate, I found a primary key that distributes well if I specify the split criteria on table creation.

That table now has ~15 billion rows representing about 300GB of data across 513 regions (on 9 region servers).

Life was good for a while.

Now, I have a new use case where I need another table very similar, but rather than serving UI-based reports, this table will be queried programmatically and VERY heavily (millions of queries per day).

I have asked about this in the past, but got derailed to other things, so I'm trying to zoom out a bit and make sure I approach this problem correctly.

My simplified use case is basically: de-dup input files against Phoenix before passing them on to the rest of our ingest process. This will result in tens of thousands of queries to Phoenix per input file.

I noted in the past that after 5-10K rapid-fire queries, the response time drops dramatically. And I think we established that this is because there is one thread being spawned per 20-mb chunk of data in each region (?)

More generally, it seems that the more regions there are in my table, the more resource-intensive phoenix queries become?

Is that correct?

I estimate that my table will contain about 500GB of data by the end of 2016.

The rows are pretty small (like 6 or 8 small columns). I have 9 region servers - soon to be 12.

The distribution is usually 2,000-5,000 rows per primary key, which is about 0.5 - 3 MB of data.

Given that information, is there a good rule of thumb for how many regions I should try to target with my schema/primary key design?

I experimented using salt buckets (presumably letting Phoenix choose how to split everything) but I keep getting errors when I try to bulk load data into salted tables ("Import job on table blah failed due to exception: java.io.IOException: Trying to load more than 32 hfiles to one family of one region").

Are there HBase configuration tweaks I should focus on? My current memstore size is set to 256mb.

Thanks for any guidance or tips here.

Re: Guidance on how many regions to plan for

Posted by Ravi Kiran <ma...@gmail.com>.

Hi Zack,
   The limitation of 32 HFiles is due to this configuration property
MAX_FILES_PER_REGION_PER_FAMILY
which defaults to 32 in LoadIncrementalHFiles.
  You can give it a try updating your configuration with a larger value and
see if it works.


https://github.com/apache/hbase/blob/master/hbase-server/src/main/java/org/apache/hadoop/hbase/mapreduce/LoadIncrementalHFiles.java#L116


Thanks
Ravi

On Mon, Jan 18, 2016 at 9:57 AM, Riesland, Zack <Za...@sensus.com>
wrote:

> In the past, my struggles with hbase/phoenix have been related to data
> ingest.
>
>
>
> Each night, we ingest lots of data via CsvBulkUpload.
>
>
>
> After lots of trial and error trying to get our largest table to
> cooperate, I found a primary key that distributes well if I specify the
> split criteria on table creation.
>
>
>
> That table now has ~15 billion rows representing about 300GB of data
> across 513 regions (on 9 region servers).
>
>
>
> Life was good for a while.
>
>
>
> Now, I have a new use case where I need another table very similar, but
> rather than serving UI-based reports, this table will be queried
> programmatically and VERY heavily (millions of queries per day).
>
>
>
> I have asked about this in the past, but got derailed to other things, so
> I’m trying to zoom out a bit and make sure I approach this problem
> correctly.
>
>
>
> My simplified use case is basically: de-dup input files against Phoenix
> before passing them on to the rest of our ingest process. This will result
> in tens of thousands of queries to Phoenix per input file.
>
>
>
> I noted in the past that after 5-10K rapid-fire queries, the response time
> drops dramatically. And I think we established that this is because there
> is one thread being spawned per 20-mb chunk of data in each region (?)
>
>
>
> More generally, it seems that the more regions there are in my table, the
> more resource-intensive phoenix queries become?
>
>
>
> Is that correct?
>
>
>
> I estimate that my table will contain about 500GB of data by the end of
> 2016.
>
>
>
> The rows are pretty small (like 6 or 8 small columns). I have 9 region
> servers – soon to be 12.
>
>
>
> The distribution is usually 2,000-5,000 rows per primary key, which is
> about 0.5 – 3 MB of data.
>
>
>
> Given that information, is there a good rule of thumb for how many regions
> I should try to target with my schema/primary key design?
>
>
>
> I experimented using salt buckets (presumably letting Phoenix choose how
> to split everything) but I keep getting errors when I try to bulk load data
> into salted tables (“Import job on table blah failed due to exception:
> java.io.IOException: Trying to load more than 32 hfiles to one family of
> one region”).
>
>
>
> Are there HBase configuration tweaks I should focus on? My current
> memstore size is set to 256mb.
>
>
>
> Thanks for any guidance or tips here.
>
>
>
>
>
>
>