You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@phoenix.apache.org by Gerald Sangudi <gs...@23andme.com.INVALID> on 2018/11/05 20:17:36 UTC
Re: Salting based on partial rowkeys

Hi folks,

I would like to continue the discussion below from a few weeks ago. I would
like to address the feedback from Thomas, Jaanai, and Sergey.

In thinking about this some more, it amounts to introducing a form of hash
partitioning in Phoenix + HBase. For this to work in the general case,
region splits and merges would need to be disabled, and the regions
pre-defined. This is already supported by DisableRegionSplitPolicy.

I understand that HBase is range-partitioned. What we are proposing is to
allow specialized use cases for users to manage their own partitioning. All
other users would be unaffected.

Jaanai also mentioned that salting is meant to address write hotspots. That
is true. We are proposing an additional use of salting (or if you prefer,
another feature altogether) for specialized use cases. Again, all other
users would be unaffected.

We have a lot of data in HBase, and some of our queries would benefit from
some sort of hash partitioning. That is the crux of our proposal.

Some specific responses:

@Josh -- exactly.

@Thomas -- if we can partition the data exactly how we want, we can make
sure that certain queries do not go across regions. We can either scan only
the matching regions, and we can perform the full aggregation within each
matching region, without having to do a merge.

@Jaanai, @Sergey -- I hope I explained about hotspotting above. We would
also be fine calling it something other than salting. Maybe that's better
to avoid confusion.

@Lars -- you posted questions in PHOENIX-4757. I will respond there.

Thanks everyone for all the feedback on this. Our goal is to discuss all
the concerns, and then finally get a yay or nay consensus from the
committers.

Gerald

On Sun, Sep 16, 2018 at 9:52 PM larsh@apache.org <la...@apache.org> wrote:

>  I added some comments on the PHOENIX-4757
>
>     On Thursday, September 13, 2018, 6:42:12 PM PDT, Josh Elser <
> elserj@apache.org> wrote:
>
>  Ahh, I get you now.
>
> For a composite primary key made up of columns 1 through N, you want
> similar controls to compute the value of the salt based on a sequence of
> the columns 1 through M where M <= N (instead of always on all columns).
>
> For large numbers of salt buckets and a scan over a facet, you prune
> your search space considerably. Makes sense to me!
>
> On 9/13/18 6:37 PM, Gerald Sangudi wrote:
> > In case the text formatting is lost below, I also added it as a comment
> in
> > the JIRA ticket:
> >
> > https://issues.apache.org/jira/browse/PHOENIX-4757
> >
> >
> > On Thu, Sep 13, 2018 at 3:24 PM, Gerald Sangudi <gs...@23andme.com>
> > wrote:
> >
> >> Sorry I missed Josh's reply; I've subscribed to the dev list now.
> >>
> >> Below is a copy-and-paste from our internal document. Thanks in advance
> >> for your review and additional feedback on this.
> >>
> >> Gerald
> >>
> >>
> >>
> >>
> >>
> >>
> >>
> >>
> >>
> >>
> >>
> >>
> >>
> >>
> >>
> >>
> >>
> >>
> >>
> >> *BackgroundWe make extensive use of multi-column rowkeys and salting
> >> <https://phoenix.apache.org/salted.html> in our different apache
> phoenix
> >> deployments. We frequently perform group-by aggregations on these data
> >> along a specific dimension that would benefit from predictably
> partitioning
> >> the data along that dimension. Proposal:We propose to add table
> metadata to
> >> allow schema designers to constrain salting to a subset of the rowkey,
> >> rather than the full rowkey as it is today. This will introduce a
> mechanism
> >> to partition data on a per-table basis along a single dimension without
> >> application changes or much change to the phoenix runtime logic. We
> expect
> >> this will result in substantially faster group-by’s along the salted
> >> dimension and negligible penalties elsewhere. This feature has also been
> >> proposed in PHOENIX-4757
> >> <https://issues.apache.org/jira/browse/PHOENIX-4757> where it was
> pointed
> >> out that partitioning and sorting data along different dimensions is a
> >> common pattern in other datastores as well.Theoretically, it could cause
> >> hotspotting when querying along the salted dimension without the leading
> >> rowkey - that would be an anti-pattern.Usage
> ExampleCurrent:Schema:CREATE
> >> TABLE relationship (id_1 BIGINT NOT NULL,id_2 BIGINT NOT NULL,other_key
> >> BIGINT NOT NULL,val SMALLINT,CONSTRAINT pk PRIMARY KEY (id_1, id_2,
> >> other_key))SALT_BUCKETS=60;Query:Select id_2, sum(val)From
> >> relationshipWhere id_1 in (2,3)Group by id_2Explain:0: jdbc:phoenix:>
> >> EXPLAIN Select id_2, sum(val) From relationship Where id_1 in (2,3)
> Group
> >> by id_2
> >>
> ;+-----------------------------------------------------------------------------------------+--------+|
> >>                                          PLAN    | EST_BY
> >>
> |+-----------------------------------------------------------------------------------------+--------+|
> >> CLIENT 60-CHUNK PARALLEL 60-WAY SKIP SCAN ON 120 KEYS OVER RELATIONSHIP
> >> [0,2] - [59,3]  | null ||    SERVER AGGREGATE INTO DISTINCT ROWS BY
> [ID_2]
> >>                                        | null || CLIENT MERGE SORT
> >>                                                                        |
> >> null
> >>
> |+-----------------------------------------------------------------------------------------+--------+3
> >> rows selected (0.048 seconds)In this case, although the group by is
> >> performed on both the client and regionserver, almost all of the actual
> >> grouping happens on the client because the id_2’s are randomly
> distributed
> >> across the regionservers. As a result, a lot of unnecessary data is
> >> serialized to the client and grouped serially there. This can become
> quite
> >> material with large resultsets.Proposed:Schema:CREATE TABLE relationship
> >> (id_1 BIGINT NOT NULL,id_2 BIGINT NOT NULL,other_key BIGINT NOT NULL,val
> >> SMALLINT,CONSTRAINT pk PRIMARY KEY (id_1, id_2,
> >> other_key),SALT_BUCKETS=60,SALT_COLUMN = id_2);Query (unchanged):Select
> >> id_2, sum(val)From relationshipWhere id_1 in (2,3)Group by id_2Explain
> >> (unchanged)Under the proposal, the data are merely partitioned so that
> all
> >> rows containing the same id_2 are on the same regionserver, the above
> query
> >> will perform almost all of the grouping in parallel on the
> regionservers.
> >> No special hint or changes to the query plan would be required to
> benefit.
> >> Tables would need to be re-salted to take advantage of the new
> >> functionality.Technical changes proposed to phoenix: - Create a new
> piece
> >> of table-level metadata: SALT_COLUMN. SALT_COLUMN will instruct the
> salting
> >> logic to generate a salt-byte based only on the specified column. If
> >> unspecified, it will behave as it does today and default to salting the
> >> entire rowkey. This metadata may be specified only when the table is
> >> created and may not be modified. The specified column must be part of
> the
> >> rowkey.  - Modify all callers of getSaltingByte
> >> <
> https://github.com/apache/phoenix/blob/master/phoenix-core/src/main/java/org/apache/phoenix/schema/SaltingUtil.java#L77
> >(byte[]
> >> value, int offset, int length, int bucketNum) to consistently leverage
> the
> >> new metadata.- Tests- DocsDesign points:One salt column vs multiple salt
> >> columns: Based on the existing signature for getSaltingByte, it seems
> >> simpler to only support a single SALT_COLUMN rather than multiple
> arbitrary
> >> SALT_COLUMNS. Known use-cases are completely supported by a single
> >> column.Syntax:  PHOENIX-4757
> >> <https://issues.apache.org/jira/browse/PHOENIX-4757> suggests an
> alternate,
> >> less verbose syntax for defining the salt bucket. The SALT_COLUMN
> syntax is
> >> suggested for clarity and consistency with other Phoenix table
> >> options.Future Enhancements (not in scope)Different aspects of the query
> >> execution runtime could take advantage of new metadata and implied
> >> knowledge that the data are partitioned in a predictable manner. For
> >> example: - It could be that client side grouping is completely
> unnecessary
> >> in cases where the SALT_COLUMN is part of the group-by expression.- A
> query
> >> that contains a literal equality predicate for the SALT_COLUMN can be
> >> isolated to a single regionserver, rather than broadcast to all
> >> regionservers.- A client-side merge-sort-join based on the SALT_COLUMN
> >> could optimize organization of merges. - Similarly, a server-side hash
> join
> >> could distribute only ‘necessary’ portions of the hash table to each
> >> regionserver.If additional advantages of these types come for free, then
> >> that’s great but can be follow on enhancements from the initial commit.*
> >>
> >> On Thu, Sep 13, 2018 at 9:33 AM, Thomas D'Silva <tdsilva@salesforce.com
> >
> >> wrote:
> >>
> >>> Gerald,
> >>>
> >>> I think you missed Josh's reply here : https://lists.apache.org/thr
> >>> ead.html/c5145461805429622a410c23c1199d578e146a5c94511b2d583
> >>> 3438b@%3Cdev.phoenix.apache.org%3E
> >>>
> >>> Could you explain how using a subset of the pk columns to generate the
> >>> salt byte helps with partitioning, aggregations etc?
> >>>
> >>> Thanks,
> >>> Thomas
> >>>
> >>> On Thu, Sep 13, 2018 at 8:32 AM, Gerald Sangudi <gs...@23andme.com>
> >>> wrote:
> >>>
> >>>> Hi folks,
> >>>>
> >>>> Any thoughts or feedback on this?
> >>>>
> >>>> Thanks,
> >>>> Gerald
> >>>>
> >>>> On Mon, Sep 10, 2018 at 1:56 PM, Gerald Sangudi <gsangudi@23andme.com
> >
> >>>> wrote:
> >>>>
> >>>>> Hello folks,
> >>>>>
> >>>>> We have a requirement for salting based on partial, rather than full,
> >>>>> rowkeys. My colleague Mike Polcari has identified the requirement and
> >>>>> proposed an approach.
> >>>>>
> >>>>> I found an already-open JIRA ticket for the same issue:
> >>>>> https://issues.apache.org/jira/browse/PHOENIX-4757. I can provide
> more
> >>>>> details from the proposal.
> >>>>>
> >>>>> The JIRA proposes a syntax of SALT_BUCKETS(col, ...) = N, whereas
> Mike
> >>>>> proposes SALT_COLUMN=col or SALT_COLUMNS=col, ... .
> >>>>>
> >>>>> The benefit at issue is that users gain more control over
> partitioning,
> >>>>> and this can be used to push some additional aggregations and hash
> joins
> >>>>> down to region servers.
> >>>>>
> >>>>> I would appreciate any go-ahead / thoughts / guidance / objections /
> >>>>> feedback. I'd like to be sure that the concept at least is not
> >>>>> objectionable. We would like to work on this and submit a patch down
> the
> >>>>> road. I'll also add a note to the JIRA ticket.
> >>>>>
> >>>>> Thanks,
> >>>>> Gerald
> >>>>>
> >>>>>
> >>>>
> >>>
> >>
> >
>