You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@kudu.apache.org by Dan Burkert <da...@cloudera.com> on 2017/09/08 17:15:15 UTC

Re: DMP/CDP Profile Store

Hi Ben,

This is certainly an interesting idea.  I think the architecture you laid
out could be successful, especially if the set of attributes is relatively
static. I just have a couple thoughts on various things:

* Co-locating partitions

Assuming you will be hash partitioning over the base ID on each attribute
table, you could manually move around tablets so that the same hash
partition buckets of each table are colocated on the same tablet servers.
This would give you some amount of locality for looking up the attributes
of a single user.  Kudu doesn't (yet) have this capability built in, but it
could be done manually using the 'kudu' command line tool, and perhaps
scripted to account for moving tablets during failover.  The advantage here
would be biggest while joining across the attribute tables in bulk.  For
single-user lookups it probably wouldn't make much difference.  KUDU-874
<https://issues.apache.org/jira/browse/KUDU-874> includes an interesting
discussion on how Spanner does this.

* High-dimensionality attributes

It sounds like in your case the attribute sets will be relatively static,
but if you want to see a design which allows for many sparse attributes,
check out the readme of kudu-ts
<https://github.com/danburkert/kudu-ts/tree/master/core>.  It describes how
it uses a few index tables to attach attributes (tags in the kudu-ts
parlance) to data points.  It may not translate particularly well here,
since the kudu-ts data points are immutable, but I think it's interesting
nonetheless.


Let us know how it goes.

- Dan

On Wed, Aug 30, 2017 at 7:57 AM, Benjamin Kim <bb...@gmail.com> wrote:

> I was wondering has anyone worked on a DMP/CDP for storing user and
> customer profiles in Kudu. Each user will have their base ID's aka identity
> graph along with statistics based on their attributes along with tables for
> these attributes grouped by category.
>
> Please let me know what you think of my thoughts.
>
> I was thinking of creating a base profile table to store the ID's and
> statistics along with unchanging or rarely changing attributes, such as
> name, that do not need to be tracked. Next, I would create tables to
> categorize groups of attributes, such as user information, behaviors,
> geolocation, devices, etc. These attribute tables would have columns for
> each attribute and would track changes by only inserting data via a time
> stamp column to know when it was entered. Essentially, I would follow the
> type 2 slowly changing dimension operandi for data warehouses. For
> attributes that expire, we will partition by a time range so that we can
> drop off expired data. For attributes where we only need to latest one, we
> would add an active column to easily flag and query them after inactivating
> older versions.
>
> Any comments or advice would be truly appreciated.
>
> Cheers,
> Ben
>