You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@druid.apache.org by Evan Galpin <ev...@gmail.com> on 2021/05/06 02:37:20 UTC

FlattenSpec for Nested Data With Unknown Array Length

Hi Druid devs!

I’m investigating Druid for an analytical workload and I think it would be
a great fit for the data and use case I have. One thing I’m stuck on right
now is data modelling.

I’ll use a somewhat classic “Blog Post” example to illustrate. Let’s assume
a Blog Entry may have many associated “comments” (in unknown quantity), and
many unknown “reactions” (quantity also unknown).

What is the best way to model this? The example flattenSpec’s that I’ve
seen showing array handling seem to indicate that the size of the array
must be known and constant. Does that then rule out the possibility of
modelling the above Blog Entry as a singular row for peak performance?

One natural way to model the above with an RDBMS would be a table for each
of Blog Entries, Comments, and Reactions, then performing joins as needed.
But am I correct in assuming that joins ought to be avoided?

Thanks in advance,
Evan

Re: FlattenSpec for Nested Data With Unknown Array Length

Posted by Gian Merlino <gi...@apache.org>.

Hey Evan,

Druid's data model doesn't currently have a good way of storing arrays of
objects like this. And you're right that even though joins exist, to get
peak performance you want to avoid them at query time.

In similar situations I have stored data models like this as 3 tables
(entries, comments, reactions) and used 3 techniques to avoid the need for
joins at query time:

1) Store aggregate information about comments and reactions in the entries
table: number of comments, number of each type of reaction, etc. That way,
no join is necessary if you just want to — for example — see the average
number of comments for certain entries. You can do something like "select
avg(num_comments) from entries".

2) Store attributes about the entries in the comments and reactions table.
That way, no join is necessary if you want to find all comments that match
entries with specific attributes. For example, if you want to get the
number of users that commented on a particular user's entry, you'd do
"select count(distinct comment_username) from comments where entry_username
= 'alice'".

3) Mash up visualizations sourced from different tables in your
presentation layer. The idea is that if all tables have entry attributes
materialized in them, then you can build a dashboard that has one viz based
on comments, one based on entries, etc, each sourced with a different query
that queries just one table. Then, when the user filters on, e.g.,
"entry_country", you can apply that filter to all of the individual queries.

Hope these techniques help in your case too.

On Wed, May 5, 2021 at 9:37 PM Evan Galpin <ev...@gmail.com> wrote:

> Hi Druid devs!
>
> I’m investigating Druid for an analytical workload and I think it would be
> a great fit for the data and use case I have. One thing I’m stuck on right
> now is data modelling.
>
> I’ll use a somewhat classic “Blog Post” example to illustrate. Let’s assume
> a Blog Entry may have many associated “comments” (in unknown quantity), and
> many unknown “reactions” (quantity also unknown).
>
> What is the best way to model this? The example flattenSpec’s that I’ve
> seen showing array handling seem to indicate that the size of the array
> must be known and constant. Does that then rule out the possibility of
> modelling the above Blog Entry as a singular row for peak performance?
>
> One natural way to model the above with an RDBMS would be a table for each
> of Blog Entries, Comments, and Reactions, then performing joins as needed.
> But am I correct in assuming that joins ought to be avoided?
>
> Thanks in advance,
> Evan
>