You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@cassandra.apache.org by Joseph Wonesh <jo...@sticknfind.com> on 2018/10/23 14:46:15 UTC

Aggregation of Set Data Type

Hello all,

 I am trying to aggregate rows which each contain a column of Set<BIGINT>.
I would like the result to contain the sum of all sets, where null would be
equivalent to the empty set. I expected a query like: "select
sum(my_set_column) from my_table group by my_key_column" to do this, but
the set type is not supported by this aggregate. Does anyone know of a way
to aggregate this using existing cassandra built-ins? Thanks!

-- 
This message is private and confidential. If you have received message in 
error, please notify us and remove from your system.

Re: Aggregation of Set Data Type

Posted by Joseph Wonesh <jo...@sticknfind.com>.

I was able to make this work using the following UDF/UDA:

CREATE OR REPLACE FUNCTION agg_set_func(state tuple<int, set<bigint>>, val
set<bigint>) CALLED ON NULL INPUT RETURNS tuple<int, set<bigint>> LANGUAGE
java AS
$$
    if (val == null) {
    return state;
    }
    Set<Long> s = state.getSet(1, Long.class);
    s.addAll(val);
    state.setSet(1, s);
return state;
$$;

CREATE OR REPLACE FUNCTION agg_set_func_final(state tuple<int,
set<bigint>>) CALLED ON NULL INPUT RETURNS set<bigint> LANGUAGE java AS
$$
    return state.getSet(1, Long.class);
$$;

CREATE AGGREGATE agg_set(set<bigint>)
SFUNC agg_set_func
STYPE tuple<int, set<bigint>>
FINALFUNC agg_set_func_final
INITCOND (0,{});

On Thu, Oct 25, 2018 at 3:55 PM Joseph Wonesh <jo...@sticknfind.com>
wrote:

> Thank you for your reply. I actually found your blog post regarding this
> topic and browsed through it, but it did not yield the answer I was looking
> for. In fact, it seems impossible to do what I wish to do without defining
> a UDA for this specific use case -- something that is not practical to do
> when all of my queries use 'group by'.
>
> For example, I ave a query like this:
>
> select sum(a), avg(a), min(a), max(a), MY_UDF(my_set_column) from my_table
> group by a;
>
> I would hope that using a UDF for my_set_column would allow me to combine
> all of the my_set_columns passed in via group by, but I cannot pass state
> to the UDF. A UDA can accept state, but that would require me rewriting the
> whole query to be:
>
> select MY_UDA(a, my_set_column) from my_table;
>
> Additionally, I would need a separate UDA for each of the different group
> by clauses. Is there no way around this? I would really like to be able to
> simply add a data column of type set<bigint> and then get all of the unique
> members in this set across an aggregation.
>
> On Tue, Oct 23, 2018 at 1:44 PM DuyHai Doan <do...@gmail.com> wrote:
>
>> You will need to use user defined aggregates for this
>>
>> Le 23 oct. 2018 16:46, "Joseph Wonesh" <jo...@sticknfind.com> a
>> écrit :
>>
>>> Hello all,
>>>
>>>  I am trying to aggregate rows which each contain a column of
>>> Set<BIGINT>. I would like the result to contain the sum of all sets, where
>>> null would be equivalent to the empty set. I expected a query like: "select
>>> sum(my_set_column) from my_table group by my_key_column" to do this, but
>>> the set type is not supported by this aggregate. Does anyone know of a way
>>> to aggregate this using existing cassandra built-ins? Thanks!
>>>
>>> This message is private and confidential. If you have received message
>>> in error, please notify us and remove from your system.
>>
>>

-- 
This message is private and confidential. If you have received message in 
error, please notify us and remove from your system.

Re: Aggregation of Set Data Type

Posted by Joseph Wonesh <jo...@sticknfind.com>.

Thank you for your reply. I actually found your blog post regarding this
topic and browsed through it, but it did not yield the answer I was looking
for. In fact, it seems impossible to do what I wish to do without defining
a UDA for this specific use case -- something that is not practical to do
when all of my queries use 'group by'.

For example, I ave a query like this:

select sum(a), avg(a), min(a), max(a), MY_UDF(my_set_column) from my_table
group by a;

I would hope that using a UDF for my_set_column would allow me to combine
all of the my_set_columns passed in via group by, but I cannot pass state
to the UDF. A UDA can accept state, but that would require me rewriting the
whole query to be:

select MY_UDA(a, my_set_column) from my_table;

Additionally, I would need a separate UDA for each of the different group
by clauses. Is there no way around this? I would really like to be able to
simply add a data column of type set<bigint> and then get all of the unique
members in this set across an aggregation.

On Tue, Oct 23, 2018 at 1:44 PM DuyHai Doan <do...@gmail.com> wrote:

> You will need to use user defined aggregates for this
>
> Le 23 oct. 2018 16:46, "Joseph Wonesh" <jo...@sticknfind.com> a
> écrit :
>
>> Hello all,
>>
>>  I am trying to aggregate rows which each contain a column of
>> Set<BIGINT>. I would like the result to contain the sum of all sets, where
>> null would be equivalent to the empty set. I expected a query like: "select
>> sum(my_set_column) from my_table group by my_key_column" to do this, but
>> the set type is not supported by this aggregate. Does anyone know of a way
>> to aggregate this using existing cassandra built-ins? Thanks!
>>
>> This message is private and confidential. If you have received message in
>> error, please notify us and remove from your system.
>
>

-- 
This message is private and confidential. If you have received message in 
error, please notify us and remove from your system.

Re: Aggregation of Set Data Type

Posted by DuyHai Doan <do...@gmail.com>.

You will need to use user defined aggregates for this

Le 23 oct. 2018 16:46, "Joseph Wonesh" <jo...@sticknfind.com> a
écrit :

> Hello all,
>
>  I am trying to aggregate rows which each contain a column of Set<BIGINT>.
> I would like the result to contain the sum of all sets, where null would be
> equivalent to the empty set. I expected a query like: "select
> sum(my_set_column) from my_table group by my_key_column" to do this, but
> the set type is not supported by this aggregate. Does anyone know of a way
> to aggregate this using existing cassandra built-ins? Thanks!
>
> This message is private and confidential. If you have received message in
> error, please notify us and remove from your system.