You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@cassandra.apache.org by Suresh Mahawar <su...@technocube.in> on 2015/08/05 12:09:16 UTC

Cassandra: UDF

Hi,

I need your help. I have a query which get top 5 records group by date (not
date + time) and sum of amount.

I wrote the following but it returns all the records not just top 5 records

CREATE OR REPLACE FUNCTION state_groupbyandsum( state map<text, double>,
datetime text, amount text )
CALLED ON NULL INPUT
RETURNS map<text, double>
LANGUAGE java
AS 'String date = datetime.substring(0,10); Double count = (Double)
state.get(date);  if (count == null) count = Double.parseDouble(amount);
else count = count +  Double.parseDouble(amount); state.put(date, count);
return state;' ;


CREATE OR REPLACE AGGREGATE groupbyandsum(text, text)
SFUNC state_groupbyandsum
STYPE map<text, double>
INITCOND {};

select groupbyandsum(datetime, amout) from warehouse;

Could you please help out to get just 5 records.


Thanks & Regards,
Suresh Mahawar
TechnoCube
Find Me on Linkedin <https://www.linkedin.com/pub/suresh-mahawar/2a/b9/a80>

Re: Cassandra: UDF

Posted by Robert Stupp <sn...@snazy.de>.
Suresh,

tip: you can use alternative (”pg-style”) string delimiters, which can span over multiple lines and makes the CQL statement much nicer:
CREATE OR REPLACE FUNCTION state_groupbyandsum (
	state map<text, double>, datetime text, amount text )
CALLED ON NULL INPUT
RETURNS map<text, double>
LANGUAGE java 
AS $$
	String date = datetime.substring(0,10);
	Double count = (Double) state.get(date);
...
	return state;
$$ ;

UDAs are best suited for queries against a single partition - not against a possibly really huge table.
This is nothing special for UDAs as you should always code your queries to hit a single partition.

User defined aggregates are not meant to do the job of (or even replace) an analytics framework like Apache Spark.
Frankly, Top-K-queries over a big data set are best suited for Spark using the Cassandra-Spark-Connector.

In your case: imagine your query returns 1B rows - all that information must be held in the map in the Java heap of the coordinator (the node that runs the UDA).

You can do Top-K query with UDAs over the whole table - and rely on the fact that rows passed to the state function are grouped by their partition key (assuming that ‘datetime’ is in your partition key) AND kicking datetime values out of your state-map that do not match the Top-K criteria.
BUT: I do NOT recommend to do that upon user request - instead in a batch job and pipe the result in another table for fast read access.

Robert


> On 05 Aug 2015, at 12:09, Suresh Mahawar <su...@technocube.in> wrote:
> 
> Hi,
> 
> I need your help. I have a query which get top 5 records group by date (not date + time) and sum of amount.
> 
> I wrote the following but it returns all the records not just top 5 records
> 
> CREATE OR REPLACE FUNCTION state_groupbyandsum( state map<text, double>, datetime text, amount text )
> CALLED ON NULL INPUT
> RETURNS map<text, double>
> LANGUAGE java 
> AS 'String date = datetime.substring(0,10); Double count = (Double) state.get(date);  if (count == null) count = Double.parseDouble(amount); else count = count +  Double.parseDouble(amount); state.put(date, count); return state;' ;
> 
> 
> CREATE OR REPLACE AGGREGATE groupbyandsum(text, text) 
> SFUNC state_groupbyandsum
> STYPE map<text, double>
> INITCOND {};
> 
> select groupbyandsum(datetime, amout) from warehouse;
> 
> Could you please help out to get just 5 records.
> 
> 
> Thanks & Regards,
> Suresh Mahawar
> TechnoCube
> Find Me on Linkedin <https://www.linkedin.com/pub/suresh-mahawar/2a/b9/a80>
—
Robert Stupp
@snazy