You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@cassandra.apache.org by Kim Liu <kl...@edgewaternetworks.com> on 2016/03/10 22:07:47 UTC

Using User Defined Functions in UPDATE queries

Hello -
I am experimenting with User Defined Functions in Cassandra (3.3) and I am a bit puzzled by a problem I am having when testing them with cqlsh.  I have tried to find the answers online, but have not had any luck so far.

According to http://cassandra.apache.org/doc/cql3/CQL.html it looks like a UDF (User Defined Function) should be useable in an update:
  UPDATE atable SET col = some_function(?) ...;

Current test environment:
  cassandra@cqlsh:test_space> show version;
  [cqlsh 5.0.1 | Cassandra 3.3 | CQL spec 3.4.0 | Native protocol v4]


However, using a value from the row in question as an argument to the UDF does not seem to work.  Here is how I am testing:
  CREATE KEYSPACE test_space WITH replication = {'class': 'SimpleStrategy', 'replication_factor': '1'};
  USE test_space;
  CREATE TABLE test_table (idx text, data int, PRIMARY KEY( idx ));
  INSERT INTO test_table(idx,data) VALUES( 'abc', 1 );
  CREATE FUNCTION max_int( a int, b int ) CALLED ON NULL INPUT RETURNS int LANGUAGE java AS 'return (a == null || b == null) ? 0 : ( a > b ? a : b);';
  SELECT * FROM test_table;
  UPDATE test_table SET data=max_int(3,4) WHERE idx='abc’;  ---- This works
  SELECT * FROM test_table;
  UPDATE test_table SET data=max_int(data,5) WHERE idx='abc’; ---- This does not work

Results:
  SyntaxException: <ErrorMessage code=2000 [Syntax error in CQL query] message="line 1:39 no viable alternative at input ',' (UPDATE test_table SET data=max_int([data],...)">



So I am wondering if I am missing something here - or if this is just a problem with testing with cqlsh.  If a UDF is not allowed to be passed a value from the row that is being updated, then what exactly is the use case for having a UDF in an UPDATE?  If all arguments to the UDF have to be supplied by the client, then the client might as well perform the function.  Using a UDF in an UPDATE would seem to make the most sense only if the row data could be accessed, since it could be used to bypass the need for a SELECT by a client to read the values and perform the operation on them before doing the UPDATE.


Thank you,
Kim Liu

--------
Kim Liu
Sr. Software Engineer
kliu@edgewaternetworks.com
_______
“Nothing in the world is more dangerous than sincere ignorance and conscientious stupidity.” -- Martin Luther King Jr.





Re: Using User Defined Functions in UPDATE queries

Posted by Kim Liu <kl...@edgewaternetworks.com>.
Thank you for the clarification.
—Kim

From: Sylvain Lebresne <sy...@datastax.com>>
Reply-To: "user@cassandra.apache.org<ma...@cassandra.apache.org>" <us...@cassandra.apache.org>>
Date: Friday, March 11, 2016 at 10:05
To: "user@cassandra.apache.org<ma...@cassandra.apache.org>" <us...@cassandra.apache.org>>
Subject: Re: Using User Defined Functions in UPDATE queries

On Fri, Mar 11, 2016 at 5:09 PM, Kim Liu <kl...@edgewaternetworks.com>> wrote:
Just for sake of clarification, then, what is the use-case for having UDFs in an UPDATE?

Honestly, it's merely there for convenience when you use things like cqlsh for instance.


Re: Using User Defined Functions in UPDATE queries

Posted by Sylvain Lebresne <sy...@datastax.com>.
On Fri, Mar 11, 2016 at 5:09 PM, Kim Liu <kl...@edgewaternetworks.com> wrote:

> Just for sake of clarification, then, what is the use-case for having UDFs
> in an UPDATE?
>

Honestly, it's merely there for convenience when you use things like cqlsh
for instance.


>
> If they cannot read data from the data store, then all of the parameters
> to the UDF must be supplied by the client, correct?
>

Correct (for UPDATE at least).


>
> If the client has all the parameters, the client could perform the
> equivalent of the UDF on the client side, first, then send the results to
> the server, instead of pushing the computation work onto the server.
>

Absolutely.


>  So I am curious as to what one is supposed to use a UDF in an UPDATE for.
>

Again, mainly a convenience.

The main end goal for UDF is for use in SELECT. It's already potentially
slightly useful to save server->client bandwidth as you could imagine to do:
  SELECT compute_md5(image) FROM images WHERE ...;
(assuming of course that trading cpu server side for bandwidth is a good
idea)

Though their most useful use will probably be in WHERE clause, for things
like:
  SELECT * FROM foo WHERE sqrt(x) = 3;
and we also plan to have functional indexes to go with that.
However, I'll not right away that those last use case are not yet
supported, but they will be eventually. Adding UDF was more of the first
incremental step, but their most interesting use case is arguably not yet
supported. But as far as UPDATE is concerned, we'll probably never support
them as I said since that would require a read-before-write (except
possibly for LWT which do that read-before-write (at great cost) anyway,
we'll see).


>
>
>
> Long-winded explanation of the use-case I was poking at using UPDATE UDFs
> for below for the morbidly curious.
>
>
>
>
> That morbidly curious, huh?
>
> The scenario is, roughly, that the application receives a set of data
> which is broken up over, say, four messages (A,B,C,D).  However, the
> messages can arrive in any order, possibly with duplicates, and the data
> set is not complete until the all four messages are received.  There are
> multiple message receivers in order to scale to the volume of messages
> coming in, so each of the four messages per data set could arrive at any
> receiver (in any chronological pattern), and each receiving station would
> then insert the partial data into Cassandra.
>
> I looked at the Cassandra SET implementation, thinking that I could just
> add ‘A’, ‘B’, ‘C’, ‘D’ (or 1,2,3,4) to a set with a secondary index.  Then
> periodically search for where the set had all elements to spot which rows
> had a complete data set ready for processing.  However, there does not
> appear to be an equality check for SETs.  (Adding elements to a set is
> another place where UPDATE appears to allow for the “x = x <operator>
> <data>” pattern which added to my confusion about using a UDF in the
> UPDATE.)
>
> So instead of using sets, the idea was to have a UDF perform a bit-wise OR
> operation.  Roughly:
>   CREATE FUNCTION bitwise_or( a int, b int ) CALLED ON NULL INPUT RETURNS
> int LANGUAGE java AS 'return Integer.valueOf((a == null ? 0 : a)|(b == null
> ? 0 : b));';
>
> Then as each message segment came in, I had intended, roughly:
>   UPDATE MessageData SET messageComplete = bitwise_or(messageComplete,2),
> data2=… ;
>   UPDATE MessageData SET messageComplete = bitwise_or(messageComplete,1),
> data1=… ;
>   UPDATE MessageData SET messageComplete = bitwise_or(messageComplete,8),
> data4=… ;
>   UPDATE MessageData SET messageComplete = bitwise_or(messageComplete,4),
> data3=… ;
>
> Then, with a secondary index on ‘messageComplete’, periodically scrape out
> all rows where messageComplete was equal to 15.  (At most, sixteen unique
> values in the secondary index.)  (And use a TTL to expire messages that did
> not eventually complete, etc.  Boilerplate infrastructure, etc.)
>
> This was based upon my incorrect assumption about UPDATE UDFs, since this
> looked like an optimal way to avoid having all the clients perform
> read-updates patterns and worrying about the clients stepping on each
> others data, as well as handling cases where duplicate messages were
> received by different receivers.  So it’s starting to look like I might
> need to use something else to perform the correlation between messages.
>
> —Kim
>
> From: Sylvain Lebresne <sy...@datastax.com>
> Reply-To: "user@cassandra.apache.org" <us...@cassandra.apache.org>
> Date: Friday, March 11, 2016 at 00:35
> To: "user@cassandra.apache.org" <us...@cassandra.apache.org>
> Subject: Re: Using User Defined Functions in UPDATE queries
>
> UDF are usable in UPDATE statement as actually trying them shows, it's
> just the documented grammar that needs fixing.
>
> But as far as doing something like:
>   UPDATE test_table SET data=max_int(data,5) WHERE idx='abc’;
> this is indeed *not* supported and likely never will. One big pillar of C*
> design is that normal writes like this don't do a read-before-write, both
> for performance and because of consistency constraints, so we can't have
> update depend on the previous value in any way.
> I'll note that maybe that make UDF useless for you and if so, I'm sorry,
> but you just can't use UDF in C* for that and you'd have to do a manual
> read-before-write client side to achieve this.
>
> For the sake of avoiding confusion, I will not that we do allow:
>   UPDATE test_table SET c = c + 1 WHERE idx='abc';
> if c is a counter, but that's a very special case. Counters have a
> completely separate path and implementation and do have a read-before-write
> (and are slower than normal update as a result).
>
>
>

Re: Using User Defined Functions in UPDATE queries

Posted by Kim Liu <kl...@edgewaternetworks.com>.
Just for sake of clarification, then, what is the use-case for having UDFs in an UPDATE?

If they cannot read data from the data store, then all of the parameters to the UDF must be supplied by the client, correct?

If the client has all the parameters, the client could perform the equivalent of the UDF on the client side, first, then send the results to the server, instead of pushing the computation work onto the server.  So I am curious as to what one is supposed to use a UDF in an UPDATE for.



Long-winded explanation of the use-case I was poking at using UPDATE UDFs for below for the morbidly curious.




That morbidly curious, huh?

The scenario is, roughly, that the application receives a set of data which is broken up over, say, four messages (A,B,C,D).  However, the messages can arrive in any order, possibly with duplicates, and the data set is not complete until the all four messages are received.  There are multiple message receivers in order to scale to the volume of messages coming in, so each of the four messages per data set could arrive at any receiver (in any chronological pattern), and each receiving station would then insert the partial data into Cassandra.

I looked at the Cassandra SET implementation, thinking that I could just add ‘A’, ‘B’, ‘C’, ‘D’ (or 1,2,3,4) to a set with a secondary index.  Then periodically search for where the set had all elements to spot which rows had a complete data set ready for processing.  However, there does not appear to be an equality check for SETs.  (Adding elements to a set is another place where UPDATE appears to allow for the “x = x <operator> <data>” pattern which added to my confusion about using a UDF in the UPDATE.)

So instead of using sets, the idea was to have a UDF perform a bit-wise OR operation.  Roughly:
  CREATE FUNCTION bitwise_or( a int, b int ) CALLED ON NULL INPUT RETURNS int LANGUAGE java AS 'return Integer.valueOf((a == null ? 0 : a)|(b == null ? 0 : b));';

Then as each message segment came in, I had intended, roughly:
  UPDATE MessageData SET messageComplete = bitwise_or(messageComplete,2), data2=… ;
  UPDATE MessageData SET messageComplete = bitwise_or(messageComplete,1), data1=… ;
  UPDATE MessageData SET messageComplete = bitwise_or(messageComplete,8), data4=… ;
  UPDATE MessageData SET messageComplete = bitwise_or(messageComplete,4), data3=… ;

Then, with a secondary index on ‘messageComplete’, periodically scrape out all rows where messageComplete was equal to 15.  (At most, sixteen unique values in the secondary index.)  (And use a TTL to expire messages that did not eventually complete, etc.  Boilerplate infrastructure, etc.)

This was based upon my incorrect assumption about UPDATE UDFs, since this looked like an optimal way to avoid having all the clients perform read-updates patterns and worrying about the clients stepping on each others data, as well as handling cases where duplicate messages were received by different receivers.  So it’s starting to look like I might need to use something else to perform the correlation between messages.

—Kim

From: Sylvain Lebresne <sy...@datastax.com>>
Reply-To: "user@cassandra.apache.org<ma...@cassandra.apache.org>" <us...@cassandra.apache.org>>
Date: Friday, March 11, 2016 at 00:35
To: "user@cassandra.apache.org<ma...@cassandra.apache.org>" <us...@cassandra.apache.org>>
Subject: Re: Using User Defined Functions in UPDATE queries

UDF are usable in UPDATE statement as actually trying them shows, it's just the documented grammar that needs fixing.

But as far as doing something like:
  UPDATE test_table SET data=max_int(data,5) WHERE idx='abc’;
this is indeed *not* supported and likely never will. One big pillar of C* design is that normal writes like this don't do a read-before-write, both for performance and because of consistency constraints, so we can't have update depend on the previous value in any way.
I'll note that maybe that make UDF useless for you and if so, I'm sorry, but you just can't use UDF in C* for that and you'd have to do a manual read-before-write client side to achieve this.

For the sake of avoiding confusion, I will not that we do allow:
  UPDATE test_table SET c = c + 1 WHERE idx='abc';
if c is a counter, but that's a very special case. Counters have a completely separate path and implementation and do have a read-before-write (and are slower than normal update as a result).



Re: Using User Defined Functions in UPDATE queries

Posted by Sylvain Lebresne <sy...@datastax.com>.
UDF are usable in UPDATE statement as actually trying them shows, it's just
the documented grammar that needs fixing.

But as far as doing something like:
  UPDATE test_table SET data=max_int(data,5) WHERE idx='abc’;
this is indeed *not* supported and likely never will. One big pillar of C*
design is that normal writes like this don't do a read-before-write, both
for performance and because of consistency constraints, so we can't have
update depend on the previous value in any way.
I'll note that maybe that make UDF useless for you and if so, I'm sorry,
but you just can't use UDF in C* for that and you'd have to do a manual
read-before-write client side to achieve this.

For the sake of avoiding confusion, I will not that we do allow:
  UPDATE test_table SET c = c + 1 WHERE idx='abc';
if c is a counter, but that's a very special case. Counters have a
completely separate path and implementation and do have a read-before-write
(and are slower than normal update as a result).


On Thu, Mar 10, 2016 at 11:11 PM, Kim Liu <kl...@edgewaternetworks.com>
wrote:

> It does sounds like the use of UDF in UPDATE is in an ambiguous state at
> the moment, then.  The document grammar says they can’t be used, but the
> document examples say they can, and the server will execute them, but it
> can’t execute them in a useful way (i.e. no row supplied data.)
>
> So essentially not useable at the moment, regardless of intent.
>
> Thanks,
> —Kim
>
> From: DuyHai Doan <do...@gmail.com>
> Reply-To: "user@cassandra.apache.org" <us...@cassandra.apache.org>
> Date: Thursday, March 10, 2016 at 14:03
> To: "user@cassandra.apache.org" <us...@cassandra.apache.org>
> Subject: Re: Using User Defined Functions in UPDATE queries
>
> Surely an error because the grammar definition for UPDATE does not mention
> any function call:
> [SNIPPED]
>
> Unless the grammar in the doc is itself not up-to-date....
>
>>
>>>

Re: Using User Defined Functions in UPDATE queries

Posted by Kim Liu <kl...@edgewaternetworks.com>.
It does sounds like the use of UDF in UPDATE is in an ambiguous state at the moment, then.  The document grammar says they can’t be used, but the document examples say they can, and the server will execute them, but it can’t execute them in a useful way (i.e. no row supplied data.)

So essentially not useable at the moment, regardless of intent.

Thanks,
—Kim

From: DuyHai Doan <do...@gmail.com>>
Reply-To: "user@cassandra.apache.org<ma...@cassandra.apache.org>" <us...@cassandra.apache.org>>
Date: Thursday, March 10, 2016 at 14:03
To: "user@cassandra.apache.org<ma...@cassandra.apache.org>" <us...@cassandra.apache.org>>
Subject: Re: Using User Defined Functions in UPDATE queries


Surely an error because the grammar definition for UPDATE does not mention any function call:

[SNIPPED]

Unless the grammar in the doc is itself not up-to-date....


Re: Using User Defined Functions in UPDATE queries

Posted by DuyHai Doan <do...@gmail.com>.
Surely an error because the grammar definition for UPDATE does not mention
any function call:

<update-stmt> ::= UPDATE <tablename> ( USING <option> ( AND <option> )* )?
SET <assignment> ( ',' <assignment> )*
WHERE <where-clause> ( IF <condition> ( AND condition )* )?

<assignment> ::= <identifier> '=' <term> |
<identifier> '=' <identifier> ('+' | '-') (<int-term> | <set-literal> |
<list-literal>) |
<identifier> '=' <identifier> '+' <map-literal> |
<identifier> '[' <term> ']' '=' <term>

<condition> ::= <identifier> '=' <term> | <identifier> '[' <term> ']' '='
<term>

<where-clause> ::= <relation> ( AND <relation> )*

<relation> ::= <identifier> '=' <term> | <identifier> IN '(' ( <term> ( ','
<term> )* )? ')' | <identifier> IN '?'

<option> ::= TIMESTAMP <integer> | TTL <integer>

Unless the grammar in the doc is itself not up-to-date....
Le 10 mars 2016 22:31, "Kim Liu" <kl...@edgewaternetworks.com> a écrit :

> Um, I’m not entirely sure how I misread it, since this was copy-pasted
> from the document:
>   UPDATE atable SET col = some_function(?) …;
>
> So the document examples certainly seem to support the use of UDF in
> UPDATE.  I suppose the document may be more erroneous in its writing than
> I in its misreading.
>
> Additionally, this statement works in cqlsh (presuming max_int() is a UDF):
>   UPDATE test_table SET data=max_int(3,4) WHERE idx='abc’;
>
> So, if the grammar is not supposed to allow this, then there is a bug
> somewhere because in 3.3 it certainly seems to be parsed and executed
> without complaint.
>
> —Kim
>
>
> From: DuyHai Doan <do...@gmail.com>
> Reply-To: "user@cassandra.apache.org" <us...@cassandra.apache.org>
> Date: Thursday, March 10, 2016 at 13:21
> To: "user@cassandra.apache.org" <us...@cassandra.apache.org>
> Subject: Re: Using User Defined Functions in UPDATE queries
>
> You have misread the CQL doc given in the link. According to CQL update
> grammar it's not possible to use UDF. I see UDF only allowed in select
> clause...
>
>>
>>

Re: Using User Defined Functions in UPDATE queries

Posted by Kim Liu <kl...@edgewaternetworks.com>.
Um, I’m not entirely sure how I misread it, since this was copy-pasted from the document:
  UPDATE atable SET col = some_function(?) …;

So the document examples certainly seem to support the use of UDF in UPDATE.  I suppose the document may be more erroneous in its writing than I in its misreading.

Additionally, this statement works in cqlsh (presuming max_int() is a UDF):
  UPDATE test_table SET data=max_int(3,4) WHERE idx='abc’;

So, if the grammar is not supposed to allow this, then there is a bug somewhere because in 3.3 it certainly seems to be parsed and executed without complaint.

—Kim


From: DuyHai Doan <do...@gmail.com>>
Reply-To: "user@cassandra.apache.org<ma...@cassandra.apache.org>" <us...@cassandra.apache.org>>
Date: Thursday, March 10, 2016 at 13:21
To: "user@cassandra.apache.org<ma...@cassandra.apache.org>" <us...@cassandra.apache.org>>
Subject: Re: Using User Defined Functions in UPDATE queries


You have misread the CQL doc given in the link. According to CQL update grammar it's not possible to use UDF. I see UDF only allowed in select clause...


Re: Using User Defined Functions in UPDATE queries

Posted by DuyHai Doan <do...@gmail.com>.
You have misread the CQL doc given in the link. According to CQL update
grammar it's not possible to use UDF. I see UDF only allowed in select
clause...
Le 10 mars 2016 22:07, "Kim Liu" <kl...@edgewaternetworks.com> a écrit :

> Hello -
> I am experimenting with User Defined Functions in Cassandra (3.3) and I am
> a bit puzzled by a problem I am having when testing them with cqlsh.  I
> have tried to find the answers online, but have not had any luck so far.
>
> According to http://cassandra.apache.org/doc/cql3/CQL.html it looks like
> a UDF (User Defined Function) should be useable in an update:
>   UPDATE atable SET col = some_function(?) ...;
>
> Current test environment:
>   cassandra@cqlsh:test_space> show version;
>   [cqlsh 5.0.1 | Cassandra 3.3 | CQL spec 3.4.0 | Native protocol v4]
>
>
> However, using a value from the row in question as an argument to the UDF
> does not seem to work.  Here is how I am testing:
>   CREATE KEYSPACE test_space WITH replication = {'class':
> 'SimpleStrategy', 'replication_factor': '1'};
>   USE test_space;
>   CREATE TABLE test_table (idx text, data int, PRIMARY KEY( idx ));
>   INSERT INTO test_table(idx,data) VALUES( 'abc', 1 );
>   CREATE FUNCTION max_int( a int, b int ) CALLED ON NULL INPUT RETURNS int
> LANGUAGE java AS 'return (a == null || b == null) ? 0 : ( a > b ? a : b);';
>   SELECT * FROM test_table;
>   UPDATE test_table SET data=max_int(3,4) WHERE idx='abc’;  ---- This works
>   SELECT * FROM test_table;
>   UPDATE test_table SET data=max_int(data,5) WHERE idx='abc’; ---- This
> does not work
>
> Results:
>   SyntaxException: <ErrorMessage code=2000 [Syntax error in CQL query]
> message="line 1:39 no viable alternative at input ',' (UPDATE test_table
> SET data=max_int([data],...)">
>
>
>
> So I am wondering if I am missing something here - or if this is just a
> problem with testing with cqlsh.  If a UDF is not allowed to be passed a
> value from the row that is being updated, then what exactly is the use case
> for having a UDF in an UPDATE?  If all arguments to the UDF have to be
> supplied by the client, then the client might as well perform the
> function.  Using a UDF in an UPDATE would seem to make the most sense only
> if the row data could be accessed, since it could be used to bypass the
> need for a SELECT by a client to read the values and perform the operation
> on them before doing the UPDATE.
>
>
> Thank you,
> Kim Liu
>
> --------
> Kim Liu
> Sr. Software Engineer
> kliu@edgewaternetworks.com
> _______
> “Nothing in the world is more dangerous than sincere ignorance and
> conscientious stupidity.” -- Martin Luther King Jr.
>
>
>
>
>