You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@cassandra.apache.org by "Sylvain Lebresne (JIRA)" <ji...@apache.org> on 2015/01/07 12:03:34 UTC

[jira] [Comment Edited] (CASSANDRA-8374) Better support of null for UDF

    [ https://issues.apache.org/jira/browse/CASSANDRA-8374?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14267515#comment-14267515 ] 

Sylvain Lebresne edited comment on CASSANDRA-8374 at 1/7/15 11:03 AM:
----------------------------------------------------------------------

bq. 99% of the time nobody will notice, 1% of the time this will cause hours of head scratching.

Mostly, I don't buy that the alternatives suggested will avoid this. The potential problem you're talking about (and correct me if I'm wrong) is the case where someone does {{UPDATE ... SET v = fct\(?\) ...}} and has a bug in it's code that makes it pass null for the bind marker while it shouldn't. And I *don't* disagree that finding such bug is made harder by the function silently returning {{null}} in that case. But I disagree that any choice we make for the default we're discussion will change that fact.  Because:
# whatever that default is, most function *will* end up returning null on null anyway. Because as I've argued at length already, while doing so has the inconvenience described above, the only other concrete alternative for 99% of functions would be to throw an exception, and that option would make the function unusable in select clause, which is, imo, just not ok (and I've seen no argument offered to the contrary so far). Again, that's the reasoning that has made us return null on null for all our existing functions, and I don't see why any future hard-coded functions would do differently.
# the potential head scratching is due to the ultimate behavior of the function of returning null on null (which again is the less evil solution in practice). It is not due to what the default at creating time is. You might argue that forcing the user to choose the behavior at creation time will help it be aware of said behavior and that awareness will reduce the time of head scratching. But I don't think that argument stands terribly well in practice because it's assuming that the user scratching it's head is the one that has defined the function in the first place. But this won't be the case for "standard" (hard-coded) functions, which, provided we add a reasonably good standard library of functions (which we should do soonish as there is no point in having every use reinvent the wheel), might just be the most often used functions. And even for UDF, there is no reason for this to be the case in general for any organisation with more than one developer.

So basically, I agree that we should try to make people generally aware that most function returns null on null so they can more easily find the problem described above if they run into it, but I'm just not convinced that forcing the choice of behavior at function creation time (for the sake of education since again 99% of the time people would choose {{RETURNS NULL ON NULL INPUT}} for the reasons discussed above) is a very good way to create that awareness (because that doesn't help for standard functions). And on the flip side, forcing the choice will be annoying every time you create a UDF (and aren't defaults exactly made to reduce annoyance when you know that one of the option will be the right choice 99% of the time?).

Anyway, I continue to think that {{RETURNS NULL ON NULL}} is likely the right default. I've tried to explain my reasoning as clearly as I can and I don't think I can do any better. If the majority still disagrees, so be it (though I'll admit being fuzzy on the actual counter-arguments to my reasoning and would certainly love to understand them better). For what it's worth, if we don't go with {{RETURNS NULL ON NULL}}, I think I prefer forcing the choice of behavior explicitly because at least that might somehow help create that awareness of the actual behavior (even though I've explained why I don't find it a very good argument).  The only argument for {{CALLED ON NULL INPUT}} as default I've seen is that it's this way in other DBs, but it's not an argument in itself in my book if we can't come with a good reasoning why it's a good default, and I haven't really seen one.




was (Author: slebresne):
bq. 99% of the time nobody will notice, 1% of the time this will cause hours of head scratching.

Mostly, I don't buy that the alternatives suggested will avoid this. The potential problem you're talking about (and correct me if I'm wrong) is the case where someone does {{UPDATE ... SET v = fct(?) ...}} and has a bug in it's code that makes it pass null for the bind marker while it shouldn't. And I *don't* disagree that finding such bug is made harder by the function silently returning {{null}} in that case. But I disagree that any choice we make for the default we're discussion will change that fact.  Because:
1. whatever that default is, most function *will* end up returning null on null anyway. Because as I've argued at length already, while doing so has the inconvenience described above, the only other concrete alternative for 99% of functions would be to throw an exception, and that option would make the function unusable in select clause, which is, imo, just not ok (and I've seen no argument offered to the contrary so far). Again, that's the reasoning that has made us return null on null for all our existing functions, and I don't see why any future hard-coded functions would do differently.
2. the potential head scratching is due to the ultimate behavior of the function of returning null on null (which again is the less evil solution in practice). It is not due to what the default at creating time is. You might argue that forcing the user to choose the behavior at creation time will help it be aware of said behavior and that awareness will reduce the time of head scratching. But I don't think that argument stands terribly well in practice because it's assuming that the user scratching it's head is the one that has defined the function in the first place. But this won't be the case for "standard" (hard-coded) functions, which, provided we add a reasonably good standard library of functions (which we should do soonish as there is no point in having every use reinvent the wheel), might just be the most often used functions. And even for UDF, there is no reason for this to be the case in general for any organisation with more than one developer.

So basically, I agree that we should try to make people generally aware that most function returns null on null so they can more easily find the problem described above if they run into it, but I'm just not convinced that forcing the choice of behavior at function creation time (for the sake of education since again 99% of the time people would choose {{RETURNS NULL ON NULL INPUT}} for the reasons discussed above) is a very good way to create that awareness (because that doesn't help for standard functions). And on the flip side, forcing the choice will be annoying every time you create a UDF (and aren't defaults exactly made to reduce annoyance when you know that one of the option will be the right choice 99% of the time?).

Anyway, I continue to think that {{RETURNS NULL ON NULL}} is likely the right default. I've tried to explain my reasoning as clearly as I can and I don't think I can do any better. If the majority still disagrees, so be it (though I'll admit being fuzzy on the actual counter-arguments to my reasoning and would certainly love to understand them better). For what it's worth, if we don't go with {{RETURNS NULL ON NULL}}, I think I prefer forcing the choice of behavior explicitly because at least that might somehow help create that awareness of the actual behavior (even though I've explained why I don't find it a very good argument).  The only argument for {{CALLED ON NULL INPUT}} as default I've seen is that it's this way in other DBs, but it's not an argument in itself in my book if we can't come with a good reasoning why it's a good default, and I haven't really seen one.



> Better support of null for UDF
> ------------------------------
>
>                 Key: CASSANDRA-8374
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-8374
>             Project: Cassandra
>          Issue Type: Bug
>            Reporter: Sylvain Lebresne
>            Assignee: Robert Stupp
>             Fix For: 3.0
>
>         Attachments: 8473-1.txt, 8473-2.txt
>
>
> Currently, every function needs to deal with it's argument potentially being {{null}}. There is very many case where that's just annoying, users should be able to define a function like:
> {noformat}
> CREATE FUNCTION addTwo(val int) RETURNS int LANGUAGE JAVA AS 'return val + 2;'
> {noformat}
> without having this crashing as soon as a column it's applied to doesn't a value for some rows (I'll note that this definition apparently cannot be compiled currently, which should be looked into).  
> In fact, I think that by default methods shouldn't have to care about {{null}} values: if the value is {{null}}, we should not call the method at all and return {{null}}. There is still methods that may explicitely want to handle {{null}} (to return a default value for instance), so maybe we can add an {{ALLOW NULLS}} to the creation syntax.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)