You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@kudu.apache.org by "Peter Ebert (Code Review)" <ge...@cloudera.org> on 2016/06/02 18:51:26 UTC

[kudu-CR] KUDU-1386 NaN float and double values are not handled correctly

Peter Ebert has posted comments on this change.

Change subject: KUDU-1386 NaN float and double values are not handled correctly
......................................................................


Patch Set 1:

(1 comment)

http://gerrit.cloudera.org:8080/#/c/3142/1/src/kudu/common/types.h
File src/kudu/common/types.h:

Line 53:   NONCOMPARABLE = 12
> > I'm not following, why would you want NaNs to show up in a per-block min/
Hey guys, I was talking to Will about this and wanted to put in unsolicited my two cents if I could.  Feel free to tell me I'm out of my element :)

I like how Impala has treated NaN where any comparison returns false.  From a mathematical standpoint I think we could all agree, this makes the most sense, if you were writing a query to select all values >1 you would really scratch your head when NaNs showed up in your result set (0.0/0.0 > 1).  You would then spend some time looking at the documentation to figure out why they were there, if not just submit a support ticket with Cloudera (like some of my clients did with doubles showing rounding errors XD).  I could really see people/clients wanting NaN to be treated more like NULL.

At the same time Impala has another precedent I like: NULL is considered greater than all other values for sorting purposes. Would it be possible to have our cake and eat it too?  Where we sorted NaN for index purposes as greater than all values, just like Impala NULLs, but never evaluated NaN = NaN as true or returned NaNs where x > 1 (or x < 1)?

I'm not as familiar with the implementation of the indexing or operators and what kind of complexity this would add, I'm sure there would be some, but I think the complexity is worth the cost for being consistent with Impala, IEEE, what I would consider mathematically consistent, and with the behavior average users will expect.

If I were guessing how to implement, for doubles it might be as simple as when someone requests x > 1, that we always add NaN > x > 1.  If impala asks for is_nan either an equivalent operator or x > inf might work.  But I would definitely want to catch the case where 'NaN' = 'NaN' because for most users I presume if they have a query that evaluates to NaN on each side they do not want those values to ever match each other.  Technically letting infinities match is also not quite correct, but it's part of the IEEE standard (as I understand it) so the precedent for this is already out there.

I would also be ok with treating NaN as NULL and disallowing its presence in index columns if that is the 'keep it simple stupid' solution.  Should be pretty rare that NaNs are intentionally used, but I could definitely see them accidentally turning up and confusing clients when it doesn\u2019t sort or compare as expected.

Thanks for reading.  Let me know if you would prefer to chat instead if anything isn't clear.


-- 
To view, visit http://gerrit.cloudera.org:8080/3142
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I194dcddeb8eabcc67699661b9cc9362a99f2f4ae
Gerrit-PatchSet: 1
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: Will Berkeley <wd...@gmail.com>
Gerrit-Reviewer: Dan Burkert <da...@cloudera.com>
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Peter Ebert
Gerrit-Reviewer: Todd Lipcon <to...@apache.org>
Gerrit-Reviewer: Will Berkeley <wd...@gmail.com>
Gerrit-HasComments: Yes