You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@hive.apache.org by "Peter Attardo (JIRA)" <ji...@apache.org> on 2017/02/10 18:51:41 UTC
[jira] [Updated] (HIVE-15876) Add the ability to define composed functions

     [ https://issues.apache.org/jira/browse/HIVE-15876?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Peter Attardo updated HIVE-15876:
---------------------------------
    Description: 
I'm not entirely sure the solution implied by the subject is the best one, but it's a useful shorthand for addressing the problem I'm seeing.

I have a use case for wanting to see if a given unix timestamp (with microseconds) falls within a date range expressed by a start_date and an end_date in 'yyyy-MM-dd' format. Without any UDFs, it would look something like this:

...
WHERE
time >= unix_timestamp('${start_date}','yyyy-MM-dd')*1000
AND time < unix_timestamp(date_add('${end_date}', 1),'yyyy-MM-dd')*1000

This condition is obviously quite a pain to read and write, and one that's easy to get wrong when trying to reproduce. I would like to simplify to something like:

...
WHERE time_in_range(time, '${start_date}','${end_date}')

I was able to write a UDF for the above relatively easily, but when testing it, it performed notably worse than the first example. The reason for that quickly became clear. Even though both functions are deterministic, that is largely irrelevant in the latter example, because the 'time' variable is not static. The query optimizer can do nothing with it, and must do the full function evaluation on each row. Whereas in the first example the optimizer can see that the "sub-functions" of unix_timestamp and date_add are both deterministic and have static inputs, and will only evaluate them once for the whole query.

I would like a way to define a single function that maintains the ability for the optimizer to see which of its constituent parts only need be evaluated once. Whether that is some syntax in the CREATE FUNCTION DDL or some annotations within scala; either would be incredibly useful.

  was:
I'm not entirely sure the solution implied by the subject is the best one, but it's a useful shorthand for addressing the problem I'm seeing.

I have a use case for wanting to see if a given unix timestamp (with milliseconds) falls within a date range expressed by a start_date and an end_date in 'yyyy-MM-dd' format. Without any UDFs, it would look something like this:

...
WHERE
time >= unix_timestamp('${start_date}','yyyy-MM-dd')*1000
AND time < unix_timestamp(date_add('${end_date}', 1),'yyyy-MM-dd')*1000

This condition is obviously quite a pain to read and write, and one that's easy to get wrong when trying to reproduce. I would like to simplify to something like:

...
WHERE time_in_range(time, '${start_date}','${end_date}')

I was able to write a UDF for the above relatively easily, but when testing it, it performed notably worse than the first example. The reason for that quickly became clear. Even though both functions are deterministic, that is largely irrelevant in the latter example, because the 'time' variable is not static. The query optimizer can do nothing with it, and must do the full function evaluation on each row. Whereas in the first example the optimizer can see that the "sub-functions" of unix_timestamp and date_add are both deterministic and have static inputs, and will only evaluate them once for the whole query.

I would like a way to define a single function that maintains the ability for the optimizer to see which of its constituent parts only need be evaluated once. Whether that is some syntax in the CREATE FUNCTION DDL or some annotations within scala; either would be incredibly useful.


> Add the ability to define composed functions
> --------------------------------------------
>
>                 Key: HIVE-15876
>                 URL: https://issues.apache.org/jira/browse/HIVE-15876
>             Project: Hive
>          Issue Type: Wish
>            Reporter: Peter Attardo
>            Priority: Minor
>
> I'm not entirely sure the solution implied by the subject is the best one, but it's a useful shorthand for addressing the problem I'm seeing.
> I have a use case for wanting to see if a given unix timestamp (with microseconds) falls within a date range expressed by a start_date and an end_date in 'yyyy-MM-dd' format. Without any UDFs, it would look something like this:
> ...
> WHERE
> time >= unix_timestamp('${start_date}','yyyy-MM-dd')*1000
> AND time < unix_timestamp(date_add('${end_date}', 1),'yyyy-MM-dd')*1000
> This condition is obviously quite a pain to read and write, and one that's easy to get wrong when trying to reproduce. I would like to simplify to something like:
> ...
> WHERE time_in_range(time, '${start_date}','${end_date}')
> I was able to write a UDF for the above relatively easily, but when testing it, it performed notably worse than the first example. The reason for that quickly became clear. Even though both functions are deterministic, that is largely irrelevant in the latter example, because the 'time' variable is not static. The query optimizer can do nothing with it, and must do the full function evaluation on each row. Whereas in the first example the optimizer can see that the "sub-functions" of unix_timestamp and date_add are both deterministic and have static inputs, and will only evaluate them once for the whole query.
> I would like a way to define a single function that maintains the ability for the optimizer to see which of its constituent parts only need be evaluated once. Whether that is some syntax in the CREATE FUNCTION DDL or some annotations within scala; either would be incredibly useful.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)