You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@hive.apache.org by "John Sichi (JIRA)" <ji...@apache.org> on 2011/02/14 23:55:57 UTC

[jira] Created: (HIVE-1994) Support new annotation @UDFType(stateful = true)

Support new annotation @UDFType(stateful = true)
------------------------------------------------

                 Key: HIVE-1994
                 URL: https://issues.apache.org/jira/browse/HIVE-1994
             Project: Hive
          Issue Type: Improvement
          Components: Query Processor, UDF
            Reporter: John Sichi
            Assignee: John Sichi


Because Hive does not yet support window functions from SQL/OLAP, people have started hacking around it by writing stateful UDF's for things like cumulative sum.  An example is row_sequence in contrib.

To clearly mark these, I think we should add a new annotation (with separate semantics from the existing deterministic annotation).  I'm proposing the name stateful for lack of a better idea, but I'm open to suggestions.

The semantics are as follows:

* A stateful UDF can only be used in the SELECT list, not in other clauses such as WHERE/ON/ORDER/GROUP
* When a stateful UDF is present in a query, there's an implication that its SELECT needs to be treated as similar to TRANSFORM, i.e. when there's DISTRIBUTE/CLUSTER/SORT clause, then run inside the corresponding reducer to make sure that the results are as expected.

For the first one, an example of why we need this is AND/OR short-circuiting; we don't want these optimizations to cause the invocation to be skipped in a confusing way, so we should just ban it outright (which is what SQL/OLAP does for window functions).

For the second one, I'm not entirely certain about the details since some of it is lost in the mists in Hive prehistory, but at least if we have the annotation, we'll be able to preserve backwards compatibility as we start adding new cost-based optimizations which might otherwise break it.  A specific example would be inserting a materialization step (e.g. for global query optimization) in between the DISTRIBUTE/CLUSTER/SORT and the outer SELECT containing the stateful UDF invocation; this could be a problem if the mappers in the second job subdivides the buckets generated by the first job.  So we wouldn't do anything immediately, but the presence of the annotation will help us going forward.


-- 
This message is automatically generated by JIRA.
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] Updated: (HIVE-1994) Support new annotation @UDFType(stateful = true)

Posted by "John Sichi (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HIVE-1994?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

John Sichi updated HIVE-1994:
-----------------------------

    Attachment: HIVE-1994.2.patch

> Support new annotation @UDFType(stateful = true)
> ------------------------------------------------
>
>                 Key: HIVE-1994
>                 URL: https://issues.apache.org/jira/browse/HIVE-1994
>             Project: Hive
>          Issue Type: Improvement
>          Components: Query Processor, UDF
>            Reporter: John Sichi
>            Assignee: John Sichi
>         Attachments: HIVE-1994.0.patch, HIVE-1994.1.patch, HIVE-1994.2.patch
>
>
> Because Hive does not yet support window functions from SQL/OLAP, people have started hacking around it by writing stateful UDF's for things like cumulative sum.  An example is row_sequence in contrib.
> To clearly mark these, I think we should add a new annotation (with separate semantics from the existing deterministic annotation).  I'm proposing the name stateful for lack of a better idea, but I'm open to suggestions.
> The semantics are as follows:
> * A stateful UDF can only be used in the SELECT list, not in other clauses such as WHERE/ON/ORDER/GROUP
> * When a stateful UDF is present in a query, there's an implication that its SELECT needs to be treated as similar to TRANSFORM, i.e. when there's DISTRIBUTE/CLUSTER/SORT clause, then run inside the corresponding reducer to make sure that the results are as expected.
> For the first one, an example of why we need this is AND/OR short-circuiting; we don't want these optimizations to cause the invocation to be skipped in a confusing way, so we should just ban it outright (which is what SQL/OLAP does for window functions).
> For the second one, I'm not entirely certain about the details since some of it is lost in the mists in Hive prehistory, but at least if we have the annotation, we'll be able to preserve backwards compatibility as we start adding new cost-based optimizations which might otherwise break it.  A specific example would be inserting a materialization step (e.g. for global query optimization) in between the DISTRIBUTE/CLUSTER/SORT and the outer SELECT containing the stateful UDF invocation; this could be a problem if the mappers in the second job subdivides the buckets generated by the first job.  So we wouldn't do anything immediately, but the presence of the annotation will help us going forward.

-- 
This message is automatically generated by JIRA.
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] Updated: (HIVE-1994) Support new annotation @UDFType(stateful = true)

Posted by "John Sichi (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HIVE-1994?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

John Sichi updated HIVE-1994:
-----------------------------

    Attachment: HIVE-1994.0.patch

Preliminary patch with everything except the fix to prevent short-circuiting.


> Support new annotation @UDFType(stateful = true)
> ------------------------------------------------
>
>                 Key: HIVE-1994
>                 URL: https://issues.apache.org/jira/browse/HIVE-1994
>             Project: Hive
>          Issue Type: Improvement
>          Components: Query Processor, UDF
>            Reporter: John Sichi
>            Assignee: John Sichi
>         Attachments: HIVE-1994.0.patch
>
>
> Because Hive does not yet support window functions from SQL/OLAP, people have started hacking around it by writing stateful UDF's for things like cumulative sum.  An example is row_sequence in contrib.
> To clearly mark these, I think we should add a new annotation (with separate semantics from the existing deterministic annotation).  I'm proposing the name stateful for lack of a better idea, but I'm open to suggestions.
> The semantics are as follows:
> * A stateful UDF can only be used in the SELECT list, not in other clauses such as WHERE/ON/ORDER/GROUP
> * When a stateful UDF is present in a query, there's an implication that its SELECT needs to be treated as similar to TRANSFORM, i.e. when there's DISTRIBUTE/CLUSTER/SORT clause, then run inside the corresponding reducer to make sure that the results are as expected.
> For the first one, an example of why we need this is AND/OR short-circuiting; we don't want these optimizations to cause the invocation to be skipped in a confusing way, so we should just ban it outright (which is what SQL/OLAP does for window functions).
> For the second one, I'm not entirely certain about the details since some of it is lost in the mists in Hive prehistory, but at least if we have the annotation, we'll be able to preserve backwards compatibility as we start adding new cost-based optimizations which might otherwise break it.  A specific example would be inserting a materialization step (e.g. for global query optimization) in between the DISTRIBUTE/CLUSTER/SORT and the outer SELECT containing the stateful UDF invocation; this could be a problem if the mappers in the second job subdivides the buckets generated by the first job.  So we wouldn't do anything immediately, but the presence of the annotation will help us going forward.

-- 
This message is automatically generated by JIRA.
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] Updated: (HIVE-1994) Support new annotation @UDFType(stateful = true)

Posted by "John Sichi (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HIVE-1994?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

John Sichi updated HIVE-1994:
-----------------------------

    Fix Version/s: 0.8.0
           Status: Patch Available  (was: Open)

Review board request:

https://reviews.apache.org/r/442/


> Support new annotation @UDFType(stateful = true)
> ------------------------------------------------
>
>                 Key: HIVE-1994
>                 URL: https://issues.apache.org/jira/browse/HIVE-1994
>             Project: Hive
>          Issue Type: Improvement
>          Components: Query Processor, UDF
>            Reporter: John Sichi
>            Assignee: John Sichi
>             Fix For: 0.8.0
>
>         Attachments: HIVE-1994.0.patch, HIVE-1994.1.patch, HIVE-1994.2.patch, HIVE-1994.3.patch
>
>
> Because Hive does not yet support window functions from SQL/OLAP, people have started hacking around it by writing stateful UDF's for things like cumulative sum.  An example is row_sequence in contrib.
> To clearly mark these, I think we should add a new annotation (with separate semantics from the existing deterministic annotation).  I'm proposing the name stateful for lack of a better idea, but I'm open to suggestions.
> The semantics are as follows:
> * A stateful UDF can only be used in the SELECT list, not in other clauses such as WHERE/ON/ORDER/GROUP
> * When a stateful UDF is present in a query, there's an implication that its SELECT needs to be treated as similar to TRANSFORM, i.e. when there's DISTRIBUTE/CLUSTER/SORT clause, then run inside the corresponding reducer to make sure that the results are as expected.
> For the first one, an example of why we need this is AND/OR short-circuiting; we don't want these optimizations to cause the invocation to be skipped in a confusing way, so we should just ban it outright (which is what SQL/OLAP does for window functions).
> For the second one, I'm not entirely certain about the details since some of it is lost in the mists in Hive prehistory, but at least if we have the annotation, we'll be able to preserve backwards compatibility as we start adding new cost-based optimizations which might otherwise break it.  A specific example would be inserting a materialization step (e.g. for global query optimization) in between the DISTRIBUTE/CLUSTER/SORT and the outer SELECT containing the stateful UDF invocation; this could be a problem if the mappers in the second job subdivides the buckets generated by the first job.  So we wouldn't do anything immediately, but the presence of the annotation will help us going forward.

-- 
This message is automatically generated by JIRA.
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] Commented: (HIVE-1994) Support new annotation @UDFType(stateful = true)

Posted by "Jonathan Chang (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HIVE-1994?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12994546#comment-12994546 ] 

Jonathan Chang commented on HIVE-1994:
--------------------------------------

The AND/OR short circuiting is an issue for both SELECT and WHERE.  I think stateful UDFs need to poison containing expressions and force them to not short circuit.  



> Support new annotation @UDFType(stateful = true)
> ------------------------------------------------
>
>                 Key: HIVE-1994
>                 URL: https://issues.apache.org/jira/browse/HIVE-1994
>             Project: Hive
>          Issue Type: Improvement
>          Components: Query Processor, UDF
>            Reporter: John Sichi
>            Assignee: John Sichi
>
> Because Hive does not yet support window functions from SQL/OLAP, people have started hacking around it by writing stateful UDF's for things like cumulative sum.  An example is row_sequence in contrib.
> To clearly mark these, I think we should add a new annotation (with separate semantics from the existing deterministic annotation).  I'm proposing the name stateful for lack of a better idea, but I'm open to suggestions.
> The semantics are as follows:
> * A stateful UDF can only be used in the SELECT list, not in other clauses such as WHERE/ON/ORDER/GROUP
> * When a stateful UDF is present in a query, there's an implication that its SELECT needs to be treated as similar to TRANSFORM, i.e. when there's DISTRIBUTE/CLUSTER/SORT clause, then run inside the corresponding reducer to make sure that the results are as expected.
> For the first one, an example of why we need this is AND/OR short-circuiting; we don't want these optimizations to cause the invocation to be skipped in a confusing way, so we should just ban it outright (which is what SQL/OLAP does for window functions).
> For the second one, I'm not entirely certain about the details since some of it is lost in the mists in Hive prehistory, but at least if we have the annotation, we'll be able to preserve backwards compatibility as we start adding new cost-based optimizations which might otherwise break it.  A specific example would be inserting a materialization step (e.g. for global query optimization) in between the DISTRIBUTE/CLUSTER/SORT and the outer SELECT containing the stateful UDF invocation; this could be a problem if the mappers in the second job subdivides the buckets generated by the first job.  So we wouldn't do anything immediately, but the presence of the annotation will help us going forward.

-- 
This message is automatically generated by JIRA.
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] Updated: (HIVE-1994) Support new annotation @UDFType(stateful = true)

Posted by "John Sichi (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HIVE-1994?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

John Sichi updated HIVE-1994:
-----------------------------

    Attachment: HIVE-1994.3.patch

HIVE-1994.3.patch removes some crud that slipped into udf_row_sequence.q accidentally, and should pass all tests.


> Support new annotation @UDFType(stateful = true)
> ------------------------------------------------
>
>                 Key: HIVE-1994
>                 URL: https://issues.apache.org/jira/browse/HIVE-1994
>             Project: Hive
>          Issue Type: Improvement
>          Components: Query Processor, UDF
>            Reporter: John Sichi
>            Assignee: John Sichi
>         Attachments: HIVE-1994.0.patch, HIVE-1994.1.patch, HIVE-1994.2.patch, HIVE-1994.3.patch
>
>
> Because Hive does not yet support window functions from SQL/OLAP, people have started hacking around it by writing stateful UDF's for things like cumulative sum.  An example is row_sequence in contrib.
> To clearly mark these, I think we should add a new annotation (with separate semantics from the existing deterministic annotation).  I'm proposing the name stateful for lack of a better idea, but I'm open to suggestions.
> The semantics are as follows:
> * A stateful UDF can only be used in the SELECT list, not in other clauses such as WHERE/ON/ORDER/GROUP
> * When a stateful UDF is present in a query, there's an implication that its SELECT needs to be treated as similar to TRANSFORM, i.e. when there's DISTRIBUTE/CLUSTER/SORT clause, then run inside the corresponding reducer to make sure that the results are as expected.
> For the first one, an example of why we need this is AND/OR short-circuiting; we don't want these optimizations to cause the invocation to be skipped in a confusing way, so we should just ban it outright (which is what SQL/OLAP does for window functions).
> For the second one, I'm not entirely certain about the details since some of it is lost in the mists in Hive prehistory, but at least if we have the annotation, we'll be able to preserve backwards compatibility as we start adding new cost-based optimizations which might otherwise break it.  A specific example would be inserting a materialization step (e.g. for global query optimization) in between the DISTRIBUTE/CLUSTER/SORT and the outer SELECT containing the stateful UDF invocation; this could be a problem if the mappers in the second job subdivides the buckets generated by the first job.  So we wouldn't do anything immediately, but the presence of the annotation will help us going forward.

-- 
This message is automatically generated by JIRA.
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] Commented: (HIVE-1994) Support new annotation @UDFType(stateful = true)

Posted by "Carl Steinbach (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HIVE-1994?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12998991#comment-12998991 ] 

Carl Steinbach commented on HIVE-1994:
--------------------------------------

+1. Will commit if tests pass.

> Support new annotation @UDFType(stateful = true)
> ------------------------------------------------
>
>                 Key: HIVE-1994
>                 URL: https://issues.apache.org/jira/browse/HIVE-1994
>             Project: Hive
>          Issue Type: Improvement
>          Components: Query Processor, UDF
>            Reporter: John Sichi
>            Assignee: John Sichi
>             Fix For: 0.8.0
>
>         Attachments: HIVE-1994.0.patch, HIVE-1994.1.patch, HIVE-1994.2.patch, HIVE-1994.3.patch
>
>
> Because Hive does not yet support window functions from SQL/OLAP, people have started hacking around it by writing stateful UDF's for things like cumulative sum.  An example is row_sequence in contrib.
> To clearly mark these, I think we should add a new annotation (with separate semantics from the existing deterministic annotation).  I'm proposing the name stateful for lack of a better idea, but I'm open to suggestions.
> The semantics are as follows:
> * A stateful UDF can only be used in the SELECT list, not in other clauses such as WHERE/ON/ORDER/GROUP
> * When a stateful UDF is present in a query, there's an implication that its SELECT needs to be treated as similar to TRANSFORM, i.e. when there's DISTRIBUTE/CLUSTER/SORT clause, then run inside the corresponding reducer to make sure that the results are as expected.
> For the first one, an example of why we need this is AND/OR short-circuiting; we don't want these optimizations to cause the invocation to be skipped in a confusing way, so we should just ban it outright (which is what SQL/OLAP does for window functions).
> For the second one, I'm not entirely certain about the details since some of it is lost in the mists in Hive prehistory, but at least if we have the annotation, we'll be able to preserve backwards compatibility as we start adding new cost-based optimizations which might otherwise break it.  A specific example would be inserting a materialization step (e.g. for global query optimization) in between the DISTRIBUTE/CLUSTER/SORT and the outer SELECT containing the stateful UDF invocation; this could be a problem if the mappers in the second job subdivides the buckets generated by the first job.  So we wouldn't do anything immediately, but the presence of the annotation will help us going forward.

-- 
This message is automatically generated by JIRA.
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] Commented: (HIVE-1994) Support new annotation @UDFType(stateful = true)

Posted by "Adam Kramer (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HIVE-1994?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12994581#comment-12994581 ] 

Adam Kramer commented on HIVE-1994:
-----------------------------------

Agree; also consider deprecating DISTRIBUTE/SORT/CLUSTER BY in favor of DISTRIBUTED/SORTED/CLUSTERED BY, a syntax that would explicitly prevent short-circuiting and subdivision for only the query it's a part of.

I can't imagine that "sort by in the subquery leads to assumptions in the parent query" scales well or will last long in any case, but this functionality is not only necessary for backwards-compatibility, but is also kind of the entire reason Hive was developed and/or conceived: To utilize mapreduce functionality in order to transform and process data. Preventing the querier from making mapreduce assumptions just seems sad.

> Support new annotation @UDFType(stateful = true)
> ------------------------------------------------
>
>                 Key: HIVE-1994
>                 URL: https://issues.apache.org/jira/browse/HIVE-1994
>             Project: Hive
>          Issue Type: Improvement
>          Components: Query Processor, UDF
>            Reporter: John Sichi
>            Assignee: John Sichi
>
> Because Hive does not yet support window functions from SQL/OLAP, people have started hacking around it by writing stateful UDF's for things like cumulative sum.  An example is row_sequence in contrib.
> To clearly mark these, I think we should add a new annotation (with separate semantics from the existing deterministic annotation).  I'm proposing the name stateful for lack of a better idea, but I'm open to suggestions.
> The semantics are as follows:
> * A stateful UDF can only be used in the SELECT list, not in other clauses such as WHERE/ON/ORDER/GROUP
> * When a stateful UDF is present in a query, there's an implication that its SELECT needs to be treated as similar to TRANSFORM, i.e. when there's DISTRIBUTE/CLUSTER/SORT clause, then run inside the corresponding reducer to make sure that the results are as expected.
> For the first one, an example of why we need this is AND/OR short-circuiting; we don't want these optimizations to cause the invocation to be skipped in a confusing way, so we should just ban it outright (which is what SQL/OLAP does for window functions).
> For the second one, I'm not entirely certain about the details since some of it is lost in the mists in Hive prehistory, but at least if we have the annotation, we'll be able to preserve backwards compatibility as we start adding new cost-based optimizations which might otherwise break it.  A specific example would be inserting a materialization step (e.g. for global query optimization) in between the DISTRIBUTE/CLUSTER/SORT and the outer SELECT containing the stateful UDF invocation; this could be a problem if the mappers in the second job subdivides the buckets generated by the first job.  So we wouldn't do anything immediately, but the presence of the annotation will help us going forward.

-- 
This message is automatically generated by JIRA.
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] Commented: (HIVE-1994) Support new annotation @UDFType(stateful = true)

Posted by "John Sichi (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HIVE-1994?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12995066#comment-12995066 ] 

John Sichi commented on HIVE-1994:
----------------------------------

If stateful is set to true, the UDF should also be treated as non-deterministic (even if the deterministic annotation explicitly returns true).

> Support new annotation @UDFType(stateful = true)
> ------------------------------------------------
>
>                 Key: HIVE-1994
>                 URL: https://issues.apache.org/jira/browse/HIVE-1994
>             Project: Hive
>          Issue Type: Improvement
>          Components: Query Processor, UDF
>            Reporter: John Sichi
>            Assignee: John Sichi
>
> Because Hive does not yet support window functions from SQL/OLAP, people have started hacking around it by writing stateful UDF's for things like cumulative sum.  An example is row_sequence in contrib.
> To clearly mark these, I think we should add a new annotation (with separate semantics from the existing deterministic annotation).  I'm proposing the name stateful for lack of a better idea, but I'm open to suggestions.
> The semantics are as follows:
> * A stateful UDF can only be used in the SELECT list, not in other clauses such as WHERE/ON/ORDER/GROUP
> * When a stateful UDF is present in a query, there's an implication that its SELECT needs to be treated as similar to TRANSFORM, i.e. when there's DISTRIBUTE/CLUSTER/SORT clause, then run inside the corresponding reducer to make sure that the results are as expected.
> For the first one, an example of why we need this is AND/OR short-circuiting; we don't want these optimizations to cause the invocation to be skipped in a confusing way, so we should just ban it outright (which is what SQL/OLAP does for window functions).
> For the second one, I'm not entirely certain about the details since some of it is lost in the mists in Hive prehistory, but at least if we have the annotation, we'll be able to preserve backwards compatibility as we start adding new cost-based optimizations which might otherwise break it.  A specific example would be inserting a materialization step (e.g. for global query optimization) in between the DISTRIBUTE/CLUSTER/SORT and the outer SELECT containing the stateful UDF invocation; this could be a problem if the mappers in the second job subdivides the buckets generated by the first job.  So we wouldn't do anything immediately, but the presence of the annotation will help us going forward.

-- 
This message is automatically generated by JIRA.
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] Commented: (HIVE-1994) Support new annotation @UDFType(stateful = true)

Posted by "John Sichi (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HIVE-1994?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12994557#comment-12994557 ] 

John Sichi commented on HIVE-1994:
----------------------------------

@Jonathan:  good point that we need to prevent short-circuiting from causing problems in the SELECT list too (e.g. inside of CASE/AND/OR).  Optimally, we should figure out how to make sure they get eagerly evaluated exactly once before evaluating the entire expression; that way short-circuiting can still be used.

We should still prevent them entirely outside of the SELECT list to avoid semantic ambiguity from other optimizations (e.g. decomposing predicates during predicate pushdown).  But I just checked SQL/OLAP, and it does allow them in ORDER BY, which makes sense for reporting.


> Support new annotation @UDFType(stateful = true)
> ------------------------------------------------
>
>                 Key: HIVE-1994
>                 URL: https://issues.apache.org/jira/browse/HIVE-1994
>             Project: Hive
>          Issue Type: Improvement
>          Components: Query Processor, UDF
>            Reporter: John Sichi
>            Assignee: John Sichi
>
> Because Hive does not yet support window functions from SQL/OLAP, people have started hacking around it by writing stateful UDF's for things like cumulative sum.  An example is row_sequence in contrib.
> To clearly mark these, I think we should add a new annotation (with separate semantics from the existing deterministic annotation).  I'm proposing the name stateful for lack of a better idea, but I'm open to suggestions.
> The semantics are as follows:
> * A stateful UDF can only be used in the SELECT list, not in other clauses such as WHERE/ON/ORDER/GROUP
> * When a stateful UDF is present in a query, there's an implication that its SELECT needs to be treated as similar to TRANSFORM, i.e. when there's DISTRIBUTE/CLUSTER/SORT clause, then run inside the corresponding reducer to make sure that the results are as expected.
> For the first one, an example of why we need this is AND/OR short-circuiting; we don't want these optimizations to cause the invocation to be skipped in a confusing way, so we should just ban it outright (which is what SQL/OLAP does for window functions).
> For the second one, I'm not entirely certain about the details since some of it is lost in the mists in Hive prehistory, but at least if we have the annotation, we'll be able to preserve backwards compatibility as we start adding new cost-based optimizations which might otherwise break it.  A specific example would be inserting a materialization step (e.g. for global query optimization) in between the DISTRIBUTE/CLUSTER/SORT and the outer SELECT containing the stateful UDF invocation; this could be a problem if the mappers in the second job subdivides the buckets generated by the first job.  So we wouldn't do anything immediately, but the presence of the annotation will help us going forward.

-- 
This message is automatically generated by JIRA.
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] Updated: (HIVE-1994) Support new annotation @UDFType(stateful = true)

Posted by "Carl Steinbach (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HIVE-1994?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Carl Steinbach updated HIVE-1994:
---------------------------------

      Resolution: Fixed
    Hadoop Flags: [Reviewed]
          Status: Resolved  (was: Patch Available)

Committed. Thanks John!

> Support new annotation @UDFType(stateful = true)
> ------------------------------------------------
>
>                 Key: HIVE-1994
>                 URL: https://issues.apache.org/jira/browse/HIVE-1994
>             Project: Hive
>          Issue Type: Improvement
>          Components: Query Processor, UDF
>            Reporter: John Sichi
>            Assignee: John Sichi
>             Fix For: 0.8.0
>
>         Attachments: HIVE-1994.0.patch, HIVE-1994.1.patch, HIVE-1994.2.patch, HIVE-1994.3.patch
>
>
> Because Hive does not yet support window functions from SQL/OLAP, people have started hacking around it by writing stateful UDF's for things like cumulative sum.  An example is row_sequence in contrib.
> To clearly mark these, I think we should add a new annotation (with separate semantics from the existing deterministic annotation).  I'm proposing the name stateful for lack of a better idea, but I'm open to suggestions.
> The semantics are as follows:
> * A stateful UDF can only be used in the SELECT list, not in other clauses such as WHERE/ON/ORDER/GROUP
> * When a stateful UDF is present in a query, there's an implication that its SELECT needs to be treated as similar to TRANSFORM, i.e. when there's DISTRIBUTE/CLUSTER/SORT clause, then run inside the corresponding reducer to make sure that the results are as expected.
> For the first one, an example of why we need this is AND/OR short-circuiting; we don't want these optimizations to cause the invocation to be skipped in a confusing way, so we should just ban it outright (which is what SQL/OLAP does for window functions).
> For the second one, I'm not entirely certain about the details since some of it is lost in the mists in Hive prehistory, but at least if we have the annotation, we'll be able to preserve backwards compatibility as we start adding new cost-based optimizations which might otherwise break it.  A specific example would be inserting a materialization step (e.g. for global query optimization) in between the DISTRIBUTE/CLUSTER/SORT and the outer SELECT containing the stateful UDF invocation; this could be a problem if the mappers in the second job subdivides the buckets generated by the first job.  So we wouldn't do anything immediately, but the presence of the annotation will help us going forward.

-- 
This message is automatically generated by JIRA.
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] Commented: (HIVE-1994) Support new annotation @UDFType(stateful = true)

Posted by "John Sichi (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HIVE-1994?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12998197#comment-12998197 ] 

John Sichi commented on HIVE-1994:
----------------------------------

HIVE-1994.1.patch addresses short-circuiting.  I'm running it through tests now.


> Support new annotation @UDFType(stateful = true)
> ------------------------------------------------
>
>                 Key: HIVE-1994
>                 URL: https://issues.apache.org/jira/browse/HIVE-1994
>             Project: Hive
>          Issue Type: Improvement
>          Components: Query Processor, UDF
>            Reporter: John Sichi
>            Assignee: John Sichi
>         Attachments: HIVE-1994.0.patch, HIVE-1994.1.patch
>
>
> Because Hive does not yet support window functions from SQL/OLAP, people have started hacking around it by writing stateful UDF's for things like cumulative sum.  An example is row_sequence in contrib.
> To clearly mark these, I think we should add a new annotation (with separate semantics from the existing deterministic annotation).  I'm proposing the name stateful for lack of a better idea, but I'm open to suggestions.
> The semantics are as follows:
> * A stateful UDF can only be used in the SELECT list, not in other clauses such as WHERE/ON/ORDER/GROUP
> * When a stateful UDF is present in a query, there's an implication that its SELECT needs to be treated as similar to TRANSFORM, i.e. when there's DISTRIBUTE/CLUSTER/SORT clause, then run inside the corresponding reducer to make sure that the results are as expected.
> For the first one, an example of why we need this is AND/OR short-circuiting; we don't want these optimizations to cause the invocation to be skipped in a confusing way, so we should just ban it outright (which is what SQL/OLAP does for window functions).
> For the second one, I'm not entirely certain about the details since some of it is lost in the mists in Hive prehistory, but at least if we have the annotation, we'll be able to preserve backwards compatibility as we start adding new cost-based optimizations which might otherwise break it.  A specific example would be inserting a materialization step (e.g. for global query optimization) in between the DISTRIBUTE/CLUSTER/SORT and the outer SELECT containing the stateful UDF invocation; this could be a problem if the mappers in the second job subdivides the buckets generated by the first job.  So we wouldn't do anything immediately, but the presence of the annotation will help us going forward.

-- 
This message is automatically generated by JIRA.
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] Updated: (HIVE-1994) Support new annotation @UDFType(stateful = true)

Posted by "John Sichi (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HIVE-1994?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

John Sichi updated HIVE-1994:
-----------------------------

    Attachment: HIVE-1994.1.patch

> Support new annotation @UDFType(stateful = true)
> ------------------------------------------------
>
>                 Key: HIVE-1994
>                 URL: https://issues.apache.org/jira/browse/HIVE-1994
>             Project: Hive
>          Issue Type: Improvement
>          Components: Query Processor, UDF
>            Reporter: John Sichi
>            Assignee: John Sichi
>         Attachments: HIVE-1994.0.patch, HIVE-1994.1.patch
>
>
> Because Hive does not yet support window functions from SQL/OLAP, people have started hacking around it by writing stateful UDF's for things like cumulative sum.  An example is row_sequence in contrib.
> To clearly mark these, I think we should add a new annotation (with separate semantics from the existing deterministic annotation).  I'm proposing the name stateful for lack of a better idea, but I'm open to suggestions.
> The semantics are as follows:
> * A stateful UDF can only be used in the SELECT list, not in other clauses such as WHERE/ON/ORDER/GROUP
> * When a stateful UDF is present in a query, there's an implication that its SELECT needs to be treated as similar to TRANSFORM, i.e. when there's DISTRIBUTE/CLUSTER/SORT clause, then run inside the corresponding reducer to make sure that the results are as expected.
> For the first one, an example of why we need this is AND/OR short-circuiting; we don't want these optimizations to cause the invocation to be skipped in a confusing way, so we should just ban it outright (which is what SQL/OLAP does for window functions).
> For the second one, I'm not entirely certain about the details since some of it is lost in the mists in Hive prehistory, but at least if we have the annotation, we'll be able to preserve backwards compatibility as we start adding new cost-based optimizations which might otherwise break it.  A specific example would be inserting a materialization step (e.g. for global query optimization) in between the DISTRIBUTE/CLUSTER/SORT and the outer SELECT containing the stateful UDF invocation; this could be a problem if the mappers in the second job subdivides the buckets generated by the first job.  So we wouldn't do anything immediately, but the presence of the annotation will help us going forward.

-- 
This message is automatically generated by JIRA.
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] Commented: (HIVE-1994) Support new annotation @UDFType(stateful = true)

Posted by "John Sichi (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HIVE-1994?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12998171#comment-12998171 ] 

John Sichi commented on HIVE-1994:
----------------------------------

Note that for CASE expressions, we *always* want short circuiting, otherwise it's impossible to do something like
case when x < 0 then sqrt(-x) else sqrt(x) end (to avoid trying to take the square root of a negative number).  So if we detect a stateful UDF inside of a CASE expression, we'll throw an exception.


> Support new annotation @UDFType(stateful = true)
> ------------------------------------------------
>
>                 Key: HIVE-1994
>                 URL: https://issues.apache.org/jira/browse/HIVE-1994
>             Project: Hive
>          Issue Type: Improvement
>          Components: Query Processor, UDF
>            Reporter: John Sichi
>            Assignee: John Sichi
>         Attachments: HIVE-1994.0.patch
>
>
> Because Hive does not yet support window functions from SQL/OLAP, people have started hacking around it by writing stateful UDF's for things like cumulative sum.  An example is row_sequence in contrib.
> To clearly mark these, I think we should add a new annotation (with separate semantics from the existing deterministic annotation).  I'm proposing the name stateful for lack of a better idea, but I'm open to suggestions.
> The semantics are as follows:
> * A stateful UDF can only be used in the SELECT list, not in other clauses such as WHERE/ON/ORDER/GROUP
> * When a stateful UDF is present in a query, there's an implication that its SELECT needs to be treated as similar to TRANSFORM, i.e. when there's DISTRIBUTE/CLUSTER/SORT clause, then run inside the corresponding reducer to make sure that the results are as expected.
> For the first one, an example of why we need this is AND/OR short-circuiting; we don't want these optimizations to cause the invocation to be skipped in a confusing way, so we should just ban it outright (which is what SQL/OLAP does for window functions).
> For the second one, I'm not entirely certain about the details since some of it is lost in the mists in Hive prehistory, but at least if we have the annotation, we'll be able to preserve backwards compatibility as we start adding new cost-based optimizations which might otherwise break it.  A specific example would be inserting a materialization step (e.g. for global query optimization) in between the DISTRIBUTE/CLUSTER/SORT and the outer SELECT containing the stateful UDF invocation; this could be a problem if the mappers in the second job subdivides the buckets generated by the first job.  So we wouldn't do anything immediately, but the presence of the annotation will help us going forward.

-- 
This message is automatically generated by JIRA.
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] Commented: (HIVE-1994) Support new annotation @UDFType(stateful = true)

Posted by "John Sichi (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HIVE-1994?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12998208#comment-12998208 ] 

John Sichi commented on HIVE-1994:
----------------------------------

HIVE-1994.2.patch fixes some conflicts.


> Support new annotation @UDFType(stateful = true)
> ------------------------------------------------
>
>                 Key: HIVE-1994
>                 URL: https://issues.apache.org/jira/browse/HIVE-1994
>             Project: Hive
>          Issue Type: Improvement
>          Components: Query Processor, UDF
>            Reporter: John Sichi
>            Assignee: John Sichi
>         Attachments: HIVE-1994.0.patch, HIVE-1994.1.patch, HIVE-1994.2.patch
>
>
> Because Hive does not yet support window functions from SQL/OLAP, people have started hacking around it by writing stateful UDF's for things like cumulative sum.  An example is row_sequence in contrib.
> To clearly mark these, I think we should add a new annotation (with separate semantics from the existing deterministic annotation).  I'm proposing the name stateful for lack of a better idea, but I'm open to suggestions.
> The semantics are as follows:
> * A stateful UDF can only be used in the SELECT list, not in other clauses such as WHERE/ON/ORDER/GROUP
> * When a stateful UDF is present in a query, there's an implication that its SELECT needs to be treated as similar to TRANSFORM, i.e. when there's DISTRIBUTE/CLUSTER/SORT clause, then run inside the corresponding reducer to make sure that the results are as expected.
> For the first one, an example of why we need this is AND/OR short-circuiting; we don't want these optimizations to cause the invocation to be skipped in a confusing way, so we should just ban it outright (which is what SQL/OLAP does for window functions).
> For the second one, I'm not entirely certain about the details since some of it is lost in the mists in Hive prehistory, but at least if we have the annotation, we'll be able to preserve backwards compatibility as we start adding new cost-based optimizations which might otherwise break it.  A specific example would be inserting a materialization step (e.g. for global query optimization) in between the DISTRIBUTE/CLUSTER/SORT and the outer SELECT containing the stateful UDF invocation; this could be a problem if the mappers in the second job subdivides the buckets generated by the first job.  So we wouldn't do anything immediately, but the presence of the annotation will help us going forward.

-- 
This message is automatically generated by JIRA.
-
For more information on JIRA, see: http://www.atlassian.com/software/jira