You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@hive.apache.org by "Ashish Thusoo (JIRA)" <ji...@apache.org> on 2010/02/04 01:46:28 UTC

[jira] Created: (HIVE-1131) Add column lineage information to the pre execution hooks

Add column lineage information to the pre execution hooks
---------------------------------------------------------

                 Key: HIVE-1131
                 URL: https://issues.apache.org/jira/browse/HIVE-1131
             Project: Hadoop Hive
          Issue Type: New Feature
          Components: Query Processor
            Reporter: Ashish Thusoo
            Assignee: Ashish Thusoo


We need a mechanism to pass the lineage information of the various columns of a table to a pre execution hook so that applications can use that for:

- auditing
- dependency checking

and many other applications.

The proposal is to expose this through a bunch of classes to the pre execution hook interface to the clients and put in the necessary transformation logic in the optimizer to generate this information.


-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (HIVE-1131) Add column lineage information to the pre execution hooks

Posted by "Ashish Thusoo (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/HIVE-1131?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Ashish Thusoo updated HIVE-1131:
--------------------------------

    Attachment: HIVE-1131_8.patch

Another one with test fixes.


> Add column lineage information to the pre execution hooks
> ---------------------------------------------------------
>
>                 Key: HIVE-1131
>                 URL: https://issues.apache.org/jira/browse/HIVE-1131
>             Project: Hadoop Hive
>          Issue Type: New Feature
>          Components: Query Processor
>            Reporter: Ashish Thusoo
>            Assignee: Ashish Thusoo
>         Attachments: HIVE-1131.patch, HIVE-1131_2.patch, HIVE-1131_3.patch, HIVE-1131_4.patch, HIVE-1131_5.patch, HIVE-1131_6.patch, HIVE-1131_7.patch, HIVE-1131_8.patch
>
>
> We need a mechanism to pass the lineage information of the various columns of a table to a pre execution hook so that applications can use that for:
> - auditing
> - dependency checking
> and many other applications.
> The proposal is to expose this through a bunch of classes to the pre execution hook interface to the clients and put in the necessary transformation logic in the optimizer to generate this information.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (HIVE-1131) Add column lineage information to the pre execution hooks

Posted by "Ashish Thusoo (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HIVE-1131?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12851674#action_12851674 ] 

Ashish Thusoo commented on HIVE-1131:
-------------------------------------

Look at the DataContainer class. That has a partition in it. And the Dependency has a mapping from Partition to the dependencies. Can you explain more your concerns on inefficiency?

For S6 actually the queryplan is the wrong place to store the lineageinfo. Because of the dynamic partitioning work that Ning is doing, I have to generate the partition to dependency mapping at run time. So I would rather store it in a run time structure as opposed to a compile time structure. SessionState fits that bill, though I think we should have another structure called ExecutionCtx for this. But otherwise I think we want to store this in a runtime structure.

S2 will add some more comments.


> Add column lineage information to the pre execution hooks
> ---------------------------------------------------------
>
>                 Key: HIVE-1131
>                 URL: https://issues.apache.org/jira/browse/HIVE-1131
>             Project: Hadoop Hive
>          Issue Type: New Feature
>          Components: Query Processor
>            Reporter: Ashish Thusoo
>            Assignee: Ashish Thusoo
>         Attachments: HIVE-1131.patch, HIVE-1131_2.patch, HIVE-1131_3.patch, HIVE-1131_4.patch
>
>
> We need a mechanism to pass the lineage information of the various columns of a table to a pre execution hook so that applications can use that for:
> - auditing
> - dependency checking
> and many other applications.
> The proposal is to expose this through a bunch of classes to the pre execution hook interface to the clients and put in the necessary transformation logic in the optimizer to generate this information.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (HIVE-1131) Add column lineage information to the pre execution hooks

Posted by "Namit Jain (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HIVE-1131?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12854158#action_12854158 ] 

Namit Jain commented on HIVE-1131:
----------------------------------

+1

looks good, running tests again, will merge if the tests pass

> Add column lineage information to the pre execution hooks
> ---------------------------------------------------------
>
>                 Key: HIVE-1131
>                 URL: https://issues.apache.org/jira/browse/HIVE-1131
>             Project: Hadoop Hive
>          Issue Type: New Feature
>          Components: Query Processor
>            Reporter: Ashish Thusoo
>            Assignee: Ashish Thusoo
>         Attachments: HIVE-1131.patch, HIVE-1131_2.patch, HIVE-1131_3.patch, HIVE-1131_4.patch, HIVE-1131_5.patch, HIVE-1131_6.patch, HIVE-1131_7.patch, HIVE-1131_8.patch, hive.1131.9.patch
>
>
> We need a mechanism to pass the lineage information of the various columns of a table to a pre execution hook so that applications can use that for:
> - auditing
> - dependency checking
> and many other applications.
> The proposal is to expose this through a bunch of classes to the pre execution hook interface to the clients and put in the necessary transformation logic in the optimizer to generate this information.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (HIVE-1131) Add column lineage information to the pre execution hooks

Posted by "Raghotham Murthy (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HIVE-1131?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12835487#action_12835487 ] 

Raghotham Murthy commented on HIVE-1131:
----------------------------------------

Went over code with Ashish. A few things: 

1. The hash<key1, hash<key2, value>> paradigm can be changed to hash<pair<key1,key2>, value>. That will reduce the amount of code needed. For example, there is no need for special iterator and item classes.
2. Code which records visits to nodes can be removed
3. PreOrderWalker.java does not have any change

> Add column lineage information to the pre execution hooks
> ---------------------------------------------------------
>
>                 Key: HIVE-1131
>                 URL: https://issues.apache.org/jira/browse/HIVE-1131
>             Project: Hadoop Hive
>          Issue Type: New Feature
>          Components: Query Processor
>            Reporter: Ashish Thusoo
>            Assignee: Ashish Thusoo
>         Attachments: HIVE-1131.patch
>
>
> We need a mechanism to pass the lineage information of the various columns of a table to a pre execution hook so that applications can use that for:
> - auditing
> - dependency checking
> and many other applications.
> The proposal is to expose this through a bunch of classes to the pre execution hook interface to the clients and put in the necessary transformation logic in the optimizer to generate this information.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (HIVE-1131) Add column lineage information to the pre execution hooks

Posted by "Zheng Shao (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HIVE-1131?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12853323#action_12853323 ] 

Zheng Shao commented on HIVE-1131:
----------------------------------

Still seeing test failures from HIVE-1131_7.patch

{code}
.ptest_0/test.17.2.1.log:    [junit] Begin query: groupby8.q
.ptest_0/test.17.2.1.log:    [junit] junit.framework.AssertionFailedError: Client execution results failed with error code = 1
--
.ptest_1/test.17.2.1.log:    [junit] Begin query: groupby8_map_skew.q
.ptest_1/test.17.2.1.log:    [junit] junit.framework.AssertionFailedError: Client execution results failed with error code = 1
--
.ptest_1/test.17.2.1.log:    [junit] Begin query: multi_insert.q
.ptest_1/test.17.2.1.log:    [junit] junit.framework.AssertionFailedError: Client execution results failed with error code = 1
--
.ptest_1/test.17.2.1.log:    [junit] Begin query: reduce_deduplicate.q
.ptest_1/test.17.2.1.log:    [junit] junit.framework.AssertionFailedError: Client execution results failed with error code = 1
--
.ptest_1/test.17.2.1.log:    [junit] Begin query: union18.q
.ptest_1/test.17.2.1.log:    [junit] junit.framework.AssertionFailedError: Client execution results failed with error code = 1
--
.ptest_2/test.17.2.1.log:    [junit] Begin query: groupby7.q
.ptest_2/test.17.2.1.log:    [junit] junit.framework.AssertionFailedError: Client execution results failed with error code = 1
--
.ptest_2/test.17.2.1.log:    [junit] Begin query: groupby8_noskew.q
.ptest_2/test.17.2.1.log:    [junit] junit.framework.AssertionFailedError: Client execution results failed with error code = 1
--
.ptest_2/test.17.2.1.log:    [junit] Begin query: input12.q
.ptest_2/test.17.2.1.log:    [junit] junit.framework.AssertionFailedError: Client execution results failed with error code = 1
--
{code}


> Add column lineage information to the pre execution hooks
> ---------------------------------------------------------
>
>                 Key: HIVE-1131
>                 URL: https://issues.apache.org/jira/browse/HIVE-1131
>             Project: Hadoop Hive
>          Issue Type: New Feature
>          Components: Query Processor
>            Reporter: Ashish Thusoo
>            Assignee: Ashish Thusoo
>         Attachments: HIVE-1131.patch, HIVE-1131_2.patch, HIVE-1131_3.patch, HIVE-1131_4.patch, HIVE-1131_5.patch, HIVE-1131_6.patch, HIVE-1131_7.patch
>
>
> We need a mechanism to pass the lineage information of the various columns of a table to a pre execution hook so that applications can use that for:
> - auditing
> - dependency checking
> and many other applications.
> The proposal is to expose this through a bunch of classes to the pre execution hook interface to the clients and put in the necessary transformation logic in the optimizer to generate this information.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (HIVE-1131) Add column lineage information to the pre execution hooks

Posted by "Ashish Thusoo (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/HIVE-1131?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Ashish Thusoo updated HIVE-1131:
--------------------------------

    Attachment: HIVE-1131.patch

This is just the source patch. Will publish the test patch soon.

> Add column lineage information to the pre execution hooks
> ---------------------------------------------------------
>
>                 Key: HIVE-1131
>                 URL: https://issues.apache.org/jira/browse/HIVE-1131
>             Project: Hadoop Hive
>          Issue Type: New Feature
>          Components: Query Processor
>            Reporter: Ashish Thusoo
>            Assignee: Ashish Thusoo
>         Attachments: HIVE-1131.patch
>
>
> We need a mechanism to pass the lineage information of the various columns of a table to a pre execution hook so that applications can use that for:
> - auditing
> - dependency checking
> and many other applications.
> The proposal is to expose this through a bunch of classes to the pre execution hook interface to the clients and put in the necessary transformation logic in the optimizer to generate this information.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (HIVE-1131) Add column lineage information to the pre execution hooks

Posted by "Zheng Shao (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HIVE-1131?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12836103#action_12836103 ] 

Zheng Shao commented on HIVE-1131:
----------------------------------

S1. Can we make lineage partition-level instead of table-level?
S2. We might want to define formally the concepts of these levels, especially how they are composited (What will be UDAF of UDF, or UDF of UDAF, like round(sum(col)), or sum(round(col)))
{code}
+  /**
+   * Enum to track dependency. This enum has two values:
+   * 1. SCALAR - Indicates that the column is derived from a scalar expression.
+   * 2. AGGREGATION - Indicates that the column is derived from an aggregation.
+   */
+  public static enum DependencyType {
+    SIMPLE, UDF, UDAF, UDTF, SCRIPT, SET
+  }
+  
{code}

S3. Use "{}" even for single statement in "if", "for" etc.
S4. Use "ArrayList" instead of "Vector" when it's accessed by a single thread.
S5. Remove "private HashMap<FileSinkOperator, Table> fopToTable;" since it's not used.


> Add column lineage information to the pre execution hooks
> ---------------------------------------------------------
>
>                 Key: HIVE-1131
>                 URL: https://issues.apache.org/jira/browse/HIVE-1131
>             Project: Hadoop Hive
>          Issue Type: New Feature
>          Components: Query Processor
>            Reporter: Ashish Thusoo
>            Assignee: Ashish Thusoo
>         Attachments: HIVE-1131.patch
>
>
> We need a mechanism to pass the lineage information of the various columns of a table to a pre execution hook so that applications can use that for:
> - auditing
> - dependency checking
> and many other applications.
> The proposal is to expose this through a bunch of classes to the pre execution hook interface to the clients and put in the necessary transformation logic in the optimizer to generate this information.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (HIVE-1131) Add column lineage information to the pre execution hooks

Posted by "Ashish Thusoo (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/HIVE-1131?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Ashish Thusoo updated HIVE-1131:
--------------------------------

    Release Note: This changes the signature of PostExecute.java
    Hadoop Flags: [Incompatible change]
          Status: Patch Available  (was: Open)

submitting.

> Add column lineage information to the pre execution hooks
> ---------------------------------------------------------
>
>                 Key: HIVE-1131
>                 URL: https://issues.apache.org/jira/browse/HIVE-1131
>             Project: Hadoop Hive
>          Issue Type: New Feature
>          Components: Query Processor
>            Reporter: Ashish Thusoo
>            Assignee: Ashish Thusoo
>         Attachments: HIVE-1131.patch, HIVE-1131_2.patch, HIVE-1131_3.patch, HIVE-1131_4.patch, HIVE-1131_5.patch, HIVE-1131_6.patch, HIVE-1131_7.patch
>
>
> We need a mechanism to pass the lineage information of the various columns of a table to a pre execution hook so that applications can use that for:
> - auditing
> - dependency checking
> and many other applications.
> The proposal is to expose this through a bunch of classes to the pre execution hook interface to the clients and put in the necessary transformation logic in the optimizer to generate this information.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (HIVE-1131) Add column lineage information to the pre execution hooks

Posted by "Ashish Thusoo (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HIVE-1131?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12852013#action_12852013 ] 

Ashish Thusoo commented on HIVE-1131:
-------------------------------------

I looked at the ExecutionCtx stuff. There are atleast 3 different unrelated fields in SessionState that we should also move to the ExecutionCtx. I will file a follow up JIRA for it but I think we should get this one in. I did see some test failures due to using HashMaps and the consequent change in ordering after I refreshed. Will fix that and upload a new patch.


> Add column lineage information to the pre execution hooks
> ---------------------------------------------------------
>
>                 Key: HIVE-1131
>                 URL: https://issues.apache.org/jira/browse/HIVE-1131
>             Project: Hadoop Hive
>          Issue Type: New Feature
>          Components: Query Processor
>            Reporter: Ashish Thusoo
>            Assignee: Ashish Thusoo
>         Attachments: HIVE-1131.patch, HIVE-1131_2.patch, HIVE-1131_3.patch, HIVE-1131_4.patch
>
>
> We need a mechanism to pass the lineage information of the various columns of a table to a pre execution hook so that applications can use that for:
> - auditing
> - dependency checking
> and many other applications.
> The proposal is to expose this through a bunch of classes to the pre execution hook interface to the clients and put in the necessary transformation logic in the optimizer to generate this information.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (HIVE-1131) Add column lineage information to the pre execution hooks

Posted by "Ashish Thusoo (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HIVE-1131?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12849624#action_12849624 ] 

Ashish Thusoo commented on HIVE-1131:
-------------------------------------

Comment 3 from Raghu and comment S2-S4 from Zheng are not yet incorporated.

The new patch overhauls things a bit to support Partition level lineage and does this in a post execute hook. It gets rid of the visits and the iterator classes. Will fix the other comments in the patch with the test cases.


> Add column lineage information to the pre execution hooks
> ---------------------------------------------------------
>
>                 Key: HIVE-1131
>                 URL: https://issues.apache.org/jira/browse/HIVE-1131
>             Project: Hadoop Hive
>          Issue Type: New Feature
>          Components: Query Processor
>            Reporter: Ashish Thusoo
>            Assignee: Ashish Thusoo
>         Attachments: HIVE-1131.patch, HIVE-1131_2.patch
>
>
> We need a mechanism to pass the lineage information of the various columns of a table to a pre execution hook so that applications can use that for:
> - auditing
> - dependency checking
> and many other applications.
> The proposal is to expose this through a bunch of classes to the pre execution hook interface to the clients and put in the necessary transformation logic in the optimizer to generate this information.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (HIVE-1131) Add column lineage information to the pre execution hooks

Posted by "Carl Steinbach (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/HIVE-1131?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Carl Steinbach updated HIVE-1131:
---------------------------------

    Fix Version/s: 0.6.0

> Add column lineage information to the pre execution hooks
> ---------------------------------------------------------
>
>                 Key: HIVE-1131
>                 URL: https://issues.apache.org/jira/browse/HIVE-1131
>             Project: Hadoop Hive
>          Issue Type: New Feature
>          Components: Query Processor
>            Reporter: Ashish Thusoo
>            Assignee: Ashish Thusoo
>             Fix For: 0.6.0
>
>         Attachments: HIVE-1131.patch, HIVE-1131_2.patch, HIVE-1131_3.patch, HIVE-1131_4.patch, HIVE-1131_5.patch, HIVE-1131_6.patch, HIVE-1131_7.patch, HIVE-1131_8.patch, hive.1131.9.patch
>
>
> We need a mechanism to pass the lineage information of the various columns of a table to a pre execution hook so that applications can use that for:
> - auditing
> - dependency checking
> and many other applications.
> The proposal is to expose this through a bunch of classes to the pre execution hook interface to the clients and put in the necessary transformation logic in the optimizer to generate this information.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (HIVE-1131) Add column lineage information to the pre execution hooks

Posted by "Ashish Thusoo (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/HIVE-1131?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Ashish Thusoo updated HIVE-1131:
--------------------------------

    Attachment: HIVE-1131_7.patch

Another patch which fixes the QueryPlan to have LinkedHashMaps as that was also creating instability in the tests.

> Add column lineage information to the pre execution hooks
> ---------------------------------------------------------
>
>                 Key: HIVE-1131
>                 URL: https://issues.apache.org/jira/browse/HIVE-1131
>             Project: Hadoop Hive
>          Issue Type: New Feature
>          Components: Query Processor
>            Reporter: Ashish Thusoo
>            Assignee: Ashish Thusoo
>         Attachments: HIVE-1131.patch, HIVE-1131_2.patch, HIVE-1131_3.patch, HIVE-1131_4.patch, HIVE-1131_5.patch, HIVE-1131_6.patch, HIVE-1131_7.patch
>
>
> We need a mechanism to pass the lineage information of the various columns of a table to a pre execution hook so that applications can use that for:
> - auditing
> - dependency checking
> and many other applications.
> The proposal is to expose this through a bunch of classes to the pre execution hook interface to the clients and put in the necessary transformation logic in the optimizer to generate this information.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (HIVE-1131) Add column lineage information to the pre execution hooks

Posted by "Ashish Thusoo (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HIVE-1131?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12849633#action_12849633 ] 

Ashish Thusoo commented on HIVE-1131:
-------------------------------------

Also I did not find any instance of S3 in the code. Perhaps you just mentioned it for completeness but in case you do find an instance please let me know the offending file.


> Add column lineage information to the pre execution hooks
> ---------------------------------------------------------
>
>                 Key: HIVE-1131
>                 URL: https://issues.apache.org/jira/browse/HIVE-1131
>             Project: Hadoop Hive
>          Issue Type: New Feature
>          Components: Query Processor
>            Reporter: Ashish Thusoo
>            Assignee: Ashish Thusoo
>         Attachments: HIVE-1131.patch, HIVE-1131_2.patch, HIVE-1131_3.patch
>
>
> We need a mechanism to pass the lineage information of the various columns of a table to a pre execution hook so that applications can use that for:
> - auditing
> - dependency checking
> and many other applications.
> The proposal is to expose this through a bunch of classes to the pre execution hook interface to the clients and put in the necessary transformation logic in the optimizer to generate this information.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (HIVE-1131) Add column lineage information to the pre execution hooks

Posted by "Ashish Thusoo (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/HIVE-1131?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Ashish Thusoo updated HIVE-1131:
--------------------------------

    Attachment: HIVE-1131_2.patch

Patch with all the review comments incorporated. This is just the source patch. Will be uploading the fixed tests shortly.


> Add column lineage information to the pre execution hooks
> ---------------------------------------------------------
>
>                 Key: HIVE-1131
>                 URL: https://issues.apache.org/jira/browse/HIVE-1131
>             Project: Hadoop Hive
>          Issue Type: New Feature
>          Components: Query Processor
>            Reporter: Ashish Thusoo
>            Assignee: Ashish Thusoo
>         Attachments: HIVE-1131.patch, HIVE-1131_2.patch
>
>
> We need a mechanism to pass the lineage information of the various columns of a table to a pre execution hook so that applications can use that for:
> - auditing
> - dependency checking
> and many other applications.
> The proposal is to expose this through a bunch of classes to the pre execution hook interface to the clients and put in the necessary transformation logic in the optimizer to generate this information.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (HIVE-1131) Add column lineage information to the pre execution hooks

Posted by "Namit Jain (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/HIVE-1131?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Namit Jain updated HIVE-1131:
-----------------------------

    Attachment: hive.1131.9.patch

uploaded a new patch with updated test results

> Add column lineage information to the pre execution hooks
> ---------------------------------------------------------
>
>                 Key: HIVE-1131
>                 URL: https://issues.apache.org/jira/browse/HIVE-1131
>             Project: Hadoop Hive
>          Issue Type: New Feature
>          Components: Query Processor
>            Reporter: Ashish Thusoo
>            Assignee: Ashish Thusoo
>         Attachments: HIVE-1131.patch, HIVE-1131_2.patch, HIVE-1131_3.patch, HIVE-1131_4.patch, HIVE-1131_5.patch, HIVE-1131_6.patch, HIVE-1131_7.patch, HIVE-1131_8.patch, hive.1131.9.patch
>
>
> We need a mechanism to pass the lineage information of the various columns of a table to a pre execution hook so that applications can use that for:
> - auditing
> - dependency checking
> and many other applications.
> The proposal is to expose this through a bunch of classes to the pre execution hook interface to the clients and put in the necessary transformation logic in the optimizer to generate this information.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (HIVE-1131) Add column lineage information to the pre execution hooks

Posted by "Zheng Shao (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HIVE-1131?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12851718#action_12851718 ] 

Zheng Shao commented on HIVE-1131:
----------------------------------

> Look at the DataContainer class. That has a partition in it. And the Dependency has a mapping from Partition to the dependencies. Can you explain more your concerns on inefficiency?

I see. So the DataContainer captures the output partition information, but we don't have input partition information (BaseColumnInfo/TableAliasInfo). This is reasonable since the input can be lots of partitions.

> For S6 actually the queryplan is the wrong place to store the lineageinfo. Because of the dynamic partitioning work that Ning is doing, I have to generate the partition to dependency mapping at run time. So I would rather store it in a run time structure as opposed to a compile time structure. SessionState fits that bill, though I think we should have another structure called ExecutionCtx for this. But otherwise I think we want to store this in a runtime structure.

+1 on the ExecutionCtx idea. SessionState is at the session level, and LineageInfo is at the query level. It will be great to put LineageInfo into ExecutionCtx.


> Add column lineage information to the pre execution hooks
> ---------------------------------------------------------
>
>                 Key: HIVE-1131
>                 URL: https://issues.apache.org/jira/browse/HIVE-1131
>             Project: Hadoop Hive
>          Issue Type: New Feature
>          Components: Query Processor
>            Reporter: Ashish Thusoo
>            Assignee: Ashish Thusoo
>         Attachments: HIVE-1131.patch, HIVE-1131_2.patch, HIVE-1131_3.patch, HIVE-1131_4.patch
>
>
> We need a mechanism to pass the lineage information of the various columns of a table to a pre execution hook so that applications can use that for:
> - auditing
> - dependency checking
> and many other applications.
> The proposal is to expose this through a bunch of classes to the pre execution hook interface to the clients and put in the necessary transformation logic in the optimizer to generate this information.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (HIVE-1131) Add column lineage information to the pre execution hooks

Posted by "Ashish Thusoo (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/HIVE-1131?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Ashish Thusoo updated HIVE-1131:
--------------------------------

    Attachment: HIVE-1131_3.patch

This fixes all the review comments. Will post the patch with tests separately.


> Add column lineage information to the pre execution hooks
> ---------------------------------------------------------
>
>                 Key: HIVE-1131
>                 URL: https://issues.apache.org/jira/browse/HIVE-1131
>             Project: Hadoop Hive
>          Issue Type: New Feature
>          Components: Query Processor
>            Reporter: Ashish Thusoo
>            Assignee: Ashish Thusoo
>         Attachments: HIVE-1131.patch, HIVE-1131_2.patch, HIVE-1131_3.patch
>
>
> We need a mechanism to pass the lineage information of the various columns of a table to a pre execution hook so that applications can use that for:
> - auditing
> - dependency checking
> and many other applications.
> The proposal is to expose this through a bunch of classes to the pre execution hook interface to the clients and put in the necessary transformation logic in the optimizer to generate this information.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (HIVE-1131) Add column lineage information to the pre execution hooks

Posted by "Ashish Thusoo (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/HIVE-1131?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Ashish Thusoo updated HIVE-1131:
--------------------------------

    Attachment: HIVE-1131_6.patch

With fixes to tests and with null dropped.


> Add column lineage information to the pre execution hooks
> ---------------------------------------------------------
>
>                 Key: HIVE-1131
>                 URL: https://issues.apache.org/jira/browse/HIVE-1131
>             Project: Hadoop Hive
>          Issue Type: New Feature
>          Components: Query Processor
>            Reporter: Ashish Thusoo
>            Assignee: Ashish Thusoo
>         Attachments: HIVE-1131.patch, HIVE-1131_2.patch, HIVE-1131_3.patch, HIVE-1131_4.patch, HIVE-1131_5.patch, HIVE-1131_6.patch
>
>
> We need a mechanism to pass the lineage information of the various columns of a table to a pre execution hook so that applications can use that for:
> - auditing
> - dependency checking
> and many other applications.
> The proposal is to expose this through a bunch of classes to the pre execution hook interface to the clients and put in the necessary transformation logic in the optimizer to generate this information.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (HIVE-1131) Add column lineage information to the pre execution hooks

Posted by "Ashish Thusoo (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/HIVE-1131?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Ashish Thusoo updated HIVE-1131:
--------------------------------

    Attachment: HIVE-1131_4.patch

This patch has all the tests updated as well.


> Add column lineage information to the pre execution hooks
> ---------------------------------------------------------
>
>                 Key: HIVE-1131
>                 URL: https://issues.apache.org/jira/browse/HIVE-1131
>             Project: Hadoop Hive
>          Issue Type: New Feature
>          Components: Query Processor
>            Reporter: Ashish Thusoo
>            Assignee: Ashish Thusoo
>         Attachments: HIVE-1131.patch, HIVE-1131_2.patch, HIVE-1131_3.patch, HIVE-1131_4.patch
>
>
> We need a mechanism to pass the lineage information of the various columns of a table to a pre execution hook so that applications can use that for:
> - auditing
> - dependency checking
> and many other applications.
> The proposal is to expose this through a bunch of classes to the pre execution hook interface to the clients and put in the necessary transformation logic in the optimizer to generate this information.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (HIVE-1131) Add column lineage information to the pre execution hooks

Posted by "Zheng Shao (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HIVE-1131?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12831303#action_12831303 ] 

Zheng Shao commented on HIVE-1131:
----------------------------------

1. LineageInfo and related classes (that are used in PreExecutionHook/PostExecutionHook) need to implement Serializable so that we can serialize out the whole execution plan (including the hooks).


> Add column lineage information to the pre execution hooks
> ---------------------------------------------------------
>
>                 Key: HIVE-1131
>                 URL: https://issues.apache.org/jira/browse/HIVE-1131
>             Project: Hadoop Hive
>          Issue Type: New Feature
>          Components: Query Processor
>            Reporter: Ashish Thusoo
>            Assignee: Ashish Thusoo
>         Attachments: HIVE-1131.patch
>
>
> We need a mechanism to pass the lineage information of the various columns of a table to a pre execution hook so that applications can use that for:
> - auditing
> - dependency checking
> and many other applications.
> The proposal is to expose this through a bunch of classes to the pre execution hook interface to the clients and put in the necessary transformation logic in the optimizer to generate this information.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (HIVE-1131) Add column lineage information to the pre execution hooks

Posted by "Zheng Shao (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HIVE-1131?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12851664#action_12851664 ] 

Zheng Shao commented on HIVE-1131:
----------------------------------

> S1. Can we make lineage partition-level instead of table-level?
I don't see this implemented in the new patch. After looking at the code more, I'd agree that this is too hard (and inefficient) to do, when the query has a range over a lot of partitions.

> S3. Use "{}" even for single statement in "if", "for" etc.
I cannot find any instances of these now.


Still have some questions:
> S2. We might want to define formally the concepts of these levels, especially how they are composited (What will be UDAF of UDF, or UDF of UDAF, like round(sum(col)), or sum(round(col))) 
LineageInfo.java: Can you add some comments on what DependencyType the nested dependencies like "round(sum(col))" or "sum(round(col)))" have?

S6. The best place to store LineageInfo is probably in the QueryPlan instead of SessionState.  Otherwise the LineageInfo will be lost when we run a query that is compiled earlier. Thoughts?


> Add column lineage information to the pre execution hooks
> ---------------------------------------------------------
>
>                 Key: HIVE-1131
>                 URL: https://issues.apache.org/jira/browse/HIVE-1131
>             Project: Hadoop Hive
>          Issue Type: New Feature
>          Components: Query Processor
>            Reporter: Ashish Thusoo
>            Assignee: Ashish Thusoo
>         Attachments: HIVE-1131.patch, HIVE-1131_2.patch, HIVE-1131_3.patch, HIVE-1131_4.patch
>
>
> We need a mechanism to pass the lineage information of the various columns of a table to a pre execution hook so that applications can use that for:
> - auditing
> - dependency checking
> and many other applications.
> The proposal is to expose this through a bunch of classes to the pre execution hook interface to the clients and put in the necessary transformation logic in the optimizer to generate this information.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (HIVE-1131) Add column lineage information to the pre execution hooks

Posted by "Zheng Shao (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HIVE-1131?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12852200#action_12852200 ] 

Zheng Shao commented on HIVE-1131:
----------------------------------

The following tests failed. Mostly because the order of Can you take a look?
Also, it will be great to get rid of the "null" after EXPRESSION in the following example.


{code}
groupby11.q
groupby7_map_skew.q
input13.q
script_pipe.q
groupby9.q
multi_insert.q
union17.q

example:
    [junit] diff -a -I file: -I /tmp/ -I invalidscheme: -I lastUpdateTime -I lastAccessTime -I owner -I transient_lastDdlTime\
 -I java.lang.RuntimeException -I at org -I at sun -I at java -I at junit -I Caused by: -I [.][.][.] [0-9]* more /data/users/\
zshao/hadoop_hive_trunk/.ptest_1/build/ql/test/logs/clientpositive/groupby9.q.out /data/users/zshao/hadoop_hive_trunk/.ptest_\
1/ql/src/test/results/clientpositive/groupby9.q.out
    [junit] 238,239d237
    [junit] < POSTHOOK: Lineage: dest1.key EXPRESSION null[(src)src.FieldSchema(name:key, type:string, comment:default), ]
    [junit] < POSTHOOK: Lineage: dest1.value EXPRESSION null[(src)src.FieldSchema(name:value, type:string, comment:default), \
]
    [junit] 242a241,242
    [junit] > POSTHOOK: Lineage: dest1.key EXPRESSION null[(src)src.FieldSchema(name:key, type:string, comment:default), ]
    [junit] > POSTHOOK: Lineage: dest1.value EXPRESSION null[(src)src.FieldSchema(name:value, type:string, comment:default), \
]

{code}


> Add column lineage information to the pre execution hooks
> ---------------------------------------------------------
>
>                 Key: HIVE-1131
>                 URL: https://issues.apache.org/jira/browse/HIVE-1131
>             Project: Hadoop Hive
>          Issue Type: New Feature
>          Components: Query Processor
>            Reporter: Ashish Thusoo
>            Assignee: Ashish Thusoo
>         Attachments: HIVE-1131.patch, HIVE-1131_2.patch, HIVE-1131_3.patch, HIVE-1131_4.patch, HIVE-1131_5.patch
>
>
> We need a mechanism to pass the lineage information of the various columns of a table to a pre execution hook so that applications can use that for:
> - auditing
> - dependency checking
> and many other applications.
> The proposal is to expose this through a bunch of classes to the pre execution hook interface to the clients and put in the necessary transformation logic in the optimizer to generate this information.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (HIVE-1131) Add column lineage information to the pre execution hooks

Posted by "Namit Jain (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/HIVE-1131?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Namit Jain updated HIVE-1131:
-----------------------------

      Resolution: Fixed
    Hadoop Flags: [Incompatible change, Reviewed]  (was: [Incompatible change])
          Status: Resolved  (was: Patch Available)

Committed. Thanks Ashish

> Add column lineage information to the pre execution hooks
> ---------------------------------------------------------
>
>                 Key: HIVE-1131
>                 URL: https://issues.apache.org/jira/browse/HIVE-1131
>             Project: Hadoop Hive
>          Issue Type: New Feature
>          Components: Query Processor
>            Reporter: Ashish Thusoo
>            Assignee: Ashish Thusoo
>         Attachments: HIVE-1131.patch, HIVE-1131_2.patch, HIVE-1131_3.patch, HIVE-1131_4.patch, HIVE-1131_5.patch, HIVE-1131_6.patch, HIVE-1131_7.patch, HIVE-1131_8.patch, hive.1131.9.patch
>
>
> We need a mechanism to pass the lineage information of the various columns of a table to a pre execution hook so that applications can use that for:
> - auditing
> - dependency checking
> and many other applications.
> The proposal is to expose this through a bunch of classes to the pre execution hook interface to the clients and put in the necessary transformation logic in the optimizer to generate this information.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (HIVE-1131) Add column lineage information to the pre execution hooks

Posted by "Ashish Thusoo (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/HIVE-1131?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Ashish Thusoo updated HIVE-1131:
--------------------------------

    Attachment: HIVE-1131_5.patch

Added a more centralized function to decide what is the dependency type. Also reduced the number of dependency types to SIMPLE, EXPRESSION and SELECT. SIMPLE = a copy of the column, EXPRESSION = UDF, UDAF, UDTF or union all, SCRIPT = if a user script is used.

Also fixed the HashMap to LinkedHashMap..


> Add column lineage information to the pre execution hooks
> ---------------------------------------------------------
>
>                 Key: HIVE-1131
>                 URL: https://issues.apache.org/jira/browse/HIVE-1131
>             Project: Hadoop Hive
>          Issue Type: New Feature
>          Components: Query Processor
>            Reporter: Ashish Thusoo
>            Assignee: Ashish Thusoo
>         Attachments: HIVE-1131.patch, HIVE-1131_2.patch, HIVE-1131_3.patch, HIVE-1131_4.patch, HIVE-1131_5.patch
>
>
> We need a mechanism to pass the lineage information of the various columns of a table to a pre execution hook so that applications can use that for:
> - auditing
> - dependency checking
> and many other applications.
> The proposal is to expose this through a bunch of classes to the pre execution hook interface to the clients and put in the necessary transformation logic in the optimizer to generate this information.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.