You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@atlas.apache.org by Suma Shivaprasad <su...@gmail.com> on 2016/06/20 03:51:24 UTC

Review Request 48939: ATLAS-904 Handle process qualified name per Hive Operation

-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/48939/
-----------------------------------------------------------

Review request for atlas, Shwetha GS and Hemanth Yamijala.


Repository: atlas


Description
-------

1. Process qualified name = HiveOperation.name + sorted inputs + sorted outputs
2. HiveOperation.name doesnt provide identifiers for identiifying INSERT, INSERT_OVERWRITE, UPDATE, DELETE etc separately . Hence adding WriteEntity.WriteType as well which exhibits the following behaviour
a. If there are multiple outputs, for each output, adds the query type(WriteType)
b. if query being run if is type INSERT [into/overwrite] TABLE [PARTITION], WriteType is INSERT/INSERT_OVERWRITE
b. If query is of type INSERT OVERWRITE hdfs_path, adds WriteType as PATH_WRITE
c. If query is of type UPDATE/DELETE, adds type as UPDATE/DELETE [ Note - linage is not available for this since this is single table operation]
3.When input is of type local dir or hdfs path currently, it doesnt add it to qualified name. The reason is that partition based paths cause a lot of processes to be created in this case instead of updating the same process.
Pending:
Address Shwetha G S suggestion to add hdfs paths to process qualified name only in case of non-partition based queries. This needs to be done per HiveOperation type
1. if HiveOperation = LOAD, IMPORT, EXPORT - detect if the current query context is dealing with partitions and do not add if it is partition based.
2. If HiveOperation = INSERT OVERWRITE DFS_PATH/LOCAL_PATH , then detect if the query context is dealing with a partitioned table in inputs and decide if we need to add or not.


Diffs
-----

  addons/hive-bridge/src/main/java/org/apache/atlas/hive/bridge/HiveMetaStoreBridge.java c956a32 
  addons/hive-bridge/src/main/java/org/apache/atlas/hive/hook/HiveHook.java 23c82df 
  addons/hive-bridge/src/test/java/org/apache/atlas/hive/hook/HiveHookIT.java e7fbf71 
  webapp/src/main/java/org/apache/atlas/web/resources/EntityResource.java 0713d30 

Diff: https://reviews.apache.org/r/48939/diff/


Testing
-------

Existing tests modified to query with new qualified name. Need to add tests for INSERT INTO TABLE


Thanks,

Suma Shivaprasad


Re: Review Request 48939: ATLAS-904 Handle process qualified name per Hive Operation

Posted by Hemanth Yamijala <yh...@gmail.com>.
-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/48939/#review138561
-----------------------------------------------------------




addons/hive-bridge/src/main/java/org/apache/atlas/hive/hook/HiveHook.java (line 897)
<https://reviews.apache.org/r/48939/#comment203754>

    Should we consider case insensitive compare?


- Hemanth Yamijala


On June 20, 2016, 4 a.m., Suma Shivaprasad wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/48939/
> -----------------------------------------------------------
> 
> (Updated June 20, 2016, 4 a.m.)
> 
> 
> Review request for atlas, Shwetha GS and Hemanth Yamijala.
> 
> 
> Bugs: ATLAS-904
>     https://issues.apache.org/jira/browse/ATLAS-904
> 
> 
> Repository: atlas
> 
> 
> Description
> -------
> 
> 1. Process qualified name = HiveOperation.name + sorted inputs + sorted outputs
> 2. HiveOperation.name doesnt provide identifiers for identiifying INSERT, INSERT_OVERWRITE, UPDATE, DELETE etc separately . Hence adding WriteEntity.WriteType as well which exhibits the following behaviour
> a. If there are multiple outputs, for each output, adds the query type(WriteType)
> b. if query being run if is type INSERT [into/overwrite] TABLE [PARTITION], WriteType is INSERT/INSERT_OVERWRITE
> b. If query is of type INSERT OVERWRITE hdfs_path, adds WriteType as PATH_WRITE
> c. If query is of type UPDATE/DELETE, adds type as UPDATE/DELETE [ Note - linage is not available for this since this is single table operation]
> 3.When input is of type local dir or hdfs path currently, it doesnt add it to qualified name. The reason is that partition based paths cause a lot of processes to be created in this case instead of updating the same process.
> Pending:
> Address Shwetha G S suggestion to add hdfs paths to process qualified name only in case of non-partition based queries. This needs to be done per HiveOperation type
> 1. if HiveOperation = LOAD, IMPORT, EXPORT - detect if the current query context is dealing with partitions and do not add if it is partition based.
> 2. If HiveOperation = INSERT OVERWRITE DFS_PATH/LOCAL_PATH , then detect if the query context is dealing with a partitioned table in inputs and decide if we need to add or not.
> 
> 
> Diffs
> -----
> 
>   addons/hive-bridge/src/main/java/org/apache/atlas/hive/bridge/HiveMetaStoreBridge.java c956a32 
>   addons/hive-bridge/src/main/java/org/apache/atlas/hive/hook/HiveHook.java 23c82df 
>   addons/hive-bridge/src/test/java/org/apache/atlas/hive/hook/HiveHookIT.java e7fbf71 
>   webapp/src/main/java/org/apache/atlas/web/resources/EntityResource.java 0713d30 
> 
> Diff: https://reviews.apache.org/r/48939/diff/
> 
> 
> Testing
> -------
> 
> Existing tests modified to query with new qualified name. Need to add tests for INSERT INTO TABLE
> 
> 
> Thanks,
> 
> Suma Shivaprasad
> 
>


Re: Review Request 48939: ATLAS-904 Handle process qualified name per Hive Operation

Posted by Hemanth Yamijala <yh...@gmail.com>.
-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/48939/#review138860
-----------------------------------------------------------


Ship it!




Ship It!

- Hemanth Yamijala


On June 20, 2016, 6:22 p.m., Suma Shivaprasad wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/48939/
> -----------------------------------------------------------
> 
> (Updated June 20, 2016, 6:22 p.m.)
> 
> 
> Review request for atlas, Shwetha GS and Hemanth Yamijala.
> 
> 
> Bugs: ATLAS-904
>     https://issues.apache.org/jira/browse/ATLAS-904
> 
> 
> Repository: atlas
> 
> 
> Description
> -------
> 
> 1. Process qualified name = HiveOperation.name + sorted inputs + sorted outputs
> 2. HiveOperation.name doesnt provide identifiers for identiifying INSERT, INSERT_OVERWRITE, UPDATE, DELETE etc separately . Hence adding WriteEntity.WriteType as well which exhibits the following behaviour
> a. If there are multiple outputs, for each output, adds the query type(WriteType)
> b. if query being run if is type INSERT [into/overwrite] TABLE [PARTITION], WriteType is INSERT/INSERT_OVERWRITE
> b. If query is of type INSERT OVERWRITE hdfs_path, adds WriteType as PATH_WRITE
> c. If query is of type UPDATE/DELETE, adds type as UPDATE/DELETE [ Note - linage is not available for this since this is single table operation]
> 3.When input is of type local dir or hdfs path currently, it doesnt add it to qualified name. The reason is that partition based paths cause a lot of processes to be created in this case instead of updating the same process.
> Pending:
> Address Shwetha G S suggestion to add hdfs paths to process qualified name only in case of non-partition based queries. This needs to be done per HiveOperation type
> 1. if HiveOperation = LOAD, IMPORT, EXPORT - detect if the current query context is dealing with partitions and do not add if it is partition based.
> 2. If HiveOperation = INSERT OVERWRITE DFS_PATH/LOCAL_PATH , then detect if the query context is dealing with a partitioned table in inputs and decide if we need to add or not.
> 
> 
> Diffs
> -----
> 
>   addons/hive-bridge/src/main/java/org/apache/atlas/hive/bridge/HiveMetaStoreBridge.java c956a32 
>   addons/hive-bridge/src/main/java/org/apache/atlas/hive/hook/HiveHook.java 5d9950f 
>   addons/hive-bridge/src/test/java/org/apache/atlas/hive/hook/HiveHookIT.java 5a175e7 
>   webapp/src/main/java/org/apache/atlas/web/resources/EntityResource.java 0713d30 
> 
> Diff: https://reviews.apache.org/r/48939/diff/
> 
> 
> Testing
> -------
> 
> Existing tests modified to query with new qualified name. Need to add tests for INSERT INTO TABLE
> 
> 
> Thanks,
> 
> Suma Shivaprasad
> 
>


Re: Review Request 48939: ATLAS-904 Handle process qualified name per Hive Operation

Posted by Suma Shivaprasad <su...@gmail.com>.
-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/48939/
-----------------------------------------------------------

(Updated June 20, 2016, 6:22 p.m.)


Review request for atlas, Shwetha GS and Hemanth Yamijala.


Changes
-------

Fixed test failures due to lower case change


Bugs: ATLAS-904
    https://issues.apache.org/jira/browse/ATLAS-904


Repository: atlas


Description
-------

1. Process qualified name = HiveOperation.name + sorted inputs + sorted outputs
2. HiveOperation.name doesnt provide identifiers for identiifying INSERT, INSERT_OVERWRITE, UPDATE, DELETE etc separately . Hence adding WriteEntity.WriteType as well which exhibits the following behaviour
a. If there are multiple outputs, for each output, adds the query type(WriteType)
b. if query being run if is type INSERT [into/overwrite] TABLE [PARTITION], WriteType is INSERT/INSERT_OVERWRITE
b. If query is of type INSERT OVERWRITE hdfs_path, adds WriteType as PATH_WRITE
c. If query is of type UPDATE/DELETE, adds type as UPDATE/DELETE [ Note - linage is not available for this since this is single table operation]
3.When input is of type local dir or hdfs path currently, it doesnt add it to qualified name. The reason is that partition based paths cause a lot of processes to be created in this case instead of updating the same process.
Pending:
Address Shwetha G S suggestion to add hdfs paths to process qualified name only in case of non-partition based queries. This needs to be done per HiveOperation type
1. if HiveOperation = LOAD, IMPORT, EXPORT - detect if the current query context is dealing with partitions and do not add if it is partition based.
2. If HiveOperation = INSERT OVERWRITE DFS_PATH/LOCAL_PATH , then detect if the query context is dealing with a partitioned table in inputs and decide if we need to add or not.


Diffs (updated)
-----

  addons/hive-bridge/src/main/java/org/apache/atlas/hive/bridge/HiveMetaStoreBridge.java c956a32 
  addons/hive-bridge/src/main/java/org/apache/atlas/hive/hook/HiveHook.java 5d9950f 
  addons/hive-bridge/src/test/java/org/apache/atlas/hive/hook/HiveHookIT.java 5a175e7 
  webapp/src/main/java/org/apache/atlas/web/resources/EntityResource.java 0713d30 

Diff: https://reviews.apache.org/r/48939/diff/


Testing
-------

Existing tests modified to query with new qualified name. Need to add tests for INSERT INTO TABLE


Thanks,

Suma Shivaprasad


Re: Review Request 48939: ATLAS-904 Handle process qualified name per Hive Operation

Posted by Suma Shivaprasad <su...@gmail.com>.
-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/48939/
-----------------------------------------------------------

(Updated June 20, 2016, 5:27 p.m.)


Review request for atlas, Shwetha GS and Hemanth Yamijala.


Changes
-------

Thanks for reviewing Hemanth. Fixed review comments. Please reopen any issue which I have dropped if you feel it should be addressed or if you have any more questions.


Bugs: ATLAS-904
    https://issues.apache.org/jira/browse/ATLAS-904


Repository: atlas


Description
-------

1. Process qualified name = HiveOperation.name + sorted inputs + sorted outputs
2. HiveOperation.name doesnt provide identifiers for identiifying INSERT, INSERT_OVERWRITE, UPDATE, DELETE etc separately . Hence adding WriteEntity.WriteType as well which exhibits the following behaviour
a. If there are multiple outputs, for each output, adds the query type(WriteType)
b. if query being run if is type INSERT [into/overwrite] TABLE [PARTITION], WriteType is INSERT/INSERT_OVERWRITE
b. If query is of type INSERT OVERWRITE hdfs_path, adds WriteType as PATH_WRITE
c. If query is of type UPDATE/DELETE, adds type as UPDATE/DELETE [ Note - linage is not available for this since this is single table operation]
3.When input is of type local dir or hdfs path currently, it doesnt add it to qualified name. The reason is that partition based paths cause a lot of processes to be created in this case instead of updating the same process.
Pending:
Address Shwetha G S suggestion to add hdfs paths to process qualified name only in case of non-partition based queries. This needs to be done per HiveOperation type
1. if HiveOperation = LOAD, IMPORT, EXPORT - detect if the current query context is dealing with partitions and do not add if it is partition based.
2. If HiveOperation = INSERT OVERWRITE DFS_PATH/LOCAL_PATH , then detect if the query context is dealing with a partitioned table in inputs and decide if we need to add or not.


Diffs (updated)
-----

  addons/hive-bridge/src/main/java/org/apache/atlas/hive/bridge/HiveMetaStoreBridge.java c956a32 
  addons/hive-bridge/src/main/java/org/apache/atlas/hive/hook/HiveHook.java 5d9950f 
  addons/hive-bridge/src/test/java/org/apache/atlas/hive/hook/HiveHookIT.java 5a175e7 
  webapp/src/main/java/org/apache/atlas/web/resources/EntityResource.java 0713d30 

Diff: https://reviews.apache.org/r/48939/diff/


Testing
-------

Existing tests modified to query with new qualified name. Need to add tests for INSERT INTO TABLE


Thanks,

Suma Shivaprasad


Re: Review Request 48939: ATLAS-904 Handle process qualified name per Hive Operation

Posted by Suma Shivaprasad <su...@gmail.com>.

> On June 20, 2016, 9:26 a.m., Hemanth Yamijala wrote:
> > addons/hive-bridge/src/main/java/org/apache/atlas/hive/hook/HiveHook.java, line 763
> > <https://reviews.apache.org/r/48939/diff/2/?file=1423788#file1423788line763>
> >
> >     Do we need a separator between the input set and output set?

This is already being taken care of within the if checks and is added before and output dataset entry is added to the buffer.


- Suma


-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/48939/#review138565
-----------------------------------------------------------


On June 20, 2016, 4 a.m., Suma Shivaprasad wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/48939/
> -----------------------------------------------------------
> 
> (Updated June 20, 2016, 4 a.m.)
> 
> 
> Review request for atlas, Shwetha GS and Hemanth Yamijala.
> 
> 
> Bugs: ATLAS-904
>     https://issues.apache.org/jira/browse/ATLAS-904
> 
> 
> Repository: atlas
> 
> 
> Description
> -------
> 
> 1. Process qualified name = HiveOperation.name + sorted inputs + sorted outputs
> 2. HiveOperation.name doesnt provide identifiers for identiifying INSERT, INSERT_OVERWRITE, UPDATE, DELETE etc separately . Hence adding WriteEntity.WriteType as well which exhibits the following behaviour
> a. If there are multiple outputs, for each output, adds the query type(WriteType)
> b. if query being run if is type INSERT [into/overwrite] TABLE [PARTITION], WriteType is INSERT/INSERT_OVERWRITE
> b. If query is of type INSERT OVERWRITE hdfs_path, adds WriteType as PATH_WRITE
> c. If query is of type UPDATE/DELETE, adds type as UPDATE/DELETE [ Note - linage is not available for this since this is single table operation]
> 3.When input is of type local dir or hdfs path currently, it doesnt add it to qualified name. The reason is that partition based paths cause a lot of processes to be created in this case instead of updating the same process.
> Pending:
> Address Shwetha G S suggestion to add hdfs paths to process qualified name only in case of non-partition based queries. This needs to be done per HiveOperation type
> 1. if HiveOperation = LOAD, IMPORT, EXPORT - detect if the current query context is dealing with partitions and do not add if it is partition based.
> 2. If HiveOperation = INSERT OVERWRITE DFS_PATH/LOCAL_PATH , then detect if the query context is dealing with a partitioned table in inputs and decide if we need to add or not.
> 
> 
> Diffs
> -----
> 
>   addons/hive-bridge/src/main/java/org/apache/atlas/hive/bridge/HiveMetaStoreBridge.java c956a32 
>   addons/hive-bridge/src/main/java/org/apache/atlas/hive/hook/HiveHook.java 23c82df 
>   addons/hive-bridge/src/test/java/org/apache/atlas/hive/hook/HiveHookIT.java e7fbf71 
>   webapp/src/main/java/org/apache/atlas/web/resources/EntityResource.java 0713d30 
> 
> Diff: https://reviews.apache.org/r/48939/diff/
> 
> 
> Testing
> -------
> 
> Existing tests modified to query with new qualified name. Need to add tests for INSERT INTO TABLE
> 
> 
> Thanks,
> 
> Suma Shivaprasad
> 
>


Re: Review Request 48939: ATLAS-904 Handle process qualified name per Hive Operation

Posted by Hemanth Yamijala <yh...@gmail.com>.
-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/48939/#review138565
-----------------------------------------------------------




addons/hive-bridge/src/main/java/org/apache/atlas/hive/hook/HiveHook.java (line 732)
<https://reviews.apache.org/r/48939/#comment203760>

    Do we need a separator between the input set and output set?


- Hemanth Yamijala


On June 20, 2016, 4 a.m., Suma Shivaprasad wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/48939/
> -----------------------------------------------------------
> 
> (Updated June 20, 2016, 4 a.m.)
> 
> 
> Review request for atlas, Shwetha GS and Hemanth Yamijala.
> 
> 
> Bugs: ATLAS-904
>     https://issues.apache.org/jira/browse/ATLAS-904
> 
> 
> Repository: atlas
> 
> 
> Description
> -------
> 
> 1. Process qualified name = HiveOperation.name + sorted inputs + sorted outputs
> 2. HiveOperation.name doesnt provide identifiers for identiifying INSERT, INSERT_OVERWRITE, UPDATE, DELETE etc separately . Hence adding WriteEntity.WriteType as well which exhibits the following behaviour
> a. If there are multiple outputs, for each output, adds the query type(WriteType)
> b. if query being run if is type INSERT [into/overwrite] TABLE [PARTITION], WriteType is INSERT/INSERT_OVERWRITE
> b. If query is of type INSERT OVERWRITE hdfs_path, adds WriteType as PATH_WRITE
> c. If query is of type UPDATE/DELETE, adds type as UPDATE/DELETE [ Note - linage is not available for this since this is single table operation]
> 3.When input is of type local dir or hdfs path currently, it doesnt add it to qualified name. The reason is that partition based paths cause a lot of processes to be created in this case instead of updating the same process.
> Pending:
> Address Shwetha G S suggestion to add hdfs paths to process qualified name only in case of non-partition based queries. This needs to be done per HiveOperation type
> 1. if HiveOperation = LOAD, IMPORT, EXPORT - detect if the current query context is dealing with partitions and do not add if it is partition based.
> 2. If HiveOperation = INSERT OVERWRITE DFS_PATH/LOCAL_PATH , then detect if the query context is dealing with a partitioned table in inputs and decide if we need to add or not.
> 
> 
> Diffs
> -----
> 
>   addons/hive-bridge/src/main/java/org/apache/atlas/hive/bridge/HiveMetaStoreBridge.java c956a32 
>   addons/hive-bridge/src/main/java/org/apache/atlas/hive/hook/HiveHook.java 23c82df 
>   addons/hive-bridge/src/test/java/org/apache/atlas/hive/hook/HiveHookIT.java e7fbf71 
>   webapp/src/main/java/org/apache/atlas/web/resources/EntityResource.java 0713d30 
> 
> Diff: https://reviews.apache.org/r/48939/diff/
> 
> 
> Testing
> -------
> 
> Existing tests modified to query with new qualified name. Need to add tests for INSERT INTO TABLE
> 
> 
> Thanks,
> 
> Suma Shivaprasad
> 
>


Re: Review Request 48939: ATLAS-904 Handle process qualified name per Hive Operation

Posted by Hemanth Yamijala <yh...@gmail.com>.

> On June 20, 2016, 7:25 a.m., Hemanth Yamijala wrote:
> > addons/hive-bridge/src/main/java/org/apache/atlas/hive/hook/HiveHook.java, line 625
> > <https://reviews.apache.org/r/48939/diff/2/?file=1423788#file1423788line625>
> >
> >     This may be a non-issue, but previously, we were passing two independent sets for source & target datasets (as opposed to a single dataSetsProcessed parameter now, which is common between source & target. The impact is that if a dataset is present in both input and output (impossible - hence non-issue?) - this would get captured only once. Further, I see that dataSetsProcessed is not used in the calling function. Hence, consider making it local to this function?

Regarding the latter part of my comment about dataSetsProcessed not being used in the calling function - please ignore that, as I forgot about the outer loop in the calling function.


- Hemanth


-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/48939/#review138534
-----------------------------------------------------------


On June 20, 2016, 4 a.m., Suma Shivaprasad wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/48939/
> -----------------------------------------------------------
> 
> (Updated June 20, 2016, 4 a.m.)
> 
> 
> Review request for atlas, Shwetha GS and Hemanth Yamijala.
> 
> 
> Bugs: ATLAS-904
>     https://issues.apache.org/jira/browse/ATLAS-904
> 
> 
> Repository: atlas
> 
> 
> Description
> -------
> 
> 1. Process qualified name = HiveOperation.name + sorted inputs + sorted outputs
> 2. HiveOperation.name doesnt provide identifiers for identiifying INSERT, INSERT_OVERWRITE, UPDATE, DELETE etc separately . Hence adding WriteEntity.WriteType as well which exhibits the following behaviour
> a. If there are multiple outputs, for each output, adds the query type(WriteType)
> b. if query being run if is type INSERT [into/overwrite] TABLE [PARTITION], WriteType is INSERT/INSERT_OVERWRITE
> b. If query is of type INSERT OVERWRITE hdfs_path, adds WriteType as PATH_WRITE
> c. If query is of type UPDATE/DELETE, adds type as UPDATE/DELETE [ Note - linage is not available for this since this is single table operation]
> 3.When input is of type local dir or hdfs path currently, it doesnt add it to qualified name. The reason is that partition based paths cause a lot of processes to be created in this case instead of updating the same process.
> Pending:
> Address Shwetha G S suggestion to add hdfs paths to process qualified name only in case of non-partition based queries. This needs to be done per HiveOperation type
> 1. if HiveOperation = LOAD, IMPORT, EXPORT - detect if the current query context is dealing with partitions and do not add if it is partition based.
> 2. If HiveOperation = INSERT OVERWRITE DFS_PATH/LOCAL_PATH , then detect if the query context is dealing with a partitioned table in inputs and decide if we need to add or not.
> 
> 
> Diffs
> -----
> 
>   addons/hive-bridge/src/main/java/org/apache/atlas/hive/bridge/HiveMetaStoreBridge.java c956a32 
>   addons/hive-bridge/src/main/java/org/apache/atlas/hive/hook/HiveHook.java 23c82df 
>   addons/hive-bridge/src/test/java/org/apache/atlas/hive/hook/HiveHookIT.java e7fbf71 
>   webapp/src/main/java/org/apache/atlas/web/resources/EntityResource.java 0713d30 
> 
> Diff: https://reviews.apache.org/r/48939/diff/
> 
> 
> Testing
> -------
> 
> Existing tests modified to query with new qualified name. Need to add tests for INSERT INTO TABLE
> 
> 
> Thanks,
> 
> Suma Shivaprasad
> 
>


Re: Review Request 48939: ATLAS-904 Handle process qualified name per Hive Operation

Posted by Suma Shivaprasad <su...@gmail.com>.

> On June 20, 2016, 7:25 a.m., Hemanth Yamijala wrote:
> > addons/hive-bridge/src/main/java/org/apache/atlas/hive/hook/HiveHook.java, line 625
> > <https://reviews.apache.org/r/48939/diff/2/?file=1423788#file1423788line625>
> >
> >     This may be a non-issue, but previously, we were passing two independent sets for source & target datasets (as opposed to a single dataSetsProcessed parameter now, which is common between source & target. The impact is that if a dataset is present in both input and output (impossible - hence non-issue?) - this would get captured only once. Further, I see that dataSetsProcessed is not used in the calling function. Hence, consider making it local to this function?
> 
> Hemanth Yamijala wrote:
>     Regarding the latter part of my comment about dataSetsProcessed not being used in the calling function - please ignore that, as I forgot about the outer loop in the calling function.

Yes, should be a non-issue since inputs and outputs having same dataset is not possible


- Suma


-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/48939/#review138534
-----------------------------------------------------------


On June 20, 2016, 4 a.m., Suma Shivaprasad wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/48939/
> -----------------------------------------------------------
> 
> (Updated June 20, 2016, 4 a.m.)
> 
> 
> Review request for atlas, Shwetha GS and Hemanth Yamijala.
> 
> 
> Bugs: ATLAS-904
>     https://issues.apache.org/jira/browse/ATLAS-904
> 
> 
> Repository: atlas
> 
> 
> Description
> -------
> 
> 1. Process qualified name = HiveOperation.name + sorted inputs + sorted outputs
> 2. HiveOperation.name doesnt provide identifiers for identiifying INSERT, INSERT_OVERWRITE, UPDATE, DELETE etc separately . Hence adding WriteEntity.WriteType as well which exhibits the following behaviour
> a. If there are multiple outputs, for each output, adds the query type(WriteType)
> b. if query being run if is type INSERT [into/overwrite] TABLE [PARTITION], WriteType is INSERT/INSERT_OVERWRITE
> b. If query is of type INSERT OVERWRITE hdfs_path, adds WriteType as PATH_WRITE
> c. If query is of type UPDATE/DELETE, adds type as UPDATE/DELETE [ Note - linage is not available for this since this is single table operation]
> 3.When input is of type local dir or hdfs path currently, it doesnt add it to qualified name. The reason is that partition based paths cause a lot of processes to be created in this case instead of updating the same process.
> Pending:
> Address Shwetha G S suggestion to add hdfs paths to process qualified name only in case of non-partition based queries. This needs to be done per HiveOperation type
> 1. if HiveOperation = LOAD, IMPORT, EXPORT - detect if the current query context is dealing with partitions and do not add if it is partition based.
> 2. If HiveOperation = INSERT OVERWRITE DFS_PATH/LOCAL_PATH , then detect if the query context is dealing with a partitioned table in inputs and decide if we need to add or not.
> 
> 
> Diffs
> -----
> 
>   addons/hive-bridge/src/main/java/org/apache/atlas/hive/bridge/HiveMetaStoreBridge.java c956a32 
>   addons/hive-bridge/src/main/java/org/apache/atlas/hive/hook/HiveHook.java 23c82df 
>   addons/hive-bridge/src/test/java/org/apache/atlas/hive/hook/HiveHookIT.java e7fbf71 
>   webapp/src/main/java/org/apache/atlas/web/resources/EntityResource.java 0713d30 
> 
> Diff: https://reviews.apache.org/r/48939/diff/
> 
> 
> Testing
> -------
> 
> Existing tests modified to query with new qualified name. Need to add tests for INSERT INTO TABLE
> 
> 
> Thanks,
> 
> Suma Shivaprasad
> 
>


Re: Review Request 48939: ATLAS-904 Handle process qualified name per Hive Operation

Posted by Hemanth Yamijala <yh...@gmail.com>.
-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/48939/#review138534
-----------------------------------------------------------




addons/hive-bridge/src/main/java/org/apache/atlas/hive/hook/HiveHook.java (line 600)
<https://reviews.apache.org/r/48939/#comment203733>

    This may be a non-issue, but previously, we were passing two independent sets for source & target datasets (as opposed to a single dataSetsProcessed parameter now, which is common between source & target. The impact is that if a dataset is present in both input and output (impossible - hence non-issue?) - this would get captured only once. Further, I see that dataSetsProcessed is not used in the calling function. Hence, consider making it local to this function?


- Hemanth Yamijala


On June 20, 2016, 4 a.m., Suma Shivaprasad wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/48939/
> -----------------------------------------------------------
> 
> (Updated June 20, 2016, 4 a.m.)
> 
> 
> Review request for atlas, Shwetha GS and Hemanth Yamijala.
> 
> 
> Bugs: ATLAS-904
>     https://issues.apache.org/jira/browse/ATLAS-904
> 
> 
> Repository: atlas
> 
> 
> Description
> -------
> 
> 1. Process qualified name = HiveOperation.name + sorted inputs + sorted outputs
> 2. HiveOperation.name doesnt provide identifiers for identiifying INSERT, INSERT_OVERWRITE, UPDATE, DELETE etc separately . Hence adding WriteEntity.WriteType as well which exhibits the following behaviour
> a. If there are multiple outputs, for each output, adds the query type(WriteType)
> b. if query being run if is type INSERT [into/overwrite] TABLE [PARTITION], WriteType is INSERT/INSERT_OVERWRITE
> b. If query is of type INSERT OVERWRITE hdfs_path, adds WriteType as PATH_WRITE
> c. If query is of type UPDATE/DELETE, adds type as UPDATE/DELETE [ Note - linage is not available for this since this is single table operation]
> 3.When input is of type local dir or hdfs path currently, it doesnt add it to qualified name. The reason is that partition based paths cause a lot of processes to be created in this case instead of updating the same process.
> Pending:
> Address Shwetha G S suggestion to add hdfs paths to process qualified name only in case of non-partition based queries. This needs to be done per HiveOperation type
> 1. if HiveOperation = LOAD, IMPORT, EXPORT - detect if the current query context is dealing with partitions and do not add if it is partition based.
> 2. If HiveOperation = INSERT OVERWRITE DFS_PATH/LOCAL_PATH , then detect if the query context is dealing with a partitioned table in inputs and decide if we need to add or not.
> 
> 
> Diffs
> -----
> 
>   addons/hive-bridge/src/main/java/org/apache/atlas/hive/bridge/HiveMetaStoreBridge.java c956a32 
>   addons/hive-bridge/src/main/java/org/apache/atlas/hive/hook/HiveHook.java 23c82df 
>   addons/hive-bridge/src/test/java/org/apache/atlas/hive/hook/HiveHookIT.java e7fbf71 
>   webapp/src/main/java/org/apache/atlas/web/resources/EntityResource.java 0713d30 
> 
> Diff: https://reviews.apache.org/r/48939/diff/
> 
> 
> Testing
> -------
> 
> Existing tests modified to query with new qualified name. Need to add tests for INSERT INTO TABLE
> 
> 
> Thanks,
> 
> Suma Shivaprasad
> 
>


Re: Review Request 48939: ATLAS-904 Handle process qualified name per Hive Operation

Posted by Suma Shivaprasad <su...@gmail.com>.

> On June 20, 2016, 9:05 a.m., Hemanth Yamijala wrote:
> > addons/hive-bridge/src/main/java/org/apache/atlas/hive/hook/HiveHook.java, line 584
> > <https://reviews.apache.org/r/48939/diff/2/?file=1423788#file1423788line584>
> >
> >     Is it safe to rely on the equals / hashcode of Entity to serve as key? If you've analyzed this and feel it is fine, please do close the issue.

Yes it uses name for equals/hashCode which is a qualified name

private String computeName() {
    switch (typ) {
    case DATABASE:
      return "database:" + database.getName();
    case TABLE:
      return t.getDbName() + "@" + t.getTableName();
    case PARTITION:
      return t.getDbName() + "@" + t.getTableName() + "@" + p.getName();
    case DUMMYPARTITION:
      return p.getName();
    case FUNCTION:
      if (database != null) {
        return database.getName() + "." + stringObject;
      }
      return stringObject;
    default:
      return d.toString();
    }


- Suma


-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/48939/#review138554
-----------------------------------------------------------


On June 20, 2016, 4 a.m., Suma Shivaprasad wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/48939/
> -----------------------------------------------------------
> 
> (Updated June 20, 2016, 4 a.m.)
> 
> 
> Review request for atlas, Shwetha GS and Hemanth Yamijala.
> 
> 
> Bugs: ATLAS-904
>     https://issues.apache.org/jira/browse/ATLAS-904
> 
> 
> Repository: atlas
> 
> 
> Description
> -------
> 
> 1. Process qualified name = HiveOperation.name + sorted inputs + sorted outputs
> 2. HiveOperation.name doesnt provide identifiers for identiifying INSERT, INSERT_OVERWRITE, UPDATE, DELETE etc separately . Hence adding WriteEntity.WriteType as well which exhibits the following behaviour
> a. If there are multiple outputs, for each output, adds the query type(WriteType)
> b. if query being run if is type INSERT [into/overwrite] TABLE [PARTITION], WriteType is INSERT/INSERT_OVERWRITE
> b. If query is of type INSERT OVERWRITE hdfs_path, adds WriteType as PATH_WRITE
> c. If query is of type UPDATE/DELETE, adds type as UPDATE/DELETE [ Note - linage is not available for this since this is single table operation]
> 3.When input is of type local dir or hdfs path currently, it doesnt add it to qualified name. The reason is that partition based paths cause a lot of processes to be created in this case instead of updating the same process.
> Pending:
> Address Shwetha G S suggestion to add hdfs paths to process qualified name only in case of non-partition based queries. This needs to be done per HiveOperation type
> 1. if HiveOperation = LOAD, IMPORT, EXPORT - detect if the current query context is dealing with partitions and do not add if it is partition based.
> 2. If HiveOperation = INSERT OVERWRITE DFS_PATH/LOCAL_PATH , then detect if the query context is dealing with a partitioned table in inputs and decide if we need to add or not.
> 
> 
> Diffs
> -----
> 
>   addons/hive-bridge/src/main/java/org/apache/atlas/hive/bridge/HiveMetaStoreBridge.java c956a32 
>   addons/hive-bridge/src/main/java/org/apache/atlas/hive/hook/HiveHook.java 23c82df 
>   addons/hive-bridge/src/test/java/org/apache/atlas/hive/hook/HiveHookIT.java e7fbf71 
>   webapp/src/main/java/org/apache/atlas/web/resources/EntityResource.java 0713d30 
> 
> Diff: https://reviews.apache.org/r/48939/diff/
> 
> 
> Testing
> -------
> 
> Existing tests modified to query with new qualified name. Need to add tests for INSERT INTO TABLE
> 
> 
> Thanks,
> 
> Suma Shivaprasad
> 
>


Re: Review Request 48939: ATLAS-904 Handle process qualified name per Hive Operation

Posted by Hemanth Yamijala <yh...@gmail.com>.
-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/48939/#review138554
-----------------------------------------------------------




addons/hive-bridge/src/main/java/org/apache/atlas/hive/hook/HiveHook.java (line 565)
<https://reviews.apache.org/r/48939/#comment203752>

    Is it safe to rely on the equals / hashcode of Entity to serve as key? If you've analyzed this and feel it is fine, please do close the issue.



addons/hive-bridge/src/main/java/org/apache/atlas/hive/hook/HiveHook.java (line 607)
<https://reviews.apache.org/r/48939/#comment203747>

    Not an issue with this patch, but does the check for multiple paths need to be present here as well - i.e. do we need to populate & use dataSetsProcessed here?



addons/hive-bridge/src/main/java/org/apache/atlas/hive/hook/HiveHook.java (line 695)
<https://reviews.apache.org/r/48939/#comment203748>

    Shouldn't this be normalized to atleast lower?



addons/hive-bridge/src/main/java/org/apache/atlas/hive/hook/HiveHook.java (line 703)
<https://reviews.apache.org/r/48939/#comment203749>

    The IDE shows this as an always true condition. Shouldn't the || be && and also the variable of interest is probably source or source.values and similar.



addons/hive-bridge/src/main/java/org/apache/atlas/hive/hook/HiveHook.java (line 738)
<https://reviews.apache.org/r/48939/#comment203751>

    Consider making it a constant.


- Hemanth Yamijala


On June 20, 2016, 4 a.m., Suma Shivaprasad wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/48939/
> -----------------------------------------------------------
> 
> (Updated June 20, 2016, 4 a.m.)
> 
> 
> Review request for atlas, Shwetha GS and Hemanth Yamijala.
> 
> 
> Bugs: ATLAS-904
>     https://issues.apache.org/jira/browse/ATLAS-904
> 
> 
> Repository: atlas
> 
> 
> Description
> -------
> 
> 1. Process qualified name = HiveOperation.name + sorted inputs + sorted outputs
> 2. HiveOperation.name doesnt provide identifiers for identiifying INSERT, INSERT_OVERWRITE, UPDATE, DELETE etc separately . Hence adding WriteEntity.WriteType as well which exhibits the following behaviour
> a. If there are multiple outputs, for each output, adds the query type(WriteType)
> b. if query being run if is type INSERT [into/overwrite] TABLE [PARTITION], WriteType is INSERT/INSERT_OVERWRITE
> b. If query is of type INSERT OVERWRITE hdfs_path, adds WriteType as PATH_WRITE
> c. If query is of type UPDATE/DELETE, adds type as UPDATE/DELETE [ Note - linage is not available for this since this is single table operation]
> 3.When input is of type local dir or hdfs path currently, it doesnt add it to qualified name. The reason is that partition based paths cause a lot of processes to be created in this case instead of updating the same process.
> Pending:
> Address Shwetha G S suggestion to add hdfs paths to process qualified name only in case of non-partition based queries. This needs to be done per HiveOperation type
> 1. if HiveOperation = LOAD, IMPORT, EXPORT - detect if the current query context is dealing with partitions and do not add if it is partition based.
> 2. If HiveOperation = INSERT OVERWRITE DFS_PATH/LOCAL_PATH , then detect if the query context is dealing with a partitioned table in inputs and decide if we need to add or not.
> 
> 
> Diffs
> -----
> 
>   addons/hive-bridge/src/main/java/org/apache/atlas/hive/bridge/HiveMetaStoreBridge.java c956a32 
>   addons/hive-bridge/src/main/java/org/apache/atlas/hive/hook/HiveHook.java 23c82df 
>   addons/hive-bridge/src/test/java/org/apache/atlas/hive/hook/HiveHookIT.java e7fbf71 
>   webapp/src/main/java/org/apache/atlas/web/resources/EntityResource.java 0713d30 
> 
> Diff: https://reviews.apache.org/r/48939/diff/
> 
> 
> Testing
> -------
> 
> Existing tests modified to query with new qualified name. Need to add tests for INSERT INTO TABLE
> 
> 
> Thanks,
> 
> Suma Shivaprasad
> 
>


Re: Review Request 48939: ATLAS-904 Handle process qualified name per Hive Operation

Posted by Hemanth Yamijala <yh...@gmail.com>.
-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/48939/#review138562
-----------------------------------------------------------




addons/hive-bridge/src/main/java/org/apache/atlas/hive/bridge/HiveMetaStoreBridge.java (line 50)
<https://reviews.apache.org/r/48939/#comment203755>

    This seems unused actually.


- Hemanth Yamijala


On June 20, 2016, 4 a.m., Suma Shivaprasad wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/48939/
> -----------------------------------------------------------
> 
> (Updated June 20, 2016, 4 a.m.)
> 
> 
> Review request for atlas, Shwetha GS and Hemanth Yamijala.
> 
> 
> Bugs: ATLAS-904
>     https://issues.apache.org/jira/browse/ATLAS-904
> 
> 
> Repository: atlas
> 
> 
> Description
> -------
> 
> 1. Process qualified name = HiveOperation.name + sorted inputs + sorted outputs
> 2. HiveOperation.name doesnt provide identifiers for identiifying INSERT, INSERT_OVERWRITE, UPDATE, DELETE etc separately . Hence adding WriteEntity.WriteType as well which exhibits the following behaviour
> a. If there are multiple outputs, for each output, adds the query type(WriteType)
> b. if query being run if is type INSERT [into/overwrite] TABLE [PARTITION], WriteType is INSERT/INSERT_OVERWRITE
> b. If query is of type INSERT OVERWRITE hdfs_path, adds WriteType as PATH_WRITE
> c. If query is of type UPDATE/DELETE, adds type as UPDATE/DELETE [ Note - linage is not available for this since this is single table operation]
> 3.When input is of type local dir or hdfs path currently, it doesnt add it to qualified name. The reason is that partition based paths cause a lot of processes to be created in this case instead of updating the same process.
> Pending:
> Address Shwetha G S suggestion to add hdfs paths to process qualified name only in case of non-partition based queries. This needs to be done per HiveOperation type
> 1. if HiveOperation = LOAD, IMPORT, EXPORT - detect if the current query context is dealing with partitions and do not add if it is partition based.
> 2. If HiveOperation = INSERT OVERWRITE DFS_PATH/LOCAL_PATH , then detect if the query context is dealing with a partitioned table in inputs and decide if we need to add or not.
> 
> 
> Diffs
> -----
> 
>   addons/hive-bridge/src/main/java/org/apache/atlas/hive/bridge/HiveMetaStoreBridge.java c956a32 
>   addons/hive-bridge/src/main/java/org/apache/atlas/hive/hook/HiveHook.java 23c82df 
>   addons/hive-bridge/src/test/java/org/apache/atlas/hive/hook/HiveHookIT.java e7fbf71 
>   webapp/src/main/java/org/apache/atlas/web/resources/EntityResource.java 0713d30 
> 
> Diff: https://reviews.apache.org/r/48939/diff/
> 
> 
> Testing
> -------
> 
> Existing tests modified to query with new qualified name. Need to add tests for INSERT INTO TABLE
> 
> 
> Thanks,
> 
> Suma Shivaprasad
> 
>


Re: Review Request 48939: ATLAS-904 Handle process qualified name per Hive Operation

Posted by Suma Shivaprasad <su...@gmail.com>.
-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/48939/
-----------------------------------------------------------

(Updated June 20, 2016, 4 a.m.)


Review request for atlas, Shwetha GS and Hemanth Yamijala.


Bugs: ATLAS-904
    https://issues.apache.org/jira/browse/ATLAS-904


Repository: atlas


Description
-------

1. Process qualified name = HiveOperation.name + sorted inputs + sorted outputs
2. HiveOperation.name doesnt provide identifiers for identiifying INSERT, INSERT_OVERWRITE, UPDATE, DELETE etc separately . Hence adding WriteEntity.WriteType as well which exhibits the following behaviour
a. If there are multiple outputs, for each output, adds the query type(WriteType)
b. if query being run if is type INSERT [into/overwrite] TABLE [PARTITION], WriteType is INSERT/INSERT_OVERWRITE
b. If query is of type INSERT OVERWRITE hdfs_path, adds WriteType as PATH_WRITE
c. If query is of type UPDATE/DELETE, adds type as UPDATE/DELETE [ Note - linage is not available for this since this is single table operation]
3.When input is of type local dir or hdfs path currently, it doesnt add it to qualified name. The reason is that partition based paths cause a lot of processes to be created in this case instead of updating the same process.
Pending:
Address Shwetha G S suggestion to add hdfs paths to process qualified name only in case of non-partition based queries. This needs to be done per HiveOperation type
1. if HiveOperation = LOAD, IMPORT, EXPORT - detect if the current query context is dealing with partitions and do not add if it is partition based.
2. If HiveOperation = INSERT OVERWRITE DFS_PATH/LOCAL_PATH , then detect if the query context is dealing with a partitioned table in inputs and decide if we need to add or not.


Diffs (updated)
-----

  addons/hive-bridge/src/main/java/org/apache/atlas/hive/bridge/HiveMetaStoreBridge.java c956a32 
  addons/hive-bridge/src/main/java/org/apache/atlas/hive/hook/HiveHook.java 23c82df 
  addons/hive-bridge/src/test/java/org/apache/atlas/hive/hook/HiveHookIT.java e7fbf71 
  webapp/src/main/java/org/apache/atlas/web/resources/EntityResource.java 0713d30 

Diff: https://reviews.apache.org/r/48939/diff/


Testing
-------

Existing tests modified to query with new qualified name. Need to add tests for INSERT INTO TABLE


Thanks,

Suma Shivaprasad


Re: Review Request 48939: ATLAS-904 Handle process qualified name per Hive Operation

Posted by Suma Shivaprasad <su...@gmail.com>.
-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/48939/
-----------------------------------------------------------

(Updated June 20, 2016, 4 a.m.)


Review request for atlas, Shwetha GS and Hemanth Yamijala.


Bugs: ATLAS-904
    https://issues.apache.org/jira/browse/ATLAS-904


Repository: atlas


Description
-------

1. Process qualified name = HiveOperation.name + sorted inputs + sorted outputs
2. HiveOperation.name doesnt provide identifiers for identiifying INSERT, INSERT_OVERWRITE, UPDATE, DELETE etc separately . Hence adding WriteEntity.WriteType as well which exhibits the following behaviour
a. If there are multiple outputs, for each output, adds the query type(WriteType)
b. if query being run if is type INSERT [into/overwrite] TABLE [PARTITION], WriteType is INSERT/INSERT_OVERWRITE
b. If query is of type INSERT OVERWRITE hdfs_path, adds WriteType as PATH_WRITE
c. If query is of type UPDATE/DELETE, adds type as UPDATE/DELETE [ Note - linage is not available for this since this is single table operation]
3.When input is of type local dir or hdfs path currently, it doesnt add it to qualified name. The reason is that partition based paths cause a lot of processes to be created in this case instead of updating the same process.
Pending:
Address Shwetha G S suggestion to add hdfs paths to process qualified name only in case of non-partition based queries. This needs to be done per HiveOperation type
1. if HiveOperation = LOAD, IMPORT, EXPORT - detect if the current query context is dealing with partitions and do not add if it is partition based.
2. If HiveOperation = INSERT OVERWRITE DFS_PATH/LOCAL_PATH , then detect if the query context is dealing with a partitioned table in inputs and decide if we need to add or not.


Diffs
-----

  addons/hive-bridge/src/main/java/org/apache/atlas/hive/bridge/HiveMetaStoreBridge.java c956a32 
  addons/hive-bridge/src/main/java/org/apache/atlas/hive/hook/HiveHook.java 23c82df 
  addons/hive-bridge/src/test/java/org/apache/atlas/hive/hook/HiveHookIT.java e7fbf71 
  webapp/src/main/java/org/apache/atlas/web/resources/EntityResource.java 0713d30 

Diff: https://reviews.apache.org/r/48939/diff/


Testing
-------

Existing tests modified to query with new qualified name. Need to add tests for INSERT INTO TABLE


Thanks,

Suma Shivaprasad