You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@atlas.apache.org by GitBox <gi...@apache.org> on 2020/03/05 19:41:22 UTC

[GitHub] [atlas] vladhlinsky opened a new pull request #91: ATLAS-3655: Create 'spark_application' type to avoid 'spark_process' from being updated for multiple operations

vladhlinsky opened a new pull request #91: ATLAS-3655: Create 'spark_application' type to avoid 'spark_process' from being updated for multiple operations
URL: https://github.com/apache/atlas/pull/91
 
 
   ## What changes were proposed in this pull request?
   
   Create `spark_application` type to avoid `spark_process` from being updated for multiple operations. Currently, Spark Atlas Connector uses `spark_process` as a top-level type for a Spark session, thus it's being updated for multiple operations within the same session.
   
   The following statements:
   ```
   spark.sql("create table table_1(col1 int,col2 string)");
   spark.sql("create table table_2 as select * from table_1");
   ```
   result in the next correct lineage:
   ```
   table1 ------> spark_process1 -------> table2
   ```
   but executing similar statements in the same spark session:
   ```
   spark.sql("create table table_3(col1 int,col2 string)"); 
   spark.sql("create table table_4 as select * from table_3");
   ```
   result in the same `spark_process` being updated and the lineage now connects all the 4 tables.
   The proposal is to create a `spark_application` entity and associate all `spark_process` entities (created within that session) to it.
   
   ## How was this patch tested?
   
   Manually using modified version of Spark Atlas Connector:
   - Installed and started Atlas.
   - Executed the next statements using spark-shell:
   
   ```
   spark.sql("create table table_1_17(col1 int,col2 string)");
   spark.sql("create table table_2_17 as select * from table_1_17");
   spark.sql("create table table_3_17(col1 int,col2 string)");
   spark.sql("create table table_4_17 as select * from table_3_17");
   ```
   
   - Verified that all 4 entites are connected in Atlas lineage.
   - `1100-spark_model.json` is updated with proposed changes.
   - Once again executed similar statements:
   
   ```
   spark.sql("create table table_1_37(col1 int,col2 string)");
   spark.sql("create table table_2_37 as select * from table_1_37");
   spark.sql("create table table_3_37(col1 int,col2 string)");
   spark.sql("create table table_4_37 as select * from table_3_37");
   ```
   
   - Verified that two `spark_process` entities are created,
   that have a single `spark_application` entity as `application`.
   Each of these processes has it's own lineage.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

[GitHub] [atlas] fpompermaier commented on issue #91: ATLAS-3655: Create 'spark_application' type to avoid 'spark_process' from being updated for multiple operations

Posted by GitBox <gi...@apache.org>.
fpompermaier commented on issue #91: ATLAS-3655: Create 'spark_application' type to avoid 'spark_process' from being updated for multiple operations
URL: https://github.com/apache/atlas/pull/91#issuecomment-596976025
 
 
   > @fpompermaier thank you for your comment.
   > 
   > No, they are not swapped. A relationship definition specifies relationship attribute on a type. This means that `spark_application_processes` define `application` relationship attribute on the `spark_process` type and `processes` attribute on the `spark_application` type. [Please, refer screenshots for an example.](https://github.com/apache/atlas/pull/91#issuecomment-595413075)
   
   Ok! Thanks fort the clarification! I'm quite new to Atlas

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

[GitHub] [atlas] sarathsubramanian commented on issue #91: ATLAS-3655: Create 'spark_application' type to avoid 'spark_process' from being updated for multiple operations

Posted by GitBox <gi...@apache.org>.
sarathsubramanian commented on issue #91: ATLAS-3655: Create 'spark_application' type to avoid 'spark_process' from being updated for multiple operations
URL: https://github.com/apache/atlas/pull/91#issuecomment-595895748
 
 
   changes looks good. +1

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

[GitHub] [atlas] vladhlinsky commented on a change in pull request #91: ATLAS-3655: Create 'spark_application' type to avoid 'spark_process' from being updated for multiple operations

Posted by GitBox <gi...@apache.org>.
vladhlinsky commented on a change in pull request #91: ATLAS-3655: Create 'spark_application' type to avoid 'spark_process' from being updated for multiple operations
URL: https://github.com/apache/atlas/pull/91#discussion_r388793497
 
 

 ##########
 File path: addons/models/1000-Hadoop/1100-spark_model.json
 ##########
 @@ -305,6 +305,34 @@
         }
       ]
     },
+    {
+      "name": "spark_application",
+      "superTypes": [
+        "Process"
+      ],
+      "serviceType": "spark",
+      "typeVersion": "1.0",
+      "attributeDefs": [
+        {
+          "name": "currUser",
+          "typeName": "string",
+          "isOptional": true,
+          "cardinality": "SINGLE",
+          "isUnique": false,
+          "isIndexable": false,
+          "searchWeight": 10
+        },
+        {
+          "name": "remoteUser",
+          "typeName": "string",
+          "isOptional": true,
+          "cardinality": "SINGLE",
+          "isUnique": false,
+          "isIndexable": false,
 
 Review comment:
   Changed.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

[GitHub] [atlas] sarathsubramanian commented on a change in pull request #91: ATLAS-3655: Create 'spark_application' type to avoid 'spark_process' from being updated for multiple operations

Posted by GitBox <gi...@apache.org>.
sarathsubramanian commented on a change in pull request #91: ATLAS-3655: Create 'spark_application' type to avoid 'spark_process' from being updated for multiple operations
URL: https://github.com/apache/atlas/pull/91#discussion_r388528238
 
 

 ##########
 File path: addons/models/1000-Hadoop/1100-spark_model.json
 ##########
 @@ -305,6 +305,34 @@
         }
       ]
     },
+    {
+      "name": "spark_application",
+      "superTypes": [
+        "Process"
+      ],
+      "serviceType": "spark",
+      "typeVersion": "1.0",
+      "attributeDefs": [
+        {
+          "name": "currUser",
+          "typeName": "string",
+          "isOptional": true,
+          "cardinality": "SINGLE",
+          "isUnique": false,
+          "isIndexable": false,
+          "searchWeight": 10
+        },
+        {
+          "name": "remoteUser",
+          "typeName": "string",
+          "isOptional": true,
+          "cardinality": "SINGLE",
+          "isUnique": false,
+          "isIndexable": false,
 
 Review comment:
   isIndexable => true

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

[GitHub] [atlas] vladhlinsky commented on a change in pull request #91: ATLAS-3655: Create 'spark_application' type to avoid 'spark_process' from being updated for multiple operations

Posted by GitBox <gi...@apache.org>.
vladhlinsky commented on a change in pull request #91: ATLAS-3655: Create 'spark_application' type to avoid 'spark_process' from being updated for multiple operations
URL: https://github.com/apache/atlas/pull/91#discussion_r388793433
 
 

 ##########
 File path: addons/models/1000-Hadoop/1100-spark_model.json
 ##########
 @@ -305,6 +305,34 @@
         }
       ]
     },
+    {
+      "name": "spark_application",
+      "superTypes": [
+        "Process"
+      ],
+      "serviceType": "spark",
+      "typeVersion": "1.0",
+      "attributeDefs": [
+        {
+          "name": "currUser",
+          "typeName": "string",
+          "isOptional": true,
+          "cardinality": "SINGLE",
+          "isUnique": false,
+          "isIndexable": false,
 
 Review comment:
   Changed. Thanks!

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

[GitHub] [atlas] nixonrodrigues commented on issue #91: ATLAS-3655: Create 'spark_application' type to avoid 'spark_process' from being updated for multiple operations

Posted by GitBox <gi...@apache.org>.
nixonrodrigues commented on issue #91: ATLAS-3655: Create 'spark_application' type to avoid 'spark_process' from being updated for multiple operations
URL: https://github.com/apache/atlas/pull/91#issuecomment-599382377
 
 
   @vladhlinsky ,
   
    this PR has conflicts, can you please rebase with master and update PR.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

[GitHub] [atlas] sarathsubramanian commented on a change in pull request #91: ATLAS-3655: Create 'spark_application' type to avoid 'spark_process' from being updated for multiple operations

Posted by GitBox <gi...@apache.org>.
sarathsubramanian commented on a change in pull request #91: ATLAS-3655: Create 'spark_application' type to avoid 'spark_process' from being updated for multiple operations
URL: https://github.com/apache/atlas/pull/91#discussion_r388527692
 
 

 ##########
 File path: addons/models/1000-Hadoop/1100-spark_model.json
 ##########
 @@ -470,6 +498,24 @@
         "cardinality": "SINGLE"
       },
       "propagateTags": "NONE"
+    },
+    {
+      "name": "spark_application_process",
 
 Review comment:
   "spark_application_process" => "spark_application_processes"

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

[GitHub] [atlas] vladhlinsky commented on a change in pull request #91: ATLAS-3655: Create 'spark_application' type to avoid 'spark_process' from being updated for multiple operations

Posted by GitBox <gi...@apache.org>.
vladhlinsky commented on a change in pull request #91: ATLAS-3655: Create 'spark_application' type to avoid 'spark_process' from being updated for multiple operations
URL: https://github.com/apache/atlas/pull/91#discussion_r388546216
 
 

 ##########
 File path: addons/models/1000-Hadoop/1100-spark_model.json
 ##########
 @@ -470,6 +498,24 @@
         "cardinality": "SINGLE"
       },
       "propagateTags": "NONE"
+    },
+    {
+      "name": "spark_application_process",
 
 Review comment:
   Renamed.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

[GitHub] [atlas] vladhlinsky commented on issue #91: ATLAS-3655: Create 'spark_application' type to avoid 'spark_process' from being updated for multiple operations

Posted by GitBox <gi...@apache.org>.
vladhlinsky commented on issue #91: ATLAS-3655: Create 'spark_application' type to avoid 'spark_process' from being updated for multiple operations
URL: https://github.com/apache/atlas/pull/91#issuecomment-595413162
 
 
   cc @HeartSaVioR @sarathsubramanian

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

[GitHub] [atlas] nixonrodrigues merged pull request #91: ATLAS-3655: Create 'spark_application' type to avoid 'spark_process' from being updated for multiple operations

Posted by GitBox <gi...@apache.org>.
nixonrodrigues merged pull request #91: ATLAS-3655: Create 'spark_application' type to avoid 'spark_process' from being updated for multiple operations
URL: https://github.com/apache/atlas/pull/91
 
 
   

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

[GitHub] [atlas] vladglinsky commented on a change in pull request #91: ATLAS-3655: Create 'spark_application' type to avoid 'spark_process' from being updated for multiple operations

Posted by GitBox <gi...@apache.org>.
vladglinsky commented on a change in pull request #91: ATLAS-3655: Create 'spark_application' type to avoid 'spark_process' from being updated for multiple operations
URL: https://github.com/apache/atlas/pull/91#discussion_r388545586
 
 

 ##########
 File path: addons/models/1000-Hadoop/1100-spark_model.json
 ##########
 @@ -305,6 +305,34 @@
         }
       ]
     },
+    {
+      "name": "spark_application",
+      "superTypes": [
+        "Process"
+      ],
+      "serviceType": "spark",
+      "typeVersion": "1.0",
+      "attributeDefs": [
+        {
+          "name": "currUser",
 
 Review comment:
   Thanks! Changed to  "currentUser".

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

[GitHub] [atlas] vladhlinsky commented on issue #91: ATLAS-3655: Create 'spark_application' type to avoid 'spark_process' from being updated for multiple operations

Posted by GitBox <gi...@apache.org>.
vladhlinsky commented on issue #91: ATLAS-3655: Create 'spark_application' type to avoid 'spark_process' from being updated for multiple operations
URL: https://github.com/apache/atlas/pull/91#issuecomment-595413075
 
 
   Attaching screenshots.
   - Installed and started Atlas.
   - Executed the next statements using spark-shell:
   
   ```
   spark.sql("create table table_1_17(col1 int,col2 string)");
   spark.sql("create table table_2_17 as select * from table_1_17");
   spark.sql("create table table_3_17(col1 int,col2 string)");
   spark.sql("create table table_4_17 as select * from table_3_17");
   ```
   - Verified that all 4 entites are connected in Atlas lineage.
   ![Screenshot from 2020-02-27 19-31-09](https://user-images.githubusercontent.com/61428392/76019361-42ab2900-5f2a-11ea-960b-192cb0d00638.png)
   
   - `1100-spark_model.json` is updated with proposed changes.
   - Once again executed similar statements:
   
   ```
   spark.sql("create table table_1_37(col1 int,col2 string)");
   spark.sql("create table table_2_37 as select * from table_1_37");
   spark.sql("create table table_3_37(col1 int,col2 string)");
   spark.sql("create table table_4_37 as select * from table_3_37");
   ```
   
   - Verified that two `spark_process` entities are created,
   that have a single `spark_application` entity as `application`.
   Each of these processes has it's own lineage.
   
   ![Screenshot from 2020-03-04 23-16-44](https://user-images.githubusercontent.com/61428392/76019494-78e8a880-5f2a-11ea-856d-7bb7b8415412.png)
   ![Screenshot from 2020-03-04 23-17-02](https://user-images.githubusercontent.com/61428392/76019557-93228680-5f2a-11ea-9ce4-6d89a87ce94e.png)
   ![Screenshot from 2020-03-04 23-17-10](https://user-images.githubusercontent.com/61428392/76019509-8140e380-5f2a-11ea-9610-53b54dbffd8f.png)
   ![Screenshot from 2020-03-04 23-17-34](https://user-images.githubusercontent.com/61428392/76019579-9cabee80-5f2a-11ea-9684-16a38029d611.png)
   ![Screenshot from 2020-03-04 23-19-56](https://user-images.githubusercontent.com/61428392/76019598-a46b9300-5f2a-11ea-9995-24732d740c08.png)
   

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

[GitHub] [atlas] vladhlinsky commented on issue #91: ATLAS-3655: Create 'spark_application' type to avoid 'spark_process' from being updated for multiple operations

Posted by GitBox <gi...@apache.org>.
vladhlinsky commented on issue #91: ATLAS-3655: Create 'spark_application' type to avoid 'spark_process' from being updated for multiple operations
URL: https://github.com/apache/atlas/pull/91#issuecomment-596974526
 
 
   @fpompermaier thank you for your comment.
   
   No, they are not swapped. A relationship definition specifies relationship attribute on a type. This means that `spark_application_processes` define `application` relationship attribute on the `spark_process` type and `processes` attribute on the `spark_application` type. [Please, refer screenshots for an example.](https://github.com/apache/atlas/pull/91#issuecomment-595413075)

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

[GitHub] [atlas] sarathsubramanian commented on a change in pull request #91: ATLAS-3655: Create 'spark_application' type to avoid 'spark_process' from being updated for multiple operations

Posted by GitBox <gi...@apache.org>.
sarathsubramanian commented on a change in pull request #91: ATLAS-3655: Create 'spark_application' type to avoid 'spark_process' from being updated for multiple operations
URL: https://github.com/apache/atlas/pull/91#discussion_r388528136
 
 

 ##########
 File path: addons/models/1000-Hadoop/1100-spark_model.json
 ##########
 @@ -305,6 +305,34 @@
         }
       ]
     },
+    {
+      "name": "spark_application",
+      "superTypes": [
+        "Process"
+      ],
+      "serviceType": "spark",
+      "typeVersion": "1.0",
+      "attributeDefs": [
+        {
+          "name": "currUser",
+          "typeName": "string",
+          "isOptional": true,
+          "cardinality": "SINGLE",
+          "isUnique": false,
+          "isIndexable": false,
 
 Review comment:
   isIndexable => true

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

[GitHub] [atlas] sarathsubramanian commented on a change in pull request #91: ATLAS-3655: Create 'spark_application' type to avoid 'spark_process' from being updated for multiple operations

Posted by GitBox <gi...@apache.org>.
sarathsubramanian commented on a change in pull request #91: ATLAS-3655: Create 'spark_application' type to avoid 'spark_process' from being updated for multiple operations
URL: https://github.com/apache/atlas/pull/91#discussion_r388527337
 
 

 ##########
 File path: addons/models/1000-Hadoop/1100-spark_model.json
 ##########
 @@ -305,6 +305,34 @@
         }
       ]
     },
+    {
+      "name": "spark_application",
+      "superTypes": [
+        "Process"
+      ],
+      "serviceType": "spark",
+      "typeVersion": "1.0",
+      "attributeDefs": [
+        {
+          "name": "currUser",
 
 Review comment:
   consider renaming "currUser" to "currentUser"

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

[GitHub] [atlas] vladhlinsky commented on a change in pull request #91: ATLAS-3655: Create 'spark_application' type to avoid 'spark_process' from being updated for multiple operations

Posted by GitBox <gi...@apache.org>.
vladhlinsky commented on a change in pull request #91: ATLAS-3655: Create 'spark_application' type to avoid 'spark_process' from being updated for multiple operations
URL: https://github.com/apache/atlas/pull/91#discussion_r388546052
 
 

 ##########
 File path: addons/models/1000-Hadoop/1100-spark_model.json
 ##########
 @@ -305,6 +305,34 @@
         }
       ]
     },
+    {
+      "name": "spark_application",
+      "superTypes": [
+        "Process"
+      ],
+      "serviceType": "spark",
+      "typeVersion": "1.0",
+      "attributeDefs": [
+        {
+          "name": "currUser",
 
 Review comment:
   Thanks! Changed to "currentUser".

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

[GitHub] [atlas] vladglinsky commented on a change in pull request #91: ATLAS-3655: Create 'spark_application' type to avoid 'spark_process' from being updated for multiple operations

Posted by GitBox <gi...@apache.org>.
vladglinsky commented on a change in pull request #91: ATLAS-3655: Create 'spark_application' type to avoid 'spark_process' from being updated for multiple operations
URL: https://github.com/apache/atlas/pull/91#discussion_r388545586
 
 

 ##########
 File path: addons/models/1000-Hadoop/1100-spark_model.json
 ##########
 @@ -305,6 +305,34 @@
         }
       ]
     },
+    {
+      "name": "spark_application",
+      "superTypes": [
+        "Process"
+      ],
+      "serviceType": "spark",
+      "typeVersion": "1.0",
+      "attributeDefs": [
+        {
+          "name": "currUser",
 
 Review comment:
   Thanks! Changed to  "currentUser".

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

[GitHub] [atlas] fpompermaier commented on issue #91: ATLAS-3655: Create 'spark_application' type to avoid 'spark_process' from being updated for multiple operations

Posted by GitBox <gi...@apache.org>.
fpompermaier commented on issue #91: ATLAS-3655: Create 'spark_application' type to avoid 'spark_process' from being updated for multiple operations
URL: https://github.com/apache/atlas/pull/91#issuecomment-596095570
 
 
   Aren't the 2 names of  "type": "spark_application",  "name": "application" and "type": "spark_process", "name": "application" swapped?

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services