You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@atlas.apache.org by GitBox <gi...@apache.org> on 2020/02/27 11:14:58 UTC

[GitHub] [atlas] vladhlinsky opened a new pull request #88: ATLAS-3640 Update 'spark_ml_model_ml_directory' and 'spark_ml_pipeline_ml_directory' relationship definitions

vladhlinsky opened a new pull request #88: ATLAS-3640 Update 'spark_ml_model_ml_directory' and 'spark_ml_pipeline_ml_directory' relationship definitions
URL: https://github.com/apache/atlas/pull/88
 
 
   ## What changes were proposed in this pull request?
   
   Update `spark_ml_model_ml_directory` and `spark_ml_pipeline_ml_directory` relationship definitions to use `DataSet` type instead of it's child type `spark_ml_directory`. This is required in order to integrate Spark Atlas Connector's ML event processor.
   Previously, Spark Atlas Connector used the `spark_ml_directory` model for ML model directory but this is changed in the scope of https://github.com/hortonworks-spark/spark-atlas-connector/issues/61, https://github.com/hortonworks-spark/spark-atlas-connector/pull/62 so ML model directory is `DataSet` entity(i.e. `hdfs_path`, `fs_path` and `aws_s3_object`).
   Thus, relationship definitions must be updated, otherwise, an attempt to create relation leads to: 
   ```
   org.apache.atlas.exception.AtlasBaseException: invalid relationshipDef: spark_ml_model_ml_directory: end type 1: spark_ml_directory, end type 2: spark_ml_model
   ```
   since `COMPOSITION` requires `spark_ml_directory` to be set.
   
   Proposed changes are safe for old clients since `DataSet` is parent type for the `spark_ml_directory`.
   
   ## How was this patch tested?
   
   Manually and with unit tests.
   

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

[GitHub] [atlas] vladhlinsky commented on issue #88: ATLAS-3640 Update 'spark_ml_model_ml_directory' and 'spark_ml_pipeline_ml_directory' relationship definitions

Posted by GitBox <gi...@apache.org>.
vladhlinsky commented on issue #88: ATLAS-3640 Update 'spark_ml_model_ml_directory' and 'spark_ml_pipeline_ml_directory' relationship definitions
URL: https://github.com/apache/atlas/pull/88#issuecomment-593356727
 
 
   Closing this PR since there is no straightforward way to update `spark_ml_model_ml_directory` and `spark_ml_pipeline_ml_directory` relationship definitions to use `DataSet` type instead of it's child type `spark_ml_directory`. 
   
   Opened a new PR to create new relationship definitions: 
   - https://issues.apache.org/jira/browse/ATLAS-3646
   - https://github.com/apache/atlas/pull/89

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

[GitHub] [atlas] vladhlinsky commented on issue #88: ATLAS-3640 Update 'spark_ml_model_ml_directory' and 'spark_ml_pipeline_ml_directory' relationship definitions

Posted by GitBox <gi...@apache.org>.
vladhlinsky commented on issue #88: ATLAS-3640 Update 'spark_ml_model_ml_directory' and 'spark_ml_pipeline_ml_directory' relationship definitions
URL: https://github.com/apache/atlas/pull/88#issuecomment-591938711
 
 
   cc @sarathsubramanian 

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

[GitHub] [atlas] vladhlinsky commented on issue #88: ATLAS-3640 Update 'spark_ml_model_ml_directory' and 'spark_ml_pipeline_ml_directory' relationship definitions

Posted by GitBox <gi...@apache.org>.
vladhlinsky commented on issue #88: ATLAS-3640 Update 'spark_ml_model_ml_directory' and 'spark_ml_pipeline_ml_directory' relationship definitions
URL: https://github.com/apache/atlas/pull/88#issuecomment-592703591
 
 
   I created the following functions in order to test proposed changes without Spark Atlas Connector:
   ```
   function create_ml_directory(){
     NAME=$1
     TIMESTAMP=$(($(date +%s%N)/1000000))
     ML_DIR="{\""version\"":{\""version\"":\""1.0.0\"",\""versionParts\"":[1]},\""msgCompressionKind\"":\""NONE\"",
     \""msgSplitIdx\"":1,\""msgSplitCount\"":1,\""msgSourceIP\"":\""172.27.12.6\"",\""msgCreatedBy\"":\""test\"",
     \""msgCreationTime\"":$TIMESTAMP,\""message\"":{\""type\"":\""ENTITY_CREATE_V2\"",\""user\"":\""test\"",
     \""entities\"":{\""entities\"":[{\""typeName\"":\""spark_ml_directory\"",\""attributes\"":
     {\""qualifiedName\"":\""$NAME\"",\""name\"":\""$NAME\"",\""uri\"":\""hdfs://\"",\""directory\"":\""/test\""},
     \""isIncomplete\"":false,\""provenanceType\"":0,\""version\"":0,\""proxy\"":false}]}}}"
   	echo $ML_DIR | ./bin/kafka-console-producer.sh --topic ATLAS_HOOK --broker-list localhost:9092
   }
   
   function create_ml_model(){
     NAME=$1
     DIR_TYPE=$2
     DIR_NAME=$3
     TIMESTAMP=$(($(date +%s%N)/1000000))
     ML_MODEL="{\""version\"":{\""version\"":\""1.0.0\"",\""versionParts\"":[1]},\""msgCompressionKind\"":\""NONE\"",
     \""msgSplitIdx\"":1,\""msgSplitCount\"":1,\""msgSourceIP\"":\""172.27.12.6\"",\""msgCreatedBy\"":\""test\"",
     \""msgCreationTime\"":$TIMESTAMP,\""message\"":{\""type\"":\""ENTITY_CREATE_V2\"",\""user\"":\""test\"",
     \""entities\"":{\""entities\"":[{\""typeName\"":\""spark_ml_model\"",\""attributes\"":
     {\""qualifiedName\"":\""$NAME\"",\""name\"":\""$NAME\""},\""isIncomplete\"":false,\""provenanceType\"":0,
     \""version\"":0,\""relationshipAttributes\"":{\""directory\"":{\""typeName\"":\""$DIR_TYPE\"",
     \""uniqueAttributes\"":{\""qualifiedName\"":\""$DIR_NAME\""}}},\""proxy\"":false}]}}}"
     echo $ML_MODEL | ./bin/kafka-console-producer.sh --topic ATLAS_HOOK --broker-list localhost:9092
   }
   
   function create_ml_pipeline(){
     NAME=$1
     DIR_TYPE=$2
     DIR_NAME=$3
     TIMESTAMP=$(($(date +%s%N)/1000000))
     ML_PIPELINE="{\""type\"":\""ENTITY_CREATE_V2\"",\""user\"":\""test\"",\""entities\"":{\""entities\"":[{\""typeName\"":
     \""spark_ml_pipeline\"",\""attributes\"":{\""qualifiedName\"":\""$NAME\"",\""name\"":\""$NAME\""},\""isIncomplete\"":
     false,\""provenanceType\"":0,\""version\"":0,\""relationshipAttributes\"":{\""directory\"":{\""typeName\"":
     \""$DIR_TYPE\"",\""uniqueAttributes\"":{\""qualifiedName\"":\""$DIR_NAME\""}}},\""proxy\"":false}]}}}"
     echo $ML_PIPELINE | ./bin/kafka-console-producer.sh --topic ATLAS_HOOK --broker-list localhost:9092
   }
   
   function create_hdfs_path(){
     NAME=$1
     TIMESTAMP=$(($(date +%s%N)/1000000))
     HDFS_PATH="{\""version\"":{\""version\"":\""1.0.0\"",\""versionParts\"":[1]},\""msgCompressionKind\"":\""NONE\"",
     \""msgSplitIdx\"":1,\""msgSplitCount\"":1,\""msgSourceIP\"":\""172.27.12.6\"",\""msgCreatedBy\"":\""test\"",
     \""msgCreationTime\"":$TIMESTAMP,\""message\"":{\""type\"":\""ENTITY_CREATE_V2\"",\""user\"":\""test\"",
     \""entities\"":{\""entities\"":[{\""typeName\"":\""hdfs_path\"",\""attributes\"":{\""path\"":\""$NAME\"",
     \""qualifiedName\"":\""$NAME\"",\""clusterName\"":\""test\"",\""name\"":\""$NAME\""},\""isIncomplete\"":false,
     \""provenanceType\"":0,\""version\"":0,\""proxy\"":false}]}}}"
     echo $HDFS_PATH | ./bin/kafka-console-producer.sh --topic ATLAS_HOOK --broker-list localhost:9092
   }
   ```
   Cases below work fine for new relationship defs with `directory` name:
   ```
   create_ml_directory mldir
   create_ml_model model_with_mldir spark_ml_directory mldir
   
   
   create_hdfs_path path
   create_ml_model model_with_path hdfs_path path
   
   
   create_ml_model model_with_mldir hdfs_path path
   
   create_ml_model model_with_path spark_ml_directory mldir
   
   
   create_ml_directory mldir2
   create_ml_pipeline pipeline_with_mldir spark_ml_directory mldir2
   ```
   but the next case fails as described in the previous comment: 
   ```
   create_hdfs_path path2
   create_ml_pipeline pipeline_with_path hdfs_path path2
   ```
   
   **I think the best way to resolve this will be creating a new relationship using different name:**
   ```
       {
         "name": "spark_ml_model_dataset",
         "serviceType": "spark",
         "typeVersion": "1.0",
         "relationshipCategory": "AGGREGATION",
         "endDef1": {
           "type": "spark_ml_model",
           "name": "dataset",
           "isContainer": true,
           "cardinality": "SINGLE"
         },
         "endDef2": {
           "type": "DataSet",
           "name": "model",
           "isContainer": false,
           "cardinality": "SINGLE"
         },
         "propagateTags": "NONE"
       },
    ...
   ```
   

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

[GitHub] [atlas] HeartSaVioR commented on issue #88: ATLAS-3640 Update 'spark_ml_model_ml_directory' and 'spark_ml_pipeline_ml_directory' relationship definitions

Posted by GitBox <gi...@apache.org>.
HeartSaVioR commented on issue #88: ATLAS-3640 Update 'spark_ml_model_ml_directory' and 'spark_ml_pipeline_ml_directory' relationship definitions
URL: https://github.com/apache/atlas/pull/88#issuecomment-592218187
 
 
   It'd be nice to elaborate how to do manual test; especially verify when the 1.0 version of Spark models are installed and upgrade Spark models to 1.1.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

[GitHub] [atlas] vladhlinsky commented on issue #88: ATLAS-3640 Update 'spark_ml_model_ml_directory' and 'spark_ml_pipeline_ml_directory' relationship definitions

Posted by GitBox <gi...@apache.org>.
vladhlinsky commented on issue #88: ATLAS-3640 Update 'spark_ml_model_ml_directory' and 'spark_ml_pipeline_ml_directory' relationship definitions
URL: https://github.com/apache/atlas/pull/88#issuecomment-591929917
 
 
   cc @HeartSaVioR

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

[GitHub] [atlas] vladhlinsky closed pull request #88: ATLAS-3640 Update 'spark_ml_model_ml_directory' and 'spark_ml_pipeline_ml_directory' relationship definitions

Posted by GitBox <gi...@apache.org>.
vladhlinsky closed pull request #88: ATLAS-3640 Update 'spark_ml_model_ml_directory' and 'spark_ml_pipeline_ml_directory' relationship definitions
URL: https://github.com/apache/atlas/pull/88
 
 
   

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

[GitHub] [atlas] vladhlinsky commented on issue #88: ATLAS-3640 Update 'spark_ml_model_ml_directory' and 'spark_ml_pipeline_ml_directory' relationship definitions

Posted by GitBox <gi...@apache.org>.
vladhlinsky commented on issue #88: ATLAS-3640 Update 'spark_ml_model_ml_directory' and 'spark_ml_pipeline_ml_directory' relationship definitions
URL: https://github.com/apache/atlas/pull/88#issuecomment-592699723
 
 
   Thanks, @HeartSaVioR!
   
   As it turned out, the proposed changes do not work correctly for an upgrade to Spark models 1.1. 
   I tested the changes only for a new installation(with no existing entities in the HBase).
   
   I guess `relationshipCategory` can be updated only via a patch. An attempt to upgrade with proposed changes leads to:
   ```
   2020-02-28 09:57:11,783 ERROR - [main:] ~ graph rollback due to exception  (GraphTransactionInterceptor:167)
   org.apache.atlas.exception.AtlasBaseException: invalid  update for relationship spark_ml_model_ml_directory: new relationshipDef category AGGREGATION, existing relationshipDef category COMPOSITION
           at org.apache.atlas.repository.store.graph.v2.AtlasRelationshipDefStoreV2.preUpdateCheck(AtlasRelationshipDefStoreV2.java:432)
   ```
   it's possible to use the following path to update this property
   ```
   {
       "patches": [
           {
               "id":              "TYPEDEF_PATCH_1000_015_001",
               "description":     "Update relationshipCategory to AGGREGATION",
               "action":          "REMOVE_LEGACY_REF_ATTRIBUTES",
               "typeName":        "spark_ml_model_ml_directory",
               "applyToVersion":  "1.0",
               "updateToVersion": "1.1",
               "params": {
                   "relationshipCategory": "AGGREGATION"
               }
           },
           ...
           }
       ]
   }
   
   ```
   however, there is no way to update `endDefs` types. I can not find a patch action for this purpose and an attempt to update it directly in the model file leads to:
   ```
   2020-02-28 12:14:05,151 INFO  - [main:] ~ GraphTransaction intercept for org.apache.atlas.repository.store.graph.v2.AtlasTypeDefGraphStoreV2.createUpdateTypesDef (GraphTransactionAdvisor$1:41)
   2020-02-28 12:14:05,213 ERROR - [main:] ~ graph rollback due to exception  (GraphTransactionInterceptor:167)
   org.apache.atlas.exception.AtlasBaseException: invalid update for relationshipDef spark_ml_model_ml_directory: new end2 AtlasRelationshipEndDef{type='DataSet', name==>'model', description==>'null', isContainer==>'false', cardinality==>'SINGLE', isLegacyAttribute==>'false'}, existing end2 AtlasRelationshipEndDef{type='spark_ml_directory', name==>'model', description==>'null', isContainer==>'false', cardinality==>'SINGLE', isLegacyAttribute==>'false'}
           at org.apache.atlas.repository.store.graph.v2.AtlasRelationshipDefStoreV2.preUpdateCheck(AtlasRelationshipDefStoreV2.java:457)
   
   ```
   
   Thus, it seems that the safest way to resolve this issue will be **creating a new relationship**.
   I tried to add the next relationship defs that use the same name `directory`:
   ```
       {
         "name": "spark_ml_model_dataset",
         "serviceType": "spark",
         "typeVersion": "1.0",
         "relationshipCategory": "AGGREGATION",
         "endDef1": {
           "type": "spark_ml_model",
           "name": "directory",
           "isContainer": true,
           "cardinality": "SINGLE"
         },
         "endDef2": {
           "type": "DataSet",
           "name": "model",
           "isContainer": false,
           "cardinality": "SINGLE"
         },
         "propagateTags": "NONE"
       },
       {
         "name": "spark_ml_pipeline_dataset",
         "serviceType": "spark",
         "typeVersion": "1.0",
         "relationshipCategory": "AGGREGATION",
         "endDef1": {
           "type": "spark_ml_pipeline",
           "name": "directory",
           "isContainer": true,
           "cardinality": "SINGLE"
         },
         "endDef2": {
           "type": "DataSet",
           "name": "pipeline",
           "isContainer": false,
           "cardinality": "SINGLE"
         },
         "propagateTags": "NONE"
       }
   ``` 
   
   and it works perfectly fine for the `spark_ml_model` but fails for the `spark_ml_pipeline` with the following error:
   ```
   2020-02-28 21:34:00,933 WARN  - [NotificationHookConsumer thread-0:] ~ Max retries exceeded for message {"version":{"version":"1.0.0","versionParts":[1]},"msgCompressionKind":"NONE","msgSplitIdx":1,"msgSplitCount":1,"msgCreationTime":1582918440918,"message":{"type":"ENTITY_CREATE_V2","user":"test","entities":{"entities":[{"typeName":"spark_ml_model","attributes":{"qualifiedName":"model_with_path8","name":"model_with_path8"},"guid":"-386799758271978","isIncomplete":false,"provenanceType":0,"version":0,"relationshipAttributes":{"directory":{"typeName":"hdfs_path","uniqueAttributes":{"qualifiedName":"path8"}}},"proxy":false}]}}} (NotificationHookConsumer$HookConsumer:793)
   org.apache.atlas.exception.AtlasBaseException: invalid relationshipDef: spark_ml_model_ml_directory: end type 1: spark_ml_directory, end type 2: spark_ml_model
   	at org.apache.atlas.repository.store.graph.v2.AtlasRelationshipStoreV2.validateRelationship(AtlasRelationshipStoreV2.java:657)
   
   ```
   
   Debugging shows that [AtlasEntityUtil.getRelationshipType](https://github.com/apache/atlas/blob/master/repository/src/main/java/org/apache/atlas/repository/store/graph/v2/EntityGraphMapper.java#L659) returns `null` for the `hdfs_path`(which is child of `DataSet`) attribute and this makes [entityType.getRelationshipAttribute](https://github.com/apache/atlas/blob/master/intg/src/main/java/org/apache/atlas/type/AtlasEntityType.java#L459) return first value of HashMap. 
   
   In the case of `spark_ml_model` relation, it appears to be the right relation, but in the case of `spark_ml_pipeline` - the wrong one. See screenshots:
   ![Screenshot from 2020-02-28 21-53-21](https://user-images.githubusercontent.com/61428392/75582735-e56c2f00-5a74-11ea-9123-4fe3bf33881c.png)
   ![Screenshot from 2020-02-28 21-53-54](https://user-images.githubusercontent.com/61428392/75582748-eac97980-5a74-11ea-86de-a12633c0b0d6.png)
   
   

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services