You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@atlas.apache.org by "Barbara Eckman (Jira)" <ji...@apache.org> on 2020/08/18 16:50:00 UTC

[jira] [Assigned] (ATLAS-3570) Atlas typedefs for Machine Learning Models, Feature Sets, and Feature Engineering Engines

     [ https://issues.apache.org/jira/browse/ATLAS-3570?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Barbara Eckman reassigned ATLAS-3570:
-------------------------------------

    Assignee:     (was: Barbara Eckman)

> Atlas typedefs for Machine Learning Models, Feature Sets, and Feature Engineering Engines
> -----------------------------------------------------------------------------------------
>
>                 Key: ATLAS-3570
>                 URL: https://issues.apache.org/jira/browse/ATLAS-3570
>             Project: Atlas
>          Issue Type: New Feature
>            Reporter: Barbara Eckman
>            Priority: Major
>         Attachments: MLModel_typedefs.tar
>
>
> Currently the base types in Atlas do not include Machine Learning (ML) Model tables. It would be nice to add typedefs for them, so they could be part of enterprise discovery and versioning.  
> ENTITIES COULD INCLUDE:
> MLModel (overview info), with attributes:
>  * uniqueId
>  * version
>  * businessUseCase
>  * modelFramework (eg scikit-learn)
>  * modelTypes (eg random forest regressor)
>  * modelClass (eg random forest (bagging + decision trees))
>  * isEnsemble boolean
>  * outcomeTypeDescription (eg single float)
>  * **dataScienceOwnerEmail
>  * githubRepoURL where the model code is founc
>  * modelDeploymentDate
>  * populationScored (eg in Comcast, residential or business customers)
>  * accuracyMeasures
> MLModelExecution, with attributes:
>  * exampleInputDatasetURL (URL where a sample input dataset can be found)
>  * outputTargetDatasetURLs
>  * opsOwnerEmail
>  * executionEndpointURL
>  * dockerContainerURL
>  * MLFlowPointerURL
>  * executionNotebookURL (eg Databricks, Jupyter)
> MLModelTraining, with attributes:
>  * hyperParameters
>  * trainingDatasetURLs
>  * trainingNotebookURL (eg Databricks, Jupyter)
> FeatureSet (a set of features prepared as input to an ML model), with attributes:
>  * version
>  * locationURL 
> FeatureEngineeringEngine (the engine that generates the feature set for an ML model), with attributes:
>  * version
>  * ownerEmail
>  * inputSourceURL
>  * processingEngineInfoURL (docs on the processing engine)
>  * githubRepoURL 
>  * outputTargetURL
> RELATIONSHIPS could include:
>  * model to  execution
>  * model to training
>  * model execution to example input dataset (eg kafka topic)
>  * model execution to output target dataset (eg S3 prefix or object)
>  * model execution to input schema
>  * model execution to output schema
>  * model execution to input feature set objects
>  * training to input training dataset objects
>  * training to input training dataset schema
>  * feature engineering engine to output feature set object
>  * feature engineering engine to input source dataset (eg kafka topic)
>  * feature engineering engine to input source dataset's schema
>  * feature engineering engine to output target dataset (eg S3 object)
>  * feature set object to its schema
> ENUMs could include:
>  * MLModel_type (eg logistic regression, random_forest_regression)
> PROCESSES related to MLModels could include:
>  * MLPipelineDependencyEdge (dependency between two models in the ML pipeline)
>  ** inputs and outputs are both MLModels
>  * MLModelEvolutionEdge (lineage between 2 versions of an ML model)
>  ** inputs and outputs are both MLModels
>  ** only attribute is an array of strings representing changes made from one version to the other.  this could be made more structured as we discover how it is used.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)