You are viewing a plain text version of this content. The canonical link for it is here.

Posted to notifications@superset.apache.org by GitBox <gi...@apache.org> on 2022/01/25 05:51:23 UTC

[GitHub] [superset] Carla6-7 opened a new issue #18159: ## Preamble

Carla6-7 opened a new issue #18159:
URL: https://github.com/apache/superset/issues/18159


   ## Preamble
   
   This is the third part of my serie of design documentation on refactoring Kedro to make deployment easier:
   
   - The first part is described in the issue #770 and focuses on refactoring configuration to separate external and applicative configuration.
   - The second part is described in the issue #904 focuses on ``DataCatalog`` entries which have a compute/storage backend different than "python / in memory operations" (including SQl, Spark...). 
   - This third part focuses on the ability to modify the running logic at runtime, outside of the ``KedroSession``.
   
   ## Defining the feature: Modifying the running logic and distribute the modifier
   
   ### Current state of Kedro's extensibility
   
   There are currently several ways to extend Kedro natively, described hereafter:
   
   |What is extended|Use cases example|Kedro object|Registration|Popularity|
   |---|---|---|---|---|
   |Pipeline execution at runtime|-  change catalog entries on the fly (cache data, change git branch... <br /> - log data remotely (mlflow, neptune, dolt, store kedro-viz static files...) |Hooks (Pipeline, node)| - via an entrypoint <br /> -  OR manual declaration in settings.py|High: [a quick github search]((https://github.com/search?p=1&q=before_pipeline_run&type=Code)) shows that many users use hooks to add custom logic at runtime|
   |CLI command|- create a configuration file <br /> - profile a catalog entry <br /> - convert a kedro pipeline to an orchestrator <br /> - visualize the pipeline in a webrowser…|plugin click commands|- via an entrypoint|Medium: Seem to be a more adavced use mainly for plugin developpers|
   |Data sources connection|- Create a dataset which can connect to a new data source unsupported by kedro (GBQ, HDF, sklearn pipelines, Databrics, Stata, redis,…. are the most recent ones)|AbstractDataSet|As a module, which can be imported by its path in the DataCatalog|High: [a quick search in Kedro's past issues](https://github.com/quantumblacklabs/kedro/issues?q=is%3Aissue+dataset) shows that it is very common request for users who need to connect to specific data sources|
   
   ### Use cases not covered by previous mechanisms
   
   However, I've encountered a bunch of use case where people want to extend the **running logic** (=how to run the pipeline) rather than of the execution logic (=how the pipeline behaves during runtime, which is achieved by hooks). Some examples includes: 
   1. Running the entire pipeline several times (e.g. with different set of parameters for hyperparameters tuning (https://github.com/quantumblacklabs/kedro/issues/282#issuecomment-768111744, https://github.com/quantumblacklabs/kedro/discussions/948,https://github.com/Galileo-Galilei/kedro-mlflow/issues/246))
   2. Prepare a conda environment in a different pid before running the pipeline to ensure environment consistency (this is very similar to what "mlflow projects" do)
   3. Perfoms "CI-like checks" (lint...) before running the pipeline, especially when you launch a very long pipeline (this is very similar to what "mlflow projects" do)
   4. Force commiting unstaged changes to ensure reproducibility (this is very similar to what "mlflow projects" do)
   5. Once the pipeline has finished running, expose it as an API (this could be a conveinent way to serve a Kedro Pipeline)
   6. If we offer the community the ability to distribute such changes, I'm pretty sure other use cases will arise 😃 
   
   These are **real life use-cases which cannot be achieved by hooks because we want to perform operations outside of a ``KedroSession``**.
   
   ### Current workaround pros and cons analysis 
   
   Actually, I can think of two ways to achieve previous use cases in Kedro:
   - override the `cli.py:run` commmand at the project level (or in a plugin) with custom logic
   - create a custom runner inheriting from ``AbstractRunner`` which contains the execution logic and manually inject it in your ``cli.py`` at the project level.
   
   These solutions have strong issues: 
   - **lack of composability**: if you want to compose 2 logic you cannot just import the ``run`` from another project or plugin, you have to recode everything at the project level. At least the ``runner`` solution enable to compose logics through inheritance, but it is not easy to maintain.
   - **difficulty of distribution**: if you create a run command in a plugin, you can ``pip install`` it and benefits from the new logic; howewever you have to give up the possibility to extend your own cli at the project level; even worse, plugin order import can lead to inconsistent behaviour if several plugins implements a run command.
   - **difficulty of maintenance**: since it is hard to know which ``run`` command is running in case of concurrrent overriding of the command, it can obfuscate a lot running errors.
   - **lack of flexibility**: You can have a single running logic in your project, while you often need to switch between kedro's default ``run`` command and the custom one (e.g. you want to run your pipeline normally most of the time while developping, and have another logic sometimes (e.g. one of the ones described above).
   
   The best workflow I could came up with to implement such "running logic" changes is the following: 
   - Create a custom ``AbstractRunner``
   - Modify the ``cli.py`` on a per project basis to use my custom runner
   - Create several different very similar commands (run, run_serve, run_pre_conda...) with duplicated code to run the session, each one with a different running logic, so I can pick up the one I want when running `kedro run`.
   
   So I can at least reuse my custom ``runner`` in other projects by importing them and modifying the other project ``cli.py``, which is not very convenient. 
   
   ## Potential solutions:
   
   ### A short term solution: Injecting the ``runner`` class at runtime 
   
   Actually, kedro seems to have all the important ``elementary bricks`` to create custom running logic and choose it at runtime: the ``run`` command and the ``AbstractRunner`` class. 
   
   The main default is that we can't easility distribute this logic to other users. I suggest to modify the default `run` command to be able to flexibly specify the runner at runtime with a similar logic as custom ``DataSet`` in the ``DataCatalog`` by specifying its path. 
   
   https://github.com/quantumblacklabs/kedro/blob/c2c984a260132cdb9c434099485eae05707ad116/kedro/framework/cli/project.py#L351-L392
   
   ```diff
   def run(
       tag,
       env,
       parallel,
       runner,
       is_async,
       node_names,
       to_nodes,
       from_nodes,
       from_inputs,
       to_outputs,
       load_version,
       pipeline,
       config,
       params,
   ):
       """Run the pipeline."""
       if parallel and runner:
           raise KedroCliError(
               "Both --parallel and --runner options cannot be used together. "
               "Please use either --parallel or --runner."
           )
       runner = runner or "SequentialRunner"
       if parallel:
           runner = "ParallelRunner"
   
   +   runner_prefix = "kedro.runner" if runner in {"SequentialRunner", "ParallelRunner", "ThreadRunner"} else ""
   +   runner_class = load_obj(runner, runner_prefix) # eventually "import settings" and load runner configuration from a config file to enable parameterization?
   -   runner_class = load_obj(runner, "kedro.runner")  
   	
       tag = _get_values_as_tuple(tag) if tag else tag
       node_names = _get_values_as_tuple(node_names) if node_names else node_names
   
   	
       with KedroSession.create(env=env, extra_params=params) as session:
           session.run(
               tags=tag,
               runner=runner_class(is_async=is_async),
               node_names=node_names,
               from_nodes=from_nodes,
               to_nodes=to_nodes,
               from_inputs=from_inputs,
               to_outputs=to_outputs,
               load_versions=load_version,
               pipeline_name=pipeline,
           )
   ```
   
   **Advantages for kedro users:** 
   - This would enable to **use the same** command to inject my running logic at runtime, e.g.: 
   ```bash
   kedro run --pipeline=my-pipeline # normal use
   kedro run --pipeline=my-pipeline --runner=kedro_mlflow.runner.MlflowRunner # use mlflow projects to create a conda env, clean git history, performs check before running 
   kedro run --pipeline=my-pipeline --runner=ServiceRunner # Serve my pipeline after running 
   ```
   - it would make **transition to production very easy if you want to have different logic** for e.g. serving the model or processing a batch.
   - this implementation is **completely backward-compatible** with kedro's running logic and completely straightforward to add to the codebase.
   - the logic is **very easy to distribute**: anyone can use my custome runner just with module path.
   
   ### Towards more flexibility: configure runners in a configuration file
   
   The previous solution does not enable to inject additional parameters to the runner. It "feels" currently poorly managed (there are "if condition" inside the run command  to check wether a parameter can be used with the given runner or not...). A solution could be to have a ``runner.yml`` file behaving in a catalog-like way to enable parametrization. it would also enable to use the same runner with different parameters. Such a file could look like this:
   
   ```
   #runner.yml
   
   my_parallel_runner_async:
       type: ParallelRunner
       is_async: True
   
   my_service_runner:
       type: nice_plugin.runner.ServiceRunner
       host: 127.0.0.1
   	port: 5000
   	
   my_service_runner2:
       type: nice_plugin.runner.ServiceRunner
       host: 127.0.0.1
   	port: 5001
   	
   ``` 
   
   And the ``run`` command could resolve a name in this ``RunnerCatalog`` and use it in the following fashion: 
   
   ```bash
   kedro run --pipeline=my_pipeline --runner=my_service_runner2
   ```
   
   __Originally posted by @Galileo-Galilei in https://github.com/kedro-org/kedro/issues/1041__


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: notifications-unsubscribe@superset.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: notifications-unsubscribe@superset.apache.org
For additional commands, e-mail: notifications-help@superset.apache.org

[GitHub] [superset] geido commented on issue #18159: ## Preamble

Posted by GitBox <gi...@apache.org>.

geido commented on issue #18159:
URL: https://github.com/apache/superset/issues/18159#issuecomment-1021147081


   Hello @Carla6-7 I feel like this issue does not belong to Superset. I'll close it for now.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: notifications-unsubscribe@superset.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: notifications-unsubscribe@superset.apache.org
For additional commands, e-mail: notifications-help@superset.apache.org

[GitHub] [superset] geido commented on issue #18159: ## Preamble

Posted by GitBox <gi...@apache.org>.

geido commented on issue #18159:
URL: https://github.com/apache/superset/issues/18159#issuecomment-1021147081


   Hello @Carla6-7 I feel like this issue does not belong to Superset. I'll close it for now.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: notifications-unsubscribe@superset.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: notifications-unsubscribe@superset.apache.org
For additional commands, e-mail: notifications-help@superset.apache.org

[GitHub] [superset] geido closed issue #18159: ## Preamble

Posted by GitBox <gi...@apache.org>.

geido closed issue #18159:
URL: https://github.com/apache/superset/issues/18159


   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: notifications-unsubscribe@superset.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: notifications-unsubscribe@superset.apache.org
For additional commands, e-mail: notifications-help@superset.apache.org

[GitHub] [superset] geido closed issue #18159: ## Preamble

Posted by GitBox <gi...@apache.org>.

geido closed issue #18159:
URL: https://github.com/apache/superset/issues/18159


   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: notifications-unsubscribe@superset.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: notifications-unsubscribe@superset.apache.org
For additional commands, e-mail: notifications-help@superset.apache.org