You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@airflow.apache.org by GitBox <gi...@apache.org> on 2020/03/25 15:49:49 UTC

[GitHub] [airflow] jrwalk opened a new issue #7870: migrate from `dill` to `cloudpickle` for advanced serialization

jrwalk opened a new issue #7870: migrate from `dill` to `cloudpickle` for advanced serialization
URL: https://github.com/apache/airflow/issues/7870
 
 
   **Description**
   
   Usage of `dill` for optional serialization in `PythonVirtualenvOperator` may be replaced with `cloudpickle` as its serialization library.  This should be a mostly drop-in replacement.
   
   **Use case / motivation**
   
   Currently, the `PythonVirtualenvOperator` optionally uses `dill` in place of stock `pickle` to serialize advanced types.  However, most major distributed compute frameworks have opted to shift to `cloudpickle`, meaning using `dill` for Airflow can introduce redundant dependencies for calling out to other distributed compute (e.g., farming compute-heavy tasks out to a remote `dask` cluster), and can interfere with serialization of tasks for those tools.
   
   Since both `dill` and `cloudpickle` are largely drop-in replacements for `pickle`, the migration should be fairly minor.
   
   **Related Issues**
   
   https://github.com/kubeflow/pipelines/issues/1387
   
   https://github.com/dask/distributed/issues/3606
   
   https://github.com/RaRe-Technologies/gensim/issues/558#issuecomment-217445542
   
   https://github.com/uqfoundation/multiprocess/issues/22#issuecomment-243120410

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

[GitHub] [airflow] ashb commented on issue #7870: migrate from `dill` to `cloudpickle` for advanced serialization

Posted by GitBox <gi...@apache.org>.
ashb commented on issue #7870: migrate from `dill` to `cloudpickle` for advanced serialization
URL: https://github.com/apache/airflow/issues/7870#issuecomment-604724038
 
 
   Is it worth making the pickle module a config setting, or just always using cloudpickle instead?

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

[GitHub] [airflow] jrwalk commented on issue #7870: migrate from `dill` to `cloudpickle` for advanced serialization

Posted by GitBox <gi...@apache.org>.
jrwalk commented on issue #7870: migrate from `dill` to `cloudpickle` for advanced serialization
URL: https://github.com/apache/airflow/issues/7870#issuecomment-605301945
 
 
   either way it would probably also necessitate reworking some of the tests, since (for example) cloudpickle could conceivably serialize `lambda` functions for the `python_callable` inputs

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

[GitHub] [airflow] jrwalk commented on issue #7870: migrate from `dill` to `cloudpickle` for advanced serialization

Posted by GitBox <gi...@apache.org>.
jrwalk commented on issue #7870: migrate from `dill` to `cloudpickle` for advanced serialization
URL: https://github.com/apache/airflow/issues/7870#issuecomment-605065110
 
 
   so `dill` is optional to use (set by a flag in the `PythonVirtualenvOperator`) but is always imported and treated as a required dependency for Airflow, right?  Keeping that pattern it should be super easy to cut over to `cloudpickle`.  
   
   I'd generally advocate for allowing fewer dependencies though, so it could be nice to make `cloudpickle` a proper extra-dependency and put some checks around the import.  That would let people keep minimal environments if they choose/need to.  I've also run into situations where it wasn't being used (no virtualenv operators in my dag) but because it was imported `dill` was still getting captured by serializers like `cloudpickle` since they scoop up a fair amount of runtime environment information.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

[GitHub] [airflow] boring-cyborg[bot] commented on issue #7870: migrate from `dill` to `cloudpickle` for advanced serialization

Posted by GitBox <gi...@apache.org>.
boring-cyborg[bot] commented on issue #7870: migrate from `dill` to `cloudpickle` for advanced serialization
URL: https://github.com/apache/airflow/issues/7870#issuecomment-603918719
 
 
   Thanks for opening your first issue here! Be sure to follow the issue template!
   

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services