You are viewing a plain text version of this content. The canonical link for it is here.

Posted to issues@arrow.apache.org by "rjzamora (via GitHub)" <gi...@apache.org> on 2023/04/04 13:36:03 UTC

[GitHub] [arrow] rjzamora opened a new issue, #34884: Dataset PartitioningFactory cannot be serialized in Python

rjzamora opened a new issue, #34884:
URL: https://github.com/apache/arrow/issues/34884

   ### Describe the bug, including details regarding any error messages, version, and platform.
   
   I would like to be able to serialize a dictionary of `pyarrow.dataset.dataset` key-word arguments (for parallel processing in Dask). However, it is not possible to do this when one of those arguments contains a `Partitioning`/`PartitioningFactory` object, because those objects cannot be serialized in Python.
   
   **Reproducer**:
   
   ```python
   In [1]: import pyarrow.dataset as ds
      ...: import pickle
      ...: 
      ...: partitioning = ds.partitioning(flavor="hive")
      ...: pickle.dumps(partitioning)
   ---------------------------------------------------------------------------
   TypeError                                 Traceback (most recent call last)
   Cell In[1], line 5
         2 import pickle
         4 partitioning = ds.partitioning(flavor="hive")
   ----> 5 pickle.dumps(partitioning)
   
   File stringsource:2, in pyarrow._dataset.PartitioningFactory.__reduce_cython__()
   
   TypeError: self.factory,self.wrapped cannot be converted to a Python object for pickling
   ```
   
   ### Component(s)
   
   Python


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@arrow.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] jorisvandenbossche closed issue #34884: [Python] Dataset PartitioningFactory cannot be serialized in Python

Posted by "jorisvandenbossche (via GitHub)" <gi...@apache.org>.

jorisvandenbossche closed issue #34884: [Python] Dataset PartitioningFactory cannot be serialized in Python
URL: https://github.com/apache/arrow/issues/34884


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] rjzamora commented on issue #34884: [Python] Dataset PartitioningFactory cannot be serialized in Python

Posted by "rjzamora (via GitHub)" <gi...@apache.org>.

rjzamora commented on issue #34884:
URL: https://github.com/apache/arrow/issues/34884#issuecomment-1628921132

   Thanks @jorisvandenbossche !


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] rjzamora commented on issue #34884: [Python] Dataset PartitioningFactory cannot be serialized in Python

Posted by "rjzamora (via GitHub)" <gi...@apache.org>.

rjzamora commented on issue #34884:
URL: https://github.com/apache/arrow/issues/34884#issuecomment-1501957023

   Thanks @westonpace ! The current workaround in Dask is indeed to allow the user to specify a dictionary like `{"flavor": "hive", "schema": ...}`. This works fine, but the Dask UX would certainly be cleaner if the user could pass in something like an initialized `HivePartitioning` object.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] westonpace commented on issue #34884: [Python] Dataset PartitioningFactory cannot be serialized in Python

Posted by "westonpace (via GitHub)" <gi...@apache.org>.

westonpace commented on issue #34884:
URL: https://github.com/apache/arrow/issues/34884#issuecomment-1501936748

   All the partitioning objects we have today can boil down to a schema (which can be saved as an empty parquet/arrow file) and a string denoting the type (e.g. "dictionary" or "hive" or "filename").  I'm not sure if this is helpful or not since I suspect the goal is automatic serialization.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org