You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@arrow.apache.org by "Krisztian Szucs (JIRA)" <ji...@apache.org> on 2019/04/16 12:45:00 UTC
[jira] [Updated] (ARROW-5144) [Python] ParquetDataset and
ParquetPiece not serializable
[ https://issues.apache.org/jira/browse/ARROW-5144?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Krisztian Szucs updated ARROW-5144:
-----------------------------------
Summary: [Python] ParquetDataset and ParquetPiece not serializable (was: [Python] ParquetDataset and CloudParquetPiece not serializable)
> [Python] ParquetDataset and ParquetPiece not serializable
> ---------------------------------------------------------
>
> Key: ARROW-5144
> URL: https://issues.apache.org/jira/browse/ARROW-5144
> Project: Apache Arrow
> Issue Type: Bug
> Components: Python
> Affects Versions: 0.13.0
> Environment: osx python36/conda cloudpickle 0.8.1
> arrow-cpp 0.13.0 py36ha71616b_0 conda-forge
> pyarrow 0.13.0 py36hb37e6aa_0 conda-forge
> Reporter: Martin Durant
> Assignee: Krisztian Szucs
> Priority: Critical
> Labels: pull-request-available
> Time Spent: 10m
> Remaining Estimate: 0h
>
> Since 0.13.0, parquet instances are no longer serialisable, which means that dask.distributed cannot pass them between processes in order to load parquet in parallel.
> Example:
> ```
> >>> import cloudpickle
> >>> import pyarrow.parquet as pq
> >>> pf = pq.ParquetDataset('nation.impala.parquet')
> >>> cloudpickle.dumps(pf)
> ~/anaconda/envs/py36/lib/python3.6/site-packages/cloudpickle/cloudpickle.py in dumps(obj, protocol)
> 893 try:
> 894 cp = CloudPickler(file, protocol=protocol)
> --> 895 cp.dump(obj)
> 896 return file.getvalue()
> 897 finally:
> ~/anaconda/envs/py36/lib/python3.6/site-packages/cloudpickle/cloudpickle.py in dump(self, obj)
> 266 self.inject_addons()
> 267 try:
> --> 268 return Pickler.dump(self, obj)
> 269 except RuntimeError as e:
> 270 if 'recursion' in e.args[0]:
> ~/anaconda/envs/py36/lib/python3.6/pickle.py in dump(self, obj)
> 407 if self.proto >= 4:
> 408 self.framer.start_framing()
> --> 409 self.save(obj)
> 410 self.write(STOP)
> 411 self.framer.end_framing()
> ~/anaconda/envs/py36/lib/python3.6/pickle.py in save(self, obj, save_persistent_id)
> 519
> 520 # Save the reduce() output and finally memoize the object
> --> 521 self.save_reduce(obj=obj, *rv)
> 522
> 523 def persistent_id(self, obj):
> ~/anaconda/envs/py36/lib/python3.6/pickle.py in save_reduce(self, func, args, state, listitems, dictitems, obj)
> 632
> 633 if state is not None:
> --> 634 save(state)
> 635 write(BUILD)
> 636
> ~/anaconda/envs/py36/lib/python3.6/pickle.py in save(self, obj, save_persistent_id)
> 474 f = self.dispatch.get(t)
> 475 if f is not None:
> --> 476 f(self, obj) # Call unbound method with explicit self
> 477 return
> 478
> ~/anaconda/envs/py36/lib/python3.6/pickle.py in save_dict(self, obj)
> 819
> 820 self.memoize(obj)
> --> 821 self._batch_setitems(obj.items())
> 822
> 823 dispatch[dict] = save_dict
> ~/anaconda/envs/py36/lib/python3.6/pickle.py in _batch_setitems(self, items)
> 845 for k, v in tmp:
> 846 save(k)
> --> 847 save(v)
> 848 write(SETITEMS)
> 849 elif n:
> ~/anaconda/envs/py36/lib/python3.6/pickle.py in save(self, obj, save_persistent_id)
> 494 reduce = getattr(obj, "__reduce_ex__", None)
> 495 if reduce is not None:
> --> 496 rv = reduce(self.proto)
> 497 else:
> 498 reduce = getattr(obj, "__reduce__", None)
> ~/anaconda/envs/py36/lib/python3.6/site-packages/pyarrow/_parquet.cpython-36m-darwin.so in pyarrow._parquet.ParquetSchema.__reduce_cython__()
> TypeError: no default __reduce__ due to non-trivial __cinit__
> ```
> The indicated schema instance is also referenced by the ParquetDatasetPiece s.
> ref: https://github.com/dask/distributed/issues/2597
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)