You are viewing a plain text version of this content. The canonical link for it is here.

Posted to issues@arrow.apache.org by "plamb-viso (via GitHub)" <gi...@apache.org> on 2023/03/04 20:21:26 UTC

[GitHub] [arrow] plamb-viso opened a new issue, #34455: ArrowNotImplementedError: concatenation of extension>

plamb-viso opened a new issue, #34455:
URL: https://github.com/apache/arrow/issues/34455

   ### Describe the usage question you have. Please include as many useful details as  possible.
   
   
   I'm using Huggingface Datasets to encode a dataset which uses pyarrow under the covers. The mapper function allows you to split encoding over multiple processors each which get chunks of the dataset. At the end of encoding, the results are flattened and written to disk using pyarrow.
   
   When run over multiple processors, the dataset correctly encodes, but then hangs indefinitely once it gets to the flattening portion. When run over a single processor, it passes mapping and once `save_to_disk` is called, another flattening process occurs in pyarrow. The following exception is thrown:
   
   ```
   File "/opt/conda/lib/python3.8/site-packages/datasets/arrow_dataset.py", line 1348, in save_to_disk
       dataset = self.flatten_indices(num_proc=num_proc) if self._indices is not None else self
     File "/opt/conda/lib/python3.8/site-packages/datasets/arrow_dataset.py", line 528, in wrapper
       out: Union["Dataset", "DatasetDict"] = func(self, *args, **kwargs)
     File "/opt/conda/lib/python3.8/site-packages/datasets/fingerprint.py", line 511, in wrapper
       out = func(dataset, *args, **kwargs)
     File "/opt/conda/lib/python3.8/site-packages/datasets/arrow_dataset.py", line 3541, in flatten_indices
       return self.map(
     File "/opt/conda/lib/python3.8/site-packages/datasets/arrow_dataset.py", line 563, in wrapper
       out: Union["Dataset", "DatasetDict"] = func(self, *args, **kwargs)
     File "/opt/conda/lib/python3.8/site-packages/datasets/arrow_dataset.py", line 528, in wrapper
       out: Union["Dataset", "DatasetDict"] = func(self, *args, **kwargs)
     File "/opt/conda/lib/python3.8/site-packages/datasets/arrow_dataset.py", line 2953, in map
       for rank, done, content in Dataset._map_single(**dataset_kwargs):
     File "/opt/conda/lib/python3.8/site-packages/datasets/arrow_dataset.py", line 3346, in _map_single
       writer.write_batch(batch)
     File "/opt/conda/lib/python3.8/site-packages/datasets/arrow_writer.py", line 555, in write_batch
       self.write_table(pa_table, writer_batch_size)
     File "/opt/conda/lib/python3.8/site-packages/datasets/arrow_writer.py", line 567, in write_table
       pa_table = pa_table.combine_chunks()
     File "pyarrow/table.pxi", line 3241, in pyarrow.lib.Table.combine_chunks
     File "pyarrow/error.pxi", line 144, in pyarrow.lib.pyarrow_internal_check_status
     File "pyarrow/error.pxi", line 121, in pyarrow.lib.check_status
   pyarrow.lib.ArrowNotImplementedError: concatenation of extension<arrow.py_extension_type<Array2DExtensionType>>
   ```
   My immediate interpretation of the error is that the `Array2D` simply hasn't been implemented yet in the way I'm attempting to use it. The problem is that I've used it the same way many times in the past.
   
   To get a feel for how I'm using it, the only two columns I have with the Array2D datatype are:
   ```python
           'seg_data': Array2D(dtype="float32", shape=(1024,4)),
           'visual_seg_data': Array2D(dtype="int64", shape=(196, 4)),
   ```
   
   I posted this on the Datasets forum as well, but thought I might try here.
   
   ### Component(s)
   
   Python


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@arrow.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] off99555 commented on issue #34455: [Python] ArrowNotImplementedError: concatenation of extension>

Posted by "off99555 (via GitHub)" <gi...@apache.org>.

off99555 commented on issue #34455:
URL: https://github.com/apache/arrow/issues/34455#issuecomment-1512832822

   I also have this error when calling `Dataset.push_to_hub()` with a dataset that has more than one shard.
   ```
   File /usr/local/lib/python3.10/dist-packages/datasets/arrow_dataset.py:5311, in Dataset.push_to_hub(self, repo_id, split, private, token, branch, max_shard_size, num_shards, embed_external_files)
      5306 if max_shard_size is not None and num_shards is not None:
      5307     raise ValueError(
      5308         "Failed to push_to_hub: please specify either max_shard_size or num_shards, but not both."
      5309     )
   -> 5311 repo_id, split, uploaded_size, dataset_nbytes, repo_files, deleted_size = self._push_parquet_shards_to_hub(
      5312     repo_id=repo_id,
      5313     split=split,
      5314     private=private,
      5315     token=token,
      5316     branch=branch,
      5317     max_shard_size=max_shard_size,
      5318     num_shards=num_shards,
      5319     embed_external_files=embed_external_files,
      5320 )
      5321 organization, dataset_name = repo_id.split("/")
      5322 info_to_dump = self.info.copy()
   
   File /usr/local/lib/python3.10/dist-packages/datasets/arrow_dataset.py:5194, in Dataset._push_parquet_shards_to_hub(self, repo_id, split, private, token, branch, max_shard_size, num_shards, embed_external_files)
      5192 uploaded_size = 0
      5193 shards_path_in_repo = []
   -> 5194 for index, shard in logging.tqdm(
      5195     enumerate(itertools.chain([first_shard], shards_iter)),
      5196     desc="Pushing dataset shards to the dataset hub",
      5197     total=num_shards,
      5198     disable=not logging.is_progress_bar_enabled(),
      5199 ):
      5200     shard_path_in_repo = path_in_repo(index, shard)
      5201     # Upload a shard only if it doesn't already exist in the repository
   
   File /usr/local/lib/python3.10/dist-packages/tqdm/notebook.py:254, in tqdm_notebook.__iter__(self)
       252 try:
       253     it = super(tqdm_notebook, self).__iter__()
   --> 254     for obj in it:
       255         # return super(tqdm...) will not catch exception
       256         yield obj
       257 # NB: except ... [ as ...] breaks IPython async KeyboardInterrupt
   
   File /usr/local/lib/python3.10/dist-packages/tqdm/std.py:1178, in tqdm.__iter__(self)
      1175 time = self._time
      1177 try:
   -> 1178     for obj in iterable:
      1179         yield obj
      1180         # Update and possibly print the progressbar.
      1181         # Note: does not call self.update(1) for speed optimisation.
   
   File /usr/local/lib/python3.10/dist-packages/datasets/arrow_dataset.py:5169, in Dataset._push_parquet_shards_to_hub.<locals>.shards_with_embedded_external_files(shards)
      5167 format = shard.format
      5168 shard = shard.with_format("arrow")
   -> 5169 shard = shard.map(
      5170     embed_table_storage,
      5171     batched=True,
      5172     batch_size=1000,
      5173     keep_in_memory=True,
      5174 )
      5175 shard = shard.with_format(**format)
      5176 yield shard
   
   File /usr/local/lib/python3.10/dist-packages/datasets/arrow_dataset.py:563, in transmit_tasks.<locals>.wrapper(*args, **kwargs)
       561     self: "Dataset" = kwargs.pop("self")
       562 # apply actual function
   --> 563 out: Union["Dataset", "DatasetDict"] = func(self, *args, **kwargs)
       564 datasets: List["Dataset"] = list(out.values()) if isinstance(out, dict) else [out]
       565 for dataset in datasets:
       566     # Remove task templates if a column mapping of the template is no longer valid
   
   File /usr/local/lib/python3.10/dist-packages/datasets/arrow_dataset.py:528, in transmit_format.<locals>.wrapper(*args, **kwargs)
       521 self_format = {
       522     "type": self._format_type,
       523     "format_kwargs": self._format_kwargs,
       524     "columns": self._format_columns,
       525     "output_all_columns": self._output_all_columns,
       526 }
       527 # apply actual function
   --> 528 out: Union["Dataset", "DatasetDict"] = func(self, *args, **kwargs)
       529 datasets: List["Dataset"] = list(out.values()) if isinstance(out, dict) else [out]
       530 # re-apply format to the output
   
   File /usr/local/lib/python3.10/dist-packages/datasets/arrow_dataset.py:3004, in Dataset.map(self, function, with_indices, with_rank, input_columns, batched, batch_size, drop_last_batch, remove_columns, keep_in_memory, load_from_cache_file, cache_file_name, writer_batch_size, features, disable_nullable, fn_kwargs, num_proc, suffix_template, new_fingerprint, desc)
      2996 if transformed_dataset is None:
      2997     with logging.tqdm(
      2998         disable=not logging.is_progress_bar_enabled(),
      2999         unit=" examples",
      (...)
      3002         desc=desc or "Map",
      3003     ) as pbar:
   -> 3004         for rank, done, content in Dataset._map_single(**dataset_kwargs):
      3005             if done:
      3006                 shards_done += 1
   
   File /usr/local/lib/python3.10/dist-packages/datasets/arrow_dataset.py:3395, in Dataset._map_single(shard, function, with_indices, with_rank, input_columns, batched, batch_size, drop_last_batch, remove_columns, keep_in_memory, cache_file_name, writer_batch_size, features, disable_nullable, fn_kwargs, new_fingerprint, rank, offset)
      3393     stack.enter_context(writer)
      3394 if isinstance(batch, pa.Table):
   -> 3395     writer.write_table(batch)
      3396 else:
      3397     writer.write_batch(batch)
   
   File /usr/local/lib/python3.10/dist-packages/datasets/arrow_writer.py:567, in ArrowWriter.write_table(self, pa_table, writer_batch_size)
       565 if self.pa_writer is None:
       566     self._build_writer(inferred_schema=pa_table.schema)
   --> 567 pa_table = pa_table.combine_chunks()
       568 pa_table = table_cast(pa_table, self._schema)
       569 if self.embed_local_files:
   
   File /usr/local/lib/python3.10/dist-packages/pyarrow/table.pxi:3315, in pyarrow.lib.Table.combine_chunks()
   
   File /usr/local/lib/python3.10/dist-packages/pyarrow/error.pxi:144, in pyarrow.lib.pyarrow_internal_check_status()
   
   File /usr/local/lib/python3.10/dist-packages/pyarrow/error.pxi:121, in pyarrow.lib.check_status()
   
   ArrowNotImplementedError: concatenation of extension<arrow.py_extension_type<Array2DExtensionType>>
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] plamb-viso commented on issue #34455: [Python] ArrowNotImplementedError: concatenation of extension>

Posted by "plamb-viso (via GitHub)" <gi...@apache.org>.

plamb-viso commented on issue #34455:
URL: https://github.com/apache/arrow/issues/34455#issuecomment-1455201244

   I was just able to get passed this exception locally by downgrading from datasets==2.10.0 to datasets==2.9.0


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] assignUser commented on issue #34455: ArrowNotImplementedError: concatenation of extension>

Posted by "assignUser (via GitHub)" <gi...@apache.org>.

assignUser commented on issue #34455:
URL: https://github.com/apache/arrow/issues/34455#issuecomment-1454893452

   Hey thanks for the report!
   
   When did this error start happening? We recently had the release of 11.0.0. Did you upgrade? Which version is this error happening with?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] guillermojp commented on issue #34455: [Python] ArrowNotImplementedError: concatenation of extension>

Posted by "guillermojp (via GitHub)" <gi...@apache.org>.

guillermojp commented on issue #34455:
URL: https://github.com/apache/arrow/issues/34455#issuecomment-1517286346

   Just for completeness, I've also faced this issue trying to write PyArrow tables directly onto the file itself. So, if `tables` is a list of pa.Table, of this (example) format:
   
   ```python
   {'var_0': 0,
    'var_1': 33,
    'var_2': 0,
    'var_3': [20256, 3798],
    'image': array([[[  0,  58], # Array3D, in this case of shape (224, 224, 3)
           [  0,  57],
           [  0,  52],
           ...,
           [  0,  22],
           [  0, 245],
           [  0, 156]],
           ...,
           [  0,   0],
           [  0,   0],
           [  0,   0]]], dtype=uint8)}
   ```
   
   and with the following schema:
   
   ```python
   var_0: int32
   var_1: int32
   var_2: int32
   var_3: fixed_size_list<item: int32>[2]
     child 0, item: int32
   image: extension<arrow.py_extension_type<Array3DExtensionType>>
   ```
   
   I have tried to use the `ArrowWriter` as follows:
   
    ```python
   from datasets import Array3D, Dataset
   from datasets.arrow_writer import ArrowWriter
   from datasets.features.features import Array3DExtensionType
   
   with ArrowWriter(schema=schema, path=out_path) as writer:
       for table in tables:
           writer.write_row(table)
       writer.finalize()
   ```
   
   And it throws exactly the same error. Strangely, if I use "pydicts" as inputs and `writer.write` instead of `writer.write_row`, the error is resolved (minus, of course, the inefficiency in converting a pa.Table to pydict, etc; this is not a comparison in terms of computational time):
   
   ```python
   from datasets import Array3D, Dataset
   from datasets.arrow_writer import ArrowWriter
   from datasets.features.features import Array3DExtensionType
   
   with ArrowWriter(schema=schema, path=out_path) as writer:
       for table in tables:
           table_dict = table.to_pydict()
           writer.write(table_dict)
       writer.finalize()
   ```
   
   Beware, by default the `writer_batch_size` input variable in the `ArrowWriter` class or the `.write/.write_row` methods would be the one causing this issue, as `writer_batch_size` defaults to... 1000 was it? Setting `writer_batch_size = 1` would """solve""" the issue but of course the arrow file would be ungodly RAM-heavy to load into memory, etc.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] westonpace closed issue #34455: [Python] ArrowNotImplementedError: concatenation of extension>

Posted by "westonpace (via GitHub)" <gi...@apache.org>.

westonpace closed issue #34455: [Python] ArrowNotImplementedError: concatenation of extension<arrow.py_extension_type<Array2DExtensionType>>
URL: https://github.com/apache/arrow/issues/34455


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] plamb-viso commented on issue #34455: [Python] ArrowNotImplementedError: concatenation of extension>

Posted by "plamb-viso (via GitHub)" <gi...@apache.org>.

plamb-viso commented on issue #34455:
URL: https://github.com/apache/arrow/issues/34455#issuecomment-1455122772

   I am using Datasets==2.10.0 which requires a version of pyarrow greater than 6.0.0. I am currently using pyarrow 8.0.0. I was able to reproduce this error locally, so I should have more information soon.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] westonpace commented on issue #34455: [Python] ArrowNotImplementedError: concatenation of extension>

Posted by "westonpace (via GitHub)" <gi...@apache.org>.

westonpace commented on issue #34455:
URL: https://github.com/apache/arrow/issues/34455#issuecomment-1523785945

   Duplicate of #31868 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] plamb-viso commented on issue #34455: [Python] ArrowNotImplementedError: concatenation of extension>

Posted by "plamb-viso (via GitHub)" <gi...@apache.org>.

plamb-viso commented on issue #34455:
URL: https://github.com/apache/arrow/issues/34455#issuecomment-1455199238

   Ah, this error is throwing deep enough in pxi files that I cannot print out values to try and figure out what is going on


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] westonpace commented on issue #34455: [Python] ArrowNotImplementedError: concatenation of extension>

Posted by "westonpace (via GitHub)" <gi...@apache.org>.

westonpace commented on issue #34455:
URL: https://github.com/apache/arrow/issues/34455#issuecomment-1523785261

   The call to `combine_chunks` was introduced in https://github.com/huggingface/datasets/pull/5542 which explains why reverting to an older version of `datasets` fixes the issue.
   
   `combine_chunks` relies on array concatenation.  Support for concatenating extension type arrays was added in https://github.com/apache/arrow/pull/14463 which will be part of 12.0.0
   
   So your options are:
   
    * Keep `datasets` pinned
    * Upgrade to Arrow 12.0.0 once it releases
    * File a bug with `datasets` and ask them to stop calling `combine_chunks`
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org